2009-06-11 10:48:06

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3

The big change with this release is that the patch reintroducing
zone_reclaim_interval has been dropped as Ram reports the malloc() stalls
have been resolved. If this bug occurs again, the counter will be there to
help us identify the situation.

Changelog since V2
o Add reviews/acks
o Take advantage of Kosaki's work on the estimate of tmpfs pages
o Watch for underflow with Kosaki's calculation
o Drop the zone_reclaim_interval patch again after Ram reported that the
scan-avoidance-heuristic works for the malloc() test case

Changelog since V1
o Rebase to mmotm
o Add various acks
o Documentation and patch leader fixes
o Use Kosaki's method for calculating the number of unmapped pages
o Consider the zone full in more situations than all pages being unreclaimable
o Add a counter to detect when scan-avoidance heuristics are failing
o Handle jiffie wraps for zone_reclaim_interval
o Move zone_reclaim_interval to the end of the set with the view to dropping
it. If Kosaki's calculation is accurate, then the problem being dealt with
should also be addressed

A bug was brought to my attention against a distro kernel but it affects
mainline and I believe problems like this have been reported in various guises
on the mailing lists although I don't have specific examples at the moment.

The reported problem was that malloc() stalled for a long time (minutes
in some cases) if a large tmpfs mount was occupying a large percentage of
memory overall. The pages did not get cleaned or reclaimed by zone_reclaim()
because the zone_reclaim_mode was unsuitable, but the lists are uselessly
scanned frequencly making the CPU spin at near 100%.

This patchset intends to address that bug and bring the behaviour of
zone_reclaim() more in line with expectations which were noticed during
investigation. It is based on top of mmotm and takes advantage of Kosaki's
work with respect to zone_reclaim().

Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the
scan should go ahead. The broken heuristic is what was causing the
malloc() stall as it uselessly scanned the LRU constantly. Currently,
zone_reclaim is assuming zone_reclaim_mode is 1 and historically it
could not deal with tmpfs pages at all. This fixes up the heuristic so
that an unnecessary scan is more likely to be correctly avoided.

Patch 2 notes that zone_reclaim() returning a failure automatically means
the zone is marked full. This is not always true. It could have
failed because the GFP mask or zone_reclaim_mode were unsuitable.

Patch 3 introduces a counter zreclaim_failed that will increment each
time the zone_reclaim scan-avoidance heuristics fail. If that
counter is rapidly increasing, then zone_reclaim_mode should be
set to 0 as a temporarily resolution and a bug reported because
the scan-avoidance heuristic is still broken.

include/linux/vmstat.h | 3 ++
mm/internal.h | 4 +++
mm/page_alloc.c | 26 +++++++++++++++---
mm/vmscan.c | 69 ++++++++++++++++++++++++++++++++++-------------
mm/vmstat.c | 3 ++
5 files changed, 82 insertions(+), 23 deletions(-)


2009-06-11 10:48:22

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/3] Properly account for the number of page cache pages zone_reclaim() can reclaim

On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
unmapped pages will be reclaimed if the zone watermarks are not being met.

There is a heuristic that determines if the scan is worthwhile but the problem
is that the heuristic is not being properly applied and is basically assuming
zone_reclaim_mode is 1 if it is enabled. The lack of proper detection can
manfiest as high CPU usage as the LRU list is scanned uselessly.

Historically, once enabled it was depending on NR_FILE_PAGES which may
include swapcache pages that the reclaim_mode cannot deal with. Patch
vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
pages that were not file-backed such as swapcache and made a calculation
based on the inactive, active and mapped files. This is far superior
when zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
reasonable starting figure.

This patch alters how zone_reclaim() works out how many pages it might be
able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set
in the reclaim_mode it will either consider NR_FILE_PAGES as potential
candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
swapcache and other non-file-backed pages. If RECLAIM_WRITE is not set,
then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
not set, then NR_FILE_MAPPED are not.

[mmotm note: This patch should be merged with or replace
vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim. Kosaki?]

[[email protected]: Estimate unmapped pages minus tmpfs pages]
[[email protected]: Fix underflow problem in Kosaki's estimate]
Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
---
mm/vmscan.c | 55 +++++++++++++++++++++++++++++++++++++++++--------------
1 files changed, 41 insertions(+), 14 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2ddcfc8..d832ba8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2333,6 +2333,44 @@ int sysctl_min_unmapped_ratio = 1;
*/
int sysctl_min_slab_ratio = 5;

+static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
+{
+ unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
+ unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
+ zone_page_state(zone, NR_ACTIVE_FILE);
+
+ /*
+ * It's possible for there to be more file mapped pages than
+ * accounted for by the pages on the file LRU lists because
+ * tmpfs pages accounted for as ANON can also be FILE_MAPPED
+ */
+ return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0;
+}
+
+/* Work out how many page cache pages we can reclaim in this reclaim_mode */
+static long zone_pagecache_reclaimable(struct zone *zone)
+{
+ long nr_pagecache_reclaimable;
+ long delta = 0;
+
+ /*
+ * If RECLAIM_SWAP is set, then all file pages are considered
+ * potentially reclaimable. Otherwise, we have to worry about
+ * pages like swapcache and zone_unmapped_file_pages() provides
+ * a better estimate
+ */
+ if (zone_reclaim_mode & RECLAIM_SWAP)
+ nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
+ else
+ nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
+
+ /* If we can't clean pages, remove dirty pages from consideration */
+ if (!(zone_reclaim_mode & RECLAIM_WRITE))
+ delta += zone_page_state(zone, NR_FILE_DIRTY);
+
+ return nr_pagecache_reclaimable;
+}
+
/*
* Try to free up some pages from this zone through reclaim.
*/
@@ -2355,7 +2393,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
.isolate_pages = isolate_pages_global,
};
unsigned long slab_reclaimable;
- long nr_unmapped_file_pages;

disable_swap_token();
cond_resched();
@@ -2368,11 +2405,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;

- nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
- zone_page_state(zone, NR_ACTIVE_FILE) -
- zone_page_state(zone, NR_FILE_MAPPED);
-
- if (nr_unmapped_file_pages > zone->min_unmapped_pages) {
+ if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
/*
* Free memory by calling shrink zone with increasing
* priorities until we have enough memory freed.
@@ -2419,8 +2452,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
{
int node_id;
int ret;
- long nr_unmapped_file_pages;
- long nr_slab_reclaimable;

/*
* Zone reclaim reclaims unmapped file backed pages and
@@ -2432,12 +2463,8 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
* if less than a specified percentage of the zone is used by
* unmapped file backed pages.
*/
- nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
- zone_page_state(zone, NR_ACTIVE_FILE) -
- zone_page_state(zone, NR_FILE_MAPPED);
- nr_slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
- if (nr_unmapped_file_pages <= zone->min_unmapped_pages &&
- nr_slab_reclaimable <= zone->min_slab_pages)
+ if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
+ zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
return 0;

if (zone_is_all_unreclaimable(zone))
--
1.5.6.5

2009-06-11 10:48:37

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/3] Do not unconditionally treat zones that fail zone_reclaim() as full

On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
unmapped pages will be reclaimed if the zone watermarks are not being
met. The problem is that zone_reclaim() failing at all means the zone
gets marked full.

This can cause situations where a zone is usable, but is being skipped
because it has been considered full. Take a situation where a large tmpfs
mount is occuping a large percentage of memory overall. The pages do not
get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
and the zonelist cache considers them not worth trying in the future.

This patch makes zone_reclaim() return more fine-grained information about
what occured when zone_reclaim() failued. The zone only gets marked full if
it really is unreclaimable. If it's a case that the scan did not occur or
if enough pages were not reclaimed with the limited reclaim_mode, then the
zone is simply skipped.

There is a side-effect to this patch. Currently, if zone_reclaim()
successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would
go ahead. With this patch applied, zone watermarks are rechecked after
zone_reclaim() does some work.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Wu Fengguang <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
---
mm/internal.h | 4 ++++
mm/page_alloc.c | 26 ++++++++++++++++++++++----
mm/vmscan.c | 11 ++++++-----
3 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index f02c750..f290c4d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -259,4 +259,8 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int len, int flags,
struct page **pages, struct vm_area_struct **vmas);

+#define ZONE_RECLAIM_NOSCAN -2
+#define ZONE_RECLAIM_FULL -1
+#define ZONE_RECLAIM_SOME 0
+#define ZONE_RECLAIM_SUCCESS 1
#endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d35e753..667ffbb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1477,15 +1477,33 @@ zonelist_scan:
BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
unsigned long mark;
+ int ret;
+
mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
- if (!zone_watermark_ok(zone, order, mark,
- classzone_idx, alloc_flags)) {
- if (!zone_reclaim_mode ||
- !zone_reclaim(zone, gfp_mask, order))
+ if (zone_watermark_ok(zone, order, mark,
+ classzone_idx, alloc_flags))
+ goto try_this_zone;
+
+ if (zone_reclaim_mode == 0)
+ goto this_zone_full;
+
+ ret = zone_reclaim(zone, gfp_mask, order);
+ switch (ret) {
+ case ZONE_RECLAIM_NOSCAN:
+ /* did not scan */
+ goto try_next_zone;
+ case ZONE_RECLAIM_FULL:
+ /* scanned but unreclaimable */
+ goto this_zone_full;
+ default:
+ /* did we reclaim enough */
+ if (!zone_watermark_ok(zone, order, mark,
+ classzone_idx, alloc_flags))
goto this_zone_full;
}
}

+try_this_zone:
page = buffered_rmqueue(preferred_zone, zone, order,
gfp_mask, migratetype);
if (page)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d832ba8..7b8eb3f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2465,16 +2465,16 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
*/
if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
- return 0;
+ return ZONE_RECLAIM_FULL;

if (zone_is_all_unreclaimable(zone))
- return 0;
+ return ZONE_RECLAIM_FULL;

/*
* Do not scan if the allocation should not be delayed.
*/
if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
- return 0;
+ return ZONE_RECLAIM_NOSCAN;

/*
* Only run zone reclaim on the local zone or on zones that do not
@@ -2484,10 +2484,11 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
*/
node_id = zone_to_nid(zone);
if (node_state(node_id, N_CPU) && node_id != numa_node_id())
- return 0;
+ return ZONE_RECLAIM_NOSCAN;

if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
- return 0;
+ return ZONE_RECLAIM_NOSCAN;
+
ret = __zone_reclaim(zone, gfp_mask, order);
zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);

--
1.5.6.5

2009-06-11 10:48:50

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/3] Count the number of times zone_reclaim() scans and fails

On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
unmapped pages will be reclaimed if the zone watermarks are not being met.

There is a heuristic that determines if the scan is worthwhile but it is
possible that the heuristic will fail and the CPU gets tied up scanning
uselessly. Detecting the situation requires some guesswork and experimentation
so this patch adds a counter "zreclaim_failed" to /proc/vmstat. If during
high CPU utilisation this counter is increasing rapidly, then the resolution
to the problem may be to set /proc/sys/vm/zone_reclaim_mode to 0.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
include/linux/vmstat.h | 3 +++
mm/vmscan.c | 3 +++
mm/vmstat.c | 3 +++
3 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index ff4696c..416f748 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -36,6 +36,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGSTEAL),
FOR_ALL_ZONES(PGSCAN_KSWAPD),
FOR_ALL_ZONES(PGSCAN_DIRECT),
+#ifdef CONFIG_NUMA
+ PGSCAN_ZONERECLAIM_FAILED,
+#endif
PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
#ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7b8eb3f..42c1013 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2492,6 +2492,9 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
ret = __zone_reclaim(zone, gfp_mask, order);
zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);

+ if (!ret)
+ count_vm_event(PGSCAN_ZONERECLAIM_FAILED);
+
return ret;
}
#endif
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1e3aa81..02677d1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -673,6 +673,9 @@ static const char * const vmstat_text[] = {
TEXTS_FOR_ZONES("pgscan_kswapd")
TEXTS_FOR_ZONES("pgscan_direct")

+#ifdef CONFIG_NUMA
+ "zreclaim_failed",
+#endif
"pginodesteal",
"slabs_scanned",
"kswapd_steal",
--
1.5.6.5

2009-06-11 11:33:18

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 3/3] Count the number of times zone_reclaim() scans and fails

> On NUMA machines, the administrator can configure zone_reclaim_mode that
> is a more targetted form of direct reclaim. On machines with large NUMA
> distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> unmapped pages will be reclaimed if the zone watermarks are not being met.
>
> There is a heuristic that determines if the scan is worthwhile but it is
> possible that the heuristic will fail and the CPU gets tied up scanning
> uselessly. Detecting the situation requires some guesswork and experimentation
> so this patch adds a counter "zreclaim_failed" to /proc/vmstat. If during
> high CPU utilisation this counter is increasing rapidly, then the resolution
> to the problem may be to set /proc/sys/vm/zone_reclaim_mode to 0.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
> ---
> include/linux/vmstat.h | 3 +++
> mm/vmscan.c | 3 +++
> mm/vmstat.c | 3 +++
> 3 files changed, 9 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index ff4696c..416f748 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -36,6 +36,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> FOR_ALL_ZONES(PGSTEAL),
> FOR_ALL_ZONES(PGSCAN_KSWAPD),
> FOR_ALL_ZONES(PGSCAN_DIRECT),
> +#ifdef CONFIG_NUMA
> + PGSCAN_ZONERECLAIM_FAILED,
> +#endif
> PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
> PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> #ifdef CONFIG_HUGETLB_PAGE
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7b8eb3f..42c1013 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2492,6 +2492,9 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> ret = __zone_reclaim(zone, gfp_mask, order);
> zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
>
> + if (!ret)
> + count_vm_event(PGSCAN_ZONERECLAIM_FAILED);
> +
> return ret;
> }
> #endif
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 1e3aa81..02677d1 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -673,6 +673,9 @@ static const char * const vmstat_text[] = {
> TEXTS_FOR_ZONES("pgscan_kswapd")
> TEXTS_FOR_ZONES("pgscan_direct")
>
> +#ifdef CONFIG_NUMA
> + "zreclaim_failed",
> +#endif
> "pginodesteal",
> "slabs_scanned",
> "kswapd_steal",

Looks good.
Reviewed-by: KOSAKI Motohiro <[email protected]>


2009-06-11 11:37:24

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/3] Properly account for the number of page cache pages zone_reclaim() can reclaim

> On NUMA machines, the administrator can configure zone_reclaim_mode that
> is a more targetted form of direct reclaim. On machines with large NUMA
> distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> unmapped pages will be reclaimed if the zone watermarks are not being met.
>
> There is a heuristic that determines if the scan is worthwhile but the problem
> is that the heuristic is not being properly applied and is basically assuming
> zone_reclaim_mode is 1 if it is enabled. The lack of proper detection can
> manfiest as high CPU usage as the LRU list is scanned uselessly.
>
> Historically, once enabled it was depending on NR_FILE_PAGES which may
> include swapcache pages that the reclaim_mode cannot deal with. Patch
> vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
> Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
> pages that were not file-backed such as swapcache and made a calculation
> based on the inactive, active and mapped files. This is far superior
> when zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
> reasonable starting figure.
>
> This patch alters how zone_reclaim() works out how many pages it might be
> able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set
> in the reclaim_mode it will either consider NR_FILE_PAGES as potential
> candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
> swapcache and other non-file-backed pages. If RECLAIM_WRITE is not set,
> then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
> not set, then NR_FILE_MAPPED are not.
>
> [mmotm note: This patch should be merged with or replace
> vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim. Kosaki?]
>
> [[email protected]: Estimate unmapped pages minus tmpfs pages]
> [[email protected]: Fix underflow problem in Kosaki's estimate]
> Signed-off-by: Mel Gorman <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
> Acked-by: Christoph Lameter <[email protected]>
> ---
> mm/vmscan.c | 55 +++++++++++++++++++++++++++++++++++++++++--------------
> 1 files changed, 41 insertions(+), 14 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2ddcfc8..d832ba8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2333,6 +2333,44 @@ int sysctl_min_unmapped_ratio = 1;
> */
> int sysctl_min_slab_ratio = 5;
>
> +static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
> +{
> + unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
> + unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
> + zone_page_state(zone, NR_ACTIVE_FILE);
> +
> + /*
> + * It's possible for there to be more file mapped pages than
> + * accounted for by the pages on the file LRU lists because
> + * tmpfs pages accounted for as ANON can also be FILE_MAPPED
> + */
> + return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0;
> +}
> +
> +/* Work out how many page cache pages we can reclaim in this reclaim_mode */
> +static long zone_pagecache_reclaimable(struct zone *zone)
> +{
> + long nr_pagecache_reclaimable;
> + long delta = 0;
> +
> + /*
> + * If RECLAIM_SWAP is set, then all file pages are considered
> + * potentially reclaimable. Otherwise, we have to worry about
> + * pages like swapcache and zone_unmapped_file_pages() provides
> + * a better estimate
> + */
> + if (zone_reclaim_mode & RECLAIM_SWAP)
> + nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
> + else
> + nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
> +
> + /* If we can't clean pages, remove dirty pages from consideration */
> + if (!(zone_reclaim_mode & RECLAIM_WRITE))
> + delta += zone_page_state(zone, NR_FILE_DIRTY);

no use delta?

> +
> + return nr_pagecache_reclaimable;
> +}
> +
> /*
> * Try to free up some pages from this zone through reclaim.
> */
> @@ -2355,7 +2393,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> .isolate_pages = isolate_pages_global,
> };
> unsigned long slab_reclaimable;
> - long nr_unmapped_file_pages;
>
> disable_swap_token();
> cond_resched();
> @@ -2368,11 +2405,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> reclaim_state.reclaimed_slab = 0;
> p->reclaim_state = &reclaim_state;
>
> - nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
> - zone_page_state(zone, NR_ACTIVE_FILE) -
> - zone_page_state(zone, NR_FILE_MAPPED);
> -
> - if (nr_unmapped_file_pages > zone->min_unmapped_pages) {
> + if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {

Documentation/sysctl/vm.txt says
=============================================================

min_unmapped_ratio:

This is available only on NUMA kernels.

A percentage of the total pages in each zone. Zone reclaim will only
occur if more than this percentage of pages are file backed and unmapped.
This is to insure that a minimal amount of local pages is still available for
file I/O even if the node is overallocated.

The default is 1 percent.

==============================================================

but your code condider more addional thing. Can you please change document too?


> /*
> * Free memory by calling shrink zone with increasing
> * priorities until we have enough memory freed.
> @@ -2419,8 +2452,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> {
> int node_id;
> int ret;
> - long nr_unmapped_file_pages;
> - long nr_slab_reclaimable;
>
> /*
> * Zone reclaim reclaims unmapped file backed pages and
> @@ -2432,12 +2463,8 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> * if less than a specified percentage of the zone is used by
> * unmapped file backed pages.
> */
> - nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
> - zone_page_state(zone, NR_ACTIVE_FILE) -
> - zone_page_state(zone, NR_FILE_MAPPED);
> - nr_slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
> - if (nr_unmapped_file_pages <= zone->min_unmapped_pages &&
> - nr_slab_reclaimable <= zone->min_slab_pages)
> + if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
> + zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
> return 0;
>
> if (zone_is_all_unreclaimable(zone))
> --
> 1.5.6.5
>


2009-06-11 13:49:16

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2/3] Do not unconditionally treat zones that fail zone_reclaim() as full

It needs to be mentioned that this fixes a bug introduced in 2.6.19.
Possibly a portion of this code needs to be backported to stable.

On Thu, 11 Jun 2009, Mel Gorman wrote:

> On NUMA machines, the administrator can configure zone_reclaim_mode that
> is a more targetted form of direct reclaim. On machines with large NUMA
> distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> unmapped pages will be reclaimed if the zone watermarks are not being
> met. The problem is that zone_reclaim() failing at all means the zone
> gets marked full.
>
> This can cause situations where a zone is usable, but is being skipped
> because it has been considered full. Take a situation where a large tmpfs
> mount is occuping a large percentage of memory overall. The pages do not
> get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
> and the zonelist cache considers them not worth trying in the future.
>
> This patch makes zone_reclaim() return more fine-grained information about
> what occured when zone_reclaim() failued. The zone only gets marked full if
> it really is unreclaimable. If it's a case that the scan did not occur or
> if enough pages were not reclaimed with the limited reclaim_mode, then the
> zone is simply skipped.
>
> There is a side-effect to this patch. Currently, if zone_reclaim()
> successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would
> go ahead. With this patch applied, zone watermarks are rechecked after
> zone_reclaim() does some work.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Reviewed-by: Wu Fengguang <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
> Reviewed-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/internal.h | 4 ++++
> mm/page_alloc.c | 26 ++++++++++++++++++++++----
> mm/vmscan.c | 11 ++++++-----
> 3 files changed, 32 insertions(+), 9 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index f02c750..f290c4d 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -259,4 +259,8 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> unsigned long start, int len, int flags,
> struct page **pages, struct vm_area_struct **vmas);
>
> +#define ZONE_RECLAIM_NOSCAN -2
> +#define ZONE_RECLAIM_FULL -1
> +#define ZONE_RECLAIM_SOME 0
> +#define ZONE_RECLAIM_SUCCESS 1
> #endif
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d35e753..667ffbb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1477,15 +1477,33 @@ zonelist_scan:
> BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
> if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
> unsigned long mark;
> + int ret;
> +
> mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
> - if (!zone_watermark_ok(zone, order, mark,
> - classzone_idx, alloc_flags)) {
> - if (!zone_reclaim_mode ||
> - !zone_reclaim(zone, gfp_mask, order))
> + if (zone_watermark_ok(zone, order, mark,
> + classzone_idx, alloc_flags))
> + goto try_this_zone;
> +
> + if (zone_reclaim_mode == 0)
> + goto this_zone_full;
> +
> + ret = zone_reclaim(zone, gfp_mask, order);
> + switch (ret) {
> + case ZONE_RECLAIM_NOSCAN:
> + /* did not scan */
> + goto try_next_zone;
> + case ZONE_RECLAIM_FULL:
> + /* scanned but unreclaimable */
> + goto this_zone_full;
> + default:
> + /* did we reclaim enough */
> + if (!zone_watermark_ok(zone, order, mark,
> + classzone_idx, alloc_flags))
> goto this_zone_full;
> }
> }
>
> +try_this_zone:
> page = buffered_rmqueue(preferred_zone, zone, order,
> gfp_mask, migratetype);
> if (page)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d832ba8..7b8eb3f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2465,16 +2465,16 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> */
> if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
> zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
> - return 0;
> + return ZONE_RECLAIM_FULL;
>
> if (zone_is_all_unreclaimable(zone))
> - return 0;
> + return ZONE_RECLAIM_FULL;
>
> /*
> * Do not scan if the allocation should not be delayed.
> */
> if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
> - return 0;
> + return ZONE_RECLAIM_NOSCAN;
>
> /*
> * Only run zone reclaim on the local zone or on zones that do not
> @@ -2484,10 +2484,11 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> */
> node_id = zone_to_nid(zone);
> if (node_state(node_id, N_CPU) && node_id != numa_node_id())
> - return 0;
> + return ZONE_RECLAIM_NOSCAN;
>
> if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
> - return 0;
> + return ZONE_RECLAIM_NOSCAN;
> +
> ret = __zone_reclaim(zone, gfp_mask, order);
> zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
>
>

2009-06-11 23:30:58

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3

On Thu, 11 Jun 2009 11:47:50 +0100
Mel Gorman <[email protected]> wrote:

> The big change with this release is that the patch reintroducing
> zone_reclaim_interval has been dropped as Ram reports the malloc() stalls
> have been resolved. If this bug occurs again, the counter will be there to
> help us identify the situation.

What is the exact relationship between this work and the somewhat
mangled "[PATCH for mmotm 0/5] introduce swap-backed-file-mapped count
and fix
vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch"
series?

That five-patch series had me thinking that it was time to drop

vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch
vmscan-drop-pf_swapwrite-from-zone_reclaim.patch
vmscan-zone_reclaim-use-may_swap.patch

(they can be removed cleanly, but I haven't tried compiling the result)

but your series is based on those.

We have 142 MM patches queued, and we need to merge next week.

2009-06-12 10:17:45

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/3] Properly account for the number of page cache pages zone_reclaim() can reclaim

On Thu, Jun 11, 2009 at 08:37:04PM +0900, KOSAKI Motohiro wrote:
> > On NUMA machines, the administrator can configure zone_reclaim_mode that
> > is a more targetted form of direct reclaim. On machines with large NUMA
> > distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> > unmapped pages will be reclaimed if the zone watermarks are not being met.
> >
> > There is a heuristic that determines if the scan is worthwhile but the problem
> > is that the heuristic is not being properly applied and is basically assuming
> > zone_reclaim_mode is 1 if it is enabled. The lack of proper detection can
> > manfiest as high CPU usage as the LRU list is scanned uselessly.
> >
> > Historically, once enabled it was depending on NR_FILE_PAGES which may
> > include swapcache pages that the reclaim_mode cannot deal with. Patch
> > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
> > Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
> > pages that were not file-backed such as swapcache and made a calculation
> > based on the inactive, active and mapped files. This is far superior
> > when zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
> > reasonable starting figure.
> >
> > This patch alters how zone_reclaim() works out how many pages it might be
> > able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set
> > in the reclaim_mode it will either consider NR_FILE_PAGES as potential
> > candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
> > swapcache and other non-file-backed pages. If RECLAIM_WRITE is not set,
> > then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
> > not set, then NR_FILE_MAPPED are not.
> >
> > [mmotm note: This patch should be merged with or replace
> > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim. Kosaki?]
> >
> > [[email protected]: Estimate unmapped pages minus tmpfs pages]
> > [[email protected]: Fix underflow problem in Kosaki's estimate]
> > Signed-off-by: Mel Gorman <[email protected]>
> > Reviewed-by: Rik van Riel <[email protected]>
> > Acked-by: Christoph Lameter <[email protected]>
> > ---
> > mm/vmscan.c | 55 +++++++++++++++++++++++++++++++++++++++++--------------
> > 1 files changed, 41 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 2ddcfc8..d832ba8 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2333,6 +2333,44 @@ int sysctl_min_unmapped_ratio = 1;
> > */
> > int sysctl_min_slab_ratio = 5;
> >
> > +static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
> > +{
> > + unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
> > + unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
> > + zone_page_state(zone, NR_ACTIVE_FILE);
> > +
> > + /*
> > + * It's possible for there to be more file mapped pages than
> > + * accounted for by the pages on the file LRU lists because
> > + * tmpfs pages accounted for as ANON can also be FILE_MAPPED
> > + */
> > + return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0;
> > +}
> > +
> > +/* Work out how many page cache pages we can reclaim in this reclaim_mode */
> > +static long zone_pagecache_reclaimable(struct zone *zone)
> > +{
> > + long nr_pagecache_reclaimable;
> > + long delta = 0;
> > +
> > + /*
> > + * If RECLAIM_SWAP is set, then all file pages are considered
> > + * potentially reclaimable. Otherwise, we have to worry about
> > + * pages like swapcache and zone_unmapped_file_pages() provides
> > + * a better estimate
> > + */
> > + if (zone_reclaim_mode & RECLAIM_SWAP)
> > + nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
> > + else
> > + nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
> > +
> > + /* If we can't clean pages, remove dirty pages from consideration */
> > + if (!(zone_reclaim_mode & RECLAIM_WRITE))
> > + delta += zone_page_state(zone, NR_FILE_DIRTY);
>
> no use delta?
>

delta was used twice in an interim version when it was possible to overflow
the counter. I left it as is because if another counter is added that must
be subtracted from nr_pagecache_reclaimable, it'll be tidier to patch in if
delta was there. I can take it out if you prefer.

> > +
> > + return nr_pagecache_reclaimable;
> > +}
> > +
> > /*
> > * Try to free up some pages from this zone through reclaim.
> > */
> > @@ -2355,7 +2393,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > .isolate_pages = isolate_pages_global,
> > };
> > unsigned long slab_reclaimable;
> > - long nr_unmapped_file_pages;
> >
> > disable_swap_token();
> > cond_resched();
> > @@ -2368,11 +2405,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > reclaim_state.reclaimed_slab = 0;
> > p->reclaim_state = &reclaim_state;
> >
> > - nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
> > - zone_page_state(zone, NR_ACTIVE_FILE) -
> > - zone_page_state(zone, NR_FILE_MAPPED);
> > -
> > - if (nr_unmapped_file_pages > zone->min_unmapped_pages) {
> > + if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
>
> Documentation/sysctl/vm.txt says
> =============================================================
>
> min_unmapped_ratio:
>
> This is available only on NUMA kernels.
>
> A percentage of the total pages in each zone. Zone reclaim will only
> occur if more than this percentage of pages are file backed and unmapped.
> This is to insure that a minimal amount of local pages is still available for
> file I/O even if the node is overallocated.
>
> The default is 1 percent.
>
> ==============================================================
>
> but your code condider more addional thing. Can you please change document too?
>

How does this look?

==============================================================
min_unmapped_ratio:

This is available only on NUMA kernels.

This is a percentage of the total pages in each zone. Zone reclaim will only
occur if more than this percentage are in a state that zone_reclaim_mode
allows to be reclaimed.

If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared
against all file-backed unmapped pages including swapcache pages and tmpfs
files. Otherwise, only unmapped pages backed by normal files but not tmpfs
files and similar are considered.

The default is 1 percent.
==============================================================

>
> > /*
> > * Free memory by calling shrink zone with increasing
> > * priorities until we have enough memory freed.
> > @@ -2419,8 +2452,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > {
> > int node_id;
> > int ret;
> > - long nr_unmapped_file_pages;
> > - long nr_slab_reclaimable;
> >
> > /*
> > * Zone reclaim reclaims unmapped file backed pages and
> > @@ -2432,12 +2463,8 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > * if less than a specified percentage of the zone is used by
> > * unmapped file backed pages.
> > */
> > - nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
> > - zone_page_state(zone, NR_ACTIVE_FILE) -
> > - zone_page_state(zone, NR_FILE_MAPPED);
> > - nr_slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
> > - if (nr_unmapped_file_pages <= zone->min_unmapped_pages &&
> > - nr_slab_reclaimable <= zone->min_slab_pages)
> > + if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
> > + zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
> > return 0;
> >
> > if (zone_is_all_unreclaimable(zone))
> > --
> > 1.5.6.5
> >
>
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-12 10:36:27

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/3] Do not unconditionally treat zones that fail zone_reclaim() as full

On Thu, Jun 11, 2009 at 09:48:53AM -0400, Christoph Lameter wrote:
> It needs to be mentioned that this fixes a bug introduced in 2.6.19.
> Possibly a portion of this code needs to be backported to stable.
>

Andrew has sucked up the patch already so I can't patch it. Andrew, there
is a further note below on the patch if you'd like to pick it up.

On the stable front, I'm think that patches 1 and 2 should being considered
-stable candidates. Patch 1 is certainly needed because it fixes up the
malloc() stall and should be picked up by distro kernels as well. This patch
closes another obvious hole albeit one harder to trigger.

Ideally patch 3 would also be in -stable so distro kernels will suck it up
as it will help identify this problem in the field if it occurs again but
I'm not sure what the -stable policy is on such things are.

> On Thu, 11 Jun 2009, Mel Gorman wrote:
>
> > On NUMA machines, the administrator can configure zone_reclaim_mode that
> > is a more targetted form of direct reclaim. On machines with large NUMA
> > distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> > unmapped pages will be reclaimed if the zone watermarks are not being
> > met. The problem is that zone_reclaim() failing at all means the zone
> > gets marked full.
> >
> > This can cause situations where a zone is usable, but is being skipped
> > because it has been considered full. Take a situation where a large tmpfs
> > mount is occuping a large percentage of memory overall. The pages do not
> > get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
> > and the zonelist cache considers them not worth trying in the future.
> >
> > This patch makes zone_reclaim() return more fine-grained information about
> > what occured when zone_reclaim() failued. The zone only gets marked full if
> > it really is unreclaimable. If it's a case that the scan did not occur or
> > if enough pages were not reclaimed with the limited reclaim_mode, then the
> > zone is simply skipped.
> >
> > There is a side-effect to this patch. Currently, if zone_reclaim()
> > successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would
> > go ahead. With this patch applied, zone watermarks are rechecked after
> > zone_reclaim() does some work.

(Noted by Christoph)

This bug was introduced by commit 9276b1bc96a132f4068fdee00983c532f43d3a26
way back in 2.6.19 when the zonelist_cache was introduced. It was not intended
that zone_reclaim() aggressively consider the zone to be full when it failed
as full direct reclaim can still be an option. Due to the age of the bug,
it should be considered a -stable candidate.

> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > Reviewed-by: Wu Fengguang <[email protected]>
> > Reviewed-by: Rik van Riel <[email protected]>
> > Reviewed-by: KOSAKI Motohiro <[email protected]>
> > ---
> > mm/internal.h | 4 ++++
> > mm/page_alloc.c | 26 ++++++++++++++++++++++----
> > mm/vmscan.c | 11 ++++++-----
> > 3 files changed, 32 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index f02c750..f290c4d 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -259,4 +259,8 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> > unsigned long start, int len, int flags,
> > struct page **pages, struct vm_area_struct **vmas);
> >
> > +#define ZONE_RECLAIM_NOSCAN -2
> > +#define ZONE_RECLAIM_FULL -1
> > +#define ZONE_RECLAIM_SOME 0
> > +#define ZONE_RECLAIM_SUCCESS 1
> > #endif
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index d35e753..667ffbb 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1477,15 +1477,33 @@ zonelist_scan:
> > BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
> > if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
> > unsigned long mark;
> > + int ret;
> > +
> > mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
> > - if (!zone_watermark_ok(zone, order, mark,
> > - classzone_idx, alloc_flags)) {
> > - if (!zone_reclaim_mode ||
> > - !zone_reclaim(zone, gfp_mask, order))
> > + if (zone_watermark_ok(zone, order, mark,
> > + classzone_idx, alloc_flags))
> > + goto try_this_zone;
> > +
> > + if (zone_reclaim_mode == 0)
> > + goto this_zone_full;
> > +
> > + ret = zone_reclaim(zone, gfp_mask, order);
> > + switch (ret) {
> > + case ZONE_RECLAIM_NOSCAN:
> > + /* did not scan */
> > + goto try_next_zone;
> > + case ZONE_RECLAIM_FULL:
> > + /* scanned but unreclaimable */
> > + goto this_zone_full;
> > + default:
> > + /* did we reclaim enough */
> > + if (!zone_watermark_ok(zone, order, mark,
> > + classzone_idx, alloc_flags))
> > goto this_zone_full;
> > }
> > }
> >
> > +try_this_zone:
> > page = buffered_rmqueue(preferred_zone, zone, order,
> > gfp_mask, migratetype);
> > if (page)
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index d832ba8..7b8eb3f 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2465,16 +2465,16 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > */
> > if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
> > zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
> > - return 0;
> > + return ZONE_RECLAIM_FULL;
> >
> > if (zone_is_all_unreclaimable(zone))
> > - return 0;
> > + return ZONE_RECLAIM_FULL;
> >
> > /*
> > * Do not scan if the allocation should not be delayed.
> > */
> > if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
> > - return 0;
> > + return ZONE_RECLAIM_NOSCAN;
> >
> > /*
> > * Only run zone reclaim on the local zone or on zones that do not
> > @@ -2484,10 +2484,11 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > */
> > node_id = zone_to_nid(zone);
> > if (node_state(node_id, N_CPU) && node_id != numa_node_id())
> > - return 0;
> > + return ZONE_RECLAIM_NOSCAN;
> >
> > if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
> > - return 0;
> > + return ZONE_RECLAIM_NOSCAN;
> > +
> > ret = __zone_reclaim(zone, gfp_mask, order);
> > zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
> >
> >
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-12 11:04:33

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3

On Thu, Jun 11, 2009 at 04:30:06PM -0700, Andrew Morton wrote:
> On Thu, 11 Jun 2009 11:47:50 +0100
> Mel Gorman <[email protected]> wrote:
>
> > The big change with this release is that the patch reintroducing
> > zone_reclaim_interval has been dropped as Ram reports the malloc() stalls
> > have been resolved. If this bug occurs again, the counter will be there to
> > help us identify the situation.
>
> What is the exact relationship between this work and the somewhat
> mangled "[PATCH for mmotm 0/5] introduce swap-backed-file-mapped count
> and fix
> vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch"
> series?
>

The patch series "Fix malloc() stall in zone_reclaim() and bring
behaviour more in line with expectations V3" replaces
vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch.

Portions of the patch series "Introduce swap-backed-file-mapped count" are
potentially follow-on work if a failure case can be identified. The series
brings the kernel behaviour more in line with documentation, but it's easier
to fix the documentation.

> That five-patch series had me thinking that it was time to drop
>
> vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch

This patch gets replaced. All the lessons in the new patch are included.
They could be merged together.

> vmscan-drop-pf_swapwrite-from-zone_reclaim.patch

This patch is wrong, but only sortof. It should be dropped or replaced with
another version. Kosaki, could you resubmit this patch except that you check
if RECLAIM_SWAP is set in zone_reclaim_mode when deciding whether to set
PF_SWAPWRITE or not please?

Your patch is correct if zone_reclaim_mode 1, but incorrect if it's 7 for
example.

> vmscan-zone_reclaim-use-may_swap.patch
>

This is a tricky one. Kosaki, I think this patch is a little dangerous. With
this applied, pages get unmapped whether RECLAIM_SWAP is set or not. This
means that zone_reclaim() now has more work to do when it's enabled and it
incurs a number of minor faults for no reason as a result of trying to avoid
going off-node. I don't believe that is desirable because it would manifest
as high minor fault counts on NUMA and would be difficult to pin down why
that was happening.

I think the code makes more sense than the documentation and it's the
documentation that should be fixed. Our current behaviour is to discard
clean, swap-backed, unmapped pages that require no further IO. This is
reasonable behaviour for zone_reclaim_mode == 1 so maybe the patch
should change the documentation to

1 = Zone reclaim discards clean unmapped disk-backed pages
2 = Zone reclaim writes dirty pages out
4 = Zone reclaim unmaps and swaps pages

If you really wanted to strict about the meaning of RECLAIM_SWAP, then
something like the following would be reasonable;

.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
.may_swap = !!(zone_reclaim_mode & RECLAIM_SWAP),

because a system administrator is not going to distinguish between
unmapping and swap. I would assume at least that RECLAIM_SWAP implies
unmapping pages for swapping but an updated documentation wouldn't hurt
with

4 = Zone reclaim unmaps and swaps pages

> (they can be removed cleanly, but I haven't tried compiling the result)
>
> but your series is based on those.
>

The patchset only depends on
vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch
and then only because of merge conflicts. All the lessons in
vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch are
incorporated.

> We have 142 MM patches queued, and we need to merge next week.
>

I'm sorry my timing for coming out with the zone_reclaim() patches sucks
and that I failed to spot these patches earlier. Despite the abundance
of evidence, I'm not trying to be deliberatly awkward :/

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-12 15:47:14

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/3] Do not unconditionally treat zones that fail zone_reclaim() as full

On Fri, 12 Jun 2009 11:36:17 +0100 Mel Gorman <[email protected]> wrote:

> On Thu, Jun 11, 2009 at 09:48:53AM -0400, Christoph Lameter wrote:
> > It needs to be mentioned that this fixes a bug introduced in 2.6.19.
> > Possibly a portion of this code needs to be backported to stable.
> >
>
> Andrew has sucked up the patch already so I can't patch it. Andrew, there
> is a further note below on the patch if you'd like to pick it up.

OK.

> On the stable front, I'm think that patches 1 and 2 should being considered
> -stable candidates. Patch 1 is certainly needed because it fixes up the
> malloc() stall and should be picked up by distro kernels as well. This patch
> closes another obvious hole albeit one harder to trigger.
>
> Ideally patch 3 would also be in -stable so distro kernels will suck it up
> as it will help identify this problem in the field if it occurs again but
> I'm not sure what the -stable policy is on such things are.

Well, I tagged the patches for stable but they don't apply at all well
to even 2.6.30 base.

2009-06-12 16:09:12

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3

On Fri, 12 Jun 2009 12:04:24 +0100 Mel Gorman <[email protected]> wrote:

> On Thu, Jun 11, 2009 at 04:30:06PM -0700, Andrew Morton wrote:
> > On Thu, 11 Jun 2009 11:47:50 +0100
> > Mel Gorman <[email protected]> wrote:
> >
> > > The big change with this release is that the patch reintroducing
> > > zone_reclaim_interval has been dropped as Ram reports the malloc() stalls
> > > have been resolved. If this bug occurs again, the counter will be there to
> > > help us identify the situation.
> >
> > What is the exact relationship between this work and the somewhat
> > mangled "[PATCH for mmotm 0/5] introduce swap-backed-file-mapped count
> > and fix
> > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch"
> > series?
> >
>
> The patch series "Fix malloc() stall in zone_reclaim() and bring
> behaviour more in line with expectations V3" replaces
> vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch.
>
> Portions of the patch series "Introduce swap-backed-file-mapped count" are
> potentially follow-on work if a failure case can be identified. The series
> brings the kernel behaviour more in line with documentation, but it's easier
> to fix the documentation.
>
> > That five-patch series had me thinking that it was time to drop
> >
> > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch
>
> This patch gets replaced. All the lessons in the new patch are included.
> They could be merged together.

OK, I'll fold
vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch
and
vmscan-properly-account-for-the-number-of-page-cache-pages-zone_reclaim-can-reclaim.patch,
using
vmscan-properly-account-for-the-number-of-page-cache-pages-zone_reclaim-can-reclaim.patch's
changelog verbatim.


> > vmscan-drop-pf_swapwrite-from-zone_reclaim.patch
>
> This patch is wrong, but only sortof. It should be dropped or replaced with
> another version. Kosaki, could you resubmit this patch except that you check
> if RECLAIM_SWAP is set in zone_reclaim_mode when deciding whether to set
> PF_SWAPWRITE or not please?
>
> Your patch is correct if zone_reclaim_mode 1, but incorrect if it's 7 for
> example.

OK, I can drop that.

> > vmscan-zone_reclaim-use-may_swap.patch
> >
>
> This is a tricky one. Kosaki, I think this patch is a little dangerous. With
> this applied, pages get unmapped whether RECLAIM_SWAP is set or not. This
> means that zone_reclaim() now has more work to do when it's enabled and it
> incurs a number of minor faults for no reason as a result of trying to avoid
> going off-node. I don't believe that is desirable because it would manifest
> as high minor fault counts on NUMA and would be difficult to pin down why
> that was happening.
>
> I think the code makes more sense than the documentation and it's the
> documentation that should be fixed. Our current behaviour is to discard
> clean, swap-backed, unmapped pages that require no further IO. This is
> reasonable behaviour for zone_reclaim_mode == 1 so maybe the patch
> should change the documentation to
>
> 1 = Zone reclaim discards clean unmapped disk-backed pages
> 2 = Zone reclaim writes dirty pages out
> 4 = Zone reclaim unmaps and swaps pages
>
> If you really wanted to strict about the meaning of RECLAIM_SWAP, then
> something like the following would be reasonable;
>
> .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
> .may_swap = !!(zone_reclaim_mode & RECLAIM_SWAP),
>
> because a system administrator is not going to distinguish between
> unmapping and swap. I would assume at least that RECLAIM_SWAP implies
> unmapping pages for swapping but an updated documentation wouldn't hurt
> with
>
> 4 = Zone reclaim unmaps and swaps pages

OK, I can drop vmscan-zone_reclaim-use-may_swap.patch also.

> > (they can be removed cleanly, but I haven't tried compiling the result)
> >
> > but your series is based on those.
> >
>
> The patchset only depends on
> vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch
> and then only because of merge conflicts. All the lessons in
> vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch are
> incorporated.

OK.

> > We have 142 MM patches queued, and we need to merge next week.
> >
>
> I'm sorry my timing for coming out with the zone_reclaim() patches sucks
> and that I failed to spot these patches earlier. Despite the abundance
> of evidence, I'm not trying to be deliberatly awkward :/

Well. Speaking of bad timing, my next 2.5 days are dedicated to
zooming around a racetrack. I'll do an mmotm now, if it looks like
it'll slightly compile. Please check carefully.

2009-06-15 04:51:30

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/3] Properly account for the number of page cache pages zone_reclaim() can reclaim

> > > +/* Work out how many page cache pages we can reclaim in this reclaim_mode */
> > > +static long zone_pagecache_reclaimable(struct zone *zone)
> > > +{
> > > + long nr_pagecache_reclaimable;
> > > + long delta = 0;
> > > +
> > > + /*
> > > + * If RECLAIM_SWAP is set, then all file pages are considered
> > > + * potentially reclaimable. Otherwise, we have to worry about
> > > + * pages like swapcache and zone_unmapped_file_pages() provides
> > > + * a better estimate
> > > + */
> > > + if (zone_reclaim_mode & RECLAIM_SWAP)
> > > + nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
> > > + else
> > > + nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
> > > +
> > > + /* If we can't clean pages, remove dirty pages from consideration */
> > > + if (!(zone_reclaim_mode & RECLAIM_WRITE))
> > > + delta += zone_page_state(zone, NR_FILE_DIRTY);
> >
> > no use delta?
> >
>
> delta was used twice in an interim version when it was possible to overflow
> the counter. I left it as is because if another counter is added that must
> be subtracted from nr_pagecache_reclaimable, it'll be tidier to patch in if
> delta was there. I can take it out if you prefer.

Honestly, I'm confusing now.

your last version have following usage of "delta"

/* Beware of double accounting */
if (delta < nr_pagecache_reclaimable)
nr_pagecache_reclaimable -= delta;

but current your patch don't have it.
IOW, nobody use delta variable. I'm not sure about you forget to
accurate to nr_pagecache_reclaimable or forget to remove
"delta += zone_page_state(zone, NR_FILE_DIRTY);" line.

Or, Am I missing anything?
Now, I don't oppose this change. I only hope to clarify your intention.



> > > - nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
> > > - zone_page_state(zone, NR_ACTIVE_FILE) -
> > > - zone_page_state(zone, NR_FILE_MAPPED);
> > > -
> > > - if (nr_unmapped_file_pages > zone->min_unmapped_pages) {
> > > + if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
> >
> > Documentation/sysctl/vm.txt says
> > =============================================================
> >
> > min_unmapped_ratio:
> >
> > This is available only on NUMA kernels.
> >
> > A percentage of the total pages in each zone. Zone reclaim will only
> > occur if more than this percentage of pages are file backed and unmapped.
> > This is to insure that a minimal amount of local pages is still available for
> > file I/O even if the node is overallocated.
> >
> > The default is 1 percent.
> >
> > ==============================================================
> >
> > but your code condider more addional thing. Can you please change document too?
> >
>
> How does this look?
>
> ==============================================================
> min_unmapped_ratio:
>
> This is available only on NUMA kernels.
>
> This is a percentage of the total pages in each zone. Zone reclaim will only
> occur if more than this percentage are in a state that zone_reclaim_mode
> allows to be reclaimed.
>
> If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared
> against all file-backed unmapped pages including swapcache pages and tmpfs
> files. Otherwise, only unmapped pages backed by normal files but not tmpfs
> files and similar are considered.
>
> The default is 1 percent.
> ==============================================================

Great! thanks.


2009-06-15 09:42:47

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3

Hi

> On Thu, Jun 11, 2009 at 04:30:06PM -0700, Andrew Morton wrote:
> > On Thu, 11 Jun 2009 11:47:50 +0100
> > Mel Gorman <[email protected]> wrote:
> >
> > > The big change with this release is that the patch reintroducing
> > > zone_reclaim_interval has been dropped as Ram reports the malloc() stalls
> > > have been resolved. If this bug occurs again, the counter will be there to
> > > help us identify the situation.
> >
> > What is the exact relationship between this work and the somewhat
> > mangled "[PATCH for mmotm 0/5] introduce swap-backed-file-mapped count
> > and fix
> > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch"
> > series?
> >
>
> The patch series "Fix malloc() stall in zone_reclaim() and bring
> behaviour more in line with expectations V3" replaces
> vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch.
>
> Portions of the patch series "Introduce swap-backed-file-mapped count" are
> potentially follow-on work if a failure case can be identified. The series
> brings the kernel behaviour more in line with documentation, but it's easier
> to fix the documentation.

Agreed.


> > That five-patch series had me thinking that it was time to drop
> >
> > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch
>
> This patch gets replaced. All the lessons in the new patch are included.
> They could be merged together.

Sure.


> > vmscan-drop-pf_swapwrite-from-zone_reclaim.patch
>
> This patch is wrong, but only sortof. It should be dropped or replaced with
> another version. Kosaki, could you resubmit this patch except that you check
> if RECLAIM_SWAP is set in zone_reclaim_mode when deciding whether to set
> PF_SWAPWRITE or not please?

OK. I'll test it again with your patch.

> Your patch is correct if zone_reclaim_mode 1, but incorrect if it's 7 for
> example.

May I ask your worry?

Parhaps, my patch description was wrong. I should wrote patch effective
to separate small and large server.

First, our dirty page limitation is sane. Thus we don't need to care
all pages of system are dirty.

Thus, on large server, turning off PF_SWAPWRITE don't cause off-node allocation.
There are always clean and droppable page in system.

In the other hand, on small server, we need to concern write-back race because
system memory are relatively small.
Thus, turning off PF_SWAPWRITE might cause off-node allocation.

- typically, small servers are latency aware than larger one.
- zone reclaim is not the feature of gurantee no off-node allocation.
- on small server, off-node allocation penalty is not much rather larger
one in many case.

I sitll think this patch is valueable.


> > vmscan-zone_reclaim-use-may_swap.patch
> >
>
> This is a tricky one. Kosaki, I think this patch is a little dangerous. With
> this applied, pages get unmapped whether RECLAIM_SWAP is set or not. This
> means that zone_reclaim() now has more work to do when it's enabled and it
> incurs a number of minor faults for no reason as a result of trying to avoid
> going off-node. I don't believe that is desirable because it would manifest
> as high minor fault counts on NUMA and would be difficult to pin down why
> that was happening.

(cc to hanns)

First, if this patch should be dropped, commit bd2f6199
(vmscan: respect higher order in zone_reclaim()) should be too. I think.

the combination of lumply reclaim and !may_unmap is really ineffective.
it might cause isolate neighbor pages and give up unmapping and pages put
back tail of lru.
it mean to shuffle lru list.

I don't think it is desirable.


Second, we did learned that "mapped or not mapped" is not appropriate
reclaim boosting between split-lru discussion.
So, I think to make consistent is better. if no considerness of may_unmap
makes serious performance issue, we need to fix try_to_free_pages() path too.


Third, if we consider MPI program on NUMA, each process only access
a part of array data frequently and never touch rest part of array.
So, AFAIK "rarely, but access" is rarely, no freqent access is not major performance source.


I have one question. your "difficultness of pinning down" is major issue?


>
> I think the code makes more sense than the documentation and it's the
> documentation that should be fixed. Our current behaviour is to discard
> clean, swap-backed, unmapped pages that require no further IO. This is
> reasonable behaviour for zone_reclaim_mode == 1 so maybe the patch
> should change the documentation to
>
> 1 = Zone reclaim discards clean unmapped disk-backed pages
> 2 = Zone reclaim writes dirty pages out
> 4 = Zone reclaim unmaps and swaps pages
>
> If you really wanted to strict about the meaning of RECLAIM_SWAP, then
> something like the following would be reasonable;
>
> .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
> .may_swap = !!(zone_reclaim_mode & RECLAIM_SWAP),
>
> because a system administrator is not going to distinguish between
> unmapping and swap. I would assume at least that RECLAIM_SWAP implies
> unmapping pages for swapping but an updated documentation wouldn't hurt
> with
>
> 4 = Zone reclaim unmaps and swaps pages
>
> > (they can be removed cleanly, but I haven't tried compiling the result)
> >
> > but your series is based on those.
> >
>
> The patchset only depends on
> vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch
> and then only because of merge conflicts. All the lessons in
> vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch are
> incorporated.
>
> > We have 142 MM patches queued, and we need to merge next week.
> >
>
> I'm sorry my timing for coming out with the zone_reclaim() patches sucks
> and that I failed to spot these patches earlier. Despite the abundance
> of evidence, I'm not trying to be deliberatly awkward :/
>
> --
> Mel Gorman
> Part-time Phd Student Linux Technology Center
> University of Limerick IBM Dublin Software Lab


2009-06-15 10:05:54

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/3] Properly account for the number of page cache pages zone_reclaim() can reclaim

On Mon, Jun 15, 2009 at 01:51:16PM +0900, KOSAKI Motohiro wrote:
> > > > +/* Work out how many page cache pages we can reclaim in this reclaim_mode */
> > > > +static long zone_pagecache_reclaimable(struct zone *zone)
> > > > +{
> > > > + long nr_pagecache_reclaimable;
> > > > + long delta = 0;
> > > > +
> > > > + /*
> > > > + * If RECLAIM_SWAP is set, then all file pages are considered
> > > > + * potentially reclaimable. Otherwise, we have to worry about
> > > > + * pages like swapcache and zone_unmapped_file_pages() provides
> > > > + * a better estimate
> > > > + */
> > > > + if (zone_reclaim_mode & RECLAIM_SWAP)
> > > > + nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
> > > > + else
> > > > + nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
> > > > +
> > > > + /* If we can't clean pages, remove dirty pages from consideration */
> > > > + if (!(zone_reclaim_mode & RECLAIM_WRITE))
> > > > + delta += zone_page_state(zone, NR_FILE_DIRTY);
> > >
> > > no use delta?
> > >
> >
> > delta was used twice in an interim version when it was possible to overflow
> > the counter. I left it as is because if another counter is added that must
> > be subtracted from nr_pagecache_reclaimable, it'll be tidier to patch in if
> > delta was there. I can take it out if you prefer.
>
> Honestly, I'm confusing now.
>
> your last version have following usage of "delta"
>
> /* Beware of double accounting */
> if (delta < nr_pagecache_reclaimable)
> nr_pagecache_reclaimable -= delta;
>
> but current your patch don't have it.
> IOW, nobody use delta variable. I'm not sure about you forget to
> accurate to nr_pagecache_reclaimable or forget to remove
> "delta += zone_page_state(zone, NR_FILE_DIRTY);" line.
>
> Or, Am I missing anything?
> Now, I don't oppose this change. I only hope to clarify your intention.
>

You're not missing anything. I accidentally removed where delta was
being used. Sorry.

>
>
> > > > - nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
> > > > - zone_page_state(zone, NR_ACTIVE_FILE) -
> > > > - zone_page_state(zone, NR_FILE_MAPPED);
> > > > -
> > > > - if (nr_unmapped_file_pages > zone->min_unmapped_pages) {
> > > > + if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
> > >
> > > Documentation/sysctl/vm.txt says
> > > =============================================================
> > >
> > > min_unmapped_ratio:
> > >
> > > This is available only on NUMA kernels.
> > >
> > > A percentage of the total pages in each zone. Zone reclaim will only
> > > occur if more than this percentage of pages are file backed and unmapped.
> > > This is to insure that a minimal amount of local pages is still available for
> > > file I/O even if the node is overallocated.
> > >
> > > The default is 1 percent.
> > >
> > > ==============================================================
> > >
> > > but your code condider more addional thing. Can you please change document too?
> > >
> >
> > How does this look?
> >
> > ==============================================================
> > min_unmapped_ratio:
> >
> > This is available only on NUMA kernels.
> >
> > This is a percentage of the total pages in each zone. Zone reclaim will only
> > occur if more than this percentage are in a state that zone_reclaim_mode
> > allows to be reclaimed.
> >
> > If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared
> > against all file-backed unmapped pages including swapcache pages and tmpfs
> > files. Otherwise, only unmapped pages backed by normal files but not tmpfs
> > files and similar are considered.
> >
> > The default is 1 percent.
> > ==============================================================
>
> Great! thanks.
>

I'll prepare two patches after reviewing mmotm. The first will use delta
properly and the second will fix the documentation. Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-15 10:28:46

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/3] Do not unconditionally treat zones that fail zone_reclaim() as full

On Fri, Jun 12, 2009 at 08:44:56AM -0700, Andrew Morton wrote:
> On Fri, 12 Jun 2009 11:36:17 +0100 Mel Gorman <[email protected]> wrote:
>
> > On Thu, Jun 11, 2009 at 09:48:53AM -0400, Christoph Lameter wrote:
> > > It needs to be mentioned that this fixes a bug introduced in 2.6.19.
> > > Possibly a portion of this code needs to be backported to stable.
> > >
> >
> > Andrew has sucked up the patch already so I can't patch it. Andrew, there
> > is a further note below on the patch if you'd like to pick it up.
>
> OK.
>
> > On the stable front, I'm think that patches 1 and 2 should being considered
> > -stable candidates. Patch 1 is certainly needed because it fixes up the
> > malloc() stall and should be picked up by distro kernels as well. This patch
> > closes another obvious hole albeit one harder to trigger.
> >
> > Ideally patch 3 would also be in -stable so distro kernels will suck it up
> > as it will help identify this problem in the field if it occurs again but
> > I'm not sure what the -stable policy is on such things are.
>
> Well, I tagged the patches for stable but they don't apply at all well
> to even 2.6.30 base.
>

What's the proper way to handle such a situation? Wait until the patches
go to mainline and post a rebased version to stable?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-15 10:57:00

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3

On Mon, Jun 15, 2009 at 06:42:37PM +0900, KOSAKI Motohiro wrote:
> Hi
>
> > On Thu, Jun 11, 2009 at 04:30:06PM -0700, Andrew Morton wrote:
> > > On Thu, 11 Jun 2009 11:47:50 +0100
> > > Mel Gorman <[email protected]> wrote:
> > >
> > > > The big change with this release is that the patch reintroducing
> > > > zone_reclaim_interval has been dropped as Ram reports the malloc() stalls
> > > > have been resolved. If this bug occurs again, the counter will be there to
> > > > help us identify the situation.
> > >
> > > What is the exact relationship between this work and the somewhat
> > > mangled "[PATCH for mmotm 0/5] introduce swap-backed-file-mapped count
> > > and fix
> > > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch"
> > > series?
> > >
> >
> > The patch series "Fix malloc() stall in zone_reclaim() and bring
> > behaviour more in line with expectations V3" replaces
> > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch.
> >
> > Portions of the patch series "Introduce swap-backed-file-mapped count" are
> > potentially follow-on work if a failure case can be identified. The series
> > brings the kernel behaviour more in line with documentation, but it's easier
> > to fix the documentation.
>
> Agreed.
>
>
> > > That five-patch series had me thinking that it was time to drop
> > >
> > > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch
> >
> > This patch gets replaced. All the lessons in the new patch are included.
> > They could be merged together.
>
> Sure.
>
>
> > > vmscan-drop-pf_swapwrite-from-zone_reclaim.patch
> >
> > This patch is wrong, but only sortof. It should be dropped or replaced with
> > another version. Kosaki, could you resubmit this patch except that you check
> > if RECLAIM_SWAP is set in zone_reclaim_mode when deciding whether to set
> > PF_SWAPWRITE or not please?
>
> OK. I'll test it again with your patch.
>
> > Your patch is correct if zone_reclaim_mode 1, but incorrect if it's 7 for
> > example.
>
> May I ask your worry?
>

Simply that I believe the intention of PF_SWAPWRITE here was to allow
zone_reclaim() to aggressively reclaim memory if the reclaim_mode allowed
it as it was a statement that off-node accesses are really not desired.

> Parhaps, my patch description was wrong. I should wrote patch effective
> to separate small and large server.
>
> First, our dirty page limitation is sane. Thus we don't need to care
> all pages of system are dirty.
>

For dirty pages, that seems fair, but what about pages that have been recently
unmapped and are being written to swap? If PF_SWAPWRITE is not set, then is it
possible we'll fail to pageout due to IO congestion even though zone_reclaim()
queues only a small number of pages?

> Thus, on large server, turning off PF_SWAPWRITE don't cause off-node allocation.
> There are always clean and droppable page in system.
>

And it would be expected that if the scan-avoidance heuristics are working,
then we know there are pages that can be reclaimed. However, if the reclaim
mode is allowing the writing, unmapping and swapping of pages, there could
still be a lot of IO to generate.

> In the other hand, on small server, we need to concern write-back race because
> system memory are relatively small.
> Thus, turning off PF_SWAPWRITE might cause off-node allocation.
>
> - typically, small servers are latency aware than larger one.
> - zone reclaim is not the feature of gurantee no off-node allocation.
> - on small server, off-node allocation penalty is not much rather larger
> one in many case.
>
> I sitll think this patch is valueable.
>

Ok. I am not fully convinced but I'll not block it either if believe it's
necessary. My current understanding is that this patch only makes a difference
if the server is IO congested in which case the system is struggling anyway
and an off-node access is going to be relatively small penalty overall.
Conceivably, having PF_SWAPWRITE set makes things worse in that situation
and the patch makes some sense.

>
> > > vmscan-zone_reclaim-use-may_swap.patch
> > >
> >
> > This is a tricky one. Kosaki, I think this patch is a little dangerous. With
> > this applied, pages get unmapped whether RECLAIM_SWAP is set or not. This
> > means that zone_reclaim() now has more work to do when it's enabled and it
> > incurs a number of minor faults for no reason as a result of trying to avoid
> > going off-node. I don't believe that is desirable because it would manifest
> > as high minor fault counts on NUMA and would be difficult to pin down why
> > that was happening.
>
> (cc to hanns)
>
> First, if this patch should be dropped, commit bd2f6199
> (vmscan: respect higher order in zone_reclaim()) should be too. I think.
> the combination of lumply reclaim and !may_unmap is really ineffective.

Whether it's ineffective or not, it's what the user has asked for. They
want a high-order page found if possible within the limits of
zone_reclaim_mode. If it fails, they will enter full direct reclaim
later in the path and try again.

How effective lumpy reclaim is in this case really depends on what the
system has been used for in the past. It's impossible to know in advance
how effective lumpy reclaim will be in every case.

> it might cause isolate neighbor pages and give up unmapping and pages put
> back tail of lru.
> it mean to shuffle lru list.
>
> I don't think it is desirable.
>

With Kamezawa Hiroyuki's patch that avoids unnecessary shuffles of the LRU
list due to lumpy reclaim, the situation might be better?

> Second, we did learned that "mapped or not mapped" is not appropriate
> reclaim boosting between split-lru discussion.
> So, I think to make consistent is better. if no considerness of may_unmap
> makes serious performance issue, we need to fix try_to_free_pages() path too.
>

I don't understand this paragraph.

If zone_reclaim_mode is set to 1, I don't believe the expected behaviour is
for pages to be unmapped from page tables. I think it will lead to mysterious
bug reports of higher numbers of minor faults when running applications on
NUMA machines in some situations.

>
> Third, if we consider MPI program on NUMA, each process only access
> a part of array data frequently and never touch rest part of array.
> So, AFAIK "rarely, but access" is rarely, no freqent access is not major performance source.
>
> I have one question. your "difficultness of pinning down" is major issue?
>

Yes. If an administrator notices that minor fault rates are higher than
expected, it's going to be very difficult for them to understand why
it is happening and why setting reclaim_mode to 0 apparently fixes the
problem. oprofile for example might just show that a lot of time is being
spent in the fault paths but not explain why.

>
> >
> > I think the code makes more sense than the documentation and it's the
> > documentation that should be fixed. Our current behaviour is to discard
> > clean, swap-backed, unmapped pages that require no further IO. This is
> > reasonable behaviour for zone_reclaim_mode == 1 so maybe the patch
> > should change the documentation to
> >
> > 1 = Zone reclaim discards clean unmapped disk-backed pages
> > 2 = Zone reclaim writes dirty pages out
> > 4 = Zone reclaim unmaps and swaps pages
> >
> > If you really wanted to strict about the meaning of RECLAIM_SWAP, then
> > something like the following would be reasonable;
> >
> > .may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
> > .may_swap = !!(zone_reclaim_mode & RECLAIM_SWAP),
> >
> > because a system administrator is not going to distinguish between
> > unmapping and swap. I would assume at least that RECLAIM_SWAP implies
> > unmapping pages for swapping but an updated documentation wouldn't hurt
> > with
> >
> > 4 = Zone reclaim unmaps and swaps pages
> >
> > > (they can be removed cleanly, but I haven't tried compiling the result)
> > >
> > > but your series is based on those.
> > >
> >
> > The patchset only depends on
> > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch
> > and then only because of merge conflicts. All the lessons in
> > vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch are
> > incorporated.
> >
> > > We have 142 MM patches queued, and we need to merge next week.
> > >
> >
> > I'm sorry my timing for coming out with the zone_reclaim() patches sucks
> > and that I failed to spot these patches earlier. Despite the abundance
> > of evidence, I'm not trying to be deliberatly awkward :/
> >

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-15 15:10:22

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3

On Mon, 15 Jun 2009, Mel Gorman wrote:

> > May I ask your worry?
> >
>
> Simply that I believe the intention of PF_SWAPWRITE here was to allow
> zone_reclaim() to aggressively reclaim memory if the reclaim_mode allowed
> it as it was a statement that off-node accesses are really not desired.

Right.

> Ok. I am not fully convinced but I'll not block it either if believe it's
> necessary. My current understanding is that this patch only makes a difference
> if the server is IO congested in which case the system is struggling anyway
> and an off-node access is going to be relatively small penalty overall.
> Conceivably, having PF_SWAPWRITE set makes things worse in that situation
> and the patch makes some sense.

We could drop support for RECLAIM_SWAP if that simplifies things.

2009-06-15 15:25:50

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3

On Mon, Jun 15, 2009 at 11:01:41AM -0400, Christoph Lameter wrote:
> On Mon, 15 Jun 2009, Mel Gorman wrote:
>
> > > May I ask your worry?
> > >
> >
> > Simply that I believe the intention of PF_SWAPWRITE here was to allow
> > zone_reclaim() to aggressively reclaim memory if the reclaim_mode allowed
> > it as it was a statement that off-node accesses are really not desired.
>
> Right.
>
> > Ok. I am not fully convinced but I'll not block it either if believe it's
> > necessary. My current understanding is that this patch only makes a difference
> > if the server is IO congested in which case the system is struggling anyway
> > and an off-node access is going to be relatively small penalty overall.
> > Conceivably, having PF_SWAPWRITE set makes things worse in that situation
> > and the patch makes some sense.
>
> We could drop support for RECLAIM_SWAP if that simplifies things.
>

I don't think that is necessary. While I expect it's very rarely used, I
imagine a situation where it would be desirable on a system that had large
amounts of tmpfs pages but where it wasn't critical they remain in-memory.

Removing PF_SWAPWRITE would make it less aggressive and if you were
happy with that, then that would be good enough for me.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-15 16:01:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/3] Do not unconditionally treat zones that fail zone_reclaim() as full

On Mon, 15 Jun 2009 11:28:30 +0100 Mel Gorman <[email protected]> wrote:

> On Fri, Jun 12, 2009 at 08:44:56AM -0700, Andrew Morton wrote:
> > On Fri, 12 Jun 2009 11:36:17 +0100 Mel Gorman <[email protected]> wrote:
> >
> > > On Thu, Jun 11, 2009 at 09:48:53AM -0400, Christoph Lameter wrote:
> > > > It needs to be mentioned that this fixes a bug introduced in 2.6.19.
> > > > Possibly a portion of this code needs to be backported to stable.
> > > >
> > >
> > > Andrew has sucked up the patch already so I can't patch it. Andrew, there
> > > is a further note below on the patch if you'd like to pick it up.
> >
> > OK.
> >
> > > On the stable front, I'm think that patches 1 and 2 should being considered
> > > -stable candidates. Patch 1 is certainly needed because it fixes up the
> > > malloc() stall and should be picked up by distro kernels as well. This patch
> > > closes another obvious hole albeit one harder to trigger.
> > >
> > > Ideally patch 3 would also be in -stable so distro kernels will suck it up
> > > as it will help identify this problem in the field if it occurs again but
> > > I'm not sure what the -stable policy is on such things are.
> >
> > Well, I tagged the patches for stable but they don't apply at all well
> > to even 2.6.30 base.
> >
>
> What's the proper way to handle such a situation? Wait until the patches
> go to mainline and post a rebased version to stable?

Yes please. I assume that when Greg&Chris try to apply the patch,
we'll hear squawks to remind us of this.

Of course, it'd be better if the patch didn't get rejects. Perhaps
whatever-patch-clashed should also be backported.

2009-06-15 21:20:53

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/3] Count the number of times zone_reclaim() scans and fails

On Thu, 11 Jun 2009 11:47:53 +0100
Mel Gorman <[email protected]> wrote:

> + PGSCAN_ZONERECLAIM_FAILED,
> @@ -2492,6 +2492,9 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> + "zreclaim_failed",

So we have "zonereclaim", "zone_reclaim" and "zreclaim", which isn't
terribly developer-friendly.


This?

--- a/include/linux/vmstat.h~vmscan-count-the-number-of-times-zone_reclaim-scans-and-fails-fix
+++ a/include/linux/vmstat.h
@@ -37,7 +37,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
FOR_ALL_ZONES(PGSCAN_KSWAPD),
FOR_ALL_ZONES(PGSCAN_DIRECT),
#ifdef CONFIG_NUMA
- PGSCAN_ZONERECLAIM_FAILED,
+ PGSCAN_ZONE_RECLAIM_FAILED,
#endif
PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
diff -puN mm/vmscan.c~vmscan-count-the-number-of-times-zone_reclaim-scans-and-fails-fix mm/vmscan.c
--- a/mm/vmscan.c~vmscan-count-the-number-of-times-zone_reclaim-scans-and-fails-fix
+++ a/mm/vmscan.c
@@ -2520,7 +2520,7 @@ int zone_reclaim(struct zone *zone, gfp_
zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);

if (!ret)
- count_vm_event(PGSCAN_ZONERECLAIM_FAILED);
+ count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);

return ret;
}
diff -puN mm/vmstat.c~vmscan-count-the-number-of-times-zone_reclaim-scans-and-fails-fix mm/vmstat.c
--- a/mm/vmstat.c~vmscan-count-the-number-of-times-zone_reclaim-scans-and-fails-fix
+++ a/mm/vmstat.c
@@ -674,7 +674,7 @@ static const char * const vmstat_text[]
TEXTS_FOR_ZONES("pgscan_direct")

#ifdef CONFIG_NUMA
- "zreclaim_failed",
+ "zone_reclaim_failed",
#endif
"pginodesteal",
"slabs_scanned",
_

2009-06-16 09:05:23

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/3] Count the number of times zone_reclaim() scans and fails

On Mon, Jun 15, 2009 at 02:19:12PM -0700, Andrew Morton wrote:
> On Thu, 11 Jun 2009 11:47:53 +0100
> Mel Gorman <[email protected]> wrote:
>
> > + PGSCAN_ZONERECLAIM_FAILED,
> > @@ -2492,6 +2492,9 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > + "zreclaim_failed",
>
> So we have "zonereclaim", "zone_reclaim" and "zreclaim", which isn't
> terribly developer-friendly.
>
>
> This?
>

Acked-by: Mel Gorman <[email protected]>

> --- a/include/linux/vmstat.h~vmscan-count-the-number-of-times-zone_reclaim-scans-and-fails-fix
> +++ a/include/linux/vmstat.h
> @@ -37,7 +37,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
> FOR_ALL_ZONES(PGSCAN_KSWAPD),
> FOR_ALL_ZONES(PGSCAN_DIRECT),
> #ifdef CONFIG_NUMA
> - PGSCAN_ZONERECLAIM_FAILED,
> + PGSCAN_ZONE_RECLAIM_FAILED,
> #endif
> PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
> PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> diff -puN mm/vmscan.c~vmscan-count-the-number-of-times-zone_reclaim-scans-and-fails-fix mm/vmscan.c
> --- a/mm/vmscan.c~vmscan-count-the-number-of-times-zone_reclaim-scans-and-fails-fix
> +++ a/mm/vmscan.c
> @@ -2520,7 +2520,7 @@ int zone_reclaim(struct zone *zone, gfp_
> zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
>
> if (!ret)
> - count_vm_event(PGSCAN_ZONERECLAIM_FAILED);
> + count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);
>
> return ret;
> }
> diff -puN mm/vmstat.c~vmscan-count-the-number-of-times-zone_reclaim-scans-and-fails-fix mm/vmstat.c
> --- a/mm/vmstat.c~vmscan-count-the-number-of-times-zone_reclaim-scans-and-fails-fix
> +++ a/mm/vmstat.c
> @@ -674,7 +674,7 @@ static const char * const vmstat_text[]
> TEXTS_FOR_ZONES("pgscan_direct")
>
> #ifdef CONFIG_NUMA
> - "zreclaim_failed",
> + "zone_reclaim_failed",
> #endif
> "pginodesteal",
> "slabs_scanned",
> _
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-16 12:08:58

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3

> On Mon, Jun 15, 2009 at 11:01:41AM -0400, Christoph Lameter wrote:
> > On Mon, 15 Jun 2009, Mel Gorman wrote:
> >
> > > > May I ask your worry?
> > > >
> > >
> > > Simply that I believe the intention of PF_SWAPWRITE here was to allow
> > > zone_reclaim() to aggressively reclaim memory if the reclaim_mode allowed
> > > it as it was a statement that off-node accesses are really not desired.
> >
> > Right.
> >
> > > Ok. I am not fully convinced but I'll not block it either if believe it's
> > > necessary. My current understanding is that this patch only makes a difference
> > > if the server is IO congested in which case the system is struggling anyway
> > > and an off-node access is going to be relatively small penalty overall.
> > > Conceivably, having PF_SWAPWRITE set makes things worse in that situation
> > > and the patch makes some sense.
> >
> > We could drop support for RECLAIM_SWAP if that simplifies things.
> >
>
> I don't think that is necessary. While I expect it's very rarely used, I
> imagine a situation where it would be desirable on a system that had large
> amounts of tmpfs pages but where it wasn't critical they remain in-memory.
>
> Removing PF_SWAPWRITE would make it less aggressive and if you were
> happy with that, then that would be good enough for me.

I surprised this a bit. I've imazined Christoph never agree to remove it.
Currently, trouble hitting user of mine don't use this feature. Thus, if it can be
removed, I don't need to worry abusing this again and I'm happy.

Mel, Have you seen actual user of this?


2009-06-16 12:21:08

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3

On Tue, Jun 16, 2009 at 09:08:47PM +0900, KOSAKI Motohiro wrote:
> > On Mon, Jun 15, 2009 at 11:01:41AM -0400, Christoph Lameter wrote:
> > > On Mon, 15 Jun 2009, Mel Gorman wrote:
> > >
> > > > > May I ask your worry?
> > > > >
> > > >
> > > > Simply that I believe the intention of PF_SWAPWRITE here was to allow
> > > > zone_reclaim() to aggressively reclaim memory if the reclaim_mode allowed
> > > > it as it was a statement that off-node accesses are really not desired.
> > >
> > > Right.
> > >
> > > > Ok. I am not fully convinced but I'll not block it either if believe it's
> > > > necessary. My current understanding is that this patch only makes a difference
> > > > if the server is IO congested in which case the system is struggling anyway
> > > > and an off-node access is going to be relatively small penalty overall.
> > > > Conceivably, having PF_SWAPWRITE set makes things worse in that situation
> > > > and the patch makes some sense.
> > >
> > > We could drop support for RECLAIM_SWAP if that simplifies things.
> > >
> >
> > I don't think that is necessary. While I expect it's very rarely used, I
> > imagine a situation where it would be desirable on a system that had large
> > amounts of tmpfs pages but where it wasn't critical they remain in-memory.
> >
> > Removing PF_SWAPWRITE would make it less aggressive and if you were
> > happy with that, then that would be good enough for me.
>
> I surprised this a bit. I've imazined Christoph never agree to remove it.
> Currently, trouble hitting user of mine don't use this feature. Thus, if it can be
> removed, I don't need to worry abusing this again and I'm happy.
>
> Mel, Have you seen actual user of this?
>

No, but then again the usage for it is quite specific. Namely for use on
systems that uses a large amount of tmpfs where the remote NUMA penalty is
high and it's acceptable to swap tmpfs pages to avoid remote accesses. I
don't see the harm in having the option available.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-16 12:30:40

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3

> On Tue, Jun 16, 2009 at 09:08:47PM +0900, KOSAKI Motohiro wrote:
> > > On Mon, Jun 15, 2009 at 11:01:41AM -0400, Christoph Lameter wrote:
> > > > On Mon, 15 Jun 2009, Mel Gorman wrote:
> > > >
> > > > > > May I ask your worry?
> > > > > >
> > > > >
> > > > > Simply that I believe the intention of PF_SWAPWRITE here was to allow
> > > > > zone_reclaim() to aggressively reclaim memory if the reclaim_mode allowed
> > > > > it as it was a statement that off-node accesses are really not desired.
> > > >
> > > > Right.
> > > >
> > > > > Ok. I am not fully convinced but I'll not block it either if believe it's
> > > > > necessary. My current understanding is that this patch only makes a difference
> > > > > if the server is IO congested in which case the system is struggling anyway
> > > > > and an off-node access is going to be relatively small penalty overall.
> > > > > Conceivably, having PF_SWAPWRITE set makes things worse in that situation
> > > > > and the patch makes some sense.
> > > >
> > > > We could drop support for RECLAIM_SWAP if that simplifies things.
> > > >
> > >
> > > I don't think that is necessary. While I expect it's very rarely used, I
> > > imagine a situation where it would be desirable on a system that had large
> > > amounts of tmpfs pages but where it wasn't critical they remain in-memory.
> > >
> > > Removing PF_SWAPWRITE would make it less aggressive and if you were
> > > happy with that, then that would be good enough for me.
> >
> > I surprised this a bit. I've imazined Christoph never agree to remove it.
> > Currently, trouble hitting user of mine don't use this feature. Thus, if it can be
> > removed, I don't need to worry abusing this again and I'm happy.
> >
> > Mel, Have you seen actual user of this?
> >
>
> No, but then again the usage for it is quite specific. Namely for use on
> systems that uses a large amount of tmpfs where the remote NUMA penalty is
> high and it's acceptable to swap tmpfs pages to avoid remote accesses. I
> don't see the harm in having the option available.

ok.
I understand your opinion.

Thanks.


2009-06-16 12:58:09

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3

Hi


> > > > vmscan-zone_reclaim-use-may_swap.patch
> > > >
> > >
> > > This is a tricky one. Kosaki, I think this patch is a little dangerous. With
> > > this applied, pages get unmapped whether RECLAIM_SWAP is set or not. This
> > > means that zone_reclaim() now has more work to do when it's enabled and it
> > > incurs a number of minor faults for no reason as a result of trying to avoid
> > > going off-node. I don't believe that is desirable because it would manifest
> > > as high minor fault counts on NUMA and would be difficult to pin down why
> > > that was happening.
> >
> > (cc to hanns)
> >
> > First, if this patch should be dropped, commit bd2f6199
> > (vmscan: respect higher order in zone_reclaim()) should be too. I think.
> > the combination of lumply reclaim and !may_unmap is really ineffective.
>
> Whether it's ineffective or not, it's what the user has asked for. They
> want a high-order page found if possible within the limits of
> zone_reclaim_mode. If it fails, they will enter full direct reclaim
> later in the path and try again.
>
> How effective lumpy reclaim is in this case really depends on what the
> system has been used for in the past. It's impossible to know in advance
> how effective lumpy reclaim will be in every case.

In general, performance discussion need to concern typical use-case.
Almost zone-reclaim enabled machine is not file server. Thus unmapped file
page are not so high ratio.

I have pessimistic suspection of successful rate of lumpy reclaim in those server.
Of cource, it don't make allocation failure, it only make full direct reclaim.

but I don't hope strange and unnecessary lru shuffling. Also I don't think
it makes performance improvement.


> > it might cause isolate neighbor pages and give up unmapping and pages put
> > back tail of lru.
> > it mean to shuffle lru list.
> >
> > I don't think it is desirable.
> >
>
> With Kamezawa Hiroyuki's patch that avoids unnecessary shuffles of the LRU
> list due to lumpy reclaim, the situation might be better?

I still my_unmap is better choice, but if we use it, I agree with adding
may_unmap and page_mapped() condition to isolate_pages_global() is better and
good choice.

nice idea.

> > Second, we did learned that "mapped or not mapped" is not appropriate
> > reclaim boosting between split-lru discussion.
> > So, I think to make consistent is better. if no considerness of may_unmap
> > makes serious performance issue, we need to fix try_to_free_pages() path too.
> >
>
> I don't understand this paragraph.
>
> If zone_reclaim_mode is set to 1, I don't believe the expected behaviour is
> for pages to be unmapped from page tables. I think it will lead to mysterious
> bug reports of higher numbers of minor faults when running applications on
> NUMA machines in some situations.

AFAIK, 99.9% user read documentation, not actual code. and documentatin
didn't describe so.
I don't think this is expected behavior.

That's my point.


> > Third, if we consider MPI program on NUMA, each process only access
> > a part of array data frequently and never touch rest part of array.
> > So, AFAIK "rarely, but access" is rarely, no freqent access is not major performance source.
> >
> > I have one question. your "difficultness of pinning down" is major issue?
> >
>
> Yes. If an administrator notices that minor fault rates are higher than
> expected, it's going to be very difficult for them to understand why
> it is happening and why setting reclaim_mode to 0 apparently fixes the
> problem. oprofile for example might just show that a lot of time is being
> spent in the fault paths but not explain why.

I don't understand this paragraph a bit. I feel this is only theorical issue.
successing of try_to_unmap_one() mean the pte don't have accessed bit.
it's obvious sign to be able to unmap pte.

if we convice MPI program, long time untouched pages often mean never touched again.
Am I missing anything? or you don't talk about non-hpc workload?



2009-06-16 13:44:32

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3

On Tue, Jun 16, 2009 at 09:57:53PM +0900, KOSAKI Motohiro wrote:
> Hi
>
>
> > > > > vmscan-zone_reclaim-use-may_swap.patch
> > > > >
> > > >
> > > > This is a tricky one. Kosaki, I think this patch is a little dangerous. With
> > > > this applied, pages get unmapped whether RECLAIM_SWAP is set or not. This
> > > > means that zone_reclaim() now has more work to do when it's enabled and it
> > > > incurs a number of minor faults for no reason as a result of trying to avoid
> > > > going off-node. I don't believe that is desirable because it would manifest
> > > > as high minor fault counts on NUMA and would be difficult to pin down why
> > > > that was happening.
> > >
> > > (cc to hanns)
> > >
> > > First, if this patch should be dropped, commit bd2f6199
> > > (vmscan: respect higher order in zone_reclaim()) should be too. I think.
> > > the combination of lumply reclaim and !may_unmap is really ineffective.
> >
> > Whether it's ineffective or not, it's what the user has asked for. They
> > want a high-order page found if possible within the limits of
> > zone_reclaim_mode. If it fails, they will enter full direct reclaim
> > later in the path and try again.
> >
> > How effective lumpy reclaim is in this case really depends on what the
> > system has been used for in the past. It's impossible to know in advance
> > how effective lumpy reclaim will be in every case.
>
> In general, performance discussion need to concern typical use-case.

What typical use case? zone_reclaim logic is enabled by default on NUMA
machines that report a large latency for remote node access. I do not
believe we can draw conclusions on what a typical use case is just
because the machine happens to be a particular NUMA type.

And this isn't a performance discussion as such either. The patch isn't
going to improve performance. I believe it'll have the opposite effect.

> Almost zone-reclaim enabled machine is not file server. Thus unmapped file
> page are not so high ratio.
>
> I have pessimistic suspection of successful rate of lumpy reclaim in those server.

While I agree on that particular case, I don't think it justifies the patch
to unmap pages so easily just because zone_reclaim() is enabled.

> Of cource, it don't make allocation failure, it only make full direct reclaim.
>
> but I don't hope strange and unnecessary lru shuffling. Also I don't think
> it makes performance improvement.
>

I'm all for avoiding unnecessary LRU shuffling

> > > it might cause isolate neighbor pages and give up unmapping and pages put
> > > back tail of lru.
> > > it mean to shuffle lru list.
> > >
> > > I don't think it is desirable.
> > >
> >
> > With Kamezawa Hiroyuki's patch that avoids unnecessary shuffles of the LRU
> > list due to lumpy reclaim, the situation might be better?
>
> I still my_unmap is better choice, but if we use it, I agree with adding
> may_unmap and page_mapped() condition to isolate_pages_global() is better and
> good choice.
>
> nice idea.
>

Ok, I agree that lumpy reclaim should be checking may_unmap and page_mapped()
but I still don't think that means that reclaim_mode of 1 allows zone_reclaim()
to unmap pages.

> > > Second, we did learned that "mapped or not mapped" is not appropriate
> > > reclaim boosting between split-lru discussion.
> > > So, I think to make consistent is better. if no considerness of may_unmap
> > > makes serious performance issue, we need to fix try_to_free_pages() path too.
> > >
> >
> > I don't understand this paragraph.
> >
> > If zone_reclaim_mode is set to 1, I don't believe the expected behaviour is
> > for pages to be unmapped from page tables. I think it will lead to mysterious
> > bug reports of higher numbers of minor faults when running applications on
> > NUMA machines in some situations.
>
> AFAIK, 99.9% user read documentation, not actual code. and documentatin
> didn't describe so.
> I don't think this is expected behavior.
>
> That's my point.
>

Which part of the documentation for zone_reclaim_mode == 1 implies that
pages will be unmapped from page tables? If the documentation is misleading,
I would prefer for it to be fixed up than pages be unmapped by default
causing performance regressions due to increased minor faults on NUMA.

>
> > > Third, if we consider MPI program on NUMA, each process only access
> > > a part of array data frequently and never touch rest part of array.
> > > So, AFAIK "rarely, but access" is rarely, no freqent access is not major performance source.
> > >
> > > I have one question. your "difficultness of pinning down" is major issue?
> > >
> >
> > Yes. If an administrator notices that minor fault rates are higher than
> > expected, it's going to be very difficult for them to understand why
> > it is happening and why setting reclaim_mode to 0 apparently fixes the
> > problem. oprofile for example might just show that a lot of time is being
> > spent in the fault paths but not explain why.
>
> I don't understand this paragraph a bit. I feel this is only theorical issue.
> successing of try_to_unmap_one() mean the pte don't have accessed bit.
> it's obvious sign to be able to unmap pte.
>

Ok, even if we are depending on the accessed bit, we are making assumptions
of the frequency of the bit being cleared and how often zone_reclaim()
is called as to whether it will cause more minor faults or not.

Yes, what I'm saying about minor faults being potentially increased is a
theoretical issue and I have no proof but it feels like a real
possibility. I would like to be convinced that setting may_unmap to 1 by
default when zone_reclaim == 1 is not going to result in this problem
occuring or at least to be convinced that it will not happen very often.

I would be much happier if setting may_unmap and may_swap only happened when
RECLAIM_SWAP was enabled.

> if we convice MPI program, long time untouched pages often mean never touched again.
> Am I missing anything? or you don't talk about non-hpc workload?
>

I don't have a particular workload in mind to be perfectly honest. I'm just not
convinced of the wisdom of trying to unmap pages by default in zone_reclaim()
just because the NUMA distances happen to be large.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-16 14:51:42

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3

On Tue, 16 Jun 2009, Mel Gorman wrote:

> I don't have a particular workload in mind to be perfectly honest. I'm just not
> convinced of the wisdom of trying to unmap pages by default in zone_reclaim()
> just because the NUMA distances happen to be large.

zone reclaim = 1 is supposed to be light weight with minimal impact. The
intend was just to remove potentially unused pagecache pages so that node
local allocations can succeed again. So lets not unmap pages.

2009-06-17 10:06:56

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3

> On Tue, 16 Jun 2009, Mel Gorman wrote:
>
> > I don't have a particular workload in mind to be perfectly honest. I'm just not
> > convinced of the wisdom of trying to unmap pages by default in zone_reclaim()
> > just because the NUMA distances happen to be large.
>
> zone reclaim = 1 is supposed to be light weight with minimal impact. The
> intend was just to remove potentially unused pagecache pages so that node
> local allocations can succeed again. So lets not unmap pages.

hm, At least major two zone reclaim developer disagree my patch. Thus I have to
agree with you, because I really don't hope to ignore other developer's opnion.

So, as far as I understand, the conclusion of this thread are
- Drop my patch
- instead, implement improvement patch of (may_unmap && page_mapped()) case
- the documentation should be changed
- it's my homework(?)

Can you agree this?


2009-06-17 12:03:51

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3

On Wed, Jun 17, 2009 at 07:06:46PM +0900, KOSAKI Motohiro wrote:
> > On Tue, 16 Jun 2009, Mel Gorman wrote:
> >
> > > I don't have a particular workload in mind to be perfectly honest. I'm just not
> > > convinced of the wisdom of trying to unmap pages by default in zone_reclaim()
> > > just because the NUMA distances happen to be large.
> >
> > zone reclaim = 1 is supposed to be light weight with minimal impact. The
> > intend was just to remove potentially unused pagecache pages so that node
> > local allocations can succeed again. So lets not unmap pages.
>
> hm, At least major two zone reclaim developer disagree my patch. Thus I have to
> agree with you, because I really don't hope to ignore other developer's opnion.
>
> So, as far as I understand, the conclusion of this thread are
> - Drop my patch
> - instead, implement improvement patch of (may_unmap && page_mapped()) case
> - the documentation should be changed
> - it's my homework(?)
>
> Can you agree this?
>

Yes.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-17 18:49:18

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 0/3] Fix malloc() stall in zone_reclaim() and bring behaviour more in line with expectations V3

On Wed, 17 Jun 2009, KOSAKI Motohiro wrote:

> hm, At least major two zone reclaim developer disagree my patch. Thus I have to
> agree with you, because I really don't hope to ignore other developer's opnion.
>
> So, as far as I understand, the conclusion of this thread are
> - Drop my patch
> - instead, implement improvement patch of (may_unmap && page_mapped()) case
> - the documentation should be changed
> - it's my homework(?)
>
> Can you agree this?

As far as I understand you: Yes. Unmapping can occur in more advanced zone
reclaim modes but the default needs to be as lightweight as possible.