2009-06-08 13:01:42

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/3] [RFC] Functional fix to zone_reclaim() and bring behaviour more in line with expectations

A bug was brought to my attention against a distro kernel but it affects
mainline and I believe problems like this have been reported in various guises
on the mailing lists although I don't have specific examples at the moment.

The problem that was reported that led to this patchset was that malloc()
stalled for a long time (minutes in some cases) if a large tmpfs mount
was occupying a large percentage of memory overall. The pages did not get
cleaned or reclaimed by zone_reclaim() because the zone_reclaim_mode was
unsuitable, but the lists are uselessly scanned frequencly making the CPU
spin at near 100%.

I do not have the bug resolved yet although I believe patch 1 of this series
addresses it and am waiting to hear back from the bug reporter. However,
the fix should work two other patches in this series also should bring
zone_reclaim() more in line with expectations.

Patch 1 reintroduces zone_reclaim_interval to catch the situation where
zone_reclaim() cannot tell in advance that the scan is a waste
of time.

Patch 2 alters the heuristics that zone_reclaim() uses to determine if the
scan should go ahead. Currently, it is basically assuming
zone_reclaim_mode is 1

Patch 3 notes that zone_reclaim() returning a failure automatically means
the zone is marked full. This is not always true. It could have failed
because the GFP mask or zone_reclaim_mode are unsuitable. The patch
makes zone_reclaim() more careful about marking zones temporarily full

Note, this patchset has not been tested heavily.

Comments?

Documentation/sysctl/vm.txt | 13 +++++++++++
include/linux/mmzone.h | 9 ++++++++
include/linux/swap.h | 1 +
kernel/sysctl.c | 9 ++++++++
mm/internal.h | 4 +++
mm/page_alloc.c | 26 +++++++++++++++++++---
mm/vmscan.c | 48 +++++++++++++++++++++++++++++++++++++-----
7 files changed, 100 insertions(+), 10 deletions(-)


2009-06-08 13:01:53

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/3] Properly account for the number of page cache pages zone_reclaim() can reclaim

On NUMA machines, the administrator can configure zone_relcaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
unmapped pages will be reclaimed if the zone watermarks are not being met.

There is a heuristic that determines if the scan is worthwhile but the
problem is that the heuristic is not being properly applied and is basically
assuming zone_reclaim_mode is 1 if it is enabled.

This patch makes zone_reclaim() makes a better attempt at working out how
many pages it might be able to reclaim given the current reclaim_mode. If it
cannot clean pages, then NR_FILE_DIRTY number of pages are not candidates. If
it cannot swap, then NR_FILE_MAPPED are not. This indirectly addresses tmpfs
as those pages tend to be dirty as they are not cleaned by pdflush or sync.

The ideal would be that the number of tmpfs pages would also be known
and account for like NR_FILE_MAPPED as swap is required to discard them.
A means of working this out quickly was not obvious but a comment is added
noting the problem.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 18 ++++++++++++++++--
1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ba211c1..ffe2f32 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2380,6 +2380,21 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
{
int node_id;
int ret;
+ int pagecache_reclaimable;
+
+ /*
+ * Work out how many page cache pages we can reclaim in this mode.
+ *
+ * NOTE: Ideally, tmpfs pages would be accounted as if they were
+ * NR_FILE_MAPPED as swap is required to discard those
+ * pages even when they are clean. However, there is no
+ * way of quickly identifying the number of tmpfs pages
+ */
+ pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
+ if (!(zone_reclaim_mode & RECLAIM_WRITE))
+ pagecache_reclaimable -= zone_page_state(zone, NR_FILE_DIRTY);
+ if (!(zone_reclaim_mode & RECLAIM_SWAP))
+ pagecache_reclaimable -= zone_page_state(zone, NR_FILE_MAPPED);

/*
* Zone reclaim reclaims unmapped file backed pages and
@@ -2391,8 +2406,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
* if less than a specified percentage of the zone is used by
* unmapped file backed pages.
*/
- if (zone_page_state(zone, NR_FILE_PAGES) -
- zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_pages
+ if (pagecache_reclaimable <= zone->min_unmapped_pages
&& zone_page_state(zone, NR_SLAB_RECLAIMABLE)
<= zone->min_slab_pages)
return 0;
--
1.5.6.5

2009-06-08 13:02:08

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On NUMA machines, the administrator can configure zone_reclaim_mode that is a
more targetted form of direct reclaim. On machines with large NUMA distances,
zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be
reclaimed if the zone watermarks are not being met. The problem is that
zone_reclaim() can be in a situation where it scans excessively without
making progress.

One such situation is where a large tmpfs mount is occupying a large
percentage of memory overall. The pages do not get cleaned or reclaimed by
zone_reclaim(), but the lists are uselessly scanned frequencly making the
CPU spin at 100%. The scanning occurs because zone_reclaim() cannot tell
in advance the scan is pointless because the counters do not distinguish
between pagecache pages backed by disk and by RAM. The observation in
the field is that malloc() stalls for a long time (minutes in some cases)
when this situation occurs.

Accounting for ram-backed file pages was considered but not implemented on
the grounds it would be introducing new branches and expensive checks into
the page cache add/remove patches and increase the number of statistics
needed in the zone. As zone_reclaim() failing is currently considered a
corner case, this seemed like overkill. Note, if there are a large number
of reports about CPU spinning at 100% on NUMA that is fixed by disabling
zone_reclaim, then this assumption is false and zone_reclaim() scanning
and failing is not a corner case but a common occurance

This patch reintroduces zone_reclaim_interval which was removed by commit
34aa1330f9b3c5783d269851d467326525207422 [zoned vm counters: zone_reclaim:
remove /proc/sys/vm/zone_reclaim_interval] because the zone counters were
considered sufficient to determine in advance if the scan would succeed.
As unsuccessful scans can still occur, zone_reclaim_interval is still
required.

Signed-off-by: Mel Gorman <[email protected]
---
Documentation/sysctl/vm.txt | 13 +++++++++++++
include/linux/mmzone.h | 9 +++++++++
include/linux/swap.h | 1 +
kernel/sysctl.c | 9 +++++++++
mm/vmscan.c | 22 ++++++++++++++++++++++
5 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index c302ddf..f9b8db5 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -52,6 +52,7 @@ Currently, these files are in /proc/sys/vm:
- swappiness
- vfs_cache_pressure
- zone_reclaim_mode
+- zone_reclaim_interval


==============================================================
@@ -620,4 +621,16 @@ Allowing regular swap effectively restricts allocations to the local
node unless explicitly overridden by memory policies or cpuset
configurations.

+================================================================
+
+zone_reclaim_interval:
+
+The time allowed for off node allocations after zone reclaim
+has failed to reclaim enough pages to allow a local allocation.
+
+Time is set in seconds and set by default to 30 seconds.
+
+Reduce the interval if undesired off node allocations occur. However, too
+frequent scans will have a negative impact on off-node allocation performance.
+
============ End of Document =================================
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a47c879..f1f0fb2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -337,6 +337,15 @@ struct zone {
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];

/*
+ * timestamp (in jiffies) of the last zone_reclaim that scanned
+ * but failed to free enough pages. This is used to avoid repeated
+ * scans when zone_reclaim() is unable to detect in advance that
+ * the scanning is useless. This can happen for example if a zone
+ * has large numbers of clean unmapped file pages on tmpfs
+ */
+ unsigned long zone_reclaim_failure;
+
+ /*
* prev_priority holds the scanning priority for this zone. It is
* defined as the scanning priority at which we achieved our reclaim
* target at the previous try_to_free_pages() or balance_pgdat()
diff --git a/include/linux/swap.h b/include/linux/swap.h
index d476aad..6a71368 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -224,6 +224,7 @@ extern long vm_total_pages;

#ifdef CONFIG_NUMA
extern int zone_reclaim_mode;
+extern int zone_reclaim_interval;
extern int sysctl_min_unmapped_ratio;
extern int sysctl_min_slab_ratio;
extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index b2970d5..cc0623c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1192,6 +1192,15 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
},
{
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "zone_reclaim_interval",
+ .data = &zone_reclaim_interval,
+ .maxlen = sizeof(zone_reclaim_interval),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_jiffies,
+ .strategy = &sysctl_jiffies,
+ },
+ {
.ctl_name = VM_MIN_UNMAPPED,
.procname = "min_unmapped_ratio",
.data = &sysctl_min_unmapped_ratio,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d254306..ba211c1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2272,6 +2272,13 @@ int zone_reclaim_mode __read_mostly;
#define RECLAIM_SWAP (1<<2) /* Swap pages out during reclaim */

/*
+ * Minimum time between zone_reclaim() scans that failed. Ordinarily, a
+ * scan will not fail because it will be determined in advance if it can
+ * succeeed but this does not always work. See mmzone.h
+ */
+int zone_reclaim_interval __read_mostly = 30*HZ;
+
+/*
* Priority for ZONE_RECLAIM. This determines the fraction of pages
* of a node considered for each zone_reclaim. 4 scans 1/16th of
* a zone.
@@ -2390,6 +2397,11 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
<= zone->min_slab_pages)
return 0;

+ /* Do not attempt a scan if scanning failed recently */
+ if (time_before(jiffies,
+ zone->zone_reclaim_failure + zone_reclaim_interval))
+ return 0;
+
if (zone_is_all_unreclaimable(zone))
return 0;

@@ -2414,6 +2426,16 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
ret = __zone_reclaim(zone, gfp_mask, order);
zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);

+ if (!ret) {
+ /*
+ * We were unable to reclaim enough pages to stay on node and
+ * unable to detect in advance that the scan would fail. Allow
+ * off node accesses for zone_reclaim_inteval jiffies before
+ * trying zone_reclaim() again
+ */
+ zone->zone_reclaim_failure = jiffies;
+ }
+
return ret;
}
#endif
--
1.5.6.5

2009-06-08 13:02:23

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/3] Do not unconditionally treat zones that fail zone_reclaim() as full

On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
unmapped pages will be reclaimed if the zone watermarks are not being
met. The problem is that zone_reclaim() failing at all means the zone
gets marked full.

This can cause situations where a zone is usable, but is being skipped
because it has been considered full. Take a situation where a large tmpfs
mount is occuping a large percentage of memory overall. The pages do not
get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
and the zonelist cache considers them not worth trying in the future.

This patch makes zone_reclaim() return more fine-grained information about
what occured when zone_reclaim() failued. The zone only gets marked full if
it really is unreclaimable. If it's a case that the scan did not occur or
if enough pages were not reclaimed with the limited reclaim_mode, then the
zone is simply skipped.

There is a side-effect to this patch. Currently, if zone_reclaim()
successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would
go ahead. With this patch applied, zone watermarks are rechecked after
zone_reclaim() does some work.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/internal.h | 4 ++++
mm/page_alloc.c | 26 ++++++++++++++++++++++----
mm/vmscan.c | 10 +++++-----
3 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 987bb03..090c267 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -284,4 +284,8 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int len, int flags,
struct page **pages, struct vm_area_struct **vmas);

+#define ZONE_RECLAIM_NOSCAN -2
+#define ZONE_RECLAIM_FULL -1
+#define ZONE_RECLAIM_SOME 0
+#define ZONE_RECLAIM_SUCCESS 1
#endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fe753ec..ce2f684 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1420,20 +1420,38 @@ zonelist_scan:

if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
unsigned long mark;
+ int ret;
if (alloc_flags & ALLOC_WMARK_MIN)
mark = zone->pages_min;
else if (alloc_flags & ALLOC_WMARK_LOW)
mark = zone->pages_low;
else
mark = zone->pages_high;
- if (!zone_watermark_ok(zone, order, mark,
- classzone_idx, alloc_flags)) {
- if (!zone_reclaim_mode ||
- !zone_reclaim(zone, gfp_mask, order))
+ if (zone_watermark_ok(zone, order, mark,
+ classzone_idx, alloc_flags))
+ goto try_this_zone;
+
+ if (zone_reclaim_mode == 0)
+ goto this_zone_full;
+
+ ret = zone_reclaim(zone, gfp_mask, order);
+ switch (ret) {
+ case ZONE_RECLAIM_NOSCAN:
+ /* did not scan */
+ goto try_next_zone;
+ case ZONE_RECLAIM_FULL:
+ /* scanned but unreclaimable */
goto this_zone_full;
+ default:
+ /* did we reclaim enough */
+ if (!zone_watermark_ok(zone, order,
+ mark, classzone_idx,
+ alloc_flags))
+ goto try_next_zone;
}
}

+try_this_zone:
page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
if (page)
break;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ffe2f32..84cdae2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2409,7 +2409,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
if (pagecache_reclaimable <= zone->min_unmapped_pages
&& zone_page_state(zone, NR_SLAB_RECLAIMABLE)
<= zone->min_slab_pages)
- return 0;
+ return ZONE_RECLAIM_NOSCAN;

/* Do not attempt a scan if scanning failed recently */
if (time_before(jiffies,
@@ -2417,13 +2417,13 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
return 0;

if (zone_is_all_unreclaimable(zone))
- return 0;
+ return ZONE_RECLAIM_FULL;

/*
* Do not scan if the allocation should not be delayed.
*/
if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
- return 0;
+ return ZONE_RECLAIM_NOSCAN;

/*
* Only run zone reclaim on the local zone or on zones that do not
@@ -2433,10 +2433,10 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
*/
node_id = zone_to_nid(zone);
if (node_state(node_id, N_CPU) && node_id != numa_node_id())
- return 0;
+ return ZONE_RECLAIM_NOSCAN;

if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
- return 0;
+ return ZONE_RECLAIM_NOSCAN;
ret = __zone_reclaim(zone, gfp_mask, order);
zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);

--
1.5.6.5

2009-06-08 13:32:17

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

Mel Gorman wrote:

> The scanning occurs because zone_reclaim() cannot tell
> in advance the scan is pointless because the counters do not distinguish
> between pagecache pages backed by disk and by RAM.

Yes it can. Since 2.6.27, filesystem backed and swap/ram backed
pages have been living on separate LRU lists. This allows you to
fix the underlying problem, instead of having to add a retry
interval.

--
All rights reversed.

2009-06-08 13:54:43

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Mon, Jun 08, 2009 at 09:31:09AM -0400, Rik van Riel wrote:
> Mel Gorman wrote:
>
>> The scanning occurs because zone_reclaim() cannot tell
>> in advance the scan is pointless because the counters do not distinguish
>> between pagecache pages backed by disk and by RAM.
>
> Yes it can. Since 2.6.27, filesystem backed and swap/ram backed
> pages have been living on separate LRU lists.

Yes, they're on separate LRU lists but they are not the only pages on those
lists. The tmpfs pages are mixed in together with anonymous pages so we
cannot use NR_*_ANON.

Look at patch 2 and where I introduced;

/*
* Work out how many page cache pages we can reclaim in this mode.
*
* NOTE: Ideally, tmpfs pages would be accounted as if they were
* NR_FILE_MAPPED as swap is required to discard those
* pages even when they are clean. However, there is no
* way of quickly identifying the number of tmpfs pages
*/
pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
if (!(zone_reclaim_mode & RECLAIM_WRITE))
pagecache_reclaimable -= zone_page_state(zone, NR_FILE_DIRTY);
if (!(zone_reclaim_mode & RECLAIM_SWAP))
pagecache_reclaimable -= zone_page_state(zone, NR_FILE_MAPPED);

If the tmpfs pages can be accounted for there, then chances are that patch
1 goes away - at least until some other situation is encountered where
we scan erroneously.

> This allows you to
> fix the underlying problem, instead of having to add a retry
> interval.
>

Which is obviously my preference but after looking around for a bit, I
didn't spot an obvious answer.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-08 14:26:17

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2/3] Properly account for the number of page cache pages zone_reclaim() can reclaim

Note that I am not aware of any current user for the advanced zone reclaim
modes that include writeback and swap.

Acked-by: Christoph Lameter <[email protected]>

2009-06-08 14:32:28

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/3] Do not unconditionally treat zones that fail zone_reclaim() as full

Ok this patch addresses a bug in zone reclaim introduced by Paul Jackson
in commit 9276b1bc96a132f4068fdee00983c532f43d3a26. Before that commit
zone reclaim would not mark a zone as full if it failed but simply
continue scanning.

Acked-by: Christoph Lameter <[email protected]>

2009-06-08 14:33:59

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Mon, 8 Jun 2009, Mel Gorman wrote:

> Yes, they're on separate LRU lists but they are not the only pages on those
> lists. The tmpfs pages are mixed in together with anonymous pages so we
> cannot use NR_*_ANON.

The tmpfs pages are unreclaimable and therefore should not be on the anon
lru.

2009-06-08 14:36:45

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/3] Properly account for the number of page cache pages zone_reclaim() can reclaim

On Mon, Jun 08, 2009 at 10:25:27AM -0400, Christoph Lameter wrote:
> Note that I am not aware of any current user for the advanced zone reclaim
> modes that include writeback and swap.
>

Neither am I, but it's the documented behaviour so we might as well act on
it as advertised.

> Acked-by: Christoph Lameter <[email protected]>
>

Thanks.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-08 14:39:12

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Mon, Jun 08, 2009 at 10:33:41AM -0400, Christoph Lameter wrote:
> On Mon, 8 Jun 2009, Mel Gorman wrote:
>
> > Yes, they're on separate LRU lists but they are not the only pages on those
> > lists. The tmpfs pages are mixed in together with anonymous pages so we
> > cannot use NR_*_ANON.
>
> The tmpfs pages are unreclaimable and therefore should not be on the anon
> lru.
>

tmpfs pages can be swap-backed so can be reclaimable. Regardless of what
list they are on, we still need to know how many of them there are if
this patch is to be avoided.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-08 14:43:54

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/3] Do not unconditionally treat zones that fail zone_reclaim() as full

On Mon, Jun 08, 2009 at 10:32:12AM -0400, Christoph Lameter wrote:
> Ok this patch addresses a bug in zone reclaim introduced by Paul Jackson
> in commit 9276b1bc96a132f4068fdee00983c532f43d3a26. Before that commit
> zone reclaim would not mark a zone as full if it failed but simply
> continue scanning.
>

Well spotted. I hadn't checked out the history as to when this problem
was introduced but it looks like this bug was introduced way back around
2.6.19.

> Acked-by: Christoph Lameter <[email protected]>
>

Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-08 14:49:27

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

Mel Gorman wrote:
> On Mon, Jun 08, 2009 at 09:31:09AM -0400, Rik van Riel wrote:
>> Mel Gorman wrote:
>>
>>> The scanning occurs because zone_reclaim() cannot tell
>>> in advance the scan is pointless because the counters do not distinguish
>>> between pagecache pages backed by disk and by RAM.
>> Yes it can. Since 2.6.27, filesystem backed and swap/ram backed
>> pages have been living on separate LRU lists.
>
> Yes, they're on separate LRU lists but they are not the only pages on those
> lists. The tmpfs pages are mixed in together with anonymous pages so we
> cannot use NR_*_ANON.
>
> Look at patch 2 and where I introduced;

I have to admit I did not read patches 2 and 3 before
replying to the (strange looking, at the time) text
above patch 1.

With that logic from patch 2 in place, patch 1 makes
perfect sense.

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed.

2009-06-08 14:56:18

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Mon, 8 Jun 2009, Mel Gorman wrote:

> > The tmpfs pages are unreclaimable and therefore should not be on the anon
> > lru.
> >
>
> tmpfs pages can be swap-backed so can be reclaimable. Regardless of what
> list they are on, we still need to know how many of them there are if
> this patch is to be avoided.

If they are reclaimable then why does it matter? They can be pushed out if
you configure zone reclaim to be that aggressive.

2009-06-08 15:11:59

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Mon, Jun 08, 2009 at 10:55:55AM -0400, Christoph Lameter wrote:
> On Mon, 8 Jun 2009, Mel Gorman wrote:
>
> > > The tmpfs pages are unreclaimable and therefore should not be on the anon
> > > lru.
> > >
> >
> > tmpfs pages can be swap-backed so can be reclaimable. Regardless of what
> > list they are on, we still need to know how many of them there are if
> > this patch is to be avoided.
>
> If they are reclaimable then why does it matter? They can be pushed out if
> you configure zone reclaim to be that aggressive.
>

Because they are reclaimable by kswapd or normal direct reclaim but *not*
reclaimable by zone_reclaim() if the zone_reclaim_mode is not configured
appropriately. I briefly considered setting zone_reclaim_mode to 7 instead of
1 by default for large NUMA distances but that has other serious consequences
such as paging in preference to going off-node as a default out-of-box
behaviour.

The point of the patch is that the heuristics that avoid the scan are not
perfect. In the event they are wrong and a useless scan occurs, the response
of the kernel after a useless scan should not be to uselessly scan a load
more times around the LRU lists making no progress.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-09 01:58:49

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Mon, Jun 08, 2009 at 09:01:28PM +0800, Mel Gorman wrote:
> On NUMA machines, the administrator can configure zone_reclaim_mode that is a
> more targetted form of direct reclaim. On machines with large NUMA distances,
> zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be
> reclaimed if the zone watermarks are not being met. The problem is that
> zone_reclaim() can be in a situation where it scans excessively without
> making progress.
>
> One such situation is where a large tmpfs mount is occupying a large
> percentage of memory overall. The pages do not get cleaned or reclaimed by
> zone_reclaim(), but the lists are uselessly scanned frequencly making the
> CPU spin at 100%. The scanning occurs because zone_reclaim() cannot tell
> in advance the scan is pointless because the counters do not distinguish
> between pagecache pages backed by disk and by RAM. The observation in
> the field is that malloc() stalls for a long time (minutes in some cases)
> when this situation occurs.
>
> Accounting for ram-backed file pages was considered but not implemented on
> the grounds it would be introducing new branches and expensive checks into
> the page cache add/remove patches and increase the number of statistics
> needed in the zone. As zone_reclaim() failing is currently considered a
> corner case, this seemed like overkill. Note, if there are a large number
> of reports about CPU spinning at 100% on NUMA that is fixed by disabling
> zone_reclaim, then this assumption is false and zone_reclaim() scanning
> and failing is not a corner case but a common occurance
>
> This patch reintroduces zone_reclaim_interval which was removed by commit
> 34aa1330f9b3c5783d269851d467326525207422 [zoned vm counters: zone_reclaim:
> remove /proc/sys/vm/zone_reclaim_interval] because the zone counters were
> considered sufficient to determine in advance if the scan would succeed.
> As unsuccessful scans can still occur, zone_reclaim_interval is still
> required.

Can we avoid the user visible parameter zone_reclaim_interval?

That means to introduce some heuristics for it. Since the whole point
is to avoid 100% CPU usage, we can take down the time used for this
failed zone reclaim (T) and forbid zone reclaim until (NOW + 100*T).

Thanks,
Fengguang

2009-06-09 02:26:10

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 2/3] Properly account for the number of page cache pages zone_reclaim() can reclaim

On Mon, Jun 08, 2009 at 09:01:29PM +0800, Mel Gorman wrote:
> On NUMA machines, the administrator can configure zone_relcaim_mode that
> is a more targetted form of direct reclaim. On machines with large NUMA
> distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> unmapped pages will be reclaimed if the zone watermarks are not being met.
>
> There is a heuristic that determines if the scan is worthwhile but the
> problem is that the heuristic is not being properly applied and is basically
> assuming zone_reclaim_mode is 1 if it is enabled.
>
> This patch makes zone_reclaim() makes a better attempt at working out how
> many pages it might be able to reclaim given the current reclaim_mode. If it
> cannot clean pages, then NR_FILE_DIRTY number of pages are not candidates. If
> it cannot swap, then NR_FILE_MAPPED are not. This indirectly addresses tmpfs
> as those pages tend to be dirty as they are not cleaned by pdflush or sync.

No, tmpfs pages are not accounted in NR_FILE_DIRTY because of the
BDI_CAP_NO_ACCT_AND_WRITEBACK bits.

> The ideal would be that the number of tmpfs pages would also be known
> and account for like NR_FILE_MAPPED as swap is required to discard them.
> A means of working this out quickly was not obvious but a comment is added
> noting the problem.

I'd rather prefer it be accounted separately than to muck up NR_FILE_MAPPED :)

> + int pagecache_reclaimable;
> +
> + /*
> + * Work out how many page cache pages we can reclaim in this mode.
> + *
> + * NOTE: Ideally, tmpfs pages would be accounted as if they were
> + * NR_FILE_MAPPED as swap is required to discard those
> + * pages even when they are clean. However, there is no
> + * way of quickly identifying the number of tmpfs pages
> + */

So can you remove the note on NR_FILE_MAPPED?

> + pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
> + if (!(zone_reclaim_mode & RECLAIM_WRITE))
> + pagecache_reclaimable -= zone_page_state(zone, NR_FILE_DIRTY);

> + if (!(zone_reclaim_mode & RECLAIM_SWAP))
> + pagecache_reclaimable -= zone_page_state(zone, NR_FILE_MAPPED);

So the "if" can be removed because NR_FILE_MAPPED is not related to swapping?

Thanks,
Fengguang

> /*
> * Zone reclaim reclaims unmapped file backed pages and
> @@ -2391,8 +2406,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> * if less than a specified percentage of the zone is used by
> * unmapped file backed pages.
> */
> - if (zone_page_state(zone, NR_FILE_PAGES) -
> - zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_pages
> + if (pagecache_reclaimable <= zone->min_unmapped_pages
> && zone_page_state(zone, NR_SLAB_RECLAIMABLE)
> <= zone->min_slab_pages)
> return 0;
> --
> 1.5.6.5

2009-06-09 02:37:51

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 2/3] Properly account for the number of page cache pages zone_reclaim() can reclaim

On Mon, Jun 08, 2009 at 09:01:29PM +0800, Mel Gorman wrote:
> On NUMA machines, the administrator can configure zone_relcaim_mode that
> is a more targetted form of direct reclaim. On machines with large NUMA
> distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> unmapped pages will be reclaimed if the zone watermarks are not being met.
>
> There is a heuristic that determines if the scan is worthwhile but the
> problem is that the heuristic is not being properly applied and is basically
> assuming zone_reclaim_mode is 1 if it is enabled.
>
> This patch makes zone_reclaim() makes a better attempt at working out how
> many pages it might be able to reclaim given the current reclaim_mode. If it
> cannot clean pages, then NR_FILE_DIRTY number of pages are not candidates. If
> it cannot swap, then NR_FILE_MAPPED are not. This indirectly addresses tmpfs
> as those pages tend to be dirty as they are not cleaned by pdflush or sync.
>
> The ideal would be that the number of tmpfs pages would also be known
> and account for like NR_FILE_MAPPED as swap is required to discard them.
> A means of working this out quickly was not obvious but a comment is added
> noting the problem.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 18 ++++++++++++++++--
> 1 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ba211c1..ffe2f32 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2380,6 +2380,21 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> {
> int node_id;
> int ret;
> + int pagecache_reclaimable;
> +
> + /*
> + * Work out how many page cache pages we can reclaim in this mode.
> + *
> + * NOTE: Ideally, tmpfs pages would be accounted as if they were
> + * NR_FILE_MAPPED as swap is required to discard those
> + * pages even when they are clean. However, there is no
> + * way of quickly identifying the number of tmpfs pages
> + */
> + pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
> + if (!(zone_reclaim_mode & RECLAIM_WRITE))
> + pagecache_reclaimable -= zone_page_state(zone, NR_FILE_DIRTY);

> + if (!(zone_reclaim_mode & RECLAIM_SWAP))
> + pagecache_reclaimable -= zone_page_state(zone, NR_FILE_MAPPED);

Your patch seems to conflict with KOSAKI's earlier patch "vmscan: change the
number of the unmapped files in zone reclaim", where he offers a better way for
getting rid of the tmpfs pages:

+ nr_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
+ zone_page_state(zone, NR_ACTIVE_FILE);
+ nr_mapped = zone_page_state(zone, NR_FILE_MAPPED);
+ if (likely(nr_file_pages >= nr_mapped))
+ nr_unmapped_file_pages = nr_file_pages - nr_mapped;

if (nr_unmapped_file_pages > zone->min_unmapped_pages) {

Thanks,
Fengguang

> /*
> * Zone reclaim reclaims unmapped file backed pages and
> @@ -2391,8 +2406,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> * if less than a specified percentage of the zone is used by
> * unmapped file backed pages.
> */
> - if (zone_page_state(zone, NR_FILE_PAGES) -
> - zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_pages
> + if (pagecache_reclaimable <= zone->min_unmapped_pages
> && zone_page_state(zone, NR_SLAB_RECLAIMABLE)
> <= zone->min_slab_pages)
> return 0;
> --
> 1.5.6.5

2009-06-09 03:14:07

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 3/3] Do not unconditionally treat zones that fail zone_reclaim() as full

On Mon, Jun 08, 2009 at 09:01:30PM +0800, Mel Gorman wrote:
> On NUMA machines, the administrator can configure zone_reclaim_mode that
> is a more targetted form of direct reclaim. On machines with large NUMA
> distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> unmapped pages will be reclaimed if the zone watermarks are not being
> met. The problem is that zone_reclaim() failing at all means the zone
> gets marked full.
>
> This can cause situations where a zone is usable, but is being skipped
> because it has been considered full. Take a situation where a large tmpfs
> mount is occuping a large percentage of memory overall. The pages do not
> get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
> and the zonelist cache considers them not worth trying in the future.
>
> This patch makes zone_reclaim() return more fine-grained information about
> what occured when zone_reclaim() failued. The zone only gets marked full if
> it really is unreclaimable. If it's a case that the scan did not occur or
> if enough pages were not reclaimed with the limited reclaim_mode, then the
> zone is simply skipped.
>
> There is a side-effect to this patch. Currently, if zone_reclaim()
> successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would
> go ahead. With this patch applied, zone watermarks are rechecked after
> zone_reclaim() does some work.
>
> Signed-off-by: Mel Gorman <[email protected]>

Thanks for making the code a lot more readable :)

Reviewed-by: Wu Fengguang <[email protected]>

> /*
> * Do not scan if the allocation should not be delayed.
> */
> if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
> - return 0;
> + return ZONE_RECLAIM_NOSCAN;

Why not kill the extra tab?

2009-06-09 07:48:18

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 3/3] Do not unconditionally treat zones that fail zone_reclaim() as full

Hi

> On NUMA machines, the administrator can configure zone_reclaim_mode that
> is a more targetted form of direct reclaim. On machines with large NUMA
> distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> unmapped pages will be reclaimed if the zone watermarks are not being
> met. The problem is that zone_reclaim() failing at all means the zone
> gets marked full.
>
> This can cause situations where a zone is usable, but is being skipped
> because it has been considered full. Take a situation where a large tmpfs
> mount is occuping a large percentage of memory overall. The pages do not
> get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
> and the zonelist cache considers them not worth trying in the future.
>
> This patch makes zone_reclaim() return more fine-grained information about
> what occured when zone_reclaim() failued. The zone only gets marked full if
> it really is unreclaimable. If it's a case that the scan did not occur or
> if enough pages were not reclaimed with the limited reclaim_mode, then the
> zone is simply skipped.
>
> There is a side-effect to this patch. Currently, if zone_reclaim()
> successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would
> go ahead. With this patch applied, zone watermarks are rechecked after
> zone_reclaim() does some work.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/internal.h | 4 ++++
> mm/page_alloc.c | 26 ++++++++++++++++++++++----
> mm/vmscan.c | 10 +++++-----
> 3 files changed, 31 insertions(+), 9 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 987bb03..090c267 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -284,4 +284,8 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> unsigned long start, int len, int flags,
> struct page **pages, struct vm_area_struct **vmas);
>
> +#define ZONE_RECLAIM_NOSCAN -2
> +#define ZONE_RECLAIM_FULL -1
> +#define ZONE_RECLAIM_SOME 0
> +#define ZONE_RECLAIM_SUCCESS 1
> #endif
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fe753ec..ce2f684 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1420,20 +1420,38 @@ zonelist_scan:
>
> if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
> unsigned long mark;
> + int ret;

Please insert one empty line here.

> if (alloc_flags & ALLOC_WMARK_MIN)
> mark = zone->pages_min;
> else if (alloc_flags & ALLOC_WMARK_LOW)
> mark = zone->pages_low;
> else
> mark = zone->pages_high;
> - if (!zone_watermark_ok(zone, order, mark,
> - classzone_idx, alloc_flags)) {
> - if (!zone_reclaim_mode ||
> - !zone_reclaim(zone, gfp_mask, order))
> + if (zone_watermark_ok(zone, order, mark,
> + classzone_idx, alloc_flags))
> + goto try_this_zone;
> +
> + if (zone_reclaim_mode == 0)
> + goto this_zone_full;
> +
> + ret = zone_reclaim(zone, gfp_mask, order);
> + switch (ret) {
> + case ZONE_RECLAIM_NOSCAN:
> + /* did not scan */
> + goto try_next_zone;
> + case ZONE_RECLAIM_FULL:
> + /* scanned but unreclaimable */
> goto this_zone_full;
> + default:
> + /* did we reclaim enough */
> + if (!zone_watermark_ok(zone, order,
> + mark, classzone_idx,
> + alloc_flags))
> + goto try_next_zone;

hmmm
I haven't catch your mention yet. sorry.
Could you please explain more?

My confuseness are:

1.
----
I think your patch almost revert Paul's 9276b1bc96a132f4068fdee00983c532f43d3a26 essence.
after your patch applied, zlc_mark_zone_full() is called only when zone_is_all_unreclaimable()==1
or memory stealed after zone_watermark_ok() rechecking.

but zone_is_all_unreclaimable() is very rare on large NUMA machine. Thus
your patch makes zlc_zone_worth_trying() check to worthless.
So, I like simple reverting 9276b1bc rather than introduce more messy if necessary.

but necessary? why?


2.
-----
Why simple following switch-case is wrong?

case ZONE_RECLAIM_NOSCAN:
goto try_next_zone;
case ZONE_RECLAIM_FULL:
case ZONE_RECLAIM_SOME:
goto this_zone_full;
case ZONE_RECLAIM_SUCCESS
; /* do nothing */

I mean,
(1) ZONE_RECLAIM_SOME and zone_watermark_ok()==1

are rare.
Is rechecking really worth?
In my experience, zone_watermark_ok() is not so fast function.

And,

(2) ZONE_RECLAIM_SUCCESS and zone_watermark_ok()==0

is also rare.
What do you afraid bad thing?

I guess, high-order allocation and ZONE_RECLAIM_SUCCESS and
zone_watermark_ok()==0 case, right?

if so, Why your system makes high order allocation so freqently?

3.
------
your patch do:

1. call zone_reclaim() and return ZONE_RECLAIM_SUCCESS
2. another thread steal memory
3. call zone_watermark_ok() and return 0

Then, jump to try_next_zone

but

1. call zone_reclaim() and return ZONE_RECLAIM_SUCCESS
2. call zone_watermark_ok() and return 1
3. another thread steal memory
4. call buffered_rmqueue() and return NULL

Then, it call zlc_mark_zone_full().

it seems a bit inconsistency.




> }
> }
>
> +try_this_zone:
> page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
> if (page)
> break;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ffe2f32..84cdae2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2409,7 +2409,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> if (pagecache_reclaimable <= zone->min_unmapped_pages
> && zone_page_state(zone, NR_SLAB_RECLAIMABLE)
> <= zone->min_slab_pages)
> - return 0;
> + return ZONE_RECLAIM_NOSCAN;
>
> /* Do not attempt a scan if scanning failed recently */
> if (time_before(jiffies,
> @@ -2417,13 +2417,13 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> return 0;
>
> if (zone_is_all_unreclaimable(zone))
> - return 0;
> + return ZONE_RECLAIM_FULL;
>
> /*
> * Do not scan if the allocation should not be delayed.
> */
> if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
> - return 0;
> + return ZONE_RECLAIM_NOSCAN;
>
> /*
> * Only run zone reclaim on the local zone or on zones that do not
> @@ -2433,10 +2433,10 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> */
> node_id = zone_to_nid(zone);
> if (node_state(node_id, N_CPU) && node_id != numa_node_id())
> - return 0;
> + return ZONE_RECLAIM_NOSCAN;
>
> if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
> - return 0;
> + return ZONE_RECLAIM_NOSCAN;
> ret = __zone_reclaim(zone, gfp_mask, order);
> zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
>
> --
> 1.5.6.5
>


2009-06-09 07:48:46

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

Hi

> On NUMA machines, the administrator can configure zone_reclaim_mode that is a
> more targetted form of direct reclaim. On machines with large NUMA distances,
> zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be
> reclaimed if the zone watermarks are not being met. The problem is that
> zone_reclaim() can be in a situation where it scans excessively without
> making progress.
>
> One such situation is where a large tmpfs mount is occupying a large
> percentage of memory overall. The pages do not get cleaned or reclaimed by
> zone_reclaim(), but the lists are uselessly scanned frequencly making the
> CPU spin at 100%. The scanning occurs because zone_reclaim() cannot tell
> in advance the scan is pointless because the counters do not distinguish
> between pagecache pages backed by disk and by RAM. The observation in
> the field is that malloc() stalls for a long time (minutes in some cases)
> when this situation occurs.
>
> Accounting for ram-backed file pages was considered but not implemented on
> the grounds it would be introducing new branches and expensive checks into
> the page cache add/remove patches and increase the number of statistics
> needed in the zone. As zone_reclaim() failing is currently considered a
> corner case, this seemed like overkill. Note, if there are a large number
> of reports about CPU spinning at 100% on NUMA that is fixed by disabling
> zone_reclaim, then this assumption is false and zone_reclaim() scanning
> and failing is not a corner case but a common occurance
>
> This patch reintroduces zone_reclaim_interval which was removed by commit
> 34aa1330f9b3c5783d269851d467326525207422 [zoned vm counters: zone_reclaim:
> remove /proc/sys/vm/zone_reclaim_interval] because the zone counters were
> considered sufficient to determine in advance if the scan would succeed.
> As unsuccessful scans can still occur, zone_reclaim_interval is still
> required.
>
> Signed-off-by: Mel Gorman <[email protected]
> ---
> Documentation/sysctl/vm.txt | 13 +++++++++++++
> include/linux/mmzone.h | 9 +++++++++
> include/linux/swap.h | 1 +
> kernel/sysctl.c | 9 +++++++++
> mm/vmscan.c | 22 ++++++++++++++++++++++
> 5 files changed, 54 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index c302ddf..f9b8db5 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -52,6 +52,7 @@ Currently, these files are in /proc/sys/vm:
> - swappiness
> - vfs_cache_pressure
> - zone_reclaim_mode
> +- zone_reclaim_interval
>
>
> ==============================================================
> @@ -620,4 +621,16 @@ Allowing regular swap effectively restricts allocations to the local
> node unless explicitly overridden by memory policies or cpuset
> configurations.
>
> +================================================================
> +
> +zone_reclaim_interval:
> +
> +The time allowed for off node allocations after zone reclaim
> +has failed to reclaim enough pages to allow a local allocation.
> +
> +Time is set in seconds and set by default to 30 seconds.
> +
> +Reduce the interval if undesired off node allocations occur. However, too
> +frequent scans will have a negative impact on off-node allocation performance.
> +
> ============ End of Document =================================
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index a47c879..f1f0fb2 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -337,6 +337,15 @@ struct zone {
> atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
>
> /*
> + * timestamp (in jiffies) of the last zone_reclaim that scanned
> + * but failed to free enough pages. This is used to avoid repeated
> + * scans when zone_reclaim() is unable to detect in advance that
> + * the scanning is useless. This can happen for example if a zone
> + * has large numbers of clean unmapped file pages on tmpfs
> + */
> + unsigned long zone_reclaim_failure;
> +
> + /*
> * prev_priority holds the scanning priority for this zone. It is
> * defined as the scanning priority at which we achieved our reclaim
> * target at the previous try_to_free_pages() or balance_pgdat()
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index d476aad..6a71368 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -224,6 +224,7 @@ extern long vm_total_pages;
>
> #ifdef CONFIG_NUMA
> extern int zone_reclaim_mode;
> +extern int zone_reclaim_interval;
> extern int sysctl_min_unmapped_ratio;
> extern int sysctl_min_slab_ratio;
> extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index b2970d5..cc0623c 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1192,6 +1192,15 @@ static struct ctl_table vm_table[] = {
> .extra1 = &zero,
> },
> {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "zone_reclaim_interval",
> + .data = &zone_reclaim_interval,
> + .maxlen = sizeof(zone_reclaim_interval),
> + .mode = 0644,
> + .proc_handler = &proc_dointvec_jiffies,
> + .strategy = &sysctl_jiffies,
> + },

hmmm, I think nobody can know proper interval settings on his own systems.
I agree with Wu. It can be hidden.


> + {
> .ctl_name = VM_MIN_UNMAPPED,
> .procname = "min_unmapped_ratio",
> .data = &sysctl_min_unmapped_ratio,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d254306..ba211c1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2272,6 +2272,13 @@ int zone_reclaim_mode __read_mostly;
> #define RECLAIM_SWAP (1<<2) /* Swap pages out during reclaim */
>
> /*
> + * Minimum time between zone_reclaim() scans that failed. Ordinarily, a
> + * scan will not fail because it will be determined in advance if it can
> + * succeeed but this does not always work. See mmzone.h
> + */
> +int zone_reclaim_interval __read_mostly = 30*HZ;
> +
> +/*
> * Priority for ZONE_RECLAIM. This determines the fraction of pages
> * of a node considered for each zone_reclaim. 4 scans 1/16th of
> * a zone.
> @@ -2390,6 +2397,11 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> <= zone->min_slab_pages)
> return 0;
>
> + /* Do not attempt a scan if scanning failed recently */
> + if (time_before(jiffies,
> + zone->zone_reclaim_failure + zone_reclaim_interval))
> + return 0;
> +
> if (zone_is_all_unreclaimable(zone))
> return 0;
>
> @@ -2414,6 +2426,16 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> ret = __zone_reclaim(zone, gfp_mask, order);
> zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
>
> + if (!ret) {
> + /*
> + * We were unable to reclaim enough pages to stay on node and
> + * unable to detect in advance that the scan would fail. Allow
> + * off node accesses for zone_reclaim_inteval jiffies before
> + * trying zone_reclaim() again
> + */
> + zone->zone_reclaim_failure = jiffies;

Oops, this simple assignment don't care jiffies round-trip.


> + }
> +
> return ret;
> }
> #endif
> --
> 1.5.6.5
>


2009-06-09 08:08:58

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Mon, Jun 08, 2009 at 10:48:16AM -0400, Rik van Riel wrote:
> Mel Gorman wrote:
>> On Mon, Jun 08, 2009 at 09:31:09AM -0400, Rik van Riel wrote:
>>> Mel Gorman wrote:
>>>
>>>> The scanning occurs because zone_reclaim() cannot tell
>>>> in advance the scan is pointless because the counters do not distinguish
>>>> between pagecache pages backed by disk and by RAM.
>>> Yes it can. Since 2.6.27, filesystem backed and swap/ram backed
>>> pages have been living on separate LRU lists.
>>
>> Yes, they're on separate LRU lists but they are not the only pages on those
>> lists. The tmpfs pages are mixed in together with anonymous pages so we
>> cannot use NR_*_ANON.
>>
>> Look at patch 2 and where I introduced;
>
> I have to admit I did not read patches 2 and 3 before
> replying to the (strange looking, at the time) text
> above patch 1.
>

Sorry about that. The ordering of the patches was in "patch that fixes
bug, patch that addresses expectations and patch that fixes imaginery
bug but that makes sense". If it was a real patchset, patch 2 would have
come first.

> With that logic from patch 2 in place, patch 1 makes
> perfect sense.
>
> Acked-by: Rik van Riel <[email protected]>
>

Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-09 08:14:40

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Tue, Jun 09, 2009 at 09:58:22AM +0800, Wu Fengguang wrote:
> On Mon, Jun 08, 2009 at 09:01:28PM +0800, Mel Gorman wrote:
> > On NUMA machines, the administrator can configure zone_reclaim_mode that is a
> > more targetted form of direct reclaim. On machines with large NUMA distances,
> > zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be
> > reclaimed if the zone watermarks are not being met. The problem is that
> > zone_reclaim() can be in a situation where it scans excessively without
> > making progress.
> >
> > One such situation is where a large tmpfs mount is occupying a large
> > percentage of memory overall. The pages do not get cleaned or reclaimed by
> > zone_reclaim(), but the lists are uselessly scanned frequencly making the
> > CPU spin at 100%. The scanning occurs because zone_reclaim() cannot tell
> > in advance the scan is pointless because the counters do not distinguish
> > between pagecache pages backed by disk and by RAM. The observation in
> > the field is that malloc() stalls for a long time (minutes in some cases)
> > when this situation occurs.
> >
> > Accounting for ram-backed file pages was considered but not implemented on
> > the grounds it would be introducing new branches and expensive checks into
> > the page cache add/remove patches and increase the number of statistics
> > needed in the zone. As zone_reclaim() failing is currently considered a
> > corner case, this seemed like overkill. Note, if there are a large number
> > of reports about CPU spinning at 100% on NUMA that is fixed by disabling
> > zone_reclaim, then this assumption is false and zone_reclaim() scanning
> > and failing is not a corner case but a common occurance
> >
> > This patch reintroduces zone_reclaim_interval which was removed by commit
> > 34aa1330f9b3c5783d269851d467326525207422 [zoned vm counters: zone_reclaim:
> > remove /proc/sys/vm/zone_reclaim_interval] because the zone counters were
> > considered sufficient to determine in advance if the scan would succeed.
> > As unsuccessful scans can still occur, zone_reclaim_interval is still
> > required.
>
> Can we avoid the user visible parameter zone_reclaim_interval?
>

You could, but then there is no way of disabling it by setting it to 0
either. I can't imagine why but the desired behaviour might really be to
spin and never go off-node unless there is no other option. They might
want to set it to 0 for example when determining what the right value for
zone_reclaim_mode is for their workloads.

> That means to introduce some heuristics for it.

I suspect the vast majority of users will ignore it unless they are runing
zone_reclaim_mode at the same time and even then will probably just leave
it as 30 as a LRU scan every 30 seconds worst case is not going to show up
on many profiles.

> Since the whole point
> is to avoid 100% CPU usage, we can take down the time used for this
> failed zone reclaim (T) and forbid zone reclaim until (NOW + 100*T).
>

i.e. just fix it internally at 100 seconds? How is that better than
having an obscure tunable? I think if this heuristic exists at all, it's
important that an administrator be able to turn it off if absolutly
necessary and so something must be user-visible.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-09 08:18:38

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Tue, Jun 09, 2009 at 04:48:28PM +0900, KOSAKI Motohiro wrote:
> Hi
>
> > On NUMA machines, the administrator can configure zone_reclaim_mode that is a
> > more targetted form of direct reclaim. On machines with large NUMA distances,
> > zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be
> > reclaimed if the zone watermarks are not being met. The problem is that
> > zone_reclaim() can be in a situation where it scans excessively without
> > making progress.
> >
> > One such situation is where a large tmpfs mount is occupying a large
> > percentage of memory overall. The pages do not get cleaned or reclaimed by
> > zone_reclaim(), but the lists are uselessly scanned frequencly making the
> > CPU spin at 100%. The scanning occurs because zone_reclaim() cannot tell
> > in advance the scan is pointless because the counters do not distinguish
> > between pagecache pages backed by disk and by RAM. The observation in
> > the field is that malloc() stalls for a long time (minutes in some cases)
> > when this situation occurs.
> >
> > Accounting for ram-backed file pages was considered but not implemented on
> > the grounds it would be introducing new branches and expensive checks into
> > the page cache add/remove patches and increase the number of statistics
> > needed in the zone. As zone_reclaim() failing is currently considered a
> > corner case, this seemed like overkill. Note, if there are a large number
> > of reports about CPU spinning at 100% on NUMA that is fixed by disabling
> > zone_reclaim, then this assumption is false and zone_reclaim() scanning
> > and failing is not a corner case but a common occurance
> >
> > This patch reintroduces zone_reclaim_interval which was removed by commit
> > 34aa1330f9b3c5783d269851d467326525207422 [zoned vm counters: zone_reclaim:
> > remove /proc/sys/vm/zone_reclaim_interval] because the zone counters were
> > considered sufficient to determine in advance if the scan would succeed.
> > As unsuccessful scans can still occur, zone_reclaim_interval is still
> > required.
> >
> > Signed-off-by: Mel Gorman <[email protected]
> > ---
> > Documentation/sysctl/vm.txt | 13 +++++++++++++
> > include/linux/mmzone.h | 9 +++++++++
> > include/linux/swap.h | 1 +
> > kernel/sysctl.c | 9 +++++++++
> > mm/vmscan.c | 22 ++++++++++++++++++++++
> > 5 files changed, 54 insertions(+), 0 deletions(-)
> >
> > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > index c302ddf..f9b8db5 100644
> > --- a/Documentation/sysctl/vm.txt
> > +++ b/Documentation/sysctl/vm.txt
> > @@ -52,6 +52,7 @@ Currently, these files are in /proc/sys/vm:
> > - swappiness
> > - vfs_cache_pressure
> > - zone_reclaim_mode
> > +- zone_reclaim_interval
> >
> >
> > ==============================================================
> > @@ -620,4 +621,16 @@ Allowing regular swap effectively restricts allocations to the local
> > node unless explicitly overridden by memory policies or cpuset
> > configurations.
> >
> > +================================================================
> > +
> > +zone_reclaim_interval:
> > +
> > +The time allowed for off node allocations after zone reclaim
> > +has failed to reclaim enough pages to allow a local allocation.
> > +
> > +Time is set in seconds and set by default to 30 seconds.
> > +
> > +Reduce the interval if undesired off node allocations occur. However, too
> > +frequent scans will have a negative impact on off-node allocation performance.
> > +
> > ============ End of Document =================================
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index a47c879..f1f0fb2 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -337,6 +337,15 @@ struct zone {
> > atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
> >
> > /*
> > + * timestamp (in jiffies) of the last zone_reclaim that scanned
> > + * but failed to free enough pages. This is used to avoid repeated
> > + * scans when zone_reclaim() is unable to detect in advance that
> > + * the scanning is useless. This can happen for example if a zone
> > + * has large numbers of clean unmapped file pages on tmpfs
> > + */
> > + unsigned long zone_reclaim_failure;
> > +
> > + /*
> > * prev_priority holds the scanning priority for this zone. It is
> > * defined as the scanning priority at which we achieved our reclaim
> > * target at the previous try_to_free_pages() or balance_pgdat()
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index d476aad..6a71368 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -224,6 +224,7 @@ extern long vm_total_pages;
> >
> > #ifdef CONFIG_NUMA
> > extern int zone_reclaim_mode;
> > +extern int zone_reclaim_interval;
> > extern int sysctl_min_unmapped_ratio;
> > extern int sysctl_min_slab_ratio;
> > extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index b2970d5..cc0623c 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -1192,6 +1192,15 @@ static struct ctl_table vm_table[] = {
> > .extra1 = &zero,
> > },
> > {
> > + .ctl_name = CTL_UNNUMBERED,
> > + .procname = "zone_reclaim_interval",
> > + .data = &zone_reclaim_interval,
> > + .maxlen = sizeof(zone_reclaim_interval),
> > + .mode = 0644,
> > + .proc_handler = &proc_dointvec_jiffies,
> > + .strategy = &sysctl_jiffies,
> > + },
>
> hmmm, I think nobody can know proper interval settings on his own systems.
> I agree with Wu. It can be hidden.
>

For the few users that case, I expect the majority of those will choose
either 0 or the default value of 30. They might want to alter this while
setting zone_reclaim_mode if they don't understand the different values
it can have for example.

My preference would be that this not exist at all but the
scan-avoidance-heuristic has to be perfect to allow that.

>
> > + {
> > .ctl_name = VM_MIN_UNMAPPED,
> > .procname = "min_unmapped_ratio",
> > .data = &sysctl_min_unmapped_ratio,
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index d254306..ba211c1 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2272,6 +2272,13 @@ int zone_reclaim_mode __read_mostly;
> > #define RECLAIM_SWAP (1<<2) /* Swap pages out during reclaim */
> >
> > /*
> > + * Minimum time between zone_reclaim() scans that failed. Ordinarily, a
> > + * scan will not fail because it will be determined in advance if it can
> > + * succeeed but this does not always work. See mmzone.h
> > + */
> > +int zone_reclaim_interval __read_mostly = 30*HZ;
> > +
> > +/*
> > * Priority for ZONE_RECLAIM. This determines the fraction of pages
> > * of a node considered for each zone_reclaim. 4 scans 1/16th of
> > * a zone.
> > @@ -2390,6 +2397,11 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > <= zone->min_slab_pages)
> > return 0;
> >
> > + /* Do not attempt a scan if scanning failed recently */
> > + if (time_before(jiffies,
> > + zone->zone_reclaim_failure + zone_reclaim_interval))
> > + return 0;
> > +
> > if (zone_is_all_unreclaimable(zone))
> > return 0;
> >
> > @@ -2414,6 +2426,16 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > ret = __zone_reclaim(zone, gfp_mask, order);
> > zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
> >
> > + if (!ret) {
> > + /*
> > + * We were unable to reclaim enough pages to stay on node and
> > + * unable to detect in advance that the scan would fail. Allow
> > + * off node accesses for zone_reclaim_inteval jiffies before
> > + * trying zone_reclaim() again
> > + */
> > + zone->zone_reclaim_failure = jiffies;
>
> Oops, this simple assignment don't care jiffies round-trip.
>

Here it is just recording the jiffies value. The real smarts with the counter
use time_before() which I assumed could handle jiffie wrap-arounds. Even
if it doesn't, the consequence is that one scan will occur that could have
been avoided around the time of the jiffie wraparound. The value will then
be reset and it will be fine.

>
> > + }
> > +
> > return ret;
> > }
> > #endif

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-09 08:20:04

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 2/3] Properly account for the number of page cache pages zone_reclaim() can reclaim

Hi

> On NUMA machines, the administrator can configure zone_relcaim_mode that
> is a more targetted form of direct reclaim. On machines with large NUMA
> distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> unmapped pages will be reclaimed if the zone watermarks are not being met.
>
> There is a heuristic that determines if the scan is worthwhile but the
> problem is that the heuristic is not being properly applied and is basically
> assuming zone_reclaim_mode is 1 if it is enabled.
>
> This patch makes zone_reclaim() makes a better attempt at working out how
> many pages it might be able to reclaim given the current reclaim_mode. If it
> cannot clean pages, then NR_FILE_DIRTY number of pages are not candidates. If
> it cannot swap, then NR_FILE_MAPPED are not. This indirectly addresses tmpfs
> as those pages tend to be dirty as they are not cleaned by pdflush or sync.
>
> The ideal would be that the number of tmpfs pages would also be known
> and account for like NR_FILE_MAPPED as swap is required to discard them.
> A means of working this out quickly was not obvious but a comment is added
> noting the problem.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 18 ++++++++++++++++--
> 1 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ba211c1..ffe2f32 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2380,6 +2380,21 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> {
> int node_id;
> int ret;
> + int pagecache_reclaimable;
> +
> + /*
> + * Work out how many page cache pages we can reclaim in this mode.
> + *
> + * NOTE: Ideally, tmpfs pages would be accounted as if they were
> + * NR_FILE_MAPPED as swap is required to discard those
> + * pages even when they are clean. However, there is no
> + * way of quickly identifying the number of tmpfs pages
> + */

I think I and you tackle the same issue.
Please see vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch in -mm.

My intension mean, tmpfs page and swapcache increased NR_FILE_PAGES.
but they can't be reclaimed by zone_reclaim_mode==1.

Then, I decide to use following calculation.

+ nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
+ zone_page_state(zone, NR_ACTIVE_FILE) -
+ zone_page_state(zone, NR_FILE_MAPPED);


> + pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
> + if (!(zone_reclaim_mode & RECLAIM_WRITE))
> + pagecache_reclaimable -= zone_page_state(zone, NR_FILE_DIRTY);
> + if (!(zone_reclaim_mode & RECLAIM_SWAP))
> + pagecache_reclaimable -= zone_page_state(zone, NR_FILE_MAPPED);

if you hope to solve tmpfs issue, RECLAIM_WRITE/RECLAIM_SWAP are unrelated, I think.
Plus, Could you please see vmscan-zone_reclaim-use-may_swap.patch in -mm?
it improve RECLAIM_SWAP by another way.



>
> /*
> * Zone reclaim reclaims unmapped file backed pages and
> @@ -2391,8 +2406,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> * if less than a specified percentage of the zone is used by
> * unmapped file backed pages.
> */
> - if (zone_page_state(zone, NR_FILE_PAGES) -
> - zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_pages
> + if (pagecache_reclaimable <= zone->min_unmapped_pages
> && zone_page_state(zone, NR_SLAB_RECLAIMABLE)
> <= zone->min_slab_pages)
> return 0;
> --
> 1.5.6.5
>


2009-06-09 08:25:58

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Tue, Jun 09, 2009 at 04:14:25PM +0800, Mel Gorman wrote:
> On Tue, Jun 09, 2009 at 09:58:22AM +0800, Wu Fengguang wrote:
> > On Mon, Jun 08, 2009 at 09:01:28PM +0800, Mel Gorman wrote:
> > > On NUMA machines, the administrator can configure zone_reclaim_mode that is a
> > > more targetted form of direct reclaim. On machines with large NUMA distances,
> > > zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be
> > > reclaimed if the zone watermarks are not being met. The problem is that
> > > zone_reclaim() can be in a situation where it scans excessively without
> > > making progress.
> > >
> > > One such situation is where a large tmpfs mount is occupying a large
> > > percentage of memory overall. The pages do not get cleaned or reclaimed by
> > > zone_reclaim(), but the lists are uselessly scanned frequencly making the
> > > CPU spin at 100%. The scanning occurs because zone_reclaim() cannot tell
> > > in advance the scan is pointless because the counters do not distinguish
> > > between pagecache pages backed by disk and by RAM. The observation in
> > > the field is that malloc() stalls for a long time (minutes in some cases)
> > > when this situation occurs.
> > >
> > > Accounting for ram-backed file pages was considered but not implemented on
> > > the grounds it would be introducing new branches and expensive checks into
> > > the page cache add/remove patches and increase the number of statistics
> > > needed in the zone. As zone_reclaim() failing is currently considered a
> > > corner case, this seemed like overkill. Note, if there are a large number
> > > of reports about CPU spinning at 100% on NUMA that is fixed by disabling
> > > zone_reclaim, then this assumption is false and zone_reclaim() scanning
> > > and failing is not a corner case but a common occurance
> > >
> > > This patch reintroduces zone_reclaim_interval which was removed by commit
> > > 34aa1330f9b3c5783d269851d467326525207422 [zoned vm counters: zone_reclaim:
> > > remove /proc/sys/vm/zone_reclaim_interval] because the zone counters were
> > > considered sufficient to determine in advance if the scan would succeed.
> > > As unsuccessful scans can still occur, zone_reclaim_interval is still
> > > required.
> >
> > Can we avoid the user visible parameter zone_reclaim_interval?
> >
>
> You could, but then there is no way of disabling it by setting it to 0
> either. I can't imagine why but the desired behaviour might really be to
> spin and never go off-node unless there is no other option. They might
> want to set it to 0 for example when determining what the right value for
> zone_reclaim_mode is for their workloads.
>
> > That means to introduce some heuristics for it.
>
> I suspect the vast majority of users will ignore it unless they are runing
> zone_reclaim_mode at the same time and even then will probably just leave
> it as 30 as a LRU scan every 30 seconds worst case is not going to show up
> on many profiles.
>
> > Since the whole point
> > is to avoid 100% CPU usage, we can take down the time used for this
> > failed zone reclaim (T) and forbid zone reclaim until (NOW + 100*T).
> >
>
> i.e. just fix it internally at 100 seconds? How is that better than
> having an obscure tunable? I think if this heuristic exists at all, it's
> important that an administrator be able to turn it off if absolutly
> necessary and so something must be user-visible.

That 100*T don't mean 100 seconds. It means to keep CPU usage under 1%:
after busy scanning for time T, let's go relax for 100*T.

2009-06-09 08:27:48

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/3] Properly account for the number of page cache pages zone_reclaim() can reclaim

On Tue, Jun 09, 2009 at 10:25:49AM +0800, Wu Fengguang wrote:
> On Mon, Jun 08, 2009 at 09:01:29PM +0800, Mel Gorman wrote:
> > On NUMA machines, the administrator can configure zone_relcaim_mode that
> > is a more targetted form of direct reclaim. On machines with large NUMA
> > distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> > unmapped pages will be reclaimed if the zone watermarks are not being met.
> >
> > There is a heuristic that determines if the scan is worthwhile but the
> > problem is that the heuristic is not being properly applied and is basically
> > assuming zone_reclaim_mode is 1 if it is enabled.
> >
> > This patch makes zone_reclaim() makes a better attempt at working out how
> > many pages it might be able to reclaim given the current reclaim_mode. If it
> > cannot clean pages, then NR_FILE_DIRTY number of pages are not candidates. If
> > it cannot swap, then NR_FILE_MAPPED are not. This indirectly addresses tmpfs
> > as those pages tend to be dirty as they are not cleaned by pdflush or sync.
>
> No, tmpfs pages are not accounted in NR_FILE_DIRTY because of the
> BDI_CAP_NO_ACCT_AND_WRITEBACK bits.
>

Ok, that explains why the dirty page count was not as high as I was
expecting. Thanks.

> > The ideal would be that the number of tmpfs pages would also be known
> > and account for like NR_FILE_MAPPED as swap is required to discard them.
> > A means of working this out quickly was not obvious but a comment is added
> > noting the problem.
>
> I'd rather prefer it be accounted separately than to muck up NR_FILE_MAPPED :)
>

Maybe I used a poor choice of words. What I meant was that the ideal would
be we had a separate count for tmpfs pages. As tmpfs pages and mapped pages
both have to be unmapped and potentially, they are "like" each other with
respect to the zone_reclaim_mode and how it behaves. We would end up
with something like

pagecache_reclaimable -= zone_page_state(zone, NR_FILE_MAPPED);
pagecache_reclaimable -= zone_page_state(zone, NR_FILE_TMPFS);

> > + int pagecache_reclaimable;
> > +
> > + /*
> > + * Work out how many page cache pages we can reclaim in this mode.
> > + *
> > + * NOTE: Ideally, tmpfs pages would be accounted as if they were
> > + * NR_FILE_MAPPED as swap is required to discard those
> > + * pages even when they are clean. However, there is no
> > + * way of quickly identifying the number of tmpfs pages
> > + */
>
> So can you remove the note on NR_FILE_MAPPED?
>

Why would I remove the note? I can alter the wording but the intention is
to show we cannot count the number of tmpfs pages quickly and it would be
nice if we could. Maybe this is clearer?

Note: Ideally tmpfs pages would be accounted for as NR_FILE_TMPFS or
similar and treated similar to NR_FILE_MAPPED as both require
unmapping from page tables and potentially swap to reclaim.
However, no such counter exists.

> > + pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
> > + if (!(zone_reclaim_mode & RECLAIM_WRITE))
> > + pagecache_reclaimable -= zone_page_state(zone, NR_FILE_DIRTY);
>
> > + if (!(zone_reclaim_mode & RECLAIM_SWAP))
> > + pagecache_reclaimable -= zone_page_state(zone, NR_FILE_MAPPED);
>
> So the "if" can be removed because NR_FILE_MAPPED is not related to swapping?
>

It's partially related with respect to what zone_reclaim() is doing.
Once something is mapped, we need RECLAIM_SWAP set on the
zone_reclaim_mode to do anything useful with them.

> Thanks,
> Fengguang
>
> > /*
> > * Zone reclaim reclaims unmapped file backed pages and
> > @@ -2391,8 +2406,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > * if less than a specified percentage of the zone is used by
> > * unmapped file backed pages.
> > */
> > - if (zone_page_state(zone, NR_FILE_PAGES) -
> > - zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_pages
> > + if (pagecache_reclaimable <= zone->min_unmapped_pages
> > && zone_page_state(zone, NR_SLAB_RECLAIMABLE)
> > <= zone->min_slab_pages)
> > return 0;
> > --
> > 1.5.6.5
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-09 08:32:08

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Tue, Jun 09, 2009 at 04:25:39PM +0800, Wu Fengguang wrote:
> On Tue, Jun 09, 2009 at 04:14:25PM +0800, Mel Gorman wrote:
> > On Tue, Jun 09, 2009 at 09:58:22AM +0800, Wu Fengguang wrote:
> > > On Mon, Jun 08, 2009 at 09:01:28PM +0800, Mel Gorman wrote:
> > > > On NUMA machines, the administrator can configure zone_reclaim_mode that is a
> > > > more targetted form of direct reclaim. On machines with large NUMA distances,
> > > > zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be
> > > > reclaimed if the zone watermarks are not being met. The problem is that
> > > > zone_reclaim() can be in a situation where it scans excessively without
> > > > making progress.
> > > >
> > > > One such situation is where a large tmpfs mount is occupying a large
> > > > percentage of memory overall. The pages do not get cleaned or reclaimed by
> > > > zone_reclaim(), but the lists are uselessly scanned frequencly making the
> > > > CPU spin at 100%. The scanning occurs because zone_reclaim() cannot tell
> > > > in advance the scan is pointless because the counters do not distinguish
> > > > between pagecache pages backed by disk and by RAM. The observation in
> > > > the field is that malloc() stalls for a long time (minutes in some cases)
> > > > when this situation occurs.
> > > >
> > > > Accounting for ram-backed file pages was considered but not implemented on
> > > > the grounds it would be introducing new branches and expensive checks into
> > > > the page cache add/remove patches and increase the number of statistics
> > > > needed in the zone. As zone_reclaim() failing is currently considered a
> > > > corner case, this seemed like overkill. Note, if there are a large number
> > > > of reports about CPU spinning at 100% on NUMA that is fixed by disabling
> > > > zone_reclaim, then this assumption is false and zone_reclaim() scanning
> > > > and failing is not a corner case but a common occurance
> > > >
> > > > This patch reintroduces zone_reclaim_interval which was removed by commit
> > > > 34aa1330f9b3c5783d269851d467326525207422 [zoned vm counters: zone_reclaim:
> > > > remove /proc/sys/vm/zone_reclaim_interval] because the zone counters were
> > > > considered sufficient to determine in advance if the scan would succeed.
> > > > As unsuccessful scans can still occur, zone_reclaim_interval is still
> > > > required.
> > >
> > > Can we avoid the user visible parameter zone_reclaim_interval?
> > >
> >
> > You could, but then there is no way of disabling it by setting it to 0
> > either. I can't imagine why but the desired behaviour might really be to
> > spin and never go off-node unless there is no other option. They might
> > want to set it to 0 for example when determining what the right value for
> > zone_reclaim_mode is for their workloads.
> >
> > > That means to introduce some heuristics for it.
> >
> > I suspect the vast majority of users will ignore it unless they are runing
> > zone_reclaim_mode at the same time and even then will probably just leave
> > it as 30 as a LRU scan every 30 seconds worst case is not going to show up
> > on many profiles.
> >
> > > Since the whole point
> > > is to avoid 100% CPU usage, we can take down the time used for this
> > > failed zone reclaim (T) and forbid zone reclaim until (NOW + 100*T).
> > >
> >
> > i.e. just fix it internally at 100 seconds? How is that better than
> > having an obscure tunable? I think if this heuristic exists at all, it's
> > important that an administrator be able to turn it off if absolutly
> > necessary and so something must be user-visible.
>
> That 100*T don't mean 100 seconds. It means to keep CPU usage under 1%:
> after busy scanning for time T, let's go relax for 100*T.
>

Do I have a means of calculating what my CPU usage is as a result of
scanning the LRU list?

If I don't and the machine is busy, would I not avoid scanning even in
situations where it should have been scanned?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-09 08:45:20

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

Hi

> > > @@ -1192,6 +1192,15 @@ static struct ctl_table vm_table[] = {
> > > .extra1 = &zero,
> > > },
> > > {
> > > + .ctl_name = CTL_UNNUMBERED,
> > > + .procname = "zone_reclaim_interval",
> > > + .data = &zone_reclaim_interval,
> > > + .maxlen = sizeof(zone_reclaim_interval),
> > > + .mode = 0644,
> > > + .proc_handler = &proc_dointvec_jiffies,
> > > + .strategy = &sysctl_jiffies,
> > > + },
> >
> > hmmm, I think nobody can know proper interval settings on his own systems.
> > I agree with Wu. It can be hidden.
> >
>
> For the few users that case, I expect the majority of those will choose
> either 0 or the default value of 30. They might want to alter this while
> setting zone_reclaim_mode if they don't understand the different values
> it can have for example.
>
> My preference would be that this not exist at all but the
> scan-avoidance-heuristic has to be perfect to allow that.

Ah, I didn't concern interval==0. thanks.
I can ack this now, but please add documentation about interval==0 meaning?




> > > @@ -2414,6 +2426,16 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > > ret = __zone_reclaim(zone, gfp_mask, order);
> > > zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
> > >
> > > + if (!ret) {
> > > + /*
> > > + * We were unable to reclaim enough pages to stay on node and
> > > + * unable to detect in advance that the scan would fail. Allow
> > > + * off node accesses for zone_reclaim_inteval jiffies before
> > > + * trying zone_reclaim() again
> > > + */
> > > + zone->zone_reclaim_failure = jiffies;
> >
> > Oops, this simple assignment don't care jiffies round-trip.
> >
>
> Here it is just recording the jiffies value. The real smarts with the counter
> use time_before() which I assumed could handle jiffie wrap-arounds. Even
> if it doesn't, the consequence is that one scan will occur that could have
> been avoided around the time of the jiffie wraparound. The value will then
> be reset and it will be fine.

time_before() assume two argument are enough nearly time.
if we use 32bit cpu and HZ=1000, about jiffies wraparound about one month.

Then,

1. zone reclaim failure occur
2. system works fine for one month
3. jiffies wrap and time_before() makes mis-calculation.

I think.


2009-06-09 08:46:03

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 2/3] Properly account for the number of page cache pages zone_reclaim() can reclaim

On Tue, Jun 09, 2009 at 04:27:29PM +0800, Mel Gorman wrote:
> On Tue, Jun 09, 2009 at 10:25:49AM +0800, Wu Fengguang wrote:
> > On Mon, Jun 08, 2009 at 09:01:29PM +0800, Mel Gorman wrote:
> > > On NUMA machines, the administrator can configure zone_relcaim_mode that
> > > is a more targetted form of direct reclaim. On machines with large NUMA
> > > distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> > > unmapped pages will be reclaimed if the zone watermarks are not being met.
> > >
> > > There is a heuristic that determines if the scan is worthwhile but the
> > > problem is that the heuristic is not being properly applied and is basically
> > > assuming zone_reclaim_mode is 1 if it is enabled.
> > >
> > > This patch makes zone_reclaim() makes a better attempt at working out how
> > > many pages it might be able to reclaim given the current reclaim_mode. If it
> > > cannot clean pages, then NR_FILE_DIRTY number of pages are not candidates. If
> > > it cannot swap, then NR_FILE_MAPPED are not. This indirectly addresses tmpfs
> > > as those pages tend to be dirty as they are not cleaned by pdflush or sync.
> >
> > No, tmpfs pages are not accounted in NR_FILE_DIRTY because of the
> > BDI_CAP_NO_ACCT_AND_WRITEBACK bits.
> >
>
> Ok, that explains why the dirty page count was not as high as I was
> expecting. Thanks.
>
> > > The ideal would be that the number of tmpfs pages would also be known
> > > and account for like NR_FILE_MAPPED as swap is required to discard them.
> > > A means of working this out quickly was not obvious but a comment is added
> > > noting the problem.
> >
> > I'd rather prefer it be accounted separately than to muck up NR_FILE_MAPPED :)
> >
>
> Maybe I used a poor choice of words. What I meant was that the ideal would
> be we had a separate count for tmpfs pages. As tmpfs pages and mapped pages
> both have to be unmapped and potentially, they are "like" each other with
> respect to the zone_reclaim_mode and how it behaves. We would end up
> with something like
>
> pagecache_reclaimable -= zone_page_state(zone, NR_FILE_MAPPED);
> pagecache_reclaimable -= zone_page_state(zone, NR_FILE_TMPFS);

OK. But tmpfs pages may be mapped, so there will be double counting.
We must at least make sure pagecache_reclaimable won't get underflowed.
(Or make another LRU list for tmpfs pages?)

> > > + int pagecache_reclaimable;
> > > +
> > > + /*
> > > + * Work out how many page cache pages we can reclaim in this mode.
> > > + *
> > > + * NOTE: Ideally, tmpfs pages would be accounted as if they were
> > > + * NR_FILE_MAPPED as swap is required to discard those
> > > + * pages even when they are clean. However, there is no
> > > + * way of quickly identifying the number of tmpfs pages
> > > + */
> >
> > So can you remove the note on NR_FILE_MAPPED?
> >
>
> Why would I remove the note? I can alter the wording but the intention is
> to show we cannot count the number of tmpfs pages quickly and it would be
> nice if we could. Maybe this is clearer?
>
> Note: Ideally tmpfs pages would be accounted for as NR_FILE_TMPFS or
> similar and treated similar to NR_FILE_MAPPED as both require
> unmapping from page tables and potentially swap to reclaim.
> However, no such counter exists.

That's better. Thanks.

> > > + pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
> > > + if (!(zone_reclaim_mode & RECLAIM_WRITE))
> > > + pagecache_reclaimable -= zone_page_state(zone, NR_FILE_DIRTY);
> >
> > > + if (!(zone_reclaim_mode & RECLAIM_SWAP))
> > > + pagecache_reclaimable -= zone_page_state(zone, NR_FILE_MAPPED);
> >
> > So the "if" can be removed because NR_FILE_MAPPED is not related to swapping?
> >
>
> It's partially related with respect to what zone_reclaim() is doing.
> Once something is mapped, we need RECLAIM_SWAP set on the
> zone_reclaim_mode to do anything useful with them.

You are referring to mapped anonymous/tmpfs pages? But I mean
NR_FILE_MAPPED pages won't goto swap when unmapped.

Thanks,
Fengguang

> > > /*
> > > * Zone reclaim reclaims unmapped file backed pages and
> > > @@ -2391,8 +2406,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > > * if less than a specified percentage of the zone is used by
> > > * unmapped file backed pages.
> > > */
> > > - if (zone_page_state(zone, NR_FILE_PAGES) -
> > > - zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_pages
> > > + if (pagecache_reclaimable <= zone->min_unmapped_pages
> > > && zone_page_state(zone, NR_SLAB_RECLAIMABLE)
> > > <= zone->min_slab_pages)
> > > return 0;
> > > --
> > > 1.5.6.5
> >
>
> --
> Mel Gorman
> Part-time Phd Student Linux Technology Center
> University of Limerick IBM Dublin Software Lab

2009-06-09 08:47:55

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/3] Properly account for the number of page cache pages zone_reclaim() can reclaim

On Tue, Jun 09, 2009 at 05:19:41PM +0900, KOSAKI Motohiro wrote:
> Hi
>
> > On NUMA machines, the administrator can configure zone_relcaim_mode that
> > is a more targetted form of direct reclaim. On machines with large NUMA
> > distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> > unmapped pages will be reclaimed if the zone watermarks are not being met.
> >
> > There is a heuristic that determines if the scan is worthwhile but the
> > problem is that the heuristic is not being properly applied and is basically
> > assuming zone_reclaim_mode is 1 if it is enabled.
> >
> > This patch makes zone_reclaim() makes a better attempt at working out how
> > many pages it might be able to reclaim given the current reclaim_mode. If it
> > cannot clean pages, then NR_FILE_DIRTY number of pages are not candidates. If
> > it cannot swap, then NR_FILE_MAPPED are not. This indirectly addresses tmpfs
> > as those pages tend to be dirty as they are not cleaned by pdflush or sync.
> >
> > The ideal would be that the number of tmpfs pages would also be known
> > and account for like NR_FILE_MAPPED as swap is required to discard them.
> > A means of working this out quickly was not obvious but a comment is added
> > noting the problem.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/vmscan.c | 18 ++++++++++++++++--
> > 1 files changed, 16 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index ba211c1..ffe2f32 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2380,6 +2380,21 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > {
> > int node_id;
> > int ret;
> > + int pagecache_reclaimable;
> > +
> > + /*
> > + * Work out how many page cache pages we can reclaim in this mode.
> > + *
> > + * NOTE: Ideally, tmpfs pages would be accounted as if they were
> > + * NR_FILE_MAPPED as swap is required to discard those
> > + * pages even when they are clean. However, there is no
> > + * way of quickly identifying the number of tmpfs pages
> > + */
>
> I think I and you tackle the same issue.
> Please see vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch in -mm.
>

Awesome. This is why I posted the patches a bit earlier than I would
normally. Stuff like this is found :D

> My intension mean, tmpfs page and swapcache increased NR_FILE_PAGES.
> but they can't be reclaimed by zone_reclaim_mode==1.
>

Sounds familiar!

> Then, I decide to use following calculation.
>
> + nr_unmapped_file_pages = zone_page_state(zone, NR_INACTIVE_FILE) +
> + zone_page_state(zone, NR_ACTIVE_FILE) -
> + zone_page_state(zone, NR_FILE_MAPPED);
>

That should now be in a helper. If I use that calculation, it'll appear
in three different places. I'll do the shuffling.

>
> > + pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
> > + if (!(zone_reclaim_mode & RECLAIM_WRITE))
> > + pagecache_reclaimable -= zone_page_state(zone, NR_FILE_DIRTY);
> > + if (!(zone_reclaim_mode & RECLAIM_SWAP))
> > + pagecache_reclaimable -= zone_page_state(zone, NR_FILE_MAPPED);
>
> if you hope to solve tmpfs issue, RECLAIM_WRITE/RECLAIM_SWAP are unrelated, I think.

For reclaim_zone() to do anything useful, the pages have to be cleaned
and swapped and that needs RECLAIM_WRITE and RECLAIM_SWAP. So, how are
they unrelated?

> Plus, Could you please see vmscan-zone_reclaim-use-may_swap.patch in -mm?
> it improve RECLAIM_SWAP by another way.
>

Looking now, I'm going to rebase this patchset on top of -mm where I can
take advantage of that patch.

Thanks a lot

>
>
> >
> > /*
> > * Zone reclaim reclaims unmapped file backed pages and
> > @@ -2391,8 +2406,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > * if less than a specified percentage of the zone is used by
> > * unmapped file backed pages.
> > */
> > - if (zone_page_state(zone, NR_FILE_PAGES) -
> > - zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_pages
> > + if (pagecache_reclaimable <= zone->min_unmapped_pages
> > && zone_page_state(zone, NR_SLAB_RECLAIMABLE)
> > <= zone->min_slab_pages)
> > return 0;

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-09 08:50:47

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/3] Do not unconditionally treat zones that fail zone_reclaim() as full

On Tue, Jun 09, 2009 at 11:11:19AM +0800, Wu Fengguang wrote:
> On Mon, Jun 08, 2009 at 09:01:30PM +0800, Mel Gorman wrote:
> > On NUMA machines, the administrator can configure zone_reclaim_mode that
> > is a more targetted form of direct reclaim. On machines with large NUMA
> > distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> > unmapped pages will be reclaimed if the zone watermarks are not being
> > met. The problem is that zone_reclaim() failing at all means the zone
> > gets marked full.
> >
> > This can cause situations where a zone is usable, but is being skipped
> > because it has been considered full. Take a situation where a large tmpfs
> > mount is occuping a large percentage of memory overall. The pages do not
> > get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
> > and the zonelist cache considers them not worth trying in the future.
> >
> > This patch makes zone_reclaim() return more fine-grained information about
> > what occured when zone_reclaim() failued. The zone only gets marked full if
> > it really is unreclaimable. If it's a case that the scan did not occur or
> > if enough pages were not reclaimed with the limited reclaim_mode, then the
> > zone is simply skipped.
> >
> > There is a side-effect to this patch. Currently, if zone_reclaim()
> > successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would
> > go ahead. With this patch applied, zone watermarks are rechecked after
> > zone_reclaim() does some work.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
>
> Thanks for making the code a lot more readable :)
>
> Reviewed-by: Wu Fengguang <[email protected]>
>

Thanks.

> > /*
> > * Do not scan if the allocation should not be delayed.
> > */
> > if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
> > - return 0;
> > + return ZONE_RECLAIM_NOSCAN;
>
> Why not kill the extra tab?
>

Why not indeed. Tab is now killed.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-09 08:55:23

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 2/3] Properly account for the number of page cache pages zone_reclaim() can reclaim

> > > The ideal would be that the number of tmpfs pages would also be known
> > > and account for like NR_FILE_MAPPED as swap is required to discard them.
> > > A means of working this out quickly was not obvious but a comment is added
> > > noting the problem.
> >
> > I'd rather prefer it be accounted separately than to muck up NR_FILE_MAPPED :)
> >
>
> Maybe I used a poor choice of words. What I meant was that the ideal would
> be we had a separate count for tmpfs pages. As tmpfs pages and mapped pages
> both have to be unmapped and potentially, they are "like" each other with
> respect to the zone_reclaim_mode and how it behaves. We would end up
> with something like
>
> pagecache_reclaimable -= zone_page_state(zone, NR_FILE_MAPPED);
> pagecache_reclaimable -= zone_page_state(zone, NR_FILE_TMPFS);

Please see shmem_writepage(). tmpfs writeout make swapcache, We also
need to concern swapcache.

note: swapcache also increase NR_FILE_PAGES, see add_to_swap_cache.


2009-06-09 09:07:45

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Tue, Jun 09, 2009 at 04:31:54PM +0800, Mel Gorman wrote:
> On Tue, Jun 09, 2009 at 04:25:39PM +0800, Wu Fengguang wrote:
> > On Tue, Jun 09, 2009 at 04:14:25PM +0800, Mel Gorman wrote:
> > > On Tue, Jun 09, 2009 at 09:58:22AM +0800, Wu Fengguang wrote:
> > > > On Mon, Jun 08, 2009 at 09:01:28PM +0800, Mel Gorman wrote:
> > > > > On NUMA machines, the administrator can configure zone_reclaim_mode that is a
> > > > > more targetted form of direct reclaim. On machines with large NUMA distances,
> > > > > zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be
> > > > > reclaimed if the zone watermarks are not being met. The problem is that
> > > > > zone_reclaim() can be in a situation where it scans excessively without
> > > > > making progress.
> > > > >
> > > > > One such situation is where a large tmpfs mount is occupying a large
> > > > > percentage of memory overall. The pages do not get cleaned or reclaimed by
> > > > > zone_reclaim(), but the lists are uselessly scanned frequencly making the
> > > > > CPU spin at 100%. The scanning occurs because zone_reclaim() cannot tell
> > > > > in advance the scan is pointless because the counters do not distinguish
> > > > > between pagecache pages backed by disk and by RAM. The observation in
> > > > > the field is that malloc() stalls for a long time (minutes in some cases)
> > > > > when this situation occurs.
> > > > >
> > > > > Accounting for ram-backed file pages was considered but not implemented on
> > > > > the grounds it would be introducing new branches and expensive checks into
> > > > > the page cache add/remove patches and increase the number of statistics
> > > > > needed in the zone. As zone_reclaim() failing is currently considered a
> > > > > corner case, this seemed like overkill. Note, if there are a large number
> > > > > of reports about CPU spinning at 100% on NUMA that is fixed by disabling
> > > > > zone_reclaim, then this assumption is false and zone_reclaim() scanning
> > > > > and failing is not a corner case but a common occurance
> > > > >
> > > > > This patch reintroduces zone_reclaim_interval which was removed by commit
> > > > > 34aa1330f9b3c5783d269851d467326525207422 [zoned vm counters: zone_reclaim:
> > > > > remove /proc/sys/vm/zone_reclaim_interval] because the zone counters were
> > > > > considered sufficient to determine in advance if the scan would succeed.
> > > > > As unsuccessful scans can still occur, zone_reclaim_interval is still
> > > > > required.
> > > >
> > > > Can we avoid the user visible parameter zone_reclaim_interval?
> > > >
> > >
> > > You could, but then there is no way of disabling it by setting it to 0
> > > either. I can't imagine why but the desired behaviour might really be to
> > > spin and never go off-node unless there is no other option. They might
> > > want to set it to 0 for example when determining what the right value for
> > > zone_reclaim_mode is for their workloads.
> > >
> > > > That means to introduce some heuristics for it.
> > >
> > > I suspect the vast majority of users will ignore it unless they are runing
> > > zone_reclaim_mode at the same time and even then will probably just leave
> > > it as 30 as a LRU scan every 30 seconds worst case is not going to show up
> > > on many profiles.
> > >
> > > > Since the whole point
> > > > is to avoid 100% CPU usage, we can take down the time used for this
> > > > failed zone reclaim (T) and forbid zone reclaim until (NOW + 100*T).
> > > >
> > >
> > > i.e. just fix it internally at 100 seconds? How is that better than
> > > having an obscure tunable? I think if this heuristic exists at all, it's
> > > important that an administrator be able to turn it off if absolutly
> > > necessary and so something must be user-visible.
> >
> > That 100*T don't mean 100 seconds. It means to keep CPU usage under 1%:
> > after busy scanning for time T, let's go relax for 100*T.
> >
>
> Do I have a means of calculating what my CPU usage is as a result of
> scanning the LRU list?
>
> If I don't and the machine is busy, would I not avoid scanning even in
> situations where it should have been scanned?

I guess we don't really care about the exact number for the ratio 100.
If the box is busy, it automatically scales the effective ratio to 200
or more, which I think is reasonable behavior.

Something like this.

Thanks,
Fengguang

---
include/linux/mmzone.h | 2 ++
mm/vmscan.c | 11 +++++++++++
2 files changed, 13 insertions(+)

--- linux.orig/include/linux/mmzone.h
+++ linux/include/linux/mmzone.h
@@ -334,6 +334,8 @@ struct zone {
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];

+ unsigned long zone_reclaim_relax;
+
/*
* prev_priority holds the scanning priority for this zone. It is
* defined as the scanning priority at which we achieved our reclaim
--- linux.orig/mm/vmscan.c
+++ linux/mm/vmscan.c
@@ -2453,6 +2453,7 @@ int zone_reclaim(struct zone *zone, gfp_
int ret;
long nr_unmapped_file_pages;
long nr_slab_reclaimable;
+ unsigned long t;

/*
* Zone reclaim reclaims unmapped file backed pages and
@@ -2475,6 +2476,11 @@ int zone_reclaim(struct zone *zone, gfp_
if (zone_is_all_unreclaimable(zone))
return 0;

+ if (time_in_range(zone->zone_reclaim_relax - 10000 * HZ,
+ jiffies,
+ zone->zone_reclaim_relax))
+ return 0;
+
/*
* Do not scan if the allocation should not be delayed.
*/
@@ -2493,7 +2499,12 @@ int zone_reclaim(struct zone *zone, gfp_

if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
return 0;
+ t = jiffies;
ret = __zone_reclaim(zone, gfp_mask, order);
+ if (sc.nr_reclaimed == 0) {
+ t = min_t(unsigned long, 10000 * HZ, 100 * (jiffies - t));
+ zone->zone_reclaim_relax = jiffies + t;
+ }
zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);

return ret;

2009-06-09 09:26:06

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/3] Do not unconditionally treat zones that fail zone_reclaim() as full

On Tue, Jun 09, 2009 at 04:48:02PM +0900, KOSAKI Motohiro wrote:
> Hi
>
> > On NUMA machines, the administrator can configure zone_reclaim_mode that
> > is a more targetted form of direct reclaim. On machines with large NUMA
> > distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> > unmapped pages will be reclaimed if the zone watermarks are not being
> > met. The problem is that zone_reclaim() failing at all means the zone
> > gets marked full.
> >
> > This can cause situations where a zone is usable, but is being skipped
> > because it has been considered full. Take a situation where a large tmpfs
> > mount is occuping a large percentage of memory overall. The pages do not
> > get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
> > and the zonelist cache considers them not worth trying in the future.
> >
> > This patch makes zone_reclaim() return more fine-grained information about
> > what occured when zone_reclaim() failued. The zone only gets marked full if
> > it really is unreclaimable. If it's a case that the scan did not occur or
> > if enough pages were not reclaimed with the limited reclaim_mode, then the
> > zone is simply skipped.
> >
> > There is a side-effect to this patch. Currently, if zone_reclaim()
> > successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would
> > go ahead. With this patch applied, zone watermarks are rechecked after
> > zone_reclaim() does some work.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/internal.h | 4 ++++
> > mm/page_alloc.c | 26 ++++++++++++++++++++++----
> > mm/vmscan.c | 10 +++++-----
> > 3 files changed, 31 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 987bb03..090c267 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -284,4 +284,8 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> > unsigned long start, int len, int flags,
> > struct page **pages, struct vm_area_struct **vmas);
> >
> > +#define ZONE_RECLAIM_NOSCAN -2
> > +#define ZONE_RECLAIM_FULL -1
> > +#define ZONE_RECLAIM_SOME 0
> > +#define ZONE_RECLAIM_SUCCESS 1
> > #endif
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index fe753ec..ce2f684 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1420,20 +1420,38 @@ zonelist_scan:
> >
> > if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
> > unsigned long mark;
> > + int ret;
>
> Please insert one empty line here.
>

Done.

> > if (alloc_flags & ALLOC_WMARK_MIN)
> > mark = zone->pages_min;
> > else if (alloc_flags & ALLOC_WMARK_LOW)
> > mark = zone->pages_low;
> > else
> > mark = zone->pages_high;
> > - if (!zone_watermark_ok(zone, order, mark,
> > - classzone_idx, alloc_flags)) {
> > - if (!zone_reclaim_mode ||
> > - !zone_reclaim(zone, gfp_mask, order))
> > + if (zone_watermark_ok(zone, order, mark,
> > + classzone_idx, alloc_flags))
> > + goto try_this_zone;
> > +
> > + if (zone_reclaim_mode == 0)
> > + goto this_zone_full;
> > +
> > + ret = zone_reclaim(zone, gfp_mask, order);
> > + switch (ret) {
> > + case ZONE_RECLAIM_NOSCAN:
> > + /* did not scan */
> > + goto try_next_zone;
> > + case ZONE_RECLAIM_FULL:
> > + /* scanned but unreclaimable */
> > goto this_zone_full;
> > + default:
> > + /* did we reclaim enough */
> > + if (!zone_watermark_ok(zone, order,
> > + mark, classzone_idx,
> > + alloc_flags))
> > + goto try_next_zone;
>
> hmmm
> I haven't catch your mention yet. sorry.
> Could you please explain more?
>
> My confuseness are:
>
> 1.
> ----
> I think your patch almost revert Paul's 9276b1bc96a132f4068fdee00983c532f43d3a26 essence.
> after your patch applied, zlc_mark_zone_full() is called only when zone_is_all_unreclaimable()==1
> or memory stealed after zone_watermark_ok() rechecking.
>

It's true that the zone is only being marked full when it's .... full due
to all pages being unreclaimable. Maybe this is too aggressive.

> but zone_is_all_unreclaimable() is very rare on large NUMA machine. Thus
> your patch makes zlc_zone_worth_trying() check to worthless.
> So, I like simple reverting 9276b1bc rather than introduce more messy if necessary.
>
> but necessary? why?
>

Allegedly the ZLC cache reduces on large NUMA machines but I have no figures
proving or disproving that so I'm wary of a full revert.

The danger as I see it is that zones get skipped when there was no need
simply because the previous caller failed to scan with the case of the GFP
flags causing the zone to be marked full of particular concern.

I was also concerned that once it was marked full, the zone was unconditionally
skipped even though the next caller might be using a different watermark
level like ALLOC_WMARK_LOW or ALLOC_NO_WATERMARKS.

How about the following.

o If the zone is fully unreclaimable - mark full
o If the zone_reclaim() avoids the scan because of the number of pages
and the current setting of reclaim_mode - mark full
o If the scan occurs but enough pages were not reclaimed to meet the
watermarks - mark full

This is the important part

o Push down the zlc_zone_worth_trying() further down to take place after
the watermark check has failed but before reclaim_zone() is considered

The last part in particular is important because it might mean the
zone_reclaim_interval can be later dropped because the zlc does the necessary
scan avoidance for a period of time. It also means that a check of a bitmap
is happening outside of a fast path.

>
> 2.
> -----
> Why simple following switch-case is wrong?
>
> case ZONE_RECLAIM_NOSCAN:
> goto try_next_zone;
> case ZONE_RECLAIM_FULL:
> case ZONE_RECLAIM_SOME:
> goto this_zone_full;
> case ZONE_RECLAIM_SUCCESS
> ; /* do nothing */
>
> I mean,
> (1) ZONE_RECLAIM_SOME and zone_watermark_ok()==1
> are rare.

How rare? In the event the zone is under pressure, we could be just on the
watermark. If we're within 32 pages of that watermark, then reclaiming some
pages might just be enough to meet the watermark so why consider it full?

> Is rechecking really worth?

If we don't recheck and we reclaimed just 1 page, we allow a caller
to go below watermarks. This could have an impact on GFP_ATOMIC
allocations.

> In my experience, zone_watermark_ok() is not so fast function.
>

It's not, but watermarks can't be ignored just because the function is not
fast. For what it's worth, we are already in a horrible slow path by the
time we're reclaiming pages and the cost of zone_watermark_ok() is less
of a concern?

> And,
>
> (2) ZONE_RECLAIM_SUCCESS and zone_watermark_ok()==0
>
> is also rare.

Again, how rare? I don't actually know myself.

> What do you afraid bad thing?
>

Because watermarks are important.

> I guess, high-order allocation and ZONE_RECLAIM_SUCCESS and
> zone_watermark_ok()==0 case, right?
>
> if so, Why your system makes high order allocation so freqently?
>

This is not about high-order allocations.

> 3.
> ------
> your patch do:
>
> 1. call zone_reclaim() and return ZONE_RECLAIM_SUCCESS
> 2. another thread steal memory
> 3. call zone_watermark_ok() and return 0
>
> but
>
> 1. call zone_reclaim() and return ZONE_RECLAIM_SUCCESS
> 2. call zone_watermark_ok() and return 1
> 3. another thread steal memory
> 4. call buffered_rmqueue() and return NULL
>
> Then, it call zlc_mark_zone_full().
>
> it seems a bit inconsistency.
>

There is a relatively harmless race in there when memory is extremely
tight and there are multiple threads contending. Potentially, we go one
page below the watermark per thread contending on the one zone because
we are not locking in this path and the allocation could be satisified
from the per-cpu allocator.

However, I do not see this issue as being serious enough to warrent
fixing because it would require a lock just to very strictly adhere to
the watermarks. It's different to the case above where if we did not check
watermarks, a thread can go below the watermark without any other thread
contending.

>
>
>
> > }
> > }
> >
> > +try_this_zone:
> > page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
> > if (page)
> > break;
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index ffe2f32..84cdae2 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2409,7 +2409,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > if (pagecache_reclaimable <= zone->min_unmapped_pages
> > && zone_page_state(zone, NR_SLAB_RECLAIMABLE)
> > <= zone->min_slab_pages)
> > - return 0;
> > + return ZONE_RECLAIM_NOSCAN;
> >
> > /* Do not attempt a scan if scanning failed recently */
> > if (time_before(jiffies,
> > @@ -2417,13 +2417,13 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > return 0;
> >
> > if (zone_is_all_unreclaimable(zone))
> > - return 0;
> > + return ZONE_RECLAIM_FULL;
> >
> > /*
> > * Do not scan if the allocation should not be delayed.
> > */
> > if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
> > - return 0;
> > + return ZONE_RECLAIM_NOSCAN;
> >
> > /*
> > * Only run zone reclaim on the local zone or on zones that do not
> > @@ -2433,10 +2433,10 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > */
> > node_id = zone_to_nid(zone);
> > if (node_state(node_id, N_CPU) && node_id != numa_node_id())
> > - return 0;
> > + return ZONE_RECLAIM_NOSCAN;
> >
> > if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
> > - return 0;
> > + return ZONE_RECLAIM_NOSCAN;
> > ret = __zone_reclaim(zone, gfp_mask, order);
> > zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
> >
> > --
> > 1.5.6.5
> >
>
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-09 09:41:03

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Tue, Jun 09, 2009 at 05:07:35PM +0800, Wu Fengguang wrote:
> On Tue, Jun 09, 2009 at 04:31:54PM +0800, Mel Gorman wrote:
> > On Tue, Jun 09, 2009 at 04:25:39PM +0800, Wu Fengguang wrote:
> > > On Tue, Jun 09, 2009 at 04:14:25PM +0800, Mel Gorman wrote:
> > > > On Tue, Jun 09, 2009 at 09:58:22AM +0800, Wu Fengguang wrote:
> > > > > On Mon, Jun 08, 2009 at 09:01:28PM +0800, Mel Gorman wrote:
> > > > > > On NUMA machines, the administrator can configure zone_reclaim_mode that is a
> > > > > > more targetted form of direct reclaim. On machines with large NUMA distances,
> > > > > > zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be
> > > > > > reclaimed if the zone watermarks are not being met. The problem is that
> > > > > > zone_reclaim() can be in a situation where it scans excessively without
> > > > > > making progress.
> > > > > >
> > > > > > One such situation is where a large tmpfs mount is occupying a large
> > > > > > percentage of memory overall. The pages do not get cleaned or reclaimed by
> > > > > > zone_reclaim(), but the lists are uselessly scanned frequencly making the
> > > > > > CPU spin at 100%. The scanning occurs because zone_reclaim() cannot tell
> > > > > > in advance the scan is pointless because the counters do not distinguish
> > > > > > between pagecache pages backed by disk and by RAM. The observation in
> > > > > > the field is that malloc() stalls for a long time (minutes in some cases)
> > > > > > when this situation occurs.
> > > > > >
> > > > > > Accounting for ram-backed file pages was considered but not implemented on
> > > > > > the grounds it would be introducing new branches and expensive checks into
> > > > > > the page cache add/remove patches and increase the number of statistics
> > > > > > needed in the zone. As zone_reclaim() failing is currently considered a
> > > > > > corner case, this seemed like overkill. Note, if there are a large number
> > > > > > of reports about CPU spinning at 100% on NUMA that is fixed by disabling
> > > > > > zone_reclaim, then this assumption is false and zone_reclaim() scanning
> > > > > > and failing is not a corner case but a common occurance
> > > > > >
> > > > > > This patch reintroduces zone_reclaim_interval which was removed by commit
> > > > > > 34aa1330f9b3c5783d269851d467326525207422 [zoned vm counters: zone_reclaim:
> > > > > > remove /proc/sys/vm/zone_reclaim_interval] because the zone counters were
> > > > > > considered sufficient to determine in advance if the scan would succeed.
> > > > > > As unsuccessful scans can still occur, zone_reclaim_interval is still
> > > > > > required.
> > > > >
> > > > > Can we avoid the user visible parameter zone_reclaim_interval?
> > > > >
> > > >
> > > > You could, but then there is no way of disabling it by setting it to 0
> > > > either. I can't imagine why but the desired behaviour might really be to
> > > > spin and never go off-node unless there is no other option. They might
> > > > want to set it to 0 for example when determining what the right value for
> > > > zone_reclaim_mode is for their workloads.
> > > >
> > > > > That means to introduce some heuristics for it.
> > > >
> > > > I suspect the vast majority of users will ignore it unless they are runing
> > > > zone_reclaim_mode at the same time and even then will probably just leave
> > > > it as 30 as a LRU scan every 30 seconds worst case is not going to show up
> > > > on many profiles.
> > > >
> > > > > Since the whole point
> > > > > is to avoid 100% CPU usage, we can take down the time used for this
> > > > > failed zone reclaim (T) and forbid zone reclaim until (NOW + 100*T).
> > > > >
> > > >
> > > > i.e. just fix it internally at 100 seconds? How is that better than
> > > > having an obscure tunable? I think if this heuristic exists at all, it's
> > > > important that an administrator be able to turn it off if absolutly
> > > > necessary and so something must be user-visible.
> > >
> > > That 100*T don't mean 100 seconds. It means to keep CPU usage under 1%:
> > > after busy scanning for time T, let's go relax for 100*T.
> > >
> >
> > Do I have a means of calculating what my CPU usage is as a result of
> > scanning the LRU list?
> >
> > If I don't and the machine is busy, would I not avoid scanning even in
> > situations where it should have been scanned?
>
> I guess we don't really care about the exact number for the ratio 100.
> If the box is busy, it automatically scales the effective ratio to 200
> or more, which I think is reasonable behavior.
>
> Something like this.
>
> Thanks,
> Fengguang
>
> ---
> include/linux/mmzone.h | 2 ++
> mm/vmscan.c | 11 +++++++++++
> 2 files changed, 13 insertions(+)
>
> --- linux.orig/include/linux/mmzone.h
> +++ linux/include/linux/mmzone.h
> @@ -334,6 +334,8 @@ struct zone {
> /* Zone statistics */
> atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
>
> + unsigned long zone_reclaim_relax;
> +
> /*
> * prev_priority holds the scanning priority for this zone. It is
> * defined as the scanning priority at which we achieved our reclaim
> --- linux.orig/mm/vmscan.c
> +++ linux/mm/vmscan.c
> @@ -2453,6 +2453,7 @@ int zone_reclaim(struct zone *zone, gfp_
> int ret;
> long nr_unmapped_file_pages;
> long nr_slab_reclaimable;
> + unsigned long t;
>
> /*
> * Zone reclaim reclaims unmapped file backed pages and
> @@ -2475,6 +2476,11 @@ int zone_reclaim(struct zone *zone, gfp_
> if (zone_is_all_unreclaimable(zone))
> return 0;
>
> + if (time_in_range(zone->zone_reclaim_relax - 10000 * HZ,
> + jiffies,
> + zone->zone_reclaim_relax))
> + return 0;
> +

So. zone_reclaim_relax is some value between now and 100 times the approximate
time it takes to scan the LRU list. This check ensures that we do not scan
multiple times within the same interval. Is that right?

> /*
> * Do not scan if the allocation should not be delayed.
> */
> @@ -2493,7 +2499,12 @@ int zone_reclaim(struct zone *zone, gfp_
>
> if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
> return 0;
> + t = jiffies;
> ret = __zone_reclaim(zone, gfp_mask, order);
> + if (sc.nr_reclaimed == 0) {
> + t = min_t(unsigned long, 10000 * HZ, 100 * (jiffies - t));
> + zone->zone_reclaim_relax = jiffies + t;
> + }

This appears to be a way of automatically selecting a value for
zone_reclaim_interval but is 100 times the length of time it takes to scan the
LRU list enough to avoid excessive scanning of the LRU lists by zone_reclaim?

I don't know and unlike zone_reclaim_interval, we have no way for the
administrator to intervene in the event we get the calculation wrong.

Conceivably though, zone_reclaim_interval could automatically tune
itself based on a heuristic like this if the administrator does not give
a specific value. I think that would be an interesting follow on once
we've brought back zone_reclaim_interval and get a feeling for how often
it is actually used.

> zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
>
> return ret;
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-09 09:42:39

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Tue, Jun 09, 2009 at 05:45:02PM +0900, KOSAKI Motohiro wrote:
> Hi
>
> > > > @@ -1192,6 +1192,15 @@ static struct ctl_table vm_table[] = {
> > > > .extra1 = &zero,
> > > > },
> > > > {
> > > > + .ctl_name = CTL_UNNUMBERED,
> > > > + .procname = "zone_reclaim_interval",
> > > > + .data = &zone_reclaim_interval,
> > > > + .maxlen = sizeof(zone_reclaim_interval),
> > > > + .mode = 0644,
> > > > + .proc_handler = &proc_dointvec_jiffies,
> > > > + .strategy = &sysctl_jiffies,
> > > > + },
> > >
> > > hmmm, I think nobody can know proper interval settings on his own systems.
> > > I agree with Wu. It can be hidden.
> > >
> >
> > For the few users that case, I expect the majority of those will choose
> > either 0 or the default value of 30. They might want to alter this while
> > setting zone_reclaim_mode if they don't understand the different values
> > it can have for example.
> >
> > My preference would be that this not exist at all but the
> > scan-avoidance-heuristic has to be perfect to allow that.
>
> Ah, I didn't concern interval==0. thanks.
> I can ack this now, but please add documentation about interval==0 meaning?
>

I will.

>
>
>
> > > > @@ -2414,6 +2426,16 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > > > ret = __zone_reclaim(zone, gfp_mask, order);
> > > > zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
> > > >
> > > > + if (!ret) {
> > > > + /*
> > > > + * We were unable to reclaim enough pages to stay on node and
> > > > + * unable to detect in advance that the scan would fail. Allow
> > > > + * off node accesses for zone_reclaim_inteval jiffies before
> > > > + * trying zone_reclaim() again
> > > > + */
> > > > + zone->zone_reclaim_failure = jiffies;
> > >
> > > Oops, this simple assignment don't care jiffies round-trip.
> > >
> >
> > Here it is just recording the jiffies value. The real smarts with the counter
> > use time_before() which I assumed could handle jiffie wrap-arounds. Even
> > if it doesn't, the consequence is that one scan will occur that could have
> > been avoided around the time of the jiffie wraparound. The value will then
> > be reset and it will be fine.
>
> time_before() assume two argument are enough nearly time.
> if we use 32bit cpu and HZ=1000, about jiffies wraparound about one month.
>
> Then,
>
> 1. zone reclaim failure occur
> 2. system works fine for one month
> 3. jiffies wrap and time_before() makes mis-calculation.
>

And the scan occurs uselessly and zone_reclaim_failure gets set again.
I believe the one useless scan is not significant enough to warrent dealing
with jiffie wraparound.

> I think.
>
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-09 09:45:40

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

> > > Here it is just recording the jiffies value. The real smarts with the counter
> > > use time_before() which I assumed could handle jiffie wrap-arounds. Even
> > > if it doesn't, the consequence is that one scan will occur that could have
> > > been avoided around the time of the jiffie wraparound. The value will then
> > > be reset and it will be fine.
> >
> > time_before() assume two argument are enough nearly time.
> > if we use 32bit cpu and HZ=1000, about jiffies wraparound about one month.
> >
> > Then,
> >
> > 1. zone reclaim failure occur
> > 2. system works fine for one month
> > 3. jiffies wrap and time_before() makes mis-calculation.
> >
>
> And the scan occurs uselessly and zone_reclaim_failure gets set again.
> I believe the one useless scan is not significant enough to warrent dealing
> with jiffie wraparound.

Thank you for kindful explanation.
I fully agreed.


2009-06-09 09:59:17

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

> > > > Here it is just recording the jiffies value. The real smarts with the counter
> > > > use time_before() which I assumed could handle jiffie wrap-arounds. Even
> > > > if it doesn't, the consequence is that one scan will occur that could have
> > > > been avoided around the time of the jiffie wraparound. The value will then
> > > > be reset and it will be fine.
> > >
> > > time_before() assume two argument are enough nearly time.
> > > if we use 32bit cpu and HZ=1000, about jiffies wraparound about one month.
> > >
> > > Then,
> > >
> > > 1. zone reclaim failure occur
> > > 2. system works fine for one month
> > > 3. jiffies wrap and time_before() makes mis-calculation.
> > >
> >
> > And the scan occurs uselessly and zone_reclaim_failure gets set again.
> > I believe the one useless scan is not significant enough to warrent dealing
> > with jiffie wraparound.
>
> Thank you for kindful explanation.
> I fully agreed.

Bah, no, not agreed.
simple last failure recording makes following scenario.


1. zone reclaim failure occur. update zone_reclaim_failure.
^
| time_before() return 1, and zone_reclaim() return immediately.
v
2. after 32 second.
^
| time_before() return 0, and zone_reclaim() works normally
v
3. after one month
^
| time_before() return 1, and zone_reclaim() return immediately.
| although recent zone_reclaim() never failed.
v
4. after more one month



2009-06-09 10:44:19

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Tue, Jun 09, 2009 at 06:59:03PM +0900, KOSAKI Motohiro wrote:
> > > > > Here it is just recording the jiffies value. The real smarts with the counter
> > > > > use time_before() which I assumed could handle jiffie wrap-arounds. Even
> > > > > if it doesn't, the consequence is that one scan will occur that could have
> > > > > been avoided around the time of the jiffie wraparound. The value will then
> > > > > be reset and it will be fine.
> > > >
> > > > time_before() assume two argument are enough nearly time.
> > > > if we use 32bit cpu and HZ=1000, about jiffies wraparound about one month.
> > > >
> > > > Then,
> > > >
> > > > 1. zone reclaim failure occur
> > > > 2. system works fine for one month
> > > > 3. jiffies wrap and time_before() makes mis-calculation.
> > > >
> > >
> > > And the scan occurs uselessly and zone_reclaim_failure gets set again.
> > > I believe the one useless scan is not significant enough to warrent dealing
> > > with jiffie wraparound.
> >
> > Thank you for kindful explanation.
> > I fully agreed.
>
> Bah, no, not agreed.
> simple last failure recording makes following scenario.
>
>
> 1. zone reclaim failure occur. update zone_reclaim_failure.
> ^
> | time_before() return 1, and zone_reclaim() return immediately.
> v
> 2. after 32 second.
> ^
> | time_before() return 0, and zone_reclaim() works normally
> v
> 3. after one month
> ^
> | time_before() return 1, and zone_reclaim() return immediately.
> | although recent zone_reclaim() never failed.
> v
> 4. after more one month
>

Pants.

/me slaps self

+ /* Watch for jiffie wraparound */
+ if (unlikely(jiffies < zone->zone_reclaim_failure))
+ zone->zone_reclaim_failure = jiffies;
+
+ /* Do not attempt a scan if scanning failed recently */
+ if (time_before(jiffies,
+ zone->zone_reclaim_failure + zone_reclaim_interval))
+ return 0;
+

?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-09 10:48:49

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/3] Properly account for the number of page cache pages zone_reclaim() can reclaim

On Tue, Jun 09, 2009 at 04:45:50PM +0800, Wu Fengguang wrote:
> On Tue, Jun 09, 2009 at 04:27:29PM +0800, Mel Gorman wrote:
> > On Tue, Jun 09, 2009 at 10:25:49AM +0800, Wu Fengguang wrote:
> > > On Mon, Jun 08, 2009 at 09:01:29PM +0800, Mel Gorman wrote:
> > > > On NUMA machines, the administrator can configure zone_relcaim_mode that
> > > > is a more targetted form of direct reclaim. On machines with large NUMA
> > > > distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> > > > unmapped pages will be reclaimed if the zone watermarks are not being met.
> > > >
> > > > There is a heuristic that determines if the scan is worthwhile but the
> > > > problem is that the heuristic is not being properly applied and is basically
> > > > assuming zone_reclaim_mode is 1 if it is enabled.
> > > >
> > > > This patch makes zone_reclaim() makes a better attempt at working out how
> > > > many pages it might be able to reclaim given the current reclaim_mode. If it
> > > > cannot clean pages, then NR_FILE_DIRTY number of pages are not candidates. If
> > > > it cannot swap, then NR_FILE_MAPPED are not. This indirectly addresses tmpfs
> > > > as those pages tend to be dirty as they are not cleaned by pdflush or sync.
> > >
> > > No, tmpfs pages are not accounted in NR_FILE_DIRTY because of the
> > > BDI_CAP_NO_ACCT_AND_WRITEBACK bits.
> > >
> >
> > Ok, that explains why the dirty page count was not as high as I was
> > expecting. Thanks.
> >
> > > > The ideal would be that the number of tmpfs pages would also be known
> > > > and account for like NR_FILE_MAPPED as swap is required to discard them.
> > > > A means of working this out quickly was not obvious but a comment is added
> > > > noting the problem.
> > >
> > > I'd rather prefer it be accounted separately than to muck up NR_FILE_MAPPED :)
> > >
> >
> > Maybe I used a poor choice of words. What I meant was that the ideal would
> > be we had a separate count for tmpfs pages. As tmpfs pages and mapped pages
> > both have to be unmapped and potentially, they are "like" each other with
> > respect to the zone_reclaim_mode and how it behaves. We would end up
> > with something like
> >
> > pagecache_reclaimable -= zone_page_state(zone, NR_FILE_MAPPED);
> > pagecache_reclaimable -= zone_page_state(zone, NR_FILE_TMPFS);
>
> OK. But tmpfs pages may be mapped, so there will be double counting.
> We must at least make sure pagecache_reclaimable won't get underflowed.

True. What vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch
does might be better overall.

> (Or make another LRU list for tmpfs pages?)
>

Another LRU won't help the accounting and will changes too significantly
how reclaim works.

> > > > + int pagecache_reclaimable;
> > > > +
> > > > + /*
> > > > + * Work out how many page cache pages we can reclaim in this mode.
> > > > + *
> > > > + * NOTE: Ideally, tmpfs pages would be accounted as if they were
> > > > + * NR_FILE_MAPPED as swap is required to discard those
> > > > + * pages even when they are clean. However, there is no
> > > > + * way of quickly identifying the number of tmpfs pages
> > > > + */
> > >
> > > So can you remove the note on NR_FILE_MAPPED?
> > >
> >
> > Why would I remove the note? I can alter the wording but the intention is
> > to show we cannot count the number of tmpfs pages quickly and it would be
> > nice if we could. Maybe this is clearer?
> >
> > Note: Ideally tmpfs pages would be accounted for as NR_FILE_TMPFS or
> > similar and treated similar to NR_FILE_MAPPED as both require
> > unmapping from page tables and potentially swap to reclaim.
> > However, no such counter exists.
>
> That's better. Thanks.
>
> > > > + pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
> > > > + if (!(zone_reclaim_mode & RECLAIM_WRITE))
> > > > + pagecache_reclaimable -= zone_page_state(zone, NR_FILE_DIRTY);
> > >
> > > > + if (!(zone_reclaim_mode & RECLAIM_SWAP))
> > > > + pagecache_reclaimable -= zone_page_state(zone, NR_FILE_MAPPED);
> > >
> > > So the "if" can be removed because NR_FILE_MAPPED is not related to swapping?
> > >
> >
> > It's partially related with respect to what zone_reclaim() is doing.
> > Once something is mapped, we need RECLAIM_SWAP set on the
> > zone_reclaim_mode to do anything useful with them.
>
> You are referring to mapped anonymous/tmpfs pages? But I mean
> NR_FILE_MAPPED pages won't goto swap when unmapped.
>

Not all of them. But some of them backed by real files will be discarded
if clean at the next pass

> Thanks,
> Fengguang
>
> > > > /*
> > > > * Zone reclaim reclaims unmapped file backed pages and
> > > > @@ -2391,8 +2406,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > > > * if less than a specified percentage of the zone is used by
> > > > * unmapped file backed pages.
> > > > */
> > > > - if (zone_page_state(zone, NR_FILE_PAGES) -
> > > > - zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_pages
> > > > + if (pagecache_reclaimable <= zone->min_unmapped_pages
> > > > && zone_page_state(zone, NR_SLAB_RECLAIMABLE)
> > > > <= zone->min_slab_pages)
> > > > return 0;
> > > > --
> > > > 1.5.6.5
> > >
> >
> > --
> > Mel Gorman
> > Part-time Phd Student Linux Technology Center
> > University of Limerick IBM Dublin Software Lab
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-09 10:50:41

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

> On Tue, Jun 09, 2009 at 06:59:03PM +0900, KOSAKI Motohiro wrote:
> > > > > > Here it is just recording the jiffies value. The real smarts with the counter
> > > > > > use time_before() which I assumed could handle jiffie wrap-arounds. Even
> > > > > > if it doesn't, the consequence is that one scan will occur that could have
> > > > > > been avoided around the time of the jiffie wraparound. The value will then
> > > > > > be reset and it will be fine.
> > > > >
> > > > > time_before() assume two argument are enough nearly time.
> > > > > if we use 32bit cpu and HZ=1000, about jiffies wraparound about one month.
> > > > >
> > > > > Then,
> > > > >
> > > > > 1. zone reclaim failure occur
> > > > > 2. system works fine for one month
> > > > > 3. jiffies wrap and time_before() makes mis-calculation.
> > > > >
> > > >
> > > > And the scan occurs uselessly and zone_reclaim_failure gets set again.
> > > > I believe the one useless scan is not significant enough to warrent dealing
> > > > with jiffie wraparound.
> > >
> > > Thank you for kindful explanation.
> > > I fully agreed.
> >
> > Bah, no, not agreed.
> > simple last failure recording makes following scenario.
> >
> >
> > 1. zone reclaim failure occur. update zone_reclaim_failure.
> > ^
> > | time_before() return 1, and zone_reclaim() return immediately.
> > v
> > 2. after 32 second.
> > ^
> > | time_before() return 0, and zone_reclaim() works normally
> > v
> > 3. after one month
> > ^
> > | time_before() return 1, and zone_reclaim() return immediately.
> > | although recent zone_reclaim() never failed.
> > v
> > 4. after more one month
> >
>
> Pants.
>
> /me slaps self
>
> + /* Watch for jiffie wraparound */
> + if (unlikely(jiffies < zone->zone_reclaim_failure))
> + zone->zone_reclaim_failure = jiffies;
> +
> + /* Do not attempt a scan if scanning failed recently */
> + if (time_before(jiffies,
> + zone->zone_reclaim_failure + zone_reclaim_interval))
> + return 0;
> +
>
> ?

looks good.



2009-06-09 12:05:36

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 3/3] Do not unconditionally treat zones that fail zone_reclaim() as full

> > hmmm
> > I haven't catch your mention yet. sorry.
> > Could you please explain more?
> >
> > My confuseness are:
> >
> > 1.
> > ----
> > I think your patch almost revert Paul's 9276b1bc96a132f4068fdee00983c532f43d3a26 essence.
> > after your patch applied, zlc_mark_zone_full() is called only when zone_is_all_unreclaimable()==1
> > or memory stealed after zone_watermark_ok() rechecking.
> >
>
> It's true that the zone is only being marked full when it's .... full due
> to all pages being unreclaimable. Maybe this is too aggressive.
>
> > but zone_is_all_unreclaimable() is very rare on large NUMA machine. Thus
> > your patch makes zlc_zone_worth_trying() check to worthless.
> > So, I like simple reverting 9276b1bc rather than introduce more messy if necessary.
> >
> > but necessary? why?
> >
>
> Allegedly the ZLC cache reduces on large NUMA machines but I have no figures
> proving or disproving that so I'm wary of a full revert.
>
> The danger as I see it is that zones get skipped when there was no need
> simply because the previous caller failed to scan with the case of the GFP
> flags causing the zone to be marked full of particular concern.
>
> I was also concerned that once it was marked full, the zone was unconditionally
> skipped even though the next caller might be using a different watermark
> level like ALLOC_WMARK_LOW or ALLOC_NO_WATERMARKS.

Right.


> How about the following.
>
> o If the zone is fully unreclaimable - mark full
> o If the zone_reclaim() avoids the scan because of the number of pages
> and the current setting of reclaim_mode - mark full
> o If the scan occurs but enough pages were not reclaimed to meet the
> watermarks - mark full

Looks good.


>
> This is the important part
>
> o Push down the zlc_zone_worth_trying() further down to take place after
> the watermark check has failed but before reclaim_zone() is considered
>
> The last part in particular is important because it might mean the
> zone_reclaim_interval can be later dropped because the zlc does the necessary
> scan avoidance for a period of time. It also means that a check of a bitmap
> is happening outside of a fast path.

hmmm...
I guess the intension of zlc_zone_worth_trying() is for reduce zone_watermark_ok() calling.
it's because zone_watermark_ok() is a bit heavy weight function.

I also strongly hope to improve fast-path of page allocator. but I'm afraid
this change break ZLC worth perfectly.

What do you think this? I think this is key point of this change.



> > 2.
> > -----
> > Why simple following switch-case is wrong?
> >
> > case ZONE_RECLAIM_NOSCAN:
> > goto try_next_zone;
> > case ZONE_RECLAIM_FULL:
> > case ZONE_RECLAIM_SOME:
> > goto this_zone_full;
> > case ZONE_RECLAIM_SUCCESS
> > ; /* do nothing */
> >
> > I mean,
> > (1) ZONE_RECLAIM_SOME and zone_watermark_ok()==1
> > are rare.
>
> How rare? In the event the zone is under pressure, we could be just on the
> watermark. If we're within 32 pages of that watermark, then reclaiming some
> pages might just be enough to meet the watermark so why consider it full?

I mean, typically zone-reclaim can found reclaimable clean 32 pages easily.
it mean
- in current kernel, dirty-ratio works perfectly.
all pages dirty scenario never happend.
- now, we have split lru. plenty anon pages don't prevent
reclaim file-backed page.


> > Is rechecking really worth?
>
> If we don't recheck and we reclaimed just 1 page, we allow a caller
> to go below watermarks. This could have an impact on GFP_ATOMIC
> allocations.

Is jsut 1 page reclaimed really happen?


> > In my experience, zone_watermark_ok() is not so fast function.
> >
>
> It's not, but watermarks can't be ignored just because the function is not
> fast. For what it's worth, we are already in a horrible slow path by the
> time we're reclaiming pages and the cost of zone_watermark_ok() is less
> of a concern?

for clarification,

reclaim bail out (commit a79311c1) changed zone-reclaim behavior too.

distro zone reclaim is horrible slow. it's because ZONE_RECLAIM_PRIORITY==4.
but mainline kernel's zone reclaim isn't so slow. it have bail-out and
effective split-lru based reclaim.

but unfortunately bail-out cause frequently zone-reclaim calling, because
one time zone-reclaim only reclaim 32 pages.

in distro kernel, zone_watermark_ok() x number-of-called-zone-reclaim is not
heavy at all. but its premise was changed.



> > And,
> >
> > (2) ZONE_RECLAIM_SUCCESS and zone_watermark_ok()==0
> >
> > is also rare.
>
> Again, how rare? I don't actually know myself.

it only happen reclaim success and another thread steal it.


>
> > What do you afraid bad thing?
>
> Because watermarks are important.

Yes.



> > I guess, high-order allocation and ZONE_RECLAIM_SUCCESS and
> > zone_watermark_ok()==0 case, right?
> >
> > if so, Why your system makes high order allocation so freqently?
>
> This is not about high-order allocations.

ok.


> > 3.
> > ------
> > your patch do:
> >
> > 1. call zone_reclaim() and return ZONE_RECLAIM_SUCCESS
> > 2. another thread steal memory
> > 3. call zone_watermark_ok() and return 0
> >
> > but
> >
> > 1. call zone_reclaim() and return ZONE_RECLAIM_SUCCESS
> > 2. call zone_watermark_ok() and return 1
> > 3. another thread steal memory
> > 4. call buffered_rmqueue() and return NULL
> >
> > Then, it call zlc_mark_zone_full().
> >
> > it seems a bit inconsistency.
> >
>
> There is a relatively harmless race in there when memory is extremely
> tight and there are multiple threads contending. Potentially, we go one
> page below the watermark per thread contending on the one zone because
> we are not locking in this path and the allocation could be satisified
> from the per-cpu allocator.
>
> However, I do not see this issue as being serious enough to warrent
> fixing because it would require a lock just to very strictly adhere to
> the watermarks. It's different to the case above where if we did not check
> watermarks, a thread can go below the watermark without any other thread
> contending.

I agree with this is not so important. ok, I get rid of this claim.



2009-06-09 12:08:21

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 2/3] Properly account for the number of page cache pages zone_reclaim() can reclaim

On Tue, Jun 09, 2009 at 06:48:10PM +0800, Mel Gorman wrote:
> On Tue, Jun 09, 2009 at 04:45:50PM +0800, Wu Fengguang wrote:
> > On Tue, Jun 09, 2009 at 04:27:29PM +0800, Mel Gorman wrote:
> > > On Tue, Jun 09, 2009 at 10:25:49AM +0800, Wu Fengguang wrote:
> > > > On Mon, Jun 08, 2009 at 09:01:29PM +0800, Mel Gorman wrote:
> > > > > On NUMA machines, the administrator can configure zone_relcaim_mode that
> > > > > is a more targetted form of direct reclaim. On machines with large NUMA
> > > > > distances for example, a zone_reclaim_mode defaults to 1 meaning that clean
> > > > > unmapped pages will be reclaimed if the zone watermarks are not being met.
> > > > >
> > > > > There is a heuristic that determines if the scan is worthwhile but the
> > > > > problem is that the heuristic is not being properly applied and is basically
> > > > > assuming zone_reclaim_mode is 1 if it is enabled.
> > > > >
> > > > > This patch makes zone_reclaim() makes a better attempt at working out how
> > > > > many pages it might be able to reclaim given the current reclaim_mode. If it
> > > > > cannot clean pages, then NR_FILE_DIRTY number of pages are not candidates. If
> > > > > it cannot swap, then NR_FILE_MAPPED are not. This indirectly addresses tmpfs
> > > > > as those pages tend to be dirty as they are not cleaned by pdflush or sync.
> > > >
> > > > No, tmpfs pages are not accounted in NR_FILE_DIRTY because of the
> > > > BDI_CAP_NO_ACCT_AND_WRITEBACK bits.
> > > >
> > >
> > > Ok, that explains why the dirty page count was not as high as I was
> > > expecting. Thanks.
> > >
> > > > > The ideal would be that the number of tmpfs pages would also be known
> > > > > and account for like NR_FILE_MAPPED as swap is required to discard them.
> > > > > A means of working this out quickly was not obvious but a comment is added
> > > > > noting the problem.
> > > >
> > > > I'd rather prefer it be accounted separately than to muck up NR_FILE_MAPPED :)
> > > >
> > >
> > > Maybe I used a poor choice of words. What I meant was that the ideal would
> > > be we had a separate count for tmpfs pages. As tmpfs pages and mapped pages
> > > both have to be unmapped and potentially, they are "like" each other with
> > > respect to the zone_reclaim_mode and how it behaves. We would end up
> > > with something like
> > >
> > > pagecache_reclaimable -= zone_page_state(zone, NR_FILE_MAPPED);
> > > pagecache_reclaimable -= zone_page_state(zone, NR_FILE_TMPFS);
> >
> > OK. But tmpfs pages may be mapped, so there will be double counting.
> > We must at least make sure pagecache_reclaimable won't get underflowed.
>
> True. What vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch
> does might be better overall.

Yup.

> > (Or make another LRU list for tmpfs pages?)
> >
>
> Another LRU won't help the accounting and will changes too significantly
> how reclaim works.

OK.

> > > > > + int pagecache_reclaimable;
> > > > > +
> > > > > + /*
> > > > > + * Work out how many page cache pages we can reclaim in this mode.
> > > > > + *
> > > > > + * NOTE: Ideally, tmpfs pages would be accounted as if they were
> > > > > + * NR_FILE_MAPPED as swap is required to discard those
> > > > > + * pages even when they are clean. However, there is no
> > > > > + * way of quickly identifying the number of tmpfs pages
> > > > > + */
> > > >
> > > > So can you remove the note on NR_FILE_MAPPED?
> > > >
> > >
> > > Why would I remove the note? I can alter the wording but the intention is
> > > to show we cannot count the number of tmpfs pages quickly and it would be
> > > nice if we could. Maybe this is clearer?
> > >
> > > Note: Ideally tmpfs pages would be accounted for as NR_FILE_TMPFS or
> > > similar and treated similar to NR_FILE_MAPPED as both require
> > > unmapping from page tables and potentially swap to reclaim.
> > > However, no such counter exists.
> >
> > That's better. Thanks.
> >
> > > > > + pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
> > > > > + if (!(zone_reclaim_mode & RECLAIM_WRITE))
> > > > > + pagecache_reclaimable -= zone_page_state(zone, NR_FILE_DIRTY);
> > > >
> > > > > + if (!(zone_reclaim_mode & RECLAIM_SWAP))
> > > > > + pagecache_reclaimable -= zone_page_state(zone, NR_FILE_MAPPED);
> > > >
> > > > So the "if" can be removed because NR_FILE_MAPPED is not related to swapping?
> > > >
> > >
> > > It's partially related with respect to what zone_reclaim() is doing.
> > > Once something is mapped, we need RECLAIM_SWAP set on the
> > > zone_reclaim_mode to do anything useful with them.
> >
> > You are referring to mapped anonymous/tmpfs pages? But I mean
> > NR_FILE_MAPPED pages won't goto swap when unmapped.
> >
>
> Not all of them. But some of them backed by real files will be discarded
> if clean at the next pass

Right.

Thanks,
Fengguang

> > > > > /*
> > > > > * Zone reclaim reclaims unmapped file backed pages and
> > > > > @@ -2391,8 +2406,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> > > > > * if less than a specified percentage of the zone is used by
> > > > > * unmapped file backed pages.
> > > > > */
> > > > > - if (zone_page_state(zone, NR_FILE_PAGES) -
> > > > > - zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_pages
> > > > > + if (pagecache_reclaimable <= zone->min_unmapped_pages
> > > > > && zone_page_state(zone, NR_SLAB_RECLAIMABLE)
> > > > > <= zone->min_slab_pages)
> > > > > return 0;
> > > > > --
> > > > > 1.5.6.5
> > > >
> > >
> > > --
> > > Mel Gorman
> > > Part-time Phd Student Linux Technology Center
> > > University of Limerick IBM Dublin Software Lab
> >
>
> --
> Mel Gorman
> Part-time Phd Student Linux Technology Center
> University of Limerick IBM Dublin Software Lab

2009-06-09 13:29:09

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/3] Do not unconditionally treat zones that fail zone_reclaim() as full

On Tue, Jun 09, 2009 at 09:05:19PM +0900, KOSAKI Motohiro wrote:
> > > hmmm
> > > I haven't catch your mention yet. sorry.
> > > Could you please explain more?
> > >
> > > My confuseness are:
> > >
> > > 1.
> > > ----
> > > I think your patch almost revert Paul's 9276b1bc96a132f4068fdee00983c532f43d3a26 essence.
> > > after your patch applied, zlc_mark_zone_full() is called only when zone_is_all_unreclaimable()==1
> > > or memory stealed after zone_watermark_ok() rechecking.
> > >
> >
> > It's true that the zone is only being marked full when it's .... full due
> > to all pages being unreclaimable. Maybe this is too aggressive.
> >
> > > but zone_is_all_unreclaimable() is very rare on large NUMA machine. Thus
> > > your patch makes zlc_zone_worth_trying() check to worthless.
> > > So, I like simple reverting 9276b1bc rather than introduce more messy if necessary.
> > >
> > > but necessary? why?
> > >
> >
> > Allegedly the ZLC cache reduces on large NUMA machines but I have no figures
> > proving or disproving that so I'm wary of a full revert.
> >
> > The danger as I see it is that zones get skipped when there was no need
> > simply because the previous caller failed to scan with the case of the GFP
> > flags causing the zone to be marked full of particular concern.
> >
> > I was also concerned that once it was marked full, the zone was unconditionally
> > skipped even though the next caller might be using a different watermark
> > level like ALLOC_WMARK_LOW or ALLOC_NO_WATERMARKS.
>
> Right.
>
> > How about the following.
> >
> > o If the zone is fully unreclaimable - mark full
> > o If the zone_reclaim() avoids the scan because of the number of pages
> > and the current setting of reclaim_mode - mark full
> > o If the scan occurs but enough pages were not reclaimed to meet the
> > watermarks - mark full
>
> Looks good.
>

Ok, I've made those changes.

>
> >
> > This is the important part
> >
> > o Push down the zlc_zone_worth_trying() further down to take place after
> > the watermark check has failed but before reclaim_zone() is considered
> >
> > The last part in particular is important because it might mean the
> > zone_reclaim_interval can be later dropped because the zlc does the necessary
> > scan avoidance for a period of time. It also means that a check of a bitmap
> > is happening outside of a fast path.
>
> hmmm...
> I guess the intension of zlc_zone_worth_trying() is for reduce zone_watermark_ok() calling.
> it's because zone_watermark_ok() is a bit heavy weight function.
>

It's possible, and according to the commit that added this, the ignoring
of watermarks was deliberate according to this note

- I pay no attention to the various watermarks and such in this performance
hint. A node could be marked full for one watermark, and then skipped
over when searching for a page using a different watermark. I think
that's actually quite ok, as it will tend to slightly increase the
spreading of memory over other nodes, away from a memory stressed node.

> I also strongly hope to improve fast-path of page allocator. but I'm afraid
> this change break ZLC worth perfectly.
>
> What do you think this? I think this is key point of this change.
>

I'll leave the zlc check where it is so. The other changes as to when
the zone is considered full still make sense.

> > > 2.
> > > -----
> > > Why simple following switch-case is wrong?
> > >
> > > case ZONE_RECLAIM_NOSCAN:
> > > goto try_next_zone;
> > > case ZONE_RECLAIM_FULL:
> > > case ZONE_RECLAIM_SOME:
> > > goto this_zone_full;
> > > case ZONE_RECLAIM_SUCCESS
> > > ; /* do nothing */
> > >
> > > I mean,
> > > (1) ZONE_RECLAIM_SOME and zone_watermark_ok()==1
> > > are rare.
> >
> > How rare? In the event the zone is under pressure, we could be just on the
> > watermark. If we're within 32 pages of that watermark, then reclaiming some
> > pages might just be enough to meet the watermark so why consider it full?
>
> I mean, typically zone-reclaim can found reclaimable clean 32 pages easily.
> it mean
> - in current kernel, dirty-ratio works perfectly.
> all pages dirty scenario never happend.
> - now, we have split lru. plenty anon pages don't prevent
> reclaim file-backed page.
>

Assuming zone_reclaim() easily reclaimed 32 pages does not mean that we
are above the watermark for this allocation. For example, lets say another
request stream was ignoring watermarks or the minimum watermarks and we're
contending. If the current request must be over the normal watermark,
32 pages may not be enough to get over that watermark due to the requests
ignoring watermarks and it should still fail and move onto the next zone.

In the case of ZONE_RECLAIM_SOME, we have two choices. We can check the
watermark, hope we are not meeting it and if so allocate a page or we can
just give up and go off-node. I believe that one last check of the watermarks
is potentially cheaper than definitely going off-node.

>
> > > Is rechecking really worth?
> >
> > If we don't recheck and we reclaimed just 1 page, we allow a caller
> > to go below watermarks. This could have an impact on GFP_ATOMIC
> > allocations.
>
> Is jsut 1 page reclaimed really happen?
>

Probably not but the cost of the zone_watermark check after doing the
LRU scan shouldn't be of major concern.

>
> > > In my experience, zone_watermark_ok() is not so fast function.
> > >
> >
> > It's not, but watermarks can't be ignored just because the function is not
> > fast. For what it's worth, we are already in a horrible slow path by the
> > time we're reclaiming pages and the cost of zone_watermark_ok() is less
> > of a concern?
>
> for clarification,
>
> reclaim bail out (commit a79311c1) changed zone-reclaim behavior too.
>
> distro zone reclaim is horrible slow. it's because ZONE_RECLAIM_PRIORITY==4.
> but mainline kernel's zone reclaim isn't so slow. it have bail-out and
> effective split-lru based reclaim.
>
> but unfortunately bail-out cause frequently zone-reclaim calling, because
> one time zone-reclaim only reclaim 32 pages.
>
> in distro kernel, zone_watermark_ok() x number-of-called-zone-reclaim is not
> heavy at all. but its premise was changed.
>

Even though it is bailing out faster, I find it difficult to believe that
the cost of zone_watermark_ok() is anywhere near the cost of an LRU scan no
matter how fast it bails out.

>
>
> > > And,
> > >
> > > (2) ZONE_RECLAIM_SUCCESS and zone_watermark_ok()==0
> > >
> > > is also rare.
> >
> > Again, how rare? I don't actually know myself.
>
> it only happen reclaim success and another thread steal it.
>
>
> >
> > > What do you afraid bad thing?
> >
> > Because watermarks are important.
>
> Yes.
>
>
>
> > > I guess, high-order allocation and ZONE_RECLAIM_SUCCESS and
> > > zone_watermark_ok()==0 case, right?
> > >
> > > if so, Why your system makes high order allocation so freqently?
> >
> > This is not about high-order allocations.
>
> ok.
>
>
> > > 3.
> > > ------
> > > your patch do:
> > >
> > > 1. call zone_reclaim() and return ZONE_RECLAIM_SUCCESS
> > > 2. another thread steal memory
> > > 3. call zone_watermark_ok() and return 0
> > >
> > > but
> > >
> > > 1. call zone_reclaim() and return ZONE_RECLAIM_SUCCESS
> > > 2. call zone_watermark_ok() and return 1
> > > 3. another thread steal memory
> > > 4. call buffered_rmqueue() and return NULL
> > >
> > > Then, it call zlc_mark_zone_full().
> > >
> > > it seems a bit inconsistency.
> > >
> >
> > There is a relatively harmless race in there when memory is extremely
> > tight and there are multiple threads contending. Potentially, we go one
> > page below the watermark per thread contending on the one zone because
> > we are not locking in this path and the allocation could be satisified
> > from the per-cpu allocator.
> >
> > However, I do not see this issue as being serious enough to warrent
> > fixing because it would require a lock just to very strictly adhere to
> > the watermarks. It's different to the case above where if we did not check
> > watermarks, a thread can go below the watermark without any other thread
> > contending.
>
> I agree with this is not so important. ok, I get rid of this claim.
>

Ok.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-09 13:38:31

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Tue, Jun 09, 2009 at 05:40:50PM +0800, Mel Gorman wrote:
> On Tue, Jun 09, 2009 at 05:07:35PM +0800, Wu Fengguang wrote:
> > On Tue, Jun 09, 2009 at 04:31:54PM +0800, Mel Gorman wrote:
> > > On Tue, Jun 09, 2009 at 04:25:39PM +0800, Wu Fengguang wrote:
> > > > On Tue, Jun 09, 2009 at 04:14:25PM +0800, Mel Gorman wrote:
> > > > > On Tue, Jun 09, 2009 at 09:58:22AM +0800, Wu Fengguang wrote:
> > > > > > On Mon, Jun 08, 2009 at 09:01:28PM +0800, Mel Gorman wrote:
> > > > > > > On NUMA machines, the administrator can configure zone_reclaim_mode that is a
> > > > > > > more targetted form of direct reclaim. On machines with large NUMA distances,
> > > > > > > zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be
> > > > > > > reclaimed if the zone watermarks are not being met. The problem is that
> > > > > > > zone_reclaim() can be in a situation where it scans excessively without
> > > > > > > making progress.
> > > > > > >
> > > > > > > One such situation is where a large tmpfs mount is occupying a large
> > > > > > > percentage of memory overall. The pages do not get cleaned or reclaimed by
> > > > > > > zone_reclaim(), but the lists are uselessly scanned frequencly making the
> > > > > > > CPU spin at 100%. The scanning occurs because zone_reclaim() cannot tell
> > > > > > > in advance the scan is pointless because the counters do not distinguish
> > > > > > > between pagecache pages backed by disk and by RAM. The observation in
> > > > > > > the field is that malloc() stalls for a long time (minutes in some cases)
> > > > > > > when this situation occurs.
> > > > > > >
> > > > > > > Accounting for ram-backed file pages was considered but not implemented on
> > > > > > > the grounds it would be introducing new branches and expensive checks into
> > > > > > > the page cache add/remove patches and increase the number of statistics
> > > > > > > needed in the zone. As zone_reclaim() failing is currently considered a
> > > > > > > corner case, this seemed like overkill. Note, if there are a large number
> > > > > > > of reports about CPU spinning at 100% on NUMA that is fixed by disabling
> > > > > > > zone_reclaim, then this assumption is false and zone_reclaim() scanning
> > > > > > > and failing is not a corner case but a common occurance
> > > > > > >
> > > > > > > This patch reintroduces zone_reclaim_interval which was removed by commit
> > > > > > > 34aa1330f9b3c5783d269851d467326525207422 [zoned vm counters: zone_reclaim:
> > > > > > > remove /proc/sys/vm/zone_reclaim_interval] because the zone counters were
> > > > > > > considered sufficient to determine in advance if the scan would succeed.
> > > > > > > As unsuccessful scans can still occur, zone_reclaim_interval is still
> > > > > > > required.
> > > > > >
> > > > > > Can we avoid the user visible parameter zone_reclaim_interval?
> > > > > >
> > > > >
> > > > > You could, but then there is no way of disabling it by setting it to 0
> > > > > either. I can't imagine why but the desired behaviour might really be to
> > > > > spin and never go off-node unless there is no other option. They might
> > > > > want to set it to 0 for example when determining what the right value for
> > > > > zone_reclaim_mode is for their workloads.
> > > > >
> > > > > > That means to introduce some heuristics for it.
> > > > >
> > > > > I suspect the vast majority of users will ignore it unless they are runing
> > > > > zone_reclaim_mode at the same time and even then will probably just leave
> > > > > it as 30 as a LRU scan every 30 seconds worst case is not going to show up
> > > > > on many profiles.
> > > > >
> > > > > > Since the whole point
> > > > > > is to avoid 100% CPU usage, we can take down the time used for this
> > > > > > failed zone reclaim (T) and forbid zone reclaim until (NOW + 100*T).
> > > > > >
> > > > >
> > > > > i.e. just fix it internally at 100 seconds? How is that better than
> > > > > having an obscure tunable? I think if this heuristic exists at all, it's
> > > > > important that an administrator be able to turn it off if absolutly
> > > > > necessary and so something must be user-visible.
> > > >
> > > > That 100*T don't mean 100 seconds. It means to keep CPU usage under 1%:
> > > > after busy scanning for time T, let's go relax for 100*T.
> > > >
> > >
> > > Do I have a means of calculating what my CPU usage is as a result of
> > > scanning the LRU list?
> > >
> > > If I don't and the machine is busy, would I not avoid scanning even in
> > > situations where it should have been scanned?
> >
> > I guess we don't really care about the exact number for the ratio 100.
> > If the box is busy, it automatically scales the effective ratio to 200
> > or more, which I think is reasonable behavior.
> >
> > Something like this.
> >
> > Thanks,
> > Fengguang
> >
> > ---
> > include/linux/mmzone.h | 2 ++
> > mm/vmscan.c | 11 +++++++++++
> > 2 files changed, 13 insertions(+)
> >
> > --- linux.orig/include/linux/mmzone.h
> > +++ linux/include/linux/mmzone.h
> > @@ -334,6 +334,8 @@ struct zone {
> > /* Zone statistics */
> > atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
> >
> > + unsigned long zone_reclaim_relax;
> > +
> > /*
> > * prev_priority holds the scanning priority for this zone. It is
> > * defined as the scanning priority at which we achieved our reclaim
> > --- linux.orig/mm/vmscan.c
> > +++ linux/mm/vmscan.c
> > @@ -2453,6 +2453,7 @@ int zone_reclaim(struct zone *zone, gfp_
> > int ret;
> > long nr_unmapped_file_pages;
> > long nr_slab_reclaimable;
> > + unsigned long t;
> >
> > /*
> > * Zone reclaim reclaims unmapped file backed pages and
> > @@ -2475,6 +2476,11 @@ int zone_reclaim(struct zone *zone, gfp_
> > if (zone_is_all_unreclaimable(zone))
> > return 0;
> >
> > + if (time_in_range(zone->zone_reclaim_relax - 10000 * HZ,
> > + jiffies,
> > + zone->zone_reclaim_relax))
> > + return 0;
> > +
>
> So. zone_reclaim_relax is some value between now and 100 times the approximate
> time it takes to scan the LRU list. This check ensures that we do not scan
> multiple times within the same interval. Is that right?

Yes and no: zone_reclaim_relax is the *absolute* time for that.
This check ensures that if we wasted T seconds doing a fruitless
zone reclaim, zone reclaim won't be repeated in the following 100*T
seconds - which is a coarse relax period.

Its simpler form is: time_before(jiffies, zone_reclaim_relax),
if not considering wraparound issues.

> > /*
> > * Do not scan if the allocation should not be delayed.
> > */
> > @@ -2493,7 +2499,12 @@ int zone_reclaim(struct zone *zone, gfp_
> >
> > if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
> > return 0;
> > + t = jiffies;
> > ret = __zone_reclaim(zone, gfp_mask, order);
> > + if (sc.nr_reclaimed == 0) {
> > + t = min_t(unsigned long, 10000 * HZ, 100 * (jiffies - t));
> > + zone->zone_reclaim_relax = jiffies + t;
> > + }
>
> This appears to be a way of automatically selecting a value for
> zone_reclaim_interval but is 100 times the length of time it takes to scan the
> LRU list enough to avoid excessive scanning of the LRU lists by zone_reclaim?

Exactly.

> I don't know and unlike zone_reclaim_interval, we have no way for the
> administrator to intervene in the event we get the calculation wrong.
>
> Conceivably though, zone_reclaim_interval could automatically tune
> itself based on a heuristic like this if the administrator does not give
> a specific value. I think that would be an interesting follow on once
> we've brought back zone_reclaim_interval and get a feeling for how often
> it is actually used.

Well I don't think that's good practice. There are heuristic
calculations all over the kernel. Shall we exporting parameters to
user space just because we are not absolutely sure? Or shall we ship
the heuristics and do adjustments based on feedbacks and only export
parameters when we find _known cases_ that cannot be covered by pure
heuristics?

Thanks,
Fengguang

> > zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
> >
> > return ret;
> >
>
> --
> Mel Gorman
> Part-time Phd Student Linux Technology Center
> University of Limerick IBM Dublin Software Lab

2009-06-09 15:06:28

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Tue, Jun 09, 2009 at 09:38:04PM +0800, Wu Fengguang wrote:
> On Tue, Jun 09, 2009 at 05:40:50PM +0800, Mel Gorman wrote:
> > On Tue, Jun 09, 2009 at 05:07:35PM +0800, Wu Fengguang wrote:
> > > On Tue, Jun 09, 2009 at 04:31:54PM +0800, Mel Gorman wrote:
> > > > On Tue, Jun 09, 2009 at 04:25:39PM +0800, Wu Fengguang wrote:
> > > > > On Tue, Jun 09, 2009 at 04:14:25PM +0800, Mel Gorman wrote:
> > > > > > On Tue, Jun 09, 2009 at 09:58:22AM +0800, Wu Fengguang wrote:
> > > > > > > On Mon, Jun 08, 2009 at 09:01:28PM +0800, Mel Gorman wrote:
> > > > > > > > On NUMA machines, the administrator can configure zone_reclaim_mode that is a
> > > > > > > > more targetted form of direct reclaim. On machines with large NUMA distances,
> > > > > > > > zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be
> > > > > > > > reclaimed if the zone watermarks are not being met. The problem is that
> > > > > > > > zone_reclaim() can be in a situation where it scans excessively without
> > > > > > > > making progress.
> > > > > > > >
> > > > > > > > One such situation is where a large tmpfs mount is occupying a large
> > > > > > > > percentage of memory overall. The pages do not get cleaned or reclaimed by
> > > > > > > > zone_reclaim(), but the lists are uselessly scanned frequencly making the
> > > > > > > > CPU spin at 100%. The scanning occurs because zone_reclaim() cannot tell
> > > > > > > > in advance the scan is pointless because the counters do not distinguish
> > > > > > > > between pagecache pages backed by disk and by RAM. The observation in
> > > > > > > > the field is that malloc() stalls for a long time (minutes in some cases)
> > > > > > > > when this situation occurs.
> > > > > > > >
> > > > > > > > Accounting for ram-backed file pages was considered but not implemented on
> > > > > > > > the grounds it would be introducing new branches and expensive checks into
> > > > > > > > the page cache add/remove patches and increase the number of statistics
> > > > > > > > needed in the zone. As zone_reclaim() failing is currently considered a
> > > > > > > > corner case, this seemed like overkill. Note, if there are a large number
> > > > > > > > of reports about CPU spinning at 100% on NUMA that is fixed by disabling
> > > > > > > > zone_reclaim, then this assumption is false and zone_reclaim() scanning
> > > > > > > > and failing is not a corner case but a common occurance
> > > > > > > >
> > > > > > > > This patch reintroduces zone_reclaim_interval which was removed by commit
> > > > > > > > 34aa1330f9b3c5783d269851d467326525207422 [zoned vm counters: zone_reclaim:
> > > > > > > > remove /proc/sys/vm/zone_reclaim_interval] because the zone counters were
> > > > > > > > considered sufficient to determine in advance if the scan would succeed.
> > > > > > > > As unsuccessful scans can still occur, zone_reclaim_interval is still
> > > > > > > > required.
> > > > > > >
> > > > > > > Can we avoid the user visible parameter zone_reclaim_interval?
> > > > > > >
> > > > > >
> > > > > > You could, but then there is no way of disabling it by setting it to 0
> > > > > > either. I can't imagine why but the desired behaviour might really be to
> > > > > > spin and never go off-node unless there is no other option. They might
> > > > > > want to set it to 0 for example when determining what the right value for
> > > > > > zone_reclaim_mode is for their workloads.
> > > > > >
> > > > > > > That means to introduce some heuristics for it.
> > > > > >
> > > > > > I suspect the vast majority of users will ignore it unless they are runing
> > > > > > zone_reclaim_mode at the same time and even then will probably just leave
> > > > > > it as 30 as a LRU scan every 30 seconds worst case is not going to show up
> > > > > > on many profiles.
> > > > > >
> > > > > > > Since the whole point
> > > > > > > is to avoid 100% CPU usage, we can take down the time used for this
> > > > > > > failed zone reclaim (T) and forbid zone reclaim until (NOW + 100*T).
> > > > > > >
> > > > > >
> > > > > > i.e. just fix it internally at 100 seconds? How is that better than
> > > > > > having an obscure tunable? I think if this heuristic exists at all, it's
> > > > > > important that an administrator be able to turn it off if absolutly
> > > > > > necessary and so something must be user-visible.
> > > > >
> > > > > That 100*T don't mean 100 seconds. It means to keep CPU usage under 1%:
> > > > > after busy scanning for time T, let's go relax for 100*T.
> > > > >
> > > >
> > > > Do I have a means of calculating what my CPU usage is as a result of
> > > > scanning the LRU list?
> > > >
> > > > If I don't and the machine is busy, would I not avoid scanning even in
> > > > situations where it should have been scanned?
> > >
> > > I guess we don't really care about the exact number for the ratio 100.
> > > If the box is busy, it automatically scales the effective ratio to 200
> > > or more, which I think is reasonable behavior.
> > >
> > > Something like this.
> > >
> > > Thanks,
> > > Fengguang
> > >
> > > ---
> > > include/linux/mmzone.h | 2 ++
> > > mm/vmscan.c | 11 +++++++++++
> > > 2 files changed, 13 insertions(+)
> > >
> > > --- linux.orig/include/linux/mmzone.h
> > > +++ linux/include/linux/mmzone.h
> > > @@ -334,6 +334,8 @@ struct zone {
> > > /* Zone statistics */
> > > atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
> > >
> > > + unsigned long zone_reclaim_relax;
> > > +
> > > /*
> > > * prev_priority holds the scanning priority for this zone. It is
> > > * defined as the scanning priority at which we achieved our reclaim
> > > --- linux.orig/mm/vmscan.c
> > > +++ linux/mm/vmscan.c
> > > @@ -2453,6 +2453,7 @@ int zone_reclaim(struct zone *zone, gfp_
> > > int ret;
> > > long nr_unmapped_file_pages;
> > > long nr_slab_reclaimable;
> > > + unsigned long t;
> > >
> > > /*
> > > * Zone reclaim reclaims unmapped file backed pages and
> > > @@ -2475,6 +2476,11 @@ int zone_reclaim(struct zone *zone, gfp_
> > > if (zone_is_all_unreclaimable(zone))
> > > return 0;
> > >
> > > + if (time_in_range(zone->zone_reclaim_relax - 10000 * HZ,
> > > + jiffies,
> > > + zone->zone_reclaim_relax))
> > > + return 0;
> > > +
> >
> > So. zone_reclaim_relax is some value between now and 100 times the approximate
> > time it takes to scan the LRU list. This check ensures that we do not scan
> > multiple times within the same interval. Is that right?
>
> Yes and no: zone_reclaim_relax is the *absolute* time for that.
> This check ensures that if we wasted T seconds doing a fruitless
> zone reclaim, zone reclaim won't be repeated in the following 100*T
> seconds - which is a coarse relax period.
>
> Its simpler form is: time_before(jiffies, zone_reclaim_relax),
> if not considering wraparound issues.
>

Ok

> > > /*
> > > * Do not scan if the allocation should not be delayed.
> > > */
> > > @@ -2493,7 +2499,12 @@ int zone_reclaim(struct zone *zone, gfp_
> > >
> > > if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
> > > return 0;
> > > + t = jiffies;
> > > ret = __zone_reclaim(zone, gfp_mask, order);
> > > + if (sc.nr_reclaimed == 0) {
> > > + t = min_t(unsigned long, 10000 * HZ, 100 * (jiffies - t));
> > > + zone->zone_reclaim_relax = jiffies + t;
> > > + }
> >
> > This appears to be a way of automatically selecting a value for
> > zone_reclaim_interval but is 100 times the length of time it takes to scan the
> > LRU list enough to avoid excessive scanning of the LRU lists by zone_reclaim?
>
> Exactly.
>
> > I don't know and unlike zone_reclaim_interval, we have no way for the
> > administrator to intervene in the event we get the calculation wrong.
> >
> > Conceivably though, zone_reclaim_interval could automatically tune
> > itself based on a heuristic like this if the administrator does not give
> > a specific value. I think that would be an interesting follow on once
> > we've brought back zone_reclaim_interval and get a feeling for how often
> > it is actually used.
>
> Well I don't think that's good practice. There are heuristic
> calculations all over the kernel. Shall we exporting parameters to
> user space just because we are not absolutely sure? Or shall we ship
> the heuristics and do adjustments based on feedbacks and only export
> parameters when we find _known cases_ that cannot be covered by pure
> heuristics?
>

Good question - I don't have a satisfactory answer but I intuitively find
the zone_reclaim_interval easier to deal with than the heuristic. That said,
I would prefer if neither was required.

In the patchset, I've added a counter for the number of times that the
scan-avoidance heuristic fails. If the tmpfs problem has been resolved
(patch with bug reporter, am awaiting test), I'll drop zone_reclaim_interval
altogether and we'll use the counter to detect if/when this situation
occurs again.

> > > zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
> > >
> > > return ret;
> > >
> >
> > --
> > Mel Gorman
> > Part-time Phd Student Linux Technology Center
> > University of Limerick IBM Dublin Software Lab
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-10 02:14:49

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Tue, Jun 09, 2009 at 11:06:19PM +0800, Mel Gorman wrote:
> On Tue, Jun 09, 2009 at 09:38:04PM +0800, Wu Fengguang wrote:
> > On Tue, Jun 09, 2009 at 05:40:50PM +0800, Mel Gorman wrote:
> > >
> > > Conceivably though, zone_reclaim_interval could automatically tune
> > > itself based on a heuristic like this if the administrator does not give
> > > a specific value. I think that would be an interesting follow on once
> > > we've brought back zone_reclaim_interval and get a feeling for how often
> > > it is actually used.
> >
> > Well I don't think that's good practice. There are heuristic
> > calculations all over the kernel. Shall we exporting parameters to
> > user space just because we are not absolutely sure? Or shall we ship
> > the heuristics and do adjustments based on feedbacks and only export
> > parameters when we find _known cases_ that cannot be covered by pure
> > heuristics?
> >
>
> Good question - I don't have a satisfactory answer but I intuitively find
> the zone_reclaim_interval easier to deal with than the heuristic. That said,
> I would prefer if neither was required.

Yes - can we rely on the (improved) accounting to make our "failure feedback"
patches unnecessary? :)

Thanks,
Fengguang

> In the patchset, I've added a counter for the number of times that the
> scan-avoidance heuristic fails. If the tmpfs problem has been resolved
> (patch with bug reporter, am awaiting test), I'll drop zone_reclaim_interval
> altogether and we'll use the counter to detect if/when this situation
> occurs again.

2009-06-10 05:23:59

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Mon, 8 Jun 2009 16:11:51 +0100 Mel Gorman <[email protected]> wrote:

> On Mon, Jun 08, 2009 at 10:55:55AM -0400, Christoph Lameter wrote:
> > On Mon, 8 Jun 2009, Mel Gorman wrote:
> >
> > > > The tmpfs pages are unreclaimable and therefore should not be on the anon
> > > > lru.
> > > >
> > >
> > > tmpfs pages can be swap-backed so can be reclaimable. Regardless of what
> > > list they are on, we still need to know how many of them there are if
> > > this patch is to be avoided.
> >
> > If they are reclaimable then why does it matter? They can be pushed out if
> > you configure zone reclaim to be that aggressive.
> >
>
> Because they are reclaimable by kswapd or normal direct reclaim but *not*
> reclaimable by zone_reclaim() if the zone_reclaim_mode is not configured
> appropriately.

Ah. (zone_reclaim_mode & RECLAIM_SWAP) == 0. That was important info.

Couldn't the lack of RECLAIM_WRITE cause a similar problem?

> I briefly considered setting zone_reclaim_mode to 7 instead of
> 1 by default for large NUMA distances but that has other serious consequences
> such as paging in preference to going off-node as a default out-of-box
> behaviour.

Maybe we should consider that a bit harder. At what stage does
zone_reclaim decide to give up and try a different node? Perhaps it's
presently too reluctant to do that?

> The point of the patch is that the heuristics that avoid the scan are not
> perfect. In the event they are wrong and a useless scan occurs, the response
> of the kernel after a useless scan should not be to uselessly scan a load
> more times around the LRU lists making no progress.

It would be sad to bring back a jiffies-based thing into page reclaim.
Wall time has little correlation with the rate of page allocation and
reclaim activity.

2009-06-10 06:44:17

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

Hi

> On Mon, 8 Jun 2009 16:11:51 +0100 Mel Gorman <[email protected]> wrote:
>
> > On Mon, Jun 08, 2009 at 10:55:55AM -0400, Christoph Lameter wrote:
> > > On Mon, 8 Jun 2009, Mel Gorman wrote:
> > >
> > > > > The tmpfs pages are unreclaimable and therefore should not be on the anon
> > > > > lru.
> > > > >
> > > >
> > > > tmpfs pages can be swap-backed so can be reclaimable. Regardless of what
> > > > list they are on, we still need to know how many of them there are if
> > > > this patch is to be avoided.
> > >
> > > If they are reclaimable then why does it matter? They can be pushed out if
> > > you configure zone reclaim to be that aggressive.
> > >
> >
> > Because they are reclaimable by kswapd or normal direct reclaim but *not*
> > reclaimable by zone_reclaim() if the zone_reclaim_mode is not configured
> > appropriately.
>
> Ah. (zone_reclaim_mode & RECLAIM_SWAP) == 0. That was important info.
>
> Couldn't the lack of RECLAIM_WRITE cause a similar problem?

Old kernel can makes easily. but currenly we have proper dirty page limit.
Thus all pages can't become dirty and zone-reclaim can found cleaner page.

In the other hand, plenty tmpfs pages can be mede easily.


2009-06-10 09:54:19

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Wed, Jun 10, 2009 at 10:14:40AM +0800, Wu Fengguang wrote:
> On Tue, Jun 09, 2009 at 11:06:19PM +0800, Mel Gorman wrote:
> > On Tue, Jun 09, 2009 at 09:38:04PM +0800, Wu Fengguang wrote:
> > > On Tue, Jun 09, 2009 at 05:40:50PM +0800, Mel Gorman wrote:
> > > >
> > > > Conceivably though, zone_reclaim_interval could automatically tune
> > > > itself based on a heuristic like this if the administrator does not give
> > > > a specific value. I think that would be an interesting follow on once
> > > > we've brought back zone_reclaim_interval and get a feeling for how often
> > > > it is actually used.
> > >
> > > Well I don't think that's good practice. There are heuristic
> > > calculations all over the kernel. Shall we exporting parameters to
> > > user space just because we are not absolutely sure? Or shall we ship
> > > the heuristics and do adjustments based on feedbacks and only export
> > > parameters when we find _known cases_ that cannot be covered by pure
> > > heuristics?
> > >
> >
> > Good question - I don't have a satisfactory answer but I intuitively find
> > the zone_reclaim_interval easier to deal with than the heuristic. That said,
> > I would prefer if neither was required.
>
> Yes - can we rely on the (improved) accounting to make our "failure feedback"
> patches unnecessary? :)
>

Am awaiting test results to answer that question :)

> Thanks,
> Fengguang
>
> > In the patchset, I've added a counter for the number of times that the
> > scan-avoidance heuristic fails. If the tmpfs problem has been resolved
> > (patch with bug reporter, am awaiting test), I'll drop zone_reclaim_interval
> > altogether and we'll use the counter to detect if/when this situation
> > occurs again.
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-10 10:00:49

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA

On Tue, Jun 09, 2009 at 10:23:01PM -0700, Andrew Morton wrote:
> On Mon, 8 Jun 2009 16:11:51 +0100 Mel Gorman <[email protected]> wrote:
>
> > On Mon, Jun 08, 2009 at 10:55:55AM -0400, Christoph Lameter wrote:
> > > On Mon, 8 Jun 2009, Mel Gorman wrote:
> > >
> > > > > The tmpfs pages are unreclaimable and therefore should not be on the anon
> > > > > lru.
> > > > >
> > > >
> > > > tmpfs pages can be swap-backed so can be reclaimable. Regardless of what
> > > > list they are on, we still need to know how many of them there are if
> > > > this patch is to be avoided.
> > >
> > > If they are reclaimable then why does it matter? They can be pushed out if
> > > you configure zone reclaim to be that aggressive.
> > >
> >
> > Because they are reclaimable by kswapd or normal direct reclaim but *not*
> > reclaimable by zone_reclaim() if the zone_reclaim_mode is not configured
> > appropriately.
>
> Ah. (zone_reclaim_mode & RECLAIM_SWAP) == 0. That was important info.
>

Yes, zone_reclaim() is a different beast to kswapd or traditional direct
reclaim.

> Couldn't the lack of RECLAIM_WRITE cause a similar problem?
>

Potentially, yes.

> > I briefly considered setting zone_reclaim_mode to 7 instead of
> > 1 by default for large NUMA distances but that has other serious consequences
> > such as paging in preference to going off-node as a default out-of-box
> > behaviour.
>
> Maybe we should consider that a bit harder. At what stage does
> zone_reclaim decide to give up and try a different node? Perhaps it's
> presently too reluctant to do that?
>

It decides to give up if it can't reclaim a number of pages
(SWAP_CLUSTER_MAX usually) with the current reclaim_mode. In practice,
that means it will go off-node if there are not enough clean unmapped
pages on the LRU list for that node.

That is a relatively short delay. If the request had to clean filesystem-backed
pages or unmap+swap pages, the cost would likely exceed the sum of all
remote-node accesses for that page.

I think in principal, the zone_reclaim_mode default of 1 is sensible and
the biggest thing this patchset needs to get right is the scan-avoidance
heuristic.

> > The point of the patch is that the heuristics that avoid the scan are not
> > perfect. In the event they are wrong and a useless scan occurs, the response
> > of the kernel after a useless scan should not be to uselessly scan a load
> > more times around the LRU lists making no progress.
>
> It would be sad to bring back a jiffies-based thing into page reclaim.
> Wall time has little correlation with the rate of page allocation and
> reclaim activity.
>

Agreed. If it turns out a patch like this is needed, I'm going to build
on Wu's suggestion to auto-selecting the zone_reclaim_interval based on
scan frequency and how long it takes to do the scan. I'm still hoping that
neither is necessary because we'll be able to guess the number of tmpfs
pages in advance.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab