2010-08-31 17:37:41

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/3] Reduce watermark-related problems with the per-cpu allocator V3

Changelog since V3
o Minor clarifications
o Rebase to 2.6.36-rc3

Changelog since V1
o Fix for !CONFIG_SMP
o Correct spelling mistakes
o Clarify a ChangeLog
o Only check for counter drift on machines large enough for the counter
drift to breach the min watermark when NR_FREE_PAGES report the low
watermark is fine

Internal IBM test teams beta testing distribution kernels have reported
problems on machines with a large number of CPUs whereby page allocator
failure messages show huge differences between the nr_free_pages vmstat
counter and what is available on the buddy lists. In an extreme example,
nr_free_pages was above the min watermark but zero pages were on the buddy
lists allowing the system to potentially livelock unable to make forward
progress unless an allocation succeeds. There is no reason why the problems
would not affect mainline so the following series mitigates the problems
in the page allocator related to to per-cpu counter drift and lists.

The first patch ensures that counters are updated after pages are added to
free lists.

The second patch notes that the counter drift between nr_free_pages and what
is on the per-cpu lists can be very high. When memory is low and kswapd
is awake, the per-cpu counters are checked as well as reading the value
of NR_FREE_PAGES. This will slow the page allocator when memory is low and
kswapd is awake but it will be much harder to breach the min watermark and
potentially livelock the system.

The third patch notes that after direct-reclaim an allocation can
fail because the necessary pages are on the per-cpu lists. After a
direct-reclaim-and-allocation-failure, the per-cpu lists are drained and
a second attempt is made.

Performance tests against 2.6.36-rc1 did not show up anything interesting. A
version of this series that continually called vmstat_update() when
memory was low was tested internally and found to help the counter drift
problem. I described this during LSF/MM Summit and the potential for IPI
storms was frowned upon. An alternative fix is in patch two which uses
for_each_online_cpu() to read the vmstat deltas while memory is low and
kswapd is awake. This should be functionally similar.

Christoph Lameter made two suggestions that I did not take action on. The
first was to make a generic helper that could be used to get a semi-accurate
reading of any vmstat counter. However, there is no evidence this is
necessary and it would be better to get a clear understanding of what counter
other than NR_FREE_PAGES would need special treatment by making it obvious
when such a helper is introduced. The second suggestion was to shrink the
threshold that vmstat got updated for affecting all counters. It was also
unclear if this was sufficient or necessary as again. Only NR_FREE_PAGES
is thhe problem counter so why affect every other counter? Also, shrinking
the threshold just shrinks the window the race can occur in. Hence, I'm
reposting the series as-is to see if there are any current objections to
deal with or if we can close up this problem now.

This patch should be merged after the patch "vmstat : update
zone stat threshold at onlining a cpu" which is in mmotm as
vmstat-update-zone-stat-threshold-when-onlining-a-cpu.patch . If we can
agree on it, it's a stable candidate.

include/linux/mmzone.h | 13 +++++++++++++
mm/mmzone.c | 29 +++++++++++++++++++++++++++++
mm/page_alloc.c | 29 +++++++++++++++++++++--------
mm/vmstat.c | 15 ++++++++++++++-
4 files changed, 77 insertions(+), 9 deletions(-)


2010-08-31 17:37:42

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list

When allocating a page, the system uses NR_FREE_PAGES counters to determine
if watermarks would remain intact after the allocation was made. This
check is made without interrupts disabled or the zone lock held and so is
race-prone by nature. Unfortunately, when pages are being freed in batch,
the counters are updated before the pages are added on the list. During this
window, the counters are misleading as the pages do not exist yet. When
under significant pressure on systems with large numbers of CPUs, it's
possible for processes to make progress even though they should have been
stalled. This is particularly problematic if a number of the processes are
using GFP_ATOMIC as the min watermark can be accidentally breached and in
extreme cases, the system can livelock.

This patch updates the counters after the pages have been added to the
list. This makes the allocator more cautious with respect to preserving
the watermarks and mitigates livelock possibilities.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a9649f4..97d74a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -588,12 +588,12 @@ static void free_pcppages_bulk(struct zone *zone, int count,
{
int migratetype = 0;
int batch_free = 0;
+ int freed = count;

spin_lock(&zone->lock);
zone->all_unreclaimable = 0;
zone->pages_scanned = 0;

- __mod_zone_page_state(zone, NR_FREE_PAGES, count);
while (count) {
struct page *page;
struct list_head *list;
@@ -621,6 +621,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
trace_mm_page_pcpu_drain(page, 0, page_private(page));
} while (--count && --batch_free && !list_empty(list));
}
+ __mod_zone_page_state(zone, NR_FREE_PAGES, freed);
spin_unlock(&zone->lock);
}

@@ -631,8 +632,8 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
zone->all_unreclaimable = 0;
zone->pages_scanned = 0;

- __mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
__free_one_page(page, zone, order, migratetype);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
spin_unlock(&zone->lock);
}

--
1.7.1

2010-08-31 17:37:57

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/3] mm: page allocator: Drain per-cpu lists after direct reclaim allocation fails

When under significant memory pressure, a process enters direct reclaim
and immediately afterwards tries to allocate a page. If it fails and no
further progress is made, it's possible the system will go OOM. However,
on systems with large amounts of memory, it's possible that a significant
number of pages are on per-cpu lists and inaccessible to the calling
process. This leads to a process entering direct reclaim more often than
it should increasing the pressure on the system and compounding the problem.

This patch notes that if direct reclaim is making progress but
allocations are still failing that the system is already under heavy
pressure. In this case, it drains the per-cpu lists and tries the
allocation a second time before continuing.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
---
mm/page_alloc.c | 20 ++++++++++++++++----
1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bbaa959..750e1dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1847,6 +1847,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
struct page *page = NULL;
struct reclaim_state reclaim_state;
struct task_struct *p = current;
+ bool drained = false;

cond_resched();

@@ -1865,14 +1866,25 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,

cond_resched();

- if (order != 0)
- drain_all_pages();
+ if (unlikely(!(*did_some_progress)))
+ return NULL;

- if (likely(*did_some_progress))
- page = get_page_from_freelist(gfp_mask, nodemask, order,
+retry:
+ page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx,
alloc_flags, preferred_zone,
migratetype);
+
+ /*
+ * If an allocation failed after direct reclaim, it could be because
+ * pages are pinned on the per-cpu lists. Drain them and try again
+ */
+ if (!page && !drained) {
+ drain_all_pages();
+ drained = true;
+ goto retry;
+ }
+
return page;
}

--
1.7.1

2010-08-31 17:37:58

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as
it is cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained both
periodically and when the delta is above a threshold. On large CPU systems,
the difference between the estimated and real value of NR_FREE_PAGES can be
very high. If NR_FREE_PAGES is much higher than number of real free page
in buddy, the VM can allocate pages below min watermark, at worst reducing
the real number of pages to zero. Even if the OOM killer kills some victim
for freeing memory, it may not free memory if the exit path requires a new
page resulting in livelock.

This patch introduces zone_nr_free_pages() to take a slightly more accurate
estimate of NR_FREE_PAGES while kswapd is awake. The estimate is not perfect
and may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
include/linux/mmzone.h | 13 +++++++++++++
mm/mmzone.c | 29 +++++++++++++++++++++++++++++
mm/page_alloc.c | 4 ++--
mm/vmstat.c | 15 ++++++++++++++-
4 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6e6e626..3984c4e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -284,6 +284,13 @@ struct zone {
unsigned long watermark[NR_WMARK];

/*
+ * When free pages are below this point, additional steps are taken
+ * when reading the number of free pages to avoid per-cpu counter
+ * drift allowing watermarks to be breached
+ */
+ unsigned long percpu_drift_mark;
+
+ /*
* We don't know if the memory that we're going to allocate will be freeable
* or/and it will be released eventually, so to avoid totally wasting several
* GB of ram we must reserve some of the lower zone memory (otherwise we risk
@@ -441,6 +448,12 @@ static inline int zone_is_oom_locked(const struct zone *zone)
return test_bit(ZONE_OOM_LOCKED, &zone->flags);
}

+#ifdef CONFIG_SMP
+unsigned long zone_nr_free_pages(struct zone *zone);
+#else
+#define zone_nr_free_pages(zone) zone_page_state(zone, NR_FREE_PAGES)
+#endif /* CONFIG_SMP */
+
/*
* The "priority" of VM scanning is how much of the queues we will scan in one
* go. A value of 12 for DEF_PRIORITY implies that we will scan 1/4096th of the
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..69ecbe9 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -87,3 +87,32 @@ int memmap_valid_within(unsigned long pfn,
return 1;
}
#endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+#ifdef CONFIG_SMP
+/* Called when a more accurate view of NR_FREE_PAGES is needed */
+unsigned long zone_nr_free_pages(struct zone *zone)
+{
+ unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
+
+ /*
+ * While kswapd is awake, it is considered the zone is under some
+ * memory pressure. Under pressure, there is a risk that
+ * per-cpu-counter-drift will allow the min watermark to be breached
+ * potentially causing a live-lock. While kswapd is awake and
+ * free pages are low, get a better estimate for free pages
+ */
+ if (nr_free_pages < zone->percpu_drift_mark &&
+ !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ struct per_cpu_pageset *pset;
+
+ pset = per_cpu_ptr(zone->pageset, cpu);
+ nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];
+ }
+ }
+
+ return nr_free_pages;
+}
+#endif /* CONFIG_SMP */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 97d74a0..bbaa959 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,7 +1462,7 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
{
/* free_pages my go negative - that's OK */
long min = mark;
- long free_pages = zone_page_state(z, NR_FREE_PAGES) - (1 << order) + 1;
+ long free_pages = zone_nr_free_pages(z) - (1 << order) + 1;
int o;

if (alloc_flags & ALLOC_HIGH)
@@ -2424,7 +2424,7 @@ void show_free_areas(void)
" all_unreclaimable? %s"
"\n",
zone->name,
- K(zone_page_state(zone, NR_FREE_PAGES)),
+ K(zone_nr_free_pages(zone)),
K(min_wmark_pages(zone)),
K(low_wmark_pages(zone)),
K(high_wmark_pages(zone)),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f389168..696cab2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -138,11 +138,24 @@ static void refresh_zone_stat_thresholds(void)
int threshold;

for_each_populated_zone(zone) {
+ unsigned long max_drift, tolerate_drift;
+
threshold = calculate_threshold(zone);

for_each_online_cpu(cpu)
per_cpu_ptr(zone->pageset, cpu)->stat_threshold
= threshold;
+
+ /*
+ * Only set percpu_drift_mark if there is a danger that
+ * NR_FREE_PAGES reports the low watermark is ok when in fact
+ * the min watermark could be breached by an allocation
+ */
+ tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
+ max_drift = num_online_cpus() * threshold;
+ if (max_drift > tolerate_drift)
+ zone->percpu_drift_mark = high_wmark_pages(zone) +
+ max_drift;
}
}

@@ -813,7 +826,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
"\n scanned %lu"
"\n spanned %lu"
"\n present %lu",
- zone_page_state(zone, NR_FREE_PAGES),
+ zone_nr_free_pages(zone),
min_wmark_pages(zone),
low_wmark_pages(zone),
high_wmark_pages(zone),
--
1.7.1

2010-08-31 23:27:09

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: page allocator: Update free page counters after pages are placed on the free list

> When allocating a page, the system uses NR_FREE_PAGES counters to determine
> if watermarks would remain intact after the allocation was made. This
> check is made without interrupts disabled or the zone lock held and so is
> race-prone by nature. Unfortunately, when pages are being freed in batch,
> the counters are updated before the pages are added on the list. During this
> window, the counters are misleading as the pages do not exist yet. When
> under significant pressure on systems with large numbers of CPUs, it's
> possible for processes to make progress even though they should have been
> stalled. This is particularly problematic if a number of the processes are
> using GFP_ATOMIC as the min watermark can be accidentally breached and in
> extreme cases, the system can livelock.
>
> This patch updates the counters after the pages have been added to the
> list. This makes the allocator more cautious with respect to preserving
> the watermarks and mitigates livelock possibilities.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
> Reviewed-by: Minchan Kim <[email protected]>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>

Reviewed-by: KOSAKI Motohiro <[email protected]>



2010-08-31 23:37:45

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake

> +#ifdef CONFIG_SMP
> +/* Called when a more accurate view of NR_FREE_PAGES is needed */
> +unsigned long zone_nr_free_pages(struct zone *zone)
> +{
> + unsigned long nr_free_pages = zone_page_state(zone, NR_FREE_PAGES);
> +
> + /*
> + * While kswapd is awake, it is considered the zone is under some
> + * memory pressure. Under pressure, there is a risk that
> + * per-cpu-counter-drift will allow the min watermark to be breached
> + * potentially causing a live-lock. While kswapd is awake and
> + * free pages are low, get a better estimate for free pages
> + */
> + if (nr_free_pages < zone->percpu_drift_mark &&
> + !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
> + int cpu;
> +
> + for_each_online_cpu(cpu) {
> + struct per_cpu_pageset *pset;
> +
> + pset = per_cpu_ptr(zone->pageset, cpu);
> + nr_free_pages += pset->vm_stat_diff[NR_FREE_PAGES];

If my understanding is correct, we have no lock when reading pset->vm_stat_diff.
It mean nr_free_pages can reach negative value at very rarely race. boundary
check is necessary?