2008-10-01 12:32:27

by Andy Whitcroft

[permalink] [raw]
Subject: [PATCH 0/4] Reclaim page capture v4

For sometime we have been looking at mechanisms for improving the availability
of larger allocations under load. One of the options we have explored is
the capturing of pages freed under direct reclaim in order to increase the
chances of free pages coelescing before they are subject to reallocation
by racing allocators.

Following this email is a patch stack implementing page capture during
direct reclaim. It consits of four patches. The first two simply pull
out existing code into helpers for reuse. The third makes buddy's use
of struct page explicit. The fourth contains the meat of the changes,
and its leader contains a much fuller description of the feature.

This update represents a rebase to -mm and incorporates feedback from
KOSAKI Motohiro. It also incorporates an accounting fix which was
preventing some captures.

I have done a lot of comparitive testing with and without this patch
set and in broad brush I am seeing improvements in hugepage allocations
(worst case size) success on all of my test systems. These tests consist
of placing a constant stream of high order allocations on the system,
at varying rates. The results for these various runs are then averaged
to give an overall improvement.

Absolute Effective
x86-64 2.48% 4.58%
powerpc 5.55% 25.22%

x86-64 has a relatively small huge page size and so is always much more
effective at allocating huge pages. Even there we get a measurable
improvement. On powerpc the huge pages are much larger and much harder
to recover. Here we see a full 25% increase in page recovery.

It should be noted that these are worst case testing, and very agressive
taking every possible page in the system. It would be helpful to get
wider testing in -mm.

Against: 2.6.27-rc1-mm1

Andrew, please consider for -mm.

-apw

Changes since V3:
- Incorporates an anon vma fix pointed out by MinChan Kim
- switch to using a pagevec for page capture collection

Changes since V2:
- Incorporates review feedback from Christoph Lameter,
- Incorporates review feedback from Peter Zijlstra, and
- Checkpatch fixes.

Changes since V1:
- Incorporates review feedback from KOSAKI Motohiro,
- fixes up accounting when checking watermarks for captured pages,
- rebase 2.6.27-rc1-mm1,
- Incorporates review feedback from Mel.


Andy Whitcroft (4):
pull out the page pre-release and sanity check logic for reuse
pull out zone cpuset and watermark checks for reuse
buddy: explicitly identify buddy field use in struct page
capture pages freed during direct reclaim for allocation by the
reclaimer

include/linux/mm_types.h | 4 +
include/linux/page-flags.h | 4 +
include/linux/pagevec.h | 1 +
mm/internal.h | 7 +-
mm/page_alloc.c | 265 ++++++++++++++++++++++++++++++++++++++------
mm/vmscan.c | 118 ++++++++++++++++----
6 files changed, 343 insertions(+), 56 deletions(-)


2008-10-01 12:31:18

by Andy Whitcroft

[permalink] [raw]
Subject: [PATCH 3/4] buddy: explicitly identify buddy field use in struct page

Explicitly define the struct page fields which buddy uses when it owns
pages. Defines a new anonymous struct to allow additional fields to
be defined in a later patch.

Signed-off-by: Andy Whitcroft <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Acked-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
---
include/linux/mm_types.h | 3 +++
mm/internal.h | 2 +-
mm/page_alloc.c | 4 ++--
3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 995c588..906d8e0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -70,6 +70,9 @@ struct page {
#endif
struct kmem_cache *slab; /* SLUB: Pointer to slab */
struct page *first_page; /* Compound tail pages */
+ struct {
+ unsigned long buddy_order; /* buddy: free page order */
+ };
};
union {
pgoff_t index; /* Our offset within mapping. */
diff --git a/mm/internal.h b/mm/internal.h
index c0e4859..fcedcd0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -58,7 +58,7 @@ extern void __free_pages_bootmem(struct page *page, unsigned int order);
static inline unsigned long page_order(struct page *page)
{
VM_BUG_ON(!PageBuddy(page));
- return page_private(page);
+ return page->buddy_order;
}

extern int mlock_vma_pages_range(struct vm_area_struct *vma,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 921c435..3a646e3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -331,7 +331,7 @@ static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags)

static inline void set_page_order(struct page *page, int order)
{
- set_page_private(page, order);
+ page->buddy_order = order;
__SetPageBuddy(page);
#ifdef CONFIG_PAGE_OWNER
page->order = -1;
@@ -341,7 +341,7 @@ static inline void set_page_order(struct page *page, int order)
static inline void rmv_page_order(struct page *page)
{
__ClearPageBuddy(page);
- set_page_private(page, 0);
+ page->buddy_order = 0;
}

/*
--
1.6.0.1.451.gc8d31

2008-10-01 12:31:31

by Andy Whitcroft

[permalink] [raw]
Subject: [PATCH 2/4] pull out zone cpuset and watermark checks for reuse

When allocating we need to confirm that the zone we are about to allocate
from is acceptable to the CPUSET we are in, and that it does not violate
the zone watermarks. Pull these checks out so we can reuse them in a
later patch.

Signed-off-by: Andy Whitcroft <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Acked-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
mm/page_alloc.c | 62 ++++++++++++++++++++++++++++++++++++++----------------
1 files changed, 43 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 55d8d9b..921c435 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1271,6 +1271,44 @@ int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
return 1;
}

+/*
+ * Return 1 if this zone is an acceptable source given the cpuset
+ * constraints.
+ */
+static inline int zone_cpuset_permits(struct zone *zone,
+ int alloc_flags, gfp_t gfp_mask)
+{
+ if ((alloc_flags & ALLOC_CPUSET) &&
+ !cpuset_zone_allowed_softwall(zone, gfp_mask))
+ return 0;
+ return 1;
+}
+
+/*
+ * Return 1 if this zone is within the watermarks specified by the
+ * allocation flags.
+ */
+static inline int zone_watermark_permits(struct zone *zone, int order,
+ int classzone_idx, int alloc_flags, gfp_t gfp_mask)
+{
+ if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
+ unsigned long mark;
+ if (alloc_flags & ALLOC_WMARK_MIN)
+ mark = zone->pages_min;
+ else if (alloc_flags & ALLOC_WMARK_LOW)
+ mark = zone->pages_low;
+ else
+ mark = zone->pages_high;
+ if (!zone_watermark_ok(zone, order, mark,
+ classzone_idx, alloc_flags)) {
+ if (!zone_reclaim_mode ||
+ !zone_reclaim(zone, gfp_mask, order))
+ return 0;
+ }
+ }
+ return 1;
+}
+
#ifdef CONFIG_NUMA
/*
* zlc_setup - Setup for "zonelist cache". Uses cached zone data to
@@ -1424,25 +1462,11 @@ zonelist_scan:
if (NUMA_BUILD && zlc_active &&
!zlc_zone_worth_trying(zonelist, z, allowednodes))
continue;
- if ((alloc_flags & ALLOC_CPUSET) &&
- !cpuset_zone_allowed_softwall(zone, gfp_mask))
- goto try_next_zone;
-
- if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
- unsigned long mark;
- if (alloc_flags & ALLOC_WMARK_MIN)
- mark = zone->pages_min;
- else if (alloc_flags & ALLOC_WMARK_LOW)
- mark = zone->pages_low;
- else
- mark = zone->pages_high;
- if (!zone_watermark_ok(zone, order, mark,
- classzone_idx, alloc_flags)) {
- if (!zone_reclaim_mode ||
- !zone_reclaim(zone, gfp_mask, order))
- goto this_zone_full;
- }
- }
+ if (!zone_cpuset_permits(zone, alloc_flags, gfp_mask))
+ goto try_next_zone;
+ if (!zone_watermark_permits(zone, order, classzone_idx,
+ alloc_flags, gfp_mask))
+ goto this_zone_full;

page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
if (page)
--
1.6.0.1.451.gc8d31

2008-10-01 12:31:47

by Andy Whitcroft

[permalink] [raw]
Subject: [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse

When we are about to release a page we perform a number of actions
on that page. We clear down any anonymous mappings, confirm that
the page is safe to release, check for freeing locks, before mapping
the page should that be required. Pull this processing out into a
helper function for reuse in a later patch.

Note that we do not convert the similar cleardown in free_hot_cold_page()
as the optimiser is unable to squash the loops during the inline.

Signed-off-by: Andy Whitcroft <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Acked-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
mm/page_alloc.c | 40 +++++++++++++++++++++++++++-------------
1 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f52fcf1..55d8d9b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -489,6 +489,32 @@ static inline int free_pages_check(struct page *page)
}

/*
+ * Prepare this page for release to the buddy. Sanity check the page.
+ * Returns 1 if the page is safe to free.
+ */
+static inline int free_page_prepare(struct page *page, int order)
+{
+ int i;
+ int reserved = 0;
+
+ for (i = 0 ; i < (1 << order) ; ++i)
+ reserved += free_pages_check(page + i);
+ if (reserved)
+ return 0;
+
+ if (!PageHighMem(page)) {
+ debug_check_no_locks_freed(page_address(page),
+ PAGE_SIZE << order);
+ debug_check_no_obj_freed(page_address(page),
+ PAGE_SIZE << order);
+ }
+ arch_free_page(page, order);
+ kernel_map_pages(page, 1 << order, 0);
+
+ return 1;
+}
+
+/*
* Frees a list of pages.
* Assumes all pages on list are in same zone, and of same order.
* count is the number of pages to free.
@@ -529,22 +555,10 @@ static void free_one_page(struct zone *zone, struct page *page, int order)
static void __free_pages_ok(struct page *page, unsigned int order)
{
unsigned long flags;
- int i;
- int reserved = 0;

- for (i = 0 ; i < (1 << order) ; ++i)
- reserved += free_pages_check(page + i);
- if (reserved)
+ if (!free_page_prepare(page, order))
return;

- if (!PageHighMem(page)) {
- debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order);
- debug_check_no_obj_freed(page_address(page),
- PAGE_SIZE << order);
- }
- arch_free_page(page, order);
- kernel_map_pages(page, 1 << order, 0);
-
local_irq_save(flags);
__count_vm_events(PGFREE, 1 << order);
free_one_page(page_zone(page), page, order);
--
1.6.0.1.451.gc8d31

2008-10-01 12:32:07

by Andy Whitcroft

[permalink] [raw]
Subject: [PATCH 4/4] capture pages freed during direct reclaim for allocation by the reclaimer

When a process enters direct reclaim it will expend effort identifying
and releasing pages in the hope of obtaining a page. However as these
pages are released asynchronously there is every possibility that the
pages will have been consumed by other allocators before the reclaimer
gets a look in. This is particularly problematic where the reclaimer is
attempting to allocate a higher order page. It is highly likely that
a parallel allocation will consume lower order constituent pages as we
release them preventing them coelescing into the higher order page the
reclaimer desires.

This patch set attempts to address this for allocations above
ALLOC_COSTLY_ORDER by temporarily collecting the pages we are releasing
onto a local free list. Instead of freeing them to the main buddy lists,
pages are collected and coelesced on this per direct reclaimer free list.
Pages which are freed by other processes are also considered, where they
coelesce with a page already under capture they will be moved to the
capture list. When pressure has been applied to a zone we then consult
the capture list and if there is an appropriatly sized page available
it is taken immediatly and the remainder returned to the free pool.
Capture is only enabled when the reclaimer's allocation order exceeds
ALLOC_COSTLY_ORDER as free pages below this order should naturally occur
in large numbers following regular reclaim.

Thanks go to Mel Gorman for numerous discussions during the development
of this patch and for his repeated reviews.

Signed-off-by: Andy Whitcroft <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Acked-by: KOSAKI Motohiro <[email protected]>
---
include/linux/mm_types.h | 1 +
include/linux/page-flags.h | 4 +
include/linux/pagevec.h | 1 +
mm/internal.h | 5 ++
mm/page_alloc.c | 159 +++++++++++++++++++++++++++++++++++++++++++-
mm/vmscan.c | 118 +++++++++++++++++++++++++++------
6 files changed, 267 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 906d8e0..cd2b549 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -72,6 +72,7 @@ struct page {
struct page *first_page; /* Compound tail pages */
struct {
unsigned long buddy_order; /* buddy: free page order */
+ struct list_head *buddy_free; /* buddy: free list pointer */
};
};
union {
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index c0ac9e0..016e604 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -117,6 +117,9 @@ enum pageflags {
/* SLUB */
PG_slub_frozen = PG_active,
PG_slub_debug = PG_error,
+
+ /* BUDDY overlays. */
+ PG_buddy_capture = PG_owner_priv_1,
};

#ifndef __GENERATING_BOUNDS_H
@@ -208,6 +211,7 @@ __PAGEFLAG(SlubDebug, slub_debug)
*/
TESTPAGEFLAG(Writeback, writeback) TESTSCFLAG(Writeback, writeback)
__PAGEFLAG(Buddy, buddy)
+__PAGEFLAG(BuddyCapture, buddy_capture) /* A buddy page, but reserved. */
PAGEFLAG(MappedToDisk, mappedtodisk)

/* PG_readahead is only used for file reads; PG_reclaim is only for writes */
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index e90a2cb..49eb9ae 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -23,6 +23,7 @@ struct pagevec {
void __pagevec_release(struct pagevec *pvec);
void __pagevec_release_nonlru(struct pagevec *pvec);
void __pagevec_free(struct pagevec *pvec);
+void __pagevec_capture(struct pagevec *pvec, struct list_head *free_list);
void ____pagevec_lru_add(struct pagevec *pvec, enum lru_list lru);
void pagevec_strip(struct pagevec *pvec);
void pagevec_swap_free(struct pagevec *pvec);
diff --git a/mm/internal.h b/mm/internal.h
index fcedcd0..9f0218b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -251,4 +251,9 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int len, int flags,
struct page **pages, struct vm_area_struct **vmas);

+extern struct page *capture_alloc_or_return(struct zone *, struct zone *,
+ struct list_head *, int, int, gfp_t);
+unsigned long try_to_free_pages_capture(struct page **, struct zonelist *,
+ nodemask_t *, int, gfp_t, int);
+
#endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3a646e3..8d1baf1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -428,6 +428,51 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
* -- wli
*/

+static inline void __capture_one_page(struct list_head *capture_list,
+ struct page *page, struct zone *zone, unsigned int order)
+{
+ unsigned long page_idx;
+ unsigned long order_size = 1UL << order;
+
+ if (unlikely(PageCompound(page)))
+ destroy_compound_page(page, order);
+
+ page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
+
+ VM_BUG_ON(page_idx & (order_size - 1));
+ VM_BUG_ON(bad_range(zone, page));
+
+ while (order < MAX_ORDER-1) {
+ unsigned long combined_idx;
+ struct page *buddy;
+
+ buddy = __page_find_buddy(page, page_idx, order);
+ if (!page_is_buddy(page, buddy, order))
+ break;
+
+ /* Our buddy is free, merge with it and move up one order. */
+ list_del(&buddy->lru);
+ if (PageBuddyCapture(buddy)) {
+ buddy->buddy_free = 0;
+ __ClearPageBuddyCapture(buddy);
+ } else {
+ zone->free_area[order].nr_free--;
+ __mod_zone_page_state(zone,
+ NR_FREE_PAGES, -(1UL << order));
+ }
+ rmv_page_order(buddy);
+ combined_idx = __find_combined_index(page_idx, order);
+ page = page + (combined_idx - page_idx);
+ page_idx = combined_idx;
+ order++;
+ }
+ set_page_order(page, order);
+ __SetPageBuddyCapture(page);
+ page->buddy_free = capture_list;
+
+ list_add(&page->lru, capture_list);
+}
+
static inline void __free_one_page(struct page *page,
struct zone *zone, unsigned int order)
{
@@ -451,6 +496,12 @@ static inline void __free_one_page(struct page *page,
buddy = __page_find_buddy(page, page_idx, order);
if (!page_is_buddy(page, buddy, order))
break;
+ if (PageBuddyCapture(buddy)) {
+ __mod_zone_page_state(zone,
+ NR_FREE_PAGES, -(1UL << order));
+ return __capture_one_page(buddy->buddy_free,
+ page, zone, order);
+ }

/* Our buddy is free, merge with it and move up one order. */
list_del(&buddy->lru);
@@ -626,6 +677,23 @@ static inline void expand(struct zone *zone, struct page *page,
}

/*
+ * Convert the passed page from actual_order to desired_order.
+ * Given a page of actual_order release all but the desired_order sized
+ * buddy at its start.
+ */
+void __carve_off(struct page *page, unsigned long actual_order,
+ unsigned long desired_order)
+{
+ int migratetype = get_pageblock_migratetype(page);
+ struct zone *zone = page_zone(page);
+ struct free_area *area = &(zone->free_area[actual_order]);
+
+ __mod_zone_page_state(zone, NR_FREE_PAGES,
+ (1UL << actual_order) - (1UL << desired_order));
+ expand(zone, page, desired_order, actual_order, area, migratetype);
+}
+
+/*
* This page is about to be returned from the page allocator
*/
static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
@@ -1664,11 +1732,15 @@ nofail_alloc:
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;

- did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
+ did_some_progress = try_to_free_pages_capture(&page, zonelist, nodemask,
+ order, gfp_mask, alloc_flags);

p->reclaim_state = NULL;
p->flags &= ~PF_MEMALLOC;

+ if (page)
+ goto got_pg;
+
cond_resched();

if (order != 0)
@@ -1799,6 +1871,24 @@ void __pagevec_free(struct pagevec *pvec)
free_hot_cold_page(pvec->pages[i], pvec->cold);
}

+void __pagevec_capture(struct pagevec *pvec, struct list_head *free_list)
+{
+ int i = pagevec_count(pvec);
+ struct zone *zone = page_zone(pvec->pages[0]);
+ unsigned long flags;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ while (--i >= 0) {
+ struct page *page = pvec->pages[i];
+ if (PageAnon(page))
+ page->mapping = NULL;
+ if (!free_page_prepare(page, 0))
+ continue;
+ __capture_one_page(free_list, page, zone, 0);
+ }
+ spin_unlock_irqrestore(&zone->lock, flags);
+}
+
void __free_pages(struct page *page, unsigned int order)
{
if (put_page_testzero(page)) {
@@ -4812,6 +4902,73 @@ out:
spin_unlock_irqrestore(&zone->lock, flags);
}

+#define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
+
+/*
+ * Run through the accumulated list of captures pages and take the first
+ * which is big enough to satisfy the original allocation. Trim the selected
+ * page to size, and free the remainder.
+ */
+struct page *capture_alloc_or_return(struct zone *zone,
+ struct zone *preferred_zone, struct list_head *capture_list,
+ int order, int alloc_flags, gfp_t gfp_mask)
+{
+ struct page *capture_page = 0;
+ unsigned long flags;
+ int classzone_idx = zone_idx(preferred_zone);
+
+ spin_lock_irqsave(&zone->lock, flags);
+
+ while (!list_empty(capture_list)) {
+ struct page *page;
+ int pg_order;
+
+ page = lru_to_page(capture_list);
+ list_del(&page->lru);
+ pg_order = page_order(page);
+
+ /*
+ * Clear out our buddy size and list information before
+ * releasing or allocating the page.
+ */
+ rmv_page_order(page);
+ page->buddy_free = 0;
+ __ClearPageBuddyCapture(page);
+
+ if (!capture_page && pg_order >= order) {
+ __carve_off(page, pg_order, order);
+ capture_page = page;
+ } else
+ __free_one_page(page, zone, pg_order);
+ }
+
+ /*
+ * Ensure that this capture would not violate the watermarks.
+ * Subtle, we actually already have the page outside the watermarks
+ * so check if we can allocate an order 0 page.
+ */
+ if (capture_page &&
+ (!zone_cpuset_permits(zone, alloc_flags, gfp_mask) ||
+ !zone_watermark_permits(zone, 0, classzone_idx,
+ alloc_flags, gfp_mask))) {
+ __free_one_page(capture_page, zone, order);
+ capture_page = NULL;
+ }
+
+ if (capture_page)
+ __count_zone_vm_events(PGALLOC, zone, 1 << order);
+
+ zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
+ zone->pages_scanned = 0;
+
+ spin_unlock_irqrestore(&zone->lock, flags);
+
+ if (capture_page)
+ prep_new_page(capture_page, order, gfp_mask);
+
+ return capture_page;
+}
+
#ifdef CONFIG_MEMORY_HOTREMOVE
/*
* All pages in the range must be isolated before calling this.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 85ce427..69bb29c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -56,6 +56,8 @@ struct scan_control {
/* This context's GFP mask */
gfp_t gfp_mask;

+ int alloc_flags;
+
int may_writepage;

/* Can pages be swapped as part of reclaim? */
@@ -81,6 +83,12 @@ struct scan_control {
unsigned long *scanned, int order, int mode,
struct zone *z, struct mem_cgroup *mem_cont,
int active, int file);
+
+ /* Captured page. */
+ struct page **capture;
+
+ /* Nodemask for acceptable allocations. */
+ nodemask_t *nodemask;
};

#define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -560,7 +568,8 @@ void putback_lru_page(struct page *page)
/*
* shrink_page_list() returns the number of reclaimed pages
*/
-static unsigned long shrink_page_list(struct list_head *page_list,
+static unsigned long shrink_page_list(struct list_head *free_list,
+ struct list_head *page_list,
struct scan_control *sc,
enum pageout_io sync_writeback)
{
@@ -743,7 +752,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
free_it:
nr_reclaimed++;
if (!pagevec_add(&freed_pvec, page)) {
- __pagevec_free(&freed_pvec);
+ if (free_list)
+ __pagevec_capture(&freed_pvec, free_list);
+ else
+ __pagevec_free(&freed_pvec);
pagevec_reinit(&freed_pvec);
}
continue;
@@ -767,8 +779,12 @@ keep:
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}
list_splice(&ret_pages, page_list);
- if (pagevec_count(&freed_pvec))
- __pagevec_free(&freed_pvec);
+ if (pagevec_count(&freed_pvec)) {
+ if (free_list)
+ __pagevec_capture(&freed_pvec, free_list);
+ else
+ __pagevec_free(&freed_pvec);
+ }
count_vm_events(PGACTIVATE, pgactivate);
return nr_reclaimed;
}
@@ -1024,7 +1040,8 @@ int isolate_lru_page(struct page *page)
* shrink_inactive_list() is a helper for shrink_zone(). It returns the number
* of reclaimed pages
*/
-static unsigned long shrink_inactive_list(unsigned long max_scan,
+static unsigned long shrink_inactive_list(struct list_head *free_list,
+ unsigned long max_scan,
struct zone *zone, struct scan_control *sc,
int priority, int file)
{
@@ -1083,7 +1100,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
spin_unlock_irq(&zone->lru_lock);

nr_scanned += nr_scan;
- nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+ nr_freed = shrink_page_list(free_list, &page_list,
+ sc, PAGEOUT_IO_ASYNC);

/*
* If we are direct reclaiming for contiguous pages and we do
@@ -1102,8 +1120,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
nr_active = clear_active_flags(&page_list, count);
count_vm_events(PGDEACTIVATE, nr_active);

- nr_freed += shrink_page_list(&page_list, sc,
- PAGEOUT_IO_SYNC);
+ nr_freed += shrink_page_list(free_list, &page_list,
+ sc, PAGEOUT_IO_SYNC);
}

nr_reclaimed += nr_freed;
@@ -1337,7 +1355,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
pagevec_release(&pvec);
}

-static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
+static unsigned long shrink_list(struct list_head *free_list,
+ enum lru_list lru, unsigned long nr_to_scan,
struct zone *zone, struct scan_control *sc, int priority)
{
int file = is_file_lru(lru);
@@ -1352,7 +1371,8 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
shrink_active_list(nr_to_scan, zone, sc, priority, file);
return 0;
}
- return shrink_inactive_list(nr_to_scan, zone, sc, priority, file);
+ return shrink_inactive_list(free_list, nr_to_scan, zone,
+ sc, priority, file);
}

/*
@@ -1444,7 +1464,7 @@ static void get_scan_ratio(struct zone *zone, struct scan_control * sc,
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
static unsigned long shrink_zone(int priority, struct zone *zone,
- struct scan_control *sc)
+ struct zone *preferred_zone, struct scan_control *sc)
{
unsigned long nr[NR_LRU_LISTS];
unsigned long nr_to_scan;
@@ -1452,6 +1472,23 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
unsigned long percent[2]; /* anon @ 0; file @ 1 */
enum lru_list l;

+ struct list_head __capture_list;
+ struct list_head *capture_list = NULL;
+ struct page *capture_page;
+
+ /*
+ * When direct reclaimers are asking for larger orders
+ * capture pages for them. There is no point if we already
+ * have an acceptable page or if this zone is not within the
+ * nodemask.
+ */
+ if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
+ sc->capture && !*(sc->capture) && (sc->nodemask == NULL ||
+ node_isset(zone_to_nid(zone), *sc->nodemask))) {
+ capture_list = &__capture_list;
+ INIT_LIST_HEAD(capture_list);
+ }
+
get_scan_ratio(zone, sc, percent);

for_each_evictable_lru(l) {
@@ -1481,6 +1518,8 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
}
}

+ capture_page = NULL;
+
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
for_each_evictable_lru(l) {
@@ -1489,10 +1528,18 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
(unsigned long)sc->swap_cluster_max);
nr[l] -= nr_to_scan;

- nr_reclaimed += shrink_list(l, nr_to_scan,
+ nr_reclaimed += shrink_list(capture_list,
+ l, nr_to_scan,
zone, sc, priority);
}
}
+ if (capture_list) {
+ capture_page = capture_alloc_or_return(zone,
+ preferred_zone, capture_list, sc->order,
+ sc->alloc_flags, sc->gfp_mask);
+ if (capture_page)
+ capture_list = NULL;
+ }
}

/*
@@ -1504,6 +1551,9 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
else if (!scan_global_lru(sc))
shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0);

+ if (capture_page)
+ *(sc->capture) = capture_page;
+
throttle_vm_writeout(sc->gfp_mask);
return nr_reclaimed;
}
@@ -1525,7 +1575,7 @@ static unsigned long shrink_zone(int priority, struct zone *zone,
* scan then give up on it.
*/
static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
- struct scan_control *sc)
+ struct zone *preferred_zone, struct scan_control *sc)
{
enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
unsigned long nr_reclaimed = 0;
@@ -1559,7 +1609,7 @@ static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
priority);
}

- nr_reclaimed += shrink_zone(priority, zone, sc);
+ nr_reclaimed += shrink_zone(priority, zone, preferred_zone, sc);
}

return nr_reclaimed;
@@ -1592,8 +1642,14 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
unsigned long lru_pages = 0;
struct zoneref *z;
struct zone *zone;
+ struct zone *preferred_zone;
enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);

+ /* This should never fail as we should be scanning a real zonelist. */
+ (void)first_zones_zonelist(zonelist, high_zoneidx, sc->nodemask,
+ &preferred_zone);
+ BUG_ON(!preferred_zone);
+
delayacct_freepages_start();

if (scan_global_lru(sc))
@@ -1615,7 +1671,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
sc->nr_scanned = 0;
if (!priority)
disable_swap_token();
- nr_reclaimed += shrink_zones(priority, zonelist, sc);
+ nr_reclaimed += shrink_zones(priority, zonelist,
+ preferred_zone, sc);
/*
* Don't shrink slabs when reclaiming memory from
* over limit cgroups
@@ -1680,11 +1737,13 @@ out:
return ret;
}

-unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
- gfp_t gfp_mask)
+unsigned long try_to_free_pages_capture(struct page **capture_pagep,
+ struct zonelist *zonelist, nodemask_t *nodemask,
+ int order, gfp_t gfp_mask, int alloc_flags)
{
struct scan_control sc = {
.gfp_mask = gfp_mask,
+ .alloc_flags = alloc_flags,
.may_writepage = !laptop_mode,
.swap_cluster_max = SWAP_CLUSTER_MAX,
.may_swap = 1,
@@ -1692,17 +1751,28 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
.order = order,
.mem_cgroup = NULL,
.isolate_pages = isolate_pages_global,
+ .capture = capture_pagep,
+ .nodemask = nodemask,
};

return do_try_to_free_pages(zonelist, &sc);
}

+unsigned long try_to_free_pages(struct zonelist *zonelist,
+ int order, gfp_t gfp_mask)
+{
+ return try_to_free_pages_capture(NULL, zonelist, NULL,
+ order, gfp_mask, 0);
+}
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR

unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
gfp_t gfp_mask)
{
struct scan_control sc = {
+ .gfp_mask = gfp_mask,
+ .alloc_flags = 0,
.may_writepage = !laptop_mode,
.may_swap = 1,
.swap_cluster_max = SWAP_CLUSTER_MAX,
@@ -1710,6 +1780,8 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
.order = 0,
.mem_cgroup = mem_cont,
.isolate_pages = mem_cgroup_isolate_pages,
+ .capture = NULL,
+ .nodemask = NULL,
};
struct zonelist *zonelist;

@@ -1751,12 +1823,15 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
struct reclaim_state *reclaim_state = current->reclaim_state;
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
+ .alloc_flags = 0,
.may_swap = 1,
.swap_cluster_max = SWAP_CLUSTER_MAX,
.swappiness = vm_swappiness,
.order = order,
.mem_cgroup = NULL,
.isolate_pages = isolate_pages_global,
+ .capture = NULL,
+ .nodemask = NULL,
};
/*
* temp_priority is used to remember the scanning priority at which
@@ -1852,7 +1927,8 @@ loop_again:
*/
if (!zone_watermark_ok(zone, order, 8*zone->pages_high,
end_zone, 0))
- nr_reclaimed += shrink_zone(priority, zone, &sc);
+ nr_reclaimed += shrink_zone(priority,
+ zone, zone, &sc);
reclaim_state->reclaimed_slab = 0;
nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
lru_pages);
@@ -2054,7 +2130,7 @@ static unsigned long shrink_all_zones(unsigned long nr_pages, int prio,
nr_to_scan = min(nr_pages,
zone_page_state(zone,
NR_LRU_BASE + l));
- ret += shrink_list(l, nr_to_scan, zone,
+ ret += shrink_list(NULL, l, nr_to_scan, zone,
sc, prio);
if (ret >= nr_pages)
return ret;
@@ -2081,6 +2157,7 @@ unsigned long shrink_all_memory(unsigned long nr_pages)
struct reclaim_state reclaim_state;
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
+ .alloc_flags = 0,
.may_swap = 0,
.swap_cluster_max = nr_pages,
.may_writepage = 1,
@@ -2269,6 +2346,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
.swap_cluster_max = max_t(unsigned long, nr_pages,
SWAP_CLUSTER_MAX),
.gfp_mask = gfp_mask,
+ .alloc_flags = 0,
.swappiness = vm_swappiness,
.isolate_pages = isolate_pages_global,
};
@@ -2295,7 +2373,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
priority = ZONE_RECLAIM_PRIORITY;
do {
note_zone_scanning_priority(zone, priority);
- nr_reclaimed += shrink_zone(priority, zone, &sc);
+ nr_reclaimed += shrink_zone(priority, zone, zone, &sc);
priority--;
} while (priority >= 0 && nr_reclaimed < nr_pages);
}
--
1.6.0.1.451.gc8d31

2008-10-01 15:02:45

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 4/4] capture pages freed during direct reclaim for allocation by the reclaimer

Andy Whitcroft wrote:
> When a process enters direct reclaim it will expend effort identifying
> and releasing pages in the hope of obtaining a page. However as these
> pages are released asynchronously there is every possibility that the
> pages will have been consumed by other allocators before the reclaimer
> gets a look in. This is particularly problematic where the reclaimer is
> attempting to allocate a higher order page. It is highly likely that
> a parallel allocation will consume lower order constituent pages as we
> release them preventing them coelescing into the higher order page the
> reclaimer desires.

The reclaim problem is due to the pcp queueing right? Could we disable pcp
queueing during reclaim? pcp processing is not necessarily a gain, so
temporarily disabling it should not be a problem.

At the beginning of reclaim just flush all pcp pages and then do not allow pcp
refills again until reclaim is finished?

2008-10-02 02:46:29

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 0/4] Reclaim page capture v4

Hi, Andy.

I tested your patch in my desktop.
The test is just kernel compile with single thread.
My system environment is as follows.

model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
MemTotal: 2065856 kB

When I tested vanilla, compile time is as follows.

2433.53user 187.96system 42:05.99elapsed 103%CPU (0avgtext+0avgdata
0maxresident)k
588752inputs+4503408outputs (127major+55456246minor)pagefaults 0swaps

When I tested your patch, as follows.

2489.63user 202.41system 44:47.71elapsed 100%CPU (0avgtext+0avgdata
0maxresident)k
538608inputs+4503928outputs (130major+55531561minor)pagefaults 0swaps

Regresstion almost is above 2 minutes.
Do you think It is a trivial?

I know your patch is good to allocate hugepage.
But, I think many users don't need it, including embedded system and
desktop users yet.

So I suggest you made it enable optionally.

On Wed, Oct 1, 2008 at 9:30 PM, Andy Whitcroft <[email protected]> wrote:
> For sometime we have been looking at mechanisms for improving the availability
> of larger allocations under load. One of the options we have explored is
> the capturing of pages freed under direct reclaim in order to increase the
> chances of free pages coelescing before they are subject to reallocation
> by racing allocators.
>
> Following this email is a patch stack implementing page capture during
> direct reclaim. It consits of four patches. The first two simply pull
> out existing code into helpers for reuse. The third makes buddy's use
> of struct page explicit. The fourth contains the meat of the changes,
> and its leader contains a much fuller description of the feature.
>
> This update represents a rebase to -mm and incorporates feedback from
> KOSAKI Motohiro. It also incorporates an accounting fix which was
> preventing some captures.
>
> I have done a lot of comparitive testing with and without this patch
> set and in broad brush I am seeing improvements in hugepage allocations
> (worst case size) success on all of my test systems. These tests consist
> of placing a constant stream of high order allocations on the system,
> at varying rates. The results for these various runs are then averaged
> to give an overall improvement.
>
> Absolute Effective
> x86-64 2.48% 4.58%
> powerpc 5.55% 25.22%
>
> x86-64 has a relatively small huge page size and so is always much more
> effective at allocating huge pages. Even there we get a measurable
> improvement. On powerpc the huge pages are much larger and much harder
> to recover. Here we see a full 25% increase in page recovery.
>
> It should be noted that these are worst case testing, and very agressive
> taking every possible page in the system. It would be helpful to get
> wider testing in -mm.
>
> Against: 2.6.27-rc1-mm1
>
> Andrew, please consider for -mm.
>
> -apw
>
> Changes since V3:
> - Incorporates an anon vma fix pointed out by MinChan Kim
> - switch to using a pagevec for page capture collection
>
> Changes since V2:
> - Incorporates review feedback from Christoph Lameter,
> - Incorporates review feedback from Peter Zijlstra, and
> - Checkpatch fixes.
>
> Changes since V1:
> - Incorporates review feedback from KOSAKI Motohiro,
> - fixes up accounting when checking watermarks for captured pages,
> - rebase 2.6.27-rc1-mm1,
> - Incorporates review feedback from Mel.
>
>
> Andy Whitcroft (4):
> pull out the page pre-release and sanity check logic for reuse
> pull out zone cpuset and watermark checks for reuse
> buddy: explicitly identify buddy field use in struct page
> capture pages freed during direct reclaim for allocation by the
> reclaimer
>
> include/linux/mm_types.h | 4 +
> include/linux/page-flags.h | 4 +
> include/linux/pagevec.h | 1 +
> mm/internal.h | 7 +-
> mm/page_alloc.c | 265 ++++++++++++++++++++++++++++++++++++++------
> mm/vmscan.c | 118 ++++++++++++++++----
> 6 files changed, 343 insertions(+), 56 deletions(-)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>



--
Kinds regards,
MinChan Kim

2008-10-02 06:39:10

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 0/4] Reclaim page capture v4

On Wed, 1 Oct 2008 13:30:57 +0100
Andy Whitcroft <[email protected]> wrote:

> For sometime we have been looking at mechanisms for improving the availability
> of larger allocations under load. One of the options we have explored is
> the capturing of pages freed under direct reclaim in order to increase the
> chances of free pages coelescing before they are subject to reallocation
> by racing allocators.
>
> Following this email is a patch stack implementing page capture during
> direct reclaim. It consits of four patches. The first two simply pull
> out existing code into helpers for reuse. The third makes buddy's use
> of struct page explicit. The fourth contains the meat of the changes,
> and its leader contains a much fuller description of the feature.
>
> This update represents a rebase to -mm and incorporates feedback from
> KOSAKI Motohiro. It also incorporates an accounting fix which was
> preventing some captures.
>
> I have done a lot of comparitive testing with and without this patch
> set and in broad brush I am seeing improvements in hugepage allocations
> (worst case size) success on all of my test systems. These tests consist
> of placing a constant stream of high order allocations on the system,
> at varying rates. The results for these various runs are then averaged
> to give an overall improvement.
>
> Absolute Effective
> x86-64 2.48% 4.58%
> powerpc 5.55% 25.22%
>
> x86-64 has a relatively small huge page size and so is always much more
> effective at allocating huge pages. Even there we get a measurable
> improvement. On powerpc the huge pages are much larger and much harder
> to recover. Here we see a full 25% increase in page recovery.
>
> It should be noted that these are worst case testing, and very agressive
> taking every possible page in the system. It would be helpful to get
> wider testing in -mm.
>
> Against: 2.6.27-rc1-mm1
>
> Andrew, please consider for -mm.
>

Hmm, can't we use "MIGRATE_ISOLATE" pageblock type for this purpose ?
The page allocater skips pageblock marked as MIGRATE_ISOLATE at allocation.
(pageblock-size is equal to HUGEPAGE size in general.)

Of course, "where should be isolated" is a problem.

Thanks,
-Kame

2008-10-02 07:00:19

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/4] pull out the page pre-release and sanity check logic for reuse

On Wed, 1 Oct 2008 13:30:58 +0100
Andy Whitcroft <[email protected]> wrote:

> When we are about to release a page we perform a number of actions
> on that page. We clear down any anonymous mappings, confirm that
> the page is safe to release, check for freeing locks, before mapping
> the page should that be required. Pull this processing out into a
> helper function for reuse in a later patch.
>
> Note that we do not convert the similar cleardown in free_hot_cold_page()
> as the optimiser is unable to squash the loops during the inline.
>
> Signed-off-by: Andy Whitcroft <[email protected]>
> Acked-by: Peter Zijlstra <[email protected]>
> Acked-by: KOSAKI Motohiro <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

2008-10-02 07:00:39

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 2/4] pull out zone cpuset and watermark checks for reuse

On Wed, 1 Oct 2008 13:30:59 +0100
Andy Whitcroft <[email protected]> wrote:

> When allocating we need to confirm that the zone we are about to allocate
> from is acceptable to the CPUSET we are in, and that it does not violate
> the zone watermarks. Pull these checks out so we can reuse them in a
> later patch.
>
> Signed-off-by: Andy Whitcroft <[email protected]>
> Acked-by: Peter Zijlstra <[email protected]>
> Acked-by: KOSAKI Motohiro <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

2008-10-02 07:01:39

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/4] buddy: explicitly identify buddy field use in struct page

On Wed, 1 Oct 2008 13:31:00 +0100
Andy Whitcroft <[email protected]> wrote:

> Explicitly define the struct page fields which buddy uses when it owns
> pages. Defines a new anonymous struct to allow additional fields to
> be defined in a later patch.
>
> Signed-off-by: Andy Whitcroft <[email protected]>
> Acked-by: Peter Zijlstra <[email protected]>
> Acked-by: KOSAKI Motohiro <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
> Reviewed-by: Christoph Lameter <[email protected]>

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

2008-10-02 07:18:41

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 4/4] capture pages freed during direct reclaim for allocation by the reclaimer

On Wed, 1 Oct 2008 13:31:01 +0100
Andy Whitcroft <[email protected]> wrote:

> When a process enters direct reclaim it will expend effort identifying
> and releasing pages in the hope of obtaining a page. However as these
> pages are released asynchronously there is every possibility that the
> pages will have been consumed by other allocators before the reclaimer
> gets a look in. This is particularly problematic where the reclaimer is
> attempting to allocate a higher order page. It is highly likely that
> a parallel allocation will consume lower order constituent pages as we
> release them preventing them coelescing into the higher order page the
> reclaimer desires.
>
> This patch set attempts to address this for allocations above
> ALLOC_COSTLY_ORDER by temporarily collecting the pages we are releasing
> onto a local free list. Instead of freeing them to the main buddy lists,
> pages are collected and coelesced on this per direct reclaimer free list.
> Pages which are freed by other processes are also considered, where they
> coelesce with a page already under capture they will be moved to the
> capture list. When pressure has been applied to a zone we then consult
> the capture list and if there is an appropriatly sized page available
> it is taken immediatly and the remainder returned to the free pool.
> Capture is only enabled when the reclaimer's allocation order exceeds
> ALLOC_COSTLY_ORDER as free pages below this order should naturally occur
> in large numbers following regular reclaim.
>
> Thanks go to Mel Gorman for numerous discussions during the development
> of this patch and for his repeated reviews.
>

Hmm.. is this routine better than
mm/memory_hotplug.c::do_migrate_range(start_pfn, end_pfn) ?

Thanks,
-Kame

2008-10-02 14:35:54

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [PATCH 4/4] capture pages freed during direct reclaim for allocation by the reclaimer

On Wed, Oct 01, 2008 at 10:01:46AM -0500, Christoph Lameter wrote:
> Andy Whitcroft wrote:
> > When a process enters direct reclaim it will expend effort identifying
> > and releasing pages in the hope of obtaining a page. However as these
> > pages are released asynchronously there is every possibility that the
> > pages will have been consumed by other allocators before the reclaimer
> > gets a look in. This is particularly problematic where the reclaimer is
> > attempting to allocate a higher order page. It is highly likely that
> > a parallel allocation will consume lower order constituent pages as we
> > release them preventing them coelescing into the higher order page the
> > reclaimer desires.
>
> The reclaim problem is due to the pcp queueing right? Could we disable pcp
> queueing during reclaim? pcp processing is not necessarily a gain, so
> temporarily disabling it should not be a problem.
>
> At the beginning of reclaim just flush all pcp pages and then do not allow pcp
> refills again until reclaim is finished?

Not entirely, some pages could get trapped there for sure. But it is
parallel allocations we are trying to guard against. Plus we already flush
the pcp during reclaim for higher orders.

-apw

2008-10-02 15:02:49

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [PATCH 4/4] capture pages freed during direct reclaim for allocation by the reclaimer

On Thu, Oct 02, 2008 at 04:24:14PM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 1 Oct 2008 13:31:01 +0100
> Andy Whitcroft <[email protected]> wrote:
>
> > When a process enters direct reclaim it will expend effort identifying
> > and releasing pages in the hope of obtaining a page. However as these
> > pages are released asynchronously there is every possibility that the
> > pages will have been consumed by other allocators before the reclaimer
> > gets a look in. This is particularly problematic where the reclaimer is
> > attempting to allocate a higher order page. It is highly likely that
> > a parallel allocation will consume lower order constituent pages as we
> > release them preventing them coelescing into the higher order page the
> > reclaimer desires.
> >
> > This patch set attempts to address this for allocations above
> > ALLOC_COSTLY_ORDER by temporarily collecting the pages we are releasing
> > onto a local free list. Instead of freeing them to the main buddy lists,
> > pages are collected and coelesced on this per direct reclaimer free list.
> > Pages which are freed by other processes are also considered, where they
> > coelesce with a page already under capture they will be moved to the
> > capture list. When pressure has been applied to a zone we then consult
> > the capture list and if there is an appropriatly sized page available
> > it is taken immediatly and the remainder returned to the free pool.
> > Capture is only enabled when the reclaimer's allocation order exceeds
> > ALLOC_COSTLY_ORDER as free pages below this order should naturally occur
> > in large numbers following regular reclaim.
> >
> > Thanks go to Mel Gorman for numerous discussions during the development
> > of this patch and for his repeated reviews.
> >
>
> Hmm.. is this routine better than
> mm/memory_hotplug.c::do_migrate_range(start_pfn, end_pfn) ?

Are you suggesting that it might be more adventageous to try and migrate
things out of this area as part of reclaim? If so then I tend to agree,
though that would be a good idea generally with or without capture.

/me adds it to his todo list to test that out.

-apw

2008-10-02 15:04:50

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [PATCH 0/4] Reclaim page capture v4

On Thu, Oct 02, 2008 at 11:46:12AM +0900, MinChan Kim wrote:
> Hi, Andy.
>
> I tested your patch in my desktop.
> The test is just kernel compile with single thread.
> My system environment is as follows.
>
> model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
> MemTotal: 2065856 kB
>
> When I tested vanilla, compile time is as follows.
>
> 2433.53user 187.96system 42:05.99elapsed 103%CPU (0avgtext+0avgdata
> 0maxresident)k
> 588752inputs+4503408outputs (127major+55456246minor)pagefaults 0swaps
>
> When I tested your patch, as follows.
>
> 2489.63user 202.41system 44:47.71elapsed 100%CPU (0avgtext+0avgdata
> 0maxresident)k
> 538608inputs+4503928outputs (130major+55531561minor)pagefaults 0swaps
>
> Regresstion almost is above 2 minutes.
> Do you think It is a trivial?
>
> I know your patch is good to allocate hugepage.
> But, I think many users don't need it, including embedded system and
> desktop users yet.
>
> So I suggest you made it enable optionally.

Hmmm. I would not expect to see any significant impact for this kind of
workload as we should not be triggering capture for the lower order
allocations at all. Something screwey must be occuring. I will go and
reproduce this here and see if I can figure out just what causes this.

-apw

2008-10-02 15:27:11

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Re: [PATCH 4/4] capture pages freed during direct reclaim for allocation by the reclaimer

----- Original Message -----

>On Thu, Oct 02, 2008 at 04:24:14PM +0900, KAMEZAWA Hiroyuki wrote:
>> On Wed, 1 Oct 2008 13:31:01 +0100
>> Andy Whitcroft <[email protected]> wrote:
>>
>> > When a process enters direct reclaim it will expend effort identifying
>> > and releasing pages in the hope of obtaining a page. However as these
>> > pages are released asynchronously there is every possibility that the
>> > pages will have been consumed by other allocators before the reclaimer
>> > gets a look in. This is particularly problematic where the reclaimer is
>> > attempting to allocate a higher order page. It is highly likely that
>> > a parallel allocation will consume lower order constituent pages as we
>> > release them preventing them coelescing into the higher order page the
>> > reclaimer desires.
>> >
>> > This patch set attempts to address this for allocations above
>> > ALLOC_COSTLY_ORDER by temporarily collecting the pages we are releasing
>> > onto a local free list. Instead of freeing them to the main buddy lists,
>> > pages are collected and coelesced on this per direct reclaimer free list.
>> > Pages which are freed by other processes are also considered, where they
>> > coelesce with a page already under capture they will be moved to the
>> > capture list. When pressure has been applied to a zone we then consult
>> > the capture list and if there is an appropriatly sized page available
>> > it is taken immediatly and the remainder returned to the free pool.
>> > Capture is only enabled when the reclaimer's allocation order exceeds
>> > ALLOC_COSTLY_ORDER as free pages below this order should naturally occur
>> > in large numbers following regular reclaim.
>> >
>> > Thanks go to Mel Gorman for numerous discussions during the development
>> > of this patch and for his repeated reviews.
>> >
>>
>> Hmm.. is this routine better than
>> mm/memory_hotplug.c::do_migrate_range(start_pfn, end_pfn) ?
>
>Are you suggesting that it might be more adventageous to try and migrate
>things out of this area as part of reclaim? If so then I tend to agree,
>though that would be a good idea generally with or without capture.
>
>/me adds it to his todo list to test that out.
>
I just remember I did the same kind of work to offline pages.
Sorry for noise.

I just have an idea to support following kind of interface via memory hotplug
This makes all pages in the section to be hugepage.

#echo huge > /sys/device/system/memory/memoryXXX/state
(memory hotplug interface supports online/offline here.)

But no patches yet...

Thanks,
-Kame

2008-10-02 16:31:11

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 4/4] capture pages freed during direct reclaim for allocation by the reclaimer

Andy Whitcroft wrote:

>> At the beginning of reclaim just flush all pcp pages and then do not allow pcp
>> refills again until reclaim is finished?
>
> Not entirely, some pages could get trapped there for sure. But it is
> parallel allocations we are trying to guard against. Plus we already flush
> the pcp during reclaim for higher orders.

So we just would need to forbid refilling the pcp.

Parallel allocations are less a problem if the freed order 0 pages get merged
immediately into the order 1 freelist. Of course that will only work 50% of
the time but it will have a similar effect to this patch.

2008-10-03 03:26:16

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/4] Reclaim page capture v4

> > I tested your patch in my desktop.
> > The test is just kernel compile with single thread.
> > My system environment is as follows.
> >
> > model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
> > MemTotal: 2065856 kB
> >
> > When I tested vanilla, compile time is as follows.
> >
> > 2433.53user 187.96system 42:05.99elapsed 103%CPU (0avgtext+0avgdata
> > 0maxresident)k
> > 588752inputs+4503408outputs (127major+55456246minor)pagefaults 0swaps
> >
> > When I tested your patch, as follows.
> >
> > 2489.63user 202.41system 44:47.71elapsed 100%CPU (0avgtext+0avgdata
> > 0maxresident)k
> > 538608inputs+4503928outputs (130major+55531561minor)pagefaults 0swaps
> >
> > Regresstion almost is above 2 minutes.
> > Do you think It is a trivial?
> >
> > I know your patch is good to allocate hugepage.
> > But, I think many users don't need it, including embedded system and
> > desktop users yet.
> >
> > So I suggest you made it enable optionally.
>
> Hmmm. I would not expect to see any significant impact for this kind of
> workload as we should not be triggering capture for the lower order
> allocations at all. Something screwey must be occuring. I will go and
> reproduce this here and see if I can figure out just what causes this.

yup.
I also think this is significant regression.

if this is reproduced, that patch shouldn't be merged to -mm, IMHO.



2008-10-03 03:41:53

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 4/4] capture pages freed during direct reclaim for allocation by the reclaimer

Hi Cristoph,

> >> At the beginning of reclaim just flush all pcp pages and then do not allow pcp
> >> refills again until reclaim is finished?
> >
> > Not entirely, some pages could get trapped there for sure. But it is
> > parallel allocations we are trying to guard against. Plus we already flush
> > the pcp during reclaim for higher orders.
>
> So we just would need to forbid refilling the pcp.
>
> Parallel allocations are less a problem if the freed order 0 pages get merged
> immediately into the order 1 freelist. Of course that will only work 50% of
> the time but it will have a similar effect to this patch.

Ah, Right.
Could we hear why you like pcp disabling than Andy's patch?

Honestly, I think pcp has some problem.
But I avoid to change pcp because I don't understand its design.

Maybe, we should discuss currect pcp behavior?


2008-10-03 06:48:44

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/4] Reclaim page capture v4

> Hi, Andy.
>
> I tested your patch in my desktop.
> The test is just kernel compile with single thread.
> My system environment is as follows.
>
> model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
> MemTotal: 2065856 kB
>
> When I tested vanilla, compile time is as follows.
>
> 2433.53user 187.96system 42:05.99elapsed 103%CPU (0avgtext+0avgdata
> 0maxresident)k
> 588752inputs+4503408outputs (127major+55456246minor)pagefaults 0swaps
>
> When I tested your patch, as follows.
>
> 2489.63user 202.41system 44:47.71elapsed 100%CPU (0avgtext+0avgdata
> 0maxresident)k
> 538608inputs+4503928outputs (130major+55531561minor)pagefaults 0swaps
>
> Regresstion almost is above 2 minutes.
> Do you think It is a trivial?

Ooops.
this is definitly significant regression.


> I know your patch is good to allocate hugepage.
> But, I think many users don't need it, including embedded system and
> desktop users yet.
>
> So I suggest you made it enable optionally.

No.
if the patch has this significant regression,
nobody turn on its option.

We should fix that.

2008-10-03 12:39:34

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 4/4] capture pages freed during direct reclaim for allocation by the reclaimer

KOSAKI Motohiro wrote:

>> Parallel allocations are less a problem if the freed order 0 pages get merged
>> immediately into the order 1 freelist. Of course that will only work 50% of
>> the time but it will have a similar effect to this patch.
>
> Ah, Right.
> Could we hear why you like pcp disabling than Andy's patch?

Its simpler code wise.


> Honestly, I think pcp has some problem.

pcps are a particular problem on NUMA because the lists are replicated per
zone and per processor.

> But I avoid to change pcp because I don't understand its design.

In the worst case we see that pcps cause a 5% performance drop (sequential
alloc w/o free followed by sequential free w/o allocs). See my page allocator
tests in my git tree.

> Maybe, we should discuss currect pcp behavior?

pcps need improvement. The performance issues with the page allocator fastpath
are likely due to bloating of the fastpaths (antifrag did not do much good on
that level). Plus current crops of processors are sensitive to cache footprint
issues (seems that the tbench regression in the network stack also are due to
the same effect). Doubly linked lists are not good today because they touch
multiple cachelines.



2008-10-07 04:29:21

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 0/4] Reclaim page capture v4

On Fri, Oct 3, 2008 at 3:48 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> Hi, Andy.
>>
>> I tested your patch in my desktop.
>> The test is just kernel compile with single thread.
>> My system environment is as follows.
>>
>> model name : Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
>> MemTotal: 2065856 kB
>>
>> When I tested vanilla, compile time is as follows.
>>
>> 2433.53user 187.96system 42:05.99elapsed 103%CPU (0avgtext+0avgdata
>> 0maxresident)k
>> 588752inputs+4503408outputs (127major+55456246minor)pagefaults 0swaps
>>
>> When I tested your patch, as follows.
>>
>> 2489.63user 202.41system 44:47.71elapsed 100%CPU (0avgtext+0avgdata
>> 0maxresident)k
>> 538608inputs+4503928outputs (130major+55531561minor)pagefaults 0swaps
>>
>> Regresstion almost is above 2 minutes.
>> Do you think It is a trivial?
>
> Ooops.
> this is definitly significant regression.
>
>
>> I know your patch is good to allocate hugepage.
>> But, I think many users don't need it, including embedded system and
>> desktop users yet.
>>
>> So I suggest you made it enable optionally.
>
> No.
> if the patch has this significant regression,
> nobody turn on its option.
>
> We should fix that.

I have been tested it.
But I can't reproduce such as regression.
I don't know why such regression happed at that time.

Sorry for confusing.
Please ignore my test result at that time.

This is new test result.

before

2346.24user 191.44system 42:07.28elapsed 100%CPU (0avgtext+0avgdata
0maxresident)k
458624inputs+4262728outputs (183major+52299730minor)pagefaults 0swaps


after

2349.75user 195.72system 42:16.36elapsed 100%CPU (0avgtext+0avgdata
0maxresident)k
475632inputs+4265208outputs (183major+52308969minor)pagefaults 0swaps

I think we can ignore some time gap.
Sometime, after is faster than before.

I could conclude it doesn't have any regressions in my desktop machine.

Tested-by: MinChan Kim <[email protected]>

--
Kinds regards,
MinChan Kim