2006-11-21 22:50:27

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/11] Avoiding fragmentation with page clustering v27

This is another post of the patches aimed at reducing external
fragmentation. Based on feedback from Christoph Lameter, it has been
reworked in two important respects. One, allocations should be grouped by
ability to migrate, not just reclaim. This lays the foundation for using
page migration as a defragmentation solution later. Second, the per-cpu
structures were larger in earlier versions which is a problem on machines
with large CPUs. They remain the same in this version.

Tests show that the kernel is far better at servicing high-order allocations
with these patches applied. kernbench figures show performance differences
of between -0.1% and +0.5% on three test machines (two ppc64 and one x86_64).

Changelog Since V27

o Renamed anti-fragmentation to Page Clustering. Anti-fragmentation was giving
the mistaken impression that it was the 100% solution for high order
allocations. Instead, it greatly increases the chances high-order
allocations will succeed and lays the foundation for defragmentation and
memory hot-remove to work properly
o Redefine page groupings based on ability to migrate or reclaim instead of
basing on reclaimability alone
o Get rid of spurious inits
o Per-cpu lists are no longer split up per-type. Instead the per-cpu list is
searched for a page of the appropriate type
o Added more explanation commentary
o Fix up bug in pageblock code where bitmap was used before being initalised

Changelog Since V26
o Fix double init of lists in setup_pageset

Changelog Since V25
o Fix loop order of for_each_rclmtype_order so that order of loop matches args
o gfpflags_to_rclmtype uses gfp_t instead of unsigned long
o Rename get_pageblock_type() to get_page_rclmtype()
o Fix alignment problem in move_freepages()
o Add mechanism for assigning flags to blocks of pages instead of page->flags
o On fallback, do not examine the preferred list of free pages a second time

The purpose of these patches is to reduce external fragmentation by grouping
pages of related types together. When pages are migrated (or reclaimed under
memory pressure), large contiguous pages will be freed.

This patch works by categorising allocations by their ability to migrate;

Movable - The pages may be moved with the page migration mechanism. These are
generally userspace pages.

Reclaimable - These are allocations for some kernel caches that are
reclaimable or allocations that are known to be very short-lived.

Unmovable - These are pages that are allocated by the kernel that
are not trivially reclaimed. For example, the memory allocated for a
loaded module would be in this category. By default, allocations are
considered to be of this type

Instead of having one MAX_ORDER-sized array of free lists in struct free_area,
there is one for each type of reclaimability. Once a 2^MAX_ORDER block of
pages is split for a type of allocation, it is added to the free-lists for
that type, in effect reserving it. Hence, over time, pages of the different
types can be clustered together.

When the preferred freelists are expired, the largest possible block is taken
from an alternative list. Buddies that are split from that large block are
placed on the preferred allocation-type freelists to mitigate fragmentation.

This implementation gives best-effort for low fragmentation in all zones. To
be effective, min_free_kbytes needs to be set to a value about 10% of physical
memory (10% was found by experimentation, it may be workload dependant). To
get that value lower, more invasive is required.

Our tests show that about 60-70% of physical memory can be allocated on
a desktop after a few days uptime. In benchmarks and stress tests, we are
finding that 80% of memory is available as contiguous blocks at the end of
the test. To compare, a standard kernel was getting < 1% of memory as large
pages on a desktop and about 8-12% of memory as large pages at the end of
stress tests.

Following this email are 8 patches that implement page clustering with an
additional 3 patches that provide an alternative to using page->flags. The
early patches introduce the split between movable and all other allocations.
Later we introduce a further split for reclaimable allocations. Note that
although in early patches an additional page flag is consumed, later patches
reuse the suspend bits, releasing this bit again. The last three patches
remove the restriction on suspend by introducing an alternative solution
for tracking page blocks which remove the need for any page bits.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab


2006-11-21 22:50:45

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers


This patch adds a flag __GFP_MOVABLE. Allocations using the __GFP_MOVABLE
in this patch can either be migrated using the page migration mechanism
or reclaimed by syncing with backing storage (be it a file or swap) and
discarding.


Signed-off-by: Mel Gorman <[email protected]>
---

fs/block_dev.c | 2 +-
fs/buffer.c | 5 +++--
fs/compat.c | 2 +-
fs/exec.c | 2 +-
fs/inode.c | 2 +-
include/asm-i386/page.h | 3 ++-
include/asm-ia64/page.h | 4 +++-
include/asm-x86_64/page.h | 3 ++-
include/linux/gfp.h | 4 +++-
include/linux/highmem.h | 3 ++-
mm/hugetlb.c | 5 +++--
mm/memory.c | 6 ++++--
mm/swap_state.c | 3 ++-
13 files changed, 28 insertions(+), 16 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/fs/block_dev.c linux-2.6.19-rc5-mm2-001_clustering_flags/fs/block_dev.c
--- linux-2.6.19-rc5-mm2-clean/fs/block_dev.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/fs/block_dev.c 2006-11-21 10:47:11.000000000 +0000
@@ -380,7 +380,7 @@ struct block_device *bdget(dev_t dev)
inode->i_rdev = dev;
inode->i_bdev = bdev;
inode->i_data.a_ops = &def_blk_aops;
- mapping_set_gfp_mask(&inode->i_data, GFP_USER);
+ mapping_set_gfp_mask(&inode->i_data, GFP_USER|__GFP_MOVABLE);
inode->i_data.backing_dev_info = &default_backing_dev_info;
spin_lock(&bdev_lock);
list_add(&bdev->bd_list, &all_bdevs);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/fs/buffer.c linux-2.6.19-rc5-mm2-001_clustering_flags/fs/buffer.c
--- linux-2.6.19-rc5-mm2-clean/fs/buffer.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/fs/buffer.c 2006-11-21 10:47:11.000000000 +0000
@@ -1048,7 +1048,8 @@ grow_dev_page(struct block_device *bdev,
struct page *page;
struct buffer_head *bh;

- page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
+ page = find_or_create_page(inode->i_mapping, index,
+ GFP_NOFS|__GFP_MOVABLE);
if (!page)
return NULL;

@@ -2723,7 +2724,7 @@ int submit_bh(int rw, struct buffer_head
* from here on down, it's all bio -- do the initial mapping,
* submit_bio -> generic_make_request may further map this bio around
*/
- bio = bio_alloc(GFP_NOIO, 1);
+ bio = bio_alloc(GFP_NOIO|__GFP_MOVABLE, 1);

bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
bio->bi_bdev = bh->b_bdev;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/fs/compat.c linux-2.6.19-rc5-mm2-001_clustering_flags/fs/compat.c
--- linux-2.6.19-rc5-mm2-clean/fs/compat.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/fs/compat.c 2006-11-21 10:47:11.000000000 +0000
@@ -1419,7 +1419,7 @@ static int compat_copy_strings(int argc,
page = bprm->page[i];
new = 0;
if (!page) {
- page = alloc_page(GFP_HIGHUSER);
+ page = alloc_page(GFP_HIGHUSER|__GFP_MOVABLE);
bprm->page[i] = page;
if (!page) {
ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/fs/exec.c linux-2.6.19-rc5-mm2-001_clustering_flags/fs/exec.c
--- linux-2.6.19-rc5-mm2-clean/fs/exec.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/fs/exec.c 2006-11-21 10:47:11.000000000 +0000
@@ -239,7 +239,7 @@ static int copy_strings(int argc, char _
page = bprm->page[i];
new = 0;
if (!page) {
- page = alloc_page(GFP_HIGHUSER);
+ page = alloc_page(GFP_HIGHUSER|__GFP_MOVABLE);
bprm->page[i] = page;
if (!page) {
ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/fs/inode.c linux-2.6.19-rc5-mm2-001_clustering_flags/fs/inode.c
--- linux-2.6.19-rc5-mm2-clean/fs/inode.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/fs/inode.c 2006-11-21 10:47:11.000000000 +0000
@@ -146,7 +146,7 @@ static struct inode *alloc_inode(struct
mapping->a_ops = &empty_aops;
mapping->host = inode;
mapping->flags = 0;
- mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
+ mapping_set_gfp_mask(mapping, GFP_HIGHUSER|__GFP_MOVABLE);
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/asm-i386/page.h linux-2.6.19-rc5-mm2-001_clustering_flags/include/asm-i386/page.h
--- linux-2.6.19-rc5-mm2-clean/include/asm-i386/page.h 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/include/asm-i386/page.h 2006-11-21 10:47:11.000000000 +0000
@@ -35,7 +35,8 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) \
+ alloc_page_vma(GFP_HIGHUSER|__GFP_ZERO|__GFP_MOVABLE, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

/*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/asm-ia64/page.h linux-2.6.19-rc5-mm2-001_clustering_flags/include/asm-ia64/page.h
--- linux-2.6.19-rc5-mm2-clean/include/asm-ia64/page.h 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/include/asm-ia64/page.h 2006-11-21 10:47:11.000000000 +0000
@@ -89,7 +89,9 @@ do { \

#define alloc_zeroed_user_highpage(vma, vaddr) \
({ \
- struct page *page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr); \
+ struct page *page = alloc_page_vma( \
+ GFP_HIGHUSER | __GFP_ZERO | __GFP_MOVABLE, \
+ vma, vaddr); \
if (page) \
flush_dcache_page(page); \
page; \
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/asm-x86_64/page.h linux-2.6.19-rc5-mm2-001_clustering_flags/include/asm-x86_64/page.h
--- linux-2.6.19-rc5-mm2-clean/include/asm-x86_64/page.h 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/include/asm-x86_64/page.h 2006-11-21 10:47:11.000000000 +0000
@@ -51,7 +51,8 @@ void copy_page(void *, void *);
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) \
+ alloc_page_vma(GFP_HIGHUSER|__GFP_ZERO|__GFP_MOVABLE, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
/*
* These are used to make use of C type-checking..
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/linux/gfp.h linux-2.6.19-rc5-mm2-001_clustering_flags/include/linux/gfp.h
--- linux-2.6.19-rc5-mm2-clean/include/linux/gfp.h 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/include/linux/gfp.h 2006-11-21 10:47:11.000000000 +0000
@@ -46,6 +46,7 @@ struct vm_area_struct;
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_MOVABLE ((__force gfp_t)0x80000u) /* Page is movable */

#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -54,7 +55,8 @@ struct vm_area_struct;
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
- __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE)
+ __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|\
+ __GFP_MOVABLE)

/* This equals 0, but use constants in case they ever change */
#define GFP_NOWAIT (GFP_ATOMIC & ~__GFP_HIGH)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/linux/highmem.h linux-2.6.19-rc5-mm2-001_clustering_flags/include/linux/highmem.h
--- linux-2.6.19-rc5-mm2-clean/include/linux/highmem.h 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/include/linux/highmem.h 2006-11-21 10:47:11.000000000 +0000
@@ -65,7 +65,8 @@ static inline void clear_user_highpage(s
static inline struct page *
alloc_zeroed_user_highpage(struct vm_area_struct *vma, unsigned long vaddr)
{
- struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+ struct page *page = alloc_page_vma(GFP_HIGHUSER|__GFP_MOVABLE,
+ vma, vaddr);

if (page)
clear_user_highpage(page, vaddr);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/mm/hugetlb.c linux-2.6.19-rc5-mm2-001_clustering_flags/mm/hugetlb.c
--- linux-2.6.19-rc5-mm2-clean/mm/hugetlb.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/mm/hugetlb.c 2006-11-21 10:47:11.000000000 +0000
@@ -103,8 +103,9 @@ static int alloc_fresh_huge_page(void)
{
static int nid = 0;
struct page *page;
- page = alloc_pages_node(nid, GFP_HIGHUSER|__GFP_COMP|__GFP_NOWARN,
- HUGETLB_PAGE_ORDER);
+ page = alloc_pages_node(nid,
+ GFP_HIGHUSER|__GFP_COMP|__GFP_NOWARN|__GFP_MOVABLE,
+ HUGETLB_PAGE_ORDER);
nid = next_node(nid, node_online_map);
if (nid == MAX_NUMNODES)
nid = first_node(node_online_map);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/mm/memory.c linux-2.6.19-rc5-mm2-001_clustering_flags/mm/memory.c
--- linux-2.6.19-rc5-mm2-clean/mm/memory.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/mm/memory.c 2006-11-21 10:47:11.000000000 +0000
@@ -1564,7 +1564,8 @@ gotten:
if (!new_page)
goto oom;
} else {
- new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ new_page = alloc_page_vma(GFP_HIGHUSER|__GFP_MOVABLE,
+ vma, address);
if (!new_page)
goto oom;
cow_user_page(new_page, old_page, address);
@@ -2188,7 +2189,8 @@ retry:

if (unlikely(anon_vma_prepare(vma)))
goto oom;
- page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ page = alloc_page_vma(GFP_HIGHUSER|__GFP_MOVABLE,
+ vma, address);
if (!page)
goto oom;
copy_user_highpage(page, new_page, address);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/mm/swap_state.c linux-2.6.19-rc5-mm2-001_clustering_flags/mm/swap_state.c
--- linux-2.6.19-rc5-mm2-clean/mm/swap_state.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/mm/swap_state.c 2006-11-21 10:47:11.000000000 +0000
@@ -343,7 +343,8 @@ struct page *read_swap_cache_async(swp_e
* Get a new page to read into from swap.
*/
if (!new_page) {
- new_page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+ new_page = alloc_page_vma(GFP_HIGHUSER|__GFP_MOVABLE,
+ vma, addr);
if (!new_page)
break; /* Out of memory */
}

2006-11-21 22:51:25

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/11] Choose pages from the per-cpu list based on migration type


The freelists for each migrate type can slowly become polluted due to the
per-cpu list. Consider what happens when the following happens

1. A 2^(MAX_ORDER-1) list is reserved for __GFP_MOVABLE pages
2. An order-0 page is allocated from the newly reserved block
3. The page is freed and placed on the per-cpu list
4. alloc_page() is called with GFP_KERNEL as the gfp_mask
5. The per-cpu list is used to satisfy the allocation

This results in a kernel page is in the middle of a migratable region. This
patch prevents this leak occuring by storing the MIGRATE_ type of the page in
page->private. On allocate, a page will only be returned of the desired type,
else more pages will be allocated. This may temporarily allow a per-cpu list
to go over the pcp->high limit but it'll be corrected on the next free. Care
is taken to preserve the hotness of pages recently freed.


Signed-off-by: Mel Gorman <[email protected]>
---

page_alloc.c | 34 ++++++++++++++++++++++++++++------
1 files changed, 28 insertions(+), 6 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-003_clustering_core/mm/page_alloc.c linux-2.6.19-rc5-mm2-004_percpu/mm/page_alloc.c
--- linux-2.6.19-rc5-mm2-003_clustering_core/mm/page_alloc.c 2006-11-21 10:48:55.000000000 +0000
+++ linux-2.6.19-rc5-mm2-004_percpu/mm/page_alloc.c 2006-11-21 10:50:40.000000000 +0000
@@ -415,6 +415,7 @@ static inline void __free_one_page(struc
{
unsigned long page_idx;
int order_size = 1 << order;
+ int migratetype = get_page_migratetype(page);

if (unlikely(PageCompound(page)))
destroy_compound_page(page, order);
@@ -449,8 +450,7 @@ static inline void __free_one_page(struc
order++;
}
set_page_order(page, order);
- list_add(&page->lru,
- &zone->free_area[order].free_list[get_page_migratetype(page)]);
+ list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
zone->free_area[order].nr_free++;
}

@@ -738,7 +738,8 @@ static int rmqueue_bulk(struct zone *zon
struct page *page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;
- list_add_tail(&page->lru, list);
+ list_add(&page->lru, list);
+ set_page_private(page, migratetype);
}
spin_unlock(&zone->lock);
return i;
@@ -876,6 +877,7 @@ static void fastcall free_hot_cold_page(
local_irq_save(flags);
__count_vm_event(PGFREE);
list_add(&page->lru, &pcp->list);
+ set_page_private(page, get_page_migratetype(page));
pcp->count++;
if (pcp->count >= pcp->high) {
free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
@@ -940,9 +942,29 @@ again:
if (unlikely(!pcp->count))
goto failed;
}
- page = list_entry(pcp->list.next, struct page, lru);
- list_del(&page->lru);
- pcp->count--;
+
+ /* Find a page of the appropriate migrate type */
+ list_for_each_entry(page, &pcp->list, lru) {
+ if (page_private(page) == migratetype) {
+ list_del(&page->lru);
+ pcp->count--;
+ break;
+ }
+ }
+
+ /*
+ * Check if a page of the appropriate migrate type
+ * was found. If not, allocate more to the pcp list
+ */
+ if (&page->lru == &pcp->list) {
+ pcp->count += rmqueue_bulk(zone, 0,
+ pcp->batch, &pcp->list, migratetype);
+ page = list_entry(pcp->list.next, struct page, lru);
+ VM_BUG_ON(page_private(page) != migratetype);
+ list_del(&page->lru);
+ pcp->count--;
+ }
+
} else {
spin_lock_irqsave(&zone->lock, flags);
page = __rmqueue(zone, order, migratetype);

2006-11-21 22:51:46

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 4/11] Add a configure option for page clustering


Page clustering has some memory overhead and a more complex allocation
path. This patch allows the strategy to be disabled for small memory systems
or if it is known the workload is suffering because of the strategy. It also
acts to show where the page clustering strategy interacts with the standard
buddy allocator.


Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Joel Schopp <[email protected]>
---

include/linux/mmzone.h | 6 ++++++
init/Kconfig | 14 ++++++++++++++
mm/page_alloc.c | 27 ++++++++++++++++++++++++++-
3 files changed, 46 insertions(+), 1 deletion(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-004_percpu/include/linux/mmzone.h linux-2.6.19-rc5-mm2-005_configurable/include/linux/mmzone.h
--- linux-2.6.19-rc5-mm2-004_percpu/include/linux/mmzone.h 2006-11-21 10:48:55.000000000 +0000
+++ linux-2.6.19-rc5-mm2-005_configurable/include/linux/mmzone.h 2006-11-21 10:52:26.000000000 +0000
@@ -24,9 +24,15 @@
#endif
#define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))

+#ifdef CONFIG_PAGE_CLUSTERING
#define MIGRATE_UNMOVABLE 0
#define MIGRATE_MOVABLE 1
#define MIGRATE_TYPES 2
+#else
+#define MIGRATE_UNMOVABLE 0
+#define MIGRATE_MOVABLE 0
+#define MIGRATE_TYPES 1
+#endif

#define for_each_migratetype_order(order, type) \
for (order = 0; order < MAX_ORDER; order++) \
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-004_percpu/init/Kconfig linux-2.6.19-rc5-mm2-005_configurable/init/Kconfig
--- linux-2.6.19-rc5-mm2-004_percpu/init/Kconfig 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-005_configurable/init/Kconfig 2006-11-21 10:52:26.000000000 +0000
@@ -500,6 +500,20 @@ config SLOB
default !SLAB
bool

+config PAGE_CLUSTERING
+ bool "Cluster movable pages together in the page allocator"
+ def_bool n
+ help
+ The standard allocator will fragment memory over time which means
+ that high order allocations will fail even if kswapd is running. If
+ this option is set, the allocator will try and group page types
+ based on their ability to migrate or reclaim. This is a best effort
+ attempt at lowering fragmentation which a few workloads care about.
+ The loss is a more complex allocactor that may perform slower. If
+ you are interested in working with large pages, say Y and set
+ /proc/sys/vm/min_free_bytes to be 10% of physical memory. Otherwise
+ say N
+
menu "Loadable module support"

config MODULES
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-004_percpu/mm/page_alloc.c linux-2.6.19-rc5-mm2-005_configurable/mm/page_alloc.c
--- linux-2.6.19-rc5-mm2-004_percpu/mm/page_alloc.c 2006-11-21 10:50:40.000000000 +0000
+++ linux-2.6.19-rc5-mm2-005_configurable/mm/page_alloc.c 2006-11-21 10:52:26.000000000 +0000
@@ -136,6 +136,7 @@ static unsigned long __initdata dma_rese
#endif /* CONFIG_MEMORY_HOTPLUG_RESERVE */
#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */

+#ifdef CONFIG_PAGE_CLUSTERING
static inline int get_page_migratetype(struct page *page)
{
return (PageMovable(page) != 0);
@@ -145,6 +146,17 @@ static inline int gfpflags_to_migratetyp
{
return ((gfp_flags & __GFP_MOVABLE) != 0);
}
+#else
+static inline int get_page_migratetype(struct page *page)
+{
+ return MIGRATE_UNMOVABLE;
+}
+
+static inline int gfpflags_to_migratetype(gfp_t gfp_flags)
+{
+ return MIGRATE_UNMOVABLE;
+}
+#endif /* CONFIG_PAGE_CLUSTERING */

#ifdef CONFIG_DEBUG_VM
static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
@@ -648,6 +660,7 @@ static int prep_new_page(struct page *pa
return 0;
}

+#ifdef CONFIG_PAGE_CLUSTERING
/* Remove an element from the buddy allocator from the fallback list */
static struct page *__rmqueue_fallback(struct zone *zone, int order,
int start_migratetype)
@@ -685,6 +698,13 @@ static struct page *__rmqueue_fallback(s

return NULL;
}
+#else
+static struct page *__rmqueue_fallback(struct zone *zone, unsigned int order,
+ int migratetype)
+{
+ return NULL;
+}
+#endif /* CONFIG_PAGE_CLUSTERING */

/*
* Do the hard work of removing an element from the buddy allocator.
@@ -877,7 +897,6 @@ static void fastcall free_hot_cold_page(
local_irq_save(flags);
__count_vm_event(PGFREE);
list_add(&page->lru, &pcp->list);
- set_page_private(page, get_page_migratetype(page));
pcp->count++;
if (pcp->count >= pcp->high) {
free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
@@ -943,6 +962,7 @@ again:
goto failed;
}

+#ifdef CONFIG_PAGE_CLUSTERING
/* Find a page of the appropriate migrate type */
list_for_each_entry(page, &pcp->list, lru) {
if (page_private(page) == migratetype) {
@@ -964,6 +984,11 @@ again:
list_del(&page->lru);
pcp->count--;
}
+#else
+ page = list_entry(pcp->list.next, struct page, lru);
+ list_del(&page->lru);
+ pcp->count--;
+#endif /* CONFIG_PAGE_CLUSTERING */

} else {
spin_lock_irqsave(&zone->lock, flags);

2006-11-21 22:51:10

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/11] Split the free lists for movable and unmovable allocations


This patch adds the core of the page clustering strategy. It works by grouping
pages together based on their ability to migrate or reclaimed. Basically,
it works by breaking the list in zone->free_area list into MIGRATE_TYPES
number of lists.


Signed-off-by: Mel Gorman <[email protected]>
---

include/linux/mmzone.h | 10 ++-
include/linux/page-flags.h | 7 ++
mm/page_alloc.c | 123 ++++++++++++++++++++++++++++++++--------
3 files changed, 116 insertions(+), 24 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-001_clustering_flags/include/linux/mmzone.h linux-2.6.19-rc5-mm2-003_clustering_core/include/linux/mmzone.h
--- linux-2.6.19-rc5-mm2-001_clustering_flags/include/linux/mmzone.h 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-003_clustering_core/include/linux/mmzone.h 2006-11-21 10:48:55.000000000 +0000
@@ -24,8 +24,16 @@
#endif
#define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))

+#define MIGRATE_UNMOVABLE 0
+#define MIGRATE_MOVABLE 1
+#define MIGRATE_TYPES 2
+
+#define for_each_migratetype_order(order, type) \
+ for (order = 0; order < MAX_ORDER; order++) \
+ for (type = 0; type < MIGRATE_TYPES; type++)
+
struct free_area {
- struct list_head free_list;
+ struct list_head free_list[MIGRATE_TYPES];
unsigned long nr_free;
};

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-001_clustering_flags/include/linux/page-flags.h linux-2.6.19-rc5-mm2-003_clustering_core/include/linux/page-flags.h
--- linux-2.6.19-rc5-mm2-001_clustering_flags/include/linux/page-flags.h 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-003_clustering_core/include/linux/page-flags.h 2006-11-21 10:48:55.000000000 +0000
@@ -93,6 +93,7 @@

#define PG_readahead 20 /* Reminder to do readahead */

+#define PG_movable 21 /* Page may be moved */

#if (BITS_PER_LONG > 32)
/*
@@ -253,6 +254,12 @@ static inline void SetPageUptodate(struc
#define SetPageReadahead(page) set_bit(PG_readahead, &(page)->flags)
#define TestClearPageReadahead(page) test_and_clear_bit(PG_readahead, &(page)->flags)

+#define PageMovable(page) test_bit(PG_movable, &(page)->flags)
+#define SetPageMovable(page) set_bit(PG_movable, &(page)->flags)
+#define ClearPageMovable(page) clear_bit(PG_movable, &(page)->flags)
+#define __SetPageMovable(page) __set_bit(PG_movable, &(page)->flags)
+#define __ClearPageMovable(page) __clear_bit(PG_movable, &(page)->flags)
+
struct page; /* forward declaration */

int test_clear_page_dirty(struct page *page);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-001_clustering_flags/mm/page_alloc.c linux-2.6.19-rc5-mm2-003_clustering_core/mm/page_alloc.c
--- linux-2.6.19-rc5-mm2-001_clustering_flags/mm/page_alloc.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-003_clustering_core/mm/page_alloc.c 2006-11-21 10:48:55.000000000 +0000
@@ -136,6 +136,16 @@ static unsigned long __initdata dma_rese
#endif /* CONFIG_MEMORY_HOTPLUG_RESERVE */
#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */

+static inline int get_page_migratetype(struct page *page)
+{
+ return (PageMovable(page) != 0);
+}
+
+static inline int gfpflags_to_migratetype(gfp_t gfp_flags)
+{
+ return ((gfp_flags & __GFP_MOVABLE) != 0);
+}
+
#ifdef CONFIG_DEBUG_VM
static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
{
@@ -411,13 +421,19 @@ static inline void __free_one_page(struc

page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);

+ /*
+ * Free pages are always marked movable so the bits are in a known
+ * state on alloc. As movable allocations are the most common, this
+ * will result in less bit manipulations
+ */
+ __SetPageMovable(page);
+
VM_BUG_ON(page_idx & (order_size - 1));
VM_BUG_ON(bad_range(zone, page));

zone->free_pages += order_size;
while (order < MAX_ORDER-1) {
unsigned long combined_idx;
- struct free_area *area;
struct page *buddy;

buddy = __page_find_buddy(page, page_idx, order);
@@ -425,8 +441,7 @@ static inline void __free_one_page(struc
break; /* Move the buddy up one level. */

list_del(&buddy->lru);
- area = zone->free_area + order;
- area->nr_free--;
+ zone->free_area[order].nr_free--;
rmv_page_order(buddy);
combined_idx = __find_combined_index(page_idx, order);
page = page + (combined_idx - page_idx);
@@ -434,7 +449,8 @@ static inline void __free_one_page(struc
order++;
}
set_page_order(page, order);
- list_add(&page->lru, &zone->free_area[order].free_list);
+ list_add(&page->lru,
+ &zone->free_area[order].free_list[get_page_migratetype(page)]);
zone->free_area[order].nr_free++;
}

@@ -569,7 +585,8 @@ void fastcall __init __free_pages_bootme
* -- wli
*/
static inline void expand(struct zone *zone, struct page *page,
- int low, int high, struct free_area *area)
+ int low, int high, struct free_area *area,
+ int migratetype)
{
unsigned long size = 1 << high;

@@ -578,7 +595,7 @@ static inline void expand(struct zone *z
high--;
size >>= 1;
VM_BUG_ON(bad_range(zone, &page[size]));
- list_add(&page[size].lru, &area->free_list);
+ list_add(&page[size].lru, &area->free_list[migratetype]);
area->nr_free++;
set_page_order(&page[size], high);
}
@@ -631,31 +648,78 @@ static int prep_new_page(struct page *pa
return 0;
}

+/* Remove an element from the buddy allocator from the fallback list */
+static struct page *__rmqueue_fallback(struct zone *zone, int order,
+ int start_migratetype)
+{
+ struct free_area * area;
+ int current_order;
+ struct page *page;
+ int migratetype = !start_migratetype;
+
+ /* Find the largest possible block of pages in the other list */
+ for (current_order = MAX_ORDER-1; current_order >= order;
+ --current_order) {
+ area = &(zone->free_area[current_order]);
+ if (list_empty(&area->free_list[migratetype]))
+ continue;
+
+ page = list_entry(area->free_list[migratetype].next,
+ struct page, lru);
+ area->nr_free--;
+
+ /*
+ * If breaking a large block of pages, place the buddies
+ * on the preferred allocation list
+ */
+ if (unlikely(current_order >= MAX_ORDER / 2))
+ migratetype = !migratetype;
+
+ /* Remove the page from the freelists */
+ list_del(&page->lru);
+ rmv_page_order(page);
+ zone->free_pages -= 1UL << order;
+ expand(zone, page, order, current_order, area, migratetype);
+ return page;
+ }
+
+ return NULL;
+}
+
/*
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static struct page *__rmqueue(struct zone *zone, unsigned int order,
+ int migratetype)
{
struct free_area * area;
unsigned int current_order;
struct page *page;

+ /* Find a page of the appropriate size in the preferred list */
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
- area = zone->free_area + current_order;
- if (list_empty(&area->free_list))
+ area = &(zone->free_area[current_order]);
+ if (list_empty(&area->free_list[migratetype]))
continue;

- page = list_entry(area->free_list.next, struct page, lru);
+ page = list_entry(area->free_list[migratetype].next,
+ struct page, lru);
list_del(&page->lru);
rmv_page_order(page);
area->nr_free--;
zone->free_pages -= 1UL << order;
- expand(zone, page, order, current_order, area);
- return page;
+ expand(zone, page, order, current_order, area, migratetype);
+ goto got_page;
}

- return NULL;
+ page = __rmqueue_fallback(zone, order, migratetype);
+
+got_page:
+ if (unlikely(migratetype == MIGRATE_UNMOVABLE) && page)
+ __ClearPageMovable(page);
+
+ return page;
}

/*
@@ -664,13 +728,14 @@ static struct page *__rmqueue(struct zon
* Returns the number of new pages which were placed at *list.
*/
static int rmqueue_bulk(struct zone *zone, unsigned int order,
- unsigned long count, struct list_head *list)
+ unsigned long count, struct list_head *list,
+ int migratetype)
{
int i;

spin_lock(&zone->lock);
for (i = 0; i < count; ++i) {
- struct page *page = __rmqueue(zone, order);
+ struct page *page = __rmqueue(zone, order, migratetype);
if (unlikely(page == NULL))
break;
list_add_tail(&page->lru, list);
@@ -745,7 +810,7 @@ void mark_free_pages(struct zone *zone)
{
unsigned long pfn, max_zone_pfn;
unsigned long flags;
- int order;
+ int order, t;
struct list_head *curr;

if (!zone->spanned_pages)
@@ -762,14 +827,15 @@ void mark_free_pages(struct zone *zone)
ClearPageNosaveFree(page);
}

- for (order = MAX_ORDER - 1; order >= 0; --order)
- list_for_each(curr, &zone->free_area[order].free_list) {
+ for_each_migratetype_order(order, t) {
+ list_for_each(curr, &zone->free_area[order].free_list[t]) {
unsigned long i;

pfn = page_to_pfn(list_entry(curr, struct page, lru));
for (i = 0; i < (1UL << order); i++)
SetPageNosaveFree(pfn_to_page(pfn + i));
}
+ }

spin_unlock_irqrestore(&zone->lock, flags);
}
@@ -859,6 +925,7 @@ static struct page *buffered_rmqueue(str
struct page *page;
int cold = !!(gfp_flags & __GFP_COLD);
int cpu;
+ int migratetype = gfpflags_to_migratetype(gfp_flags);

again:
cpu = get_cpu();
@@ -869,7 +936,7 @@ again:
local_irq_save(flags);
if (!pcp->count) {
pcp->count = rmqueue_bulk(zone, 0,
- pcp->batch, &pcp->list);
+ pcp->batch, &pcp->list, migratetype);
if (unlikely(!pcp->count))
goto failed;
}
@@ -878,7 +945,7 @@ again:
pcp->count--;
} else {
spin_lock_irqsave(&zone->lock, flags);
- page = __rmqueue(zone, order);
+ page = __rmqueue(zone, order, migratetype);
spin_unlock(&zone->lock);
if (!page)
goto failed;
@@ -2046,6 +2113,16 @@ void __meminit memmap_init_zone(unsigned
init_page_count(page);
reset_page_mapcount(page);
SetPageReserved(page);
+
+ /*
+ * Mark the page movable so that blocks are reserved for
+ * movable at startup. This will force kernel allocations
+ * to reserve their blocks rather than leaking throughout
+ * the address space during boot when many long-lived
+ * kernel allocations are made
+ */
+ SetPageMovable(page);
+
INIT_LIST_HEAD(&page->lru);
#ifdef WANT_PAGE_VIRTUAL
/* The shift won't overflow because ZONE_NORMAL is below 4G. */
@@ -2061,9 +2138,9 @@ void __meminit memmap_init_zone(unsigned
void zone_init_free_lists(struct pglist_data *pgdat, struct zone *zone,
unsigned long size)
{
- int order;
- for (order = 0; order < MAX_ORDER ; order++) {
- INIT_LIST_HEAD(&zone->free_area[order].free_list);
+ int order, t;
+ for_each_migratetype_order(order, t) {
+ INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
zone->free_area[order].nr_free = 0;
}
}

2006-11-21 22:52:17

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 5/11] Drain per-cpu lists when high-order allocations fail


Per-cpu pages can accidentally cause fragmentation because they are free, but
pinned pages in an otherwise contiguous block. When this patch is applied,
the per-cpu caches are drained after the direct-reclaim is entered if the
requested order is greater than 0. It simply reuses the code used by suspend
and hotplug.

Signed-off-by: Mel Gorman <[email protected]>
---

Kconfig | 4 ++++
page_alloc.c | 33 ++++++++++++++++++++++++++++++---
2 files changed, 34 insertions(+), 3 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-005_configurable/mm/Kconfig linux-2.6.19-rc5-mm2-006_drainpercpu/mm/Kconfig
--- linux-2.6.19-rc5-mm2-005_configurable/mm/Kconfig 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-006_drainpercpu/mm/Kconfig 2006-11-21 10:54:11.000000000 +0000
@@ -247,3 +247,7 @@ config READAHEAD_SMOOTH_AGING
- have the danger of readahead thrashing(i.e. memory tight)

This feature is only available on non-NUMA systems.
+
+config NEED_DRAIN_PERCPU_PAGES
+ def_bool y
+ depends on PM || HOTPLUG_CPU || PAGE_CLUSTERING
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-005_configurable/mm/page_alloc.c linux-2.6.19-rc5-mm2-006_drainpercpu/mm/page_alloc.c
--- linux-2.6.19-rc5-mm2-005_configurable/mm/page_alloc.c 2006-11-21 10:52:26.000000000 +0000
+++ linux-2.6.19-rc5-mm2-006_drainpercpu/mm/page_alloc.c 2006-11-21 10:54:11.000000000 +0000
@@ -801,7 +801,7 @@ void drain_node_pages(int nodeid)
}
#endif

-#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU)
+#ifdef CONFIG_NEED_DRAIN_PERCPU_PAGES
static void __drain_pages(unsigned int cpu)
{
unsigned long flags;
@@ -823,7 +823,7 @@ static void __drain_pages(unsigned int c
}
}
}
-#endif /* CONFIG_PM || CONFIG_HOTPLUG_CPU */
+#endif /* CONFIG_DRAIN_PERCPU_PAGES */

#ifdef CONFIG_PM

@@ -860,7 +860,9 @@ void mark_free_pages(struct zone *zone)

spin_unlock_irqrestore(&zone->lock, flags);
}
+#endif /* CONFIG_PM */

+#if defined(CONFIG_PM) || defined(CONFIG_PAGE_CLUSTERING)
/*
* Spill all of this CPU's per-cpu pages back into the buddy allocator.
*/
@@ -872,7 +874,28 @@ void drain_local_pages(void)
__drain_pages(smp_processor_id());
local_irq_restore(flags);
}
-#endif /* CONFIG_PM */
+
+void smp_drain_local_pages(void *arg)
+{
+ drain_local_pages();
+}
+
+/*
+ * Spill all the per-cpu pages from all CPUs back into the buddy allocator
+ */
+void drain_all_local_pages(void)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __drain_pages(smp_processor_id());
+ local_irq_restore(flags);
+
+ smp_call_function(smp_drain_local_pages, NULL, 0, 1);
+}
+#else
+void drain_all_local_pages(void) {}
+#endif /* CONFIG_PM || CONFIG_PAGE_CLUSTERING */

/*
* Free a 0-order page
@@ -897,6 +920,7 @@ static void fastcall free_hot_cold_page(
local_irq_save(flags);
__count_vm_event(PGFREE);
list_add(&page->lru, &pcp->list);
+ set_page_private(page, get_page_migratetype(page));
pcp->count++;
if (pcp->count >= pcp->high) {
free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
@@ -1489,6 +1513,9 @@ nofail_alloc:

cond_resched();

+ if (order != 0)
+ drain_all_local_pages();
+
if (likely(did_some_progress)) {
page = get_page_from_freelist(gfp_mask, order,
zonelist, alloc_flags);

2006-11-21 22:52:39

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 6/11] Move free pages between lists on steal


When a fallback occurs, there will be free pages for one allocation type
stored on the list for another. When a large steal occurs, this patch will
move all the free pages within one list to the another.

Signed-off-by: Mel Gorman <[email protected]>
---

page_alloc.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 73 insertions(+), 7 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-006_drainpercpu/mm/page_alloc.c linux-2.6.19-rc5-mm2-007_movefree/mm/page_alloc.c
--- linux-2.6.19-rc5-mm2-006_drainpercpu/mm/page_alloc.c 2006-11-21 10:54:11.000000000 +0000
+++ linux-2.6.19-rc5-mm2-007_movefree/mm/page_alloc.c 2006-11-21 10:56:06.000000000 +0000
@@ -661,6 +661,62 @@ static int prep_new_page(struct page *pa
}

#ifdef CONFIG_PAGE_CLUSTERING
+/*
+ * Move the free pages in a range to the free lists of the requested type.
+ * Note that start_page and end_pages are not aligned in a MAX_ORDER_NR_PAGES
+ * boundary. If alignment is required, use move_freepages_block()
+ */
+int move_freepages(struct zone *zone,
+ struct page *start_page, struct page *end_page,
+ int migratetype)
+{
+ struct page *page;
+ unsigned long order;
+ int blocks_moved = 0;
+
+ BUG_ON(page_zone(start_page) != page_zone(end_page));
+
+ for (page = start_page; page < end_page;) {
+ if (!PageBuddy(page)) {
+ page++;
+ continue;
+ }
+#ifdef CONFIG_HOLES_IN_ZONE
+ if (!pfn_valid(page_to_pfn(page))) {
+ page++;
+ continue;
+ }
+#endif
+
+ order = page_order(page);
+ list_del(&page->lru);
+ list_add(&page->lru,
+ &zone->free_area[order].free_list[migratetype]);
+ page += 1 << order;
+ blocks_moved++;
+ }
+
+ return blocks_moved;
+}
+
+int move_freepages_block(struct zone *zone, struct page *page, int migratetype)
+{
+ unsigned long start_pfn;
+ struct page *start_page, *end_page;
+
+ start_pfn = page_to_pfn(page);
+ start_pfn = start_pfn & ~(MAX_ORDER_NR_PAGES-1);
+ start_page = pfn_to_page(start_pfn);
+ end_page = start_page + MAX_ORDER_NR_PAGES;
+
+ if (page_zone(page) != page_zone(start_page))
+ start_page = page;
+ if (page_zone(page) != page_zone(end_page))
+ return 0;
+
+ return move_freepages(zone, start_page, end_page, migratetype);
+}
+
/* Remove an element from the buddy allocator from the fallback list */
static struct page *__rmqueue_fallback(struct zone *zone, int order,
int start_migratetype)
@@ -681,24 +737,34 @@ static struct page *__rmqueue_fallback(s
struct page, lru);
area->nr_free--;

- /*
- * If breaking a large block of pages, place the buddies
- * on the preferred allocation list
- */
- if (unlikely(current_order >= MAX_ORDER / 2))
- migratetype = !migratetype;
-
/* Remove the page from the freelists */
list_del(&page->lru);
rmv_page_order(page);
zone->free_pages -= 1UL << order;
expand(zone, page, order, current_order, area, migratetype);
+
+ /* Move free pages between lists if stealing a large block */
+ if (current_order > MAX_ORDER / 2)
+ move_freepages_block(zone, page, start_migratetype);
+
return page;
}

return NULL;
}
#else
+int move_freepages(struct zone *zone,
+ struct page *start_page, struct page *end_page,
+ int migratetype)
+{
+ return 0;
+}
+
+int move_freepages_block(struct zone *zone, struct page *page, int migratetype)
+{
+ return 0;
+}
+
static struct page *__rmqueue_fallback(struct zone *zone, unsigned int order,
int migratetype)
{

2006-11-21 22:53:56

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 10/11] Remove dependency on page->flag bits


The page clustering implementation uses page flags to track page usage.
In preparation for their replacement with corresponding pageblock flags
remove the page->flags manipulation.

After this patch, page clustering is broken until the next patch in the set
is applied.

Signed-off-by: Mel Gorman <[email protected]>
---

arch/x86_64/kernel/e820.c | 8 -------
include/linux/page-flags.h | 44 +---------------------------------------
init/Kconfig | 1
mm/page_alloc.c | 16 --------------
4 files changed, 2 insertions(+), 67 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-101_pageblock_bits/arch/x86_64/kernel/e820.c linux-2.6.19-rc5-mm2-102_remove_clustering_flags/arch/x86_64/kernel/e820.c
--- linux-2.6.19-rc5-mm2-101_pageblock_bits/arch/x86_64/kernel/e820.c 2006-11-21 10:57:46.000000000 +0000
+++ linux-2.6.19-rc5-mm2-102_remove_clustering_flags/arch/x86_64/kernel/e820.c 2006-11-21 11:25:08.000000000 +0000
@@ -217,13 +217,6 @@ void __init e820_reserve_resources(void)
}
}

-#ifdef CONFIG_PAGE_CLUSTERING
-static void __init
-e820_mark_nosave_range(unsigned long start, unsigned long end)
-{
- printk("Nosave not set when anti-frag is enabled");
-}
-#else
/* Mark pages corresponding to given address range as nosave */
static void __init
e820_mark_nosave_range(unsigned long start, unsigned long end)
@@ -239,7 +232,6 @@ e820_mark_nosave_range(unsigned long sta
if (pfn_valid(pfn))
SetPageNosave(pfn_to_page(pfn));
}
-#endif

/*
* Find the ranges of physical addresses that do not correspond to
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-101_pageblock_bits/include/linux/page-flags.h linux-2.6.19-rc5-mm2-102_remove_clustering_flags/include/linux/page-flags.h
--- linux-2.6.19-rc5-mm2-101_pageblock_bits/include/linux/page-flags.h 2006-11-21 10:57:46.000000000 +0000
+++ linux-2.6.19-rc5-mm2-102_remove_clustering_flags/include/linux/page-flags.h 2006-11-21 11:25:08.000000000 +0000
@@ -82,29 +82,17 @@
#define PG_private 11 /* If pagecache, has fs-private data */

#define PG_writeback 12 /* Page is under writeback */
+#define PG_nosave 13 /* Used for system suspend/resume */
#define PG_compound 14 /* Part of a compound page */
#define PG_swapcache 15 /* Swap page: swp_entry_t in private */

#define PG_mappedtodisk 16 /* Has blocks allocated on-disk */
#define PG_reclaim 17 /* To be reclaimed asap */
+#define PG_nosave_free 18 /* Free, should not be written */
#define PG_buddy 19 /* Page is free, on buddy lists */

#define PG_readahead 20 /* Reminder to do readahead */

-/*
- * As page clustering requires two flags, it was best to reuse the suspend
- * flags and make page clustering depend on !SOFTWARE_SUSPEND. This works
- * on the assumption that machines being suspended do not really care about
- * large contiguous allocations.
- */
-#ifndef CONFIG_PAGE_CLUSTERING
-#define PG_nosave 13 /* Used for system suspend/resume */
-#define PG_nosave_free 18 /* Free, should not be written */
-#else
-#define PG_reclaimable 13 /* Page is reclaimable */
-#define PG_movable 18 /* Page is movable */
-#endif
-
#if (BITS_PER_LONG > 32)
/*
* 64-bit-only flags build down from bit 31
@@ -221,7 +209,6 @@ static inline void SetPageUptodate(struc
ret; \
})

-#ifndef CONFIG_PAGE_CLUSTERING
#define PageNosave(page) test_bit(PG_nosave, &(page)->flags)
#define SetPageNosave(page) set_bit(PG_nosave, &(page)->flags)
#define TestSetPageNosave(page) test_and_set_bit(PG_nosave, &(page)->flags)
@@ -232,33 +219,6 @@ static inline void SetPageUptodate(struc
#define SetPageNosaveFree(page) set_bit(PG_nosave_free, &(page)->flags)
#define ClearPageNosaveFree(page) clear_bit(PG_nosave_free, &(page)->flags)

-#define PageReclaimable(page) (0)
-#define SetPageReclaimable(page) do {} while (0)
-#define ClearPageReclaimable(page) do {} while (0)
-#define __SetPageReclaimable(page) do {} while (0)
-#define __ClearPageReclaimable(page) do {} while (0)
-
-#define PageMovable(page) (0)
-#define SetPageMovable(page) do {} while (0)
-#define ClearPageMovable(page) do {} while (0)
-#define __SetPageMovable(page) do {} while (0)
-#define __ClearPageMovable(page) do {} while (0)
-
-#else
-
-#define PageReclaimable(page) test_bit(PG_reclaimable, &(page)->flags)
-#define SetPageReclaimable(page) set_bit(PG_reclaimable, &(page)->flags)
-#define ClearPageReclaimable(page) clear_bit(PG_reclaimable, &(page)->flags)
-#define __SetPageReclaimable(page) __set_bit(PG_reclaimable, &(page)->flags)
-#define __ClearPageReclaimable(page) __clear_bit(PG_reclaimable, &(page)->flags)
-
-#define PageMovable(page) test_bit(PG_movable, &(page)->flags)
-#define SetPageMovable(page) set_bit(PG_movable, &(page)->flags)
-#define ClearPageMovable(page) clear_bit(PG_movable, &(page)->flags)
-#define __SetPageMovable(page) __set_bit(PG_movable, &(page)->flags)
-#define __ClearPageMovable(page) __clear_bit(PG_movable, &(page)->flags)
-#endif /* CONFIG_PAGE_CLUSTERING */
-
#define PageBuddy(page) test_bit(PG_buddy, &(page)->flags)
#define __SetPageBuddy(page) __set_bit(PG_buddy, &(page)->flags)
#define __ClearPageBuddy(page) __clear_bit(PG_buddy, &(page)->flags)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-101_pageblock_bits/init/Kconfig linux-2.6.19-rc5-mm2-102_remove_clustering_flags/init/Kconfig
--- linux-2.6.19-rc5-mm2-101_pageblock_bits/init/Kconfig 2006-11-21 10:57:46.000000000 +0000
+++ linux-2.6.19-rc5-mm2-102_remove_clustering_flags/init/Kconfig 2006-11-21 11:25:08.000000000 +0000
@@ -502,7 +502,6 @@ config SLOB

config PAGE_CLUSTERING
bool "Cluster movable pages together in the page allocator"
- depends on !SOFTWARE_SUSPEND
def_bool n
help
The standard allocator will fragment memory over time which means
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-101_pageblock_bits/mm/page_alloc.c linux-2.6.19-rc5-mm2-102_remove_clustering_flags/mm/page_alloc.c
--- linux-2.6.19-rc5-mm2-101_pageblock_bits/mm/page_alloc.c 2006-11-21 11:52:45.000000000 +0000
+++ linux-2.6.19-rc5-mm2-102_remove_clustering_flags/mm/page_alloc.c 2006-11-21 11:52:50.000000000 +0000
@@ -145,7 +145,6 @@ static unsigned long __initdata dma_rese
#ifdef CONFIG_PAGE_CLUSTERING
static inline int get_page_migratetype(struct page *page)
{
- return ((PageMovable(page) != 0) << 1) | (PageReclaimable(page) != 0);
}

static inline int gfpflags_to_migratetype(gfp_t gfp_flags)
@@ -443,14 +442,6 @@ static inline void __free_one_page(struc

page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);

- /*
- * Free pages are always marked movable so the bits are in a known
- * state on alloc. As movable allocations are the most common, this
- * will result in less bit manipulations
- */
- __SetPageMovable(page);
- __ClearPageReclaimable(page);
-
VM_BUG_ON(page_idx & (order_size - 1));
VM_BUG_ON(bad_range(zone, page));

@@ -850,12 +841,6 @@ static struct page *__rmqueue(struct zon
page = __rmqueue_fallback(zone, order, migratetype);

got_page:
- if (unlikely(migratetype != MIGRATE_MOVABLE) && page)
- __ClearPageMovable(page);
-
- if (migratetype == MIGRATE_RECLAIMABLE && page)
- __SetPageReclaimable(page);
-
return page;
}

@@ -2312,7 +2297,6 @@ void __meminit memmap_init_zone(unsigned
* the address space during boot when many long-lived
* kernel allocations are made
*/
- SetPageMovable(page);

INIT_LIST_HEAD(&page->lru);
#ifdef WANT_PAGE_VIRTUAL

2006-11-21 22:53:53

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 9/11] Add a bitmap that is used to track flags affecting a block of pages


Page clustering uses two bits per page to track if the page can be moved
or reclaimed. However, what is of real interest is what the whole block
of pages is being used for. This patch adds a bitmap that is used for
flags affecting a whole a MAX_ORDER block of pages. Later patches drop the
requirement to use page->flags and this bitmap is used instead.

In non-SPARSEMEM configurations, the bitmap is stored in the struct zone
and allocated during initialisation. SPARSEMEM statically allocates the
bitmap in a struct mem_section so that bitmaps do not have to be resized
during memory hotadd. This wastes a small amount of memory per unused section
(usually sizeof(unsigned long)) but the complexity of dynamically allocating
the memory is quite high.

This mechanism is a proof of concept, so it uses obviously correct over optimal
implementation.

Additional credit to Andy Whitcroft who reviewed up an earlier implementation
of the mechanism an suggested how to make it a *lot* cleaner.

Signed-off-by: Mel Gorman <[email protected]>
---

include/linux/mmzone.h | 13 +++
include/linux/pageblock-flags.h | 48 ++++++++++++++
mm/page_alloc.c | 115 +++++++++++++++++++++++++++++++++++
3 files changed, 176 insertions(+)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-009_stats/include/linux/mmzone.h linux-2.6.19-rc5-mm2-101_pageblock_bits/include/linux/mmzone.h
--- linux-2.6.19-rc5-mm2-009_stats/include/linux/mmzone.h 2006-11-21 10:57:46.000000000 +0000
+++ linux-2.6.19-rc5-mm2-101_pageblock_bits/include/linux/mmzone.h 2006-11-21 11:23:20.000000000 +0000
@@ -13,6 +13,7 @@
#include <linux/init.h>
#include <linux/seqlock.h>
#include <linux/nodemask.h>
+#include <linux/pageblock-flags.h>
#include <asm/atomic.h>
#include <asm/page.h>

@@ -222,6 +223,14 @@ struct zone {
#endif
struct free_area free_area[MAX_ORDER];

+#ifndef CONFIG_SPARSEMEM
+ /*
+ * Flags for a MAX_ORDER_NR_PAGES block. See pageblock-flags.h.
+ * In SPARSEMEM, this map is stored in struct mem_section
+ */
+ unsigned long *pageblock_flags;
+#endif /* CONFIG_SPARSEMEM */
+

ZONE_PADDING(_pad1_)

@@ -677,6 +686,9 @@ extern struct zone *next_zone(struct zon
#define PAGES_PER_SECTION (1UL << PFN_SECTION_SHIFT)
#define PAGE_SECTION_MASK (~(PAGES_PER_SECTION-1))

+#define SECTION_BLOCKFLAGS_BITS \
+ ((SECTION_SIZE_BITS - (MAX_ORDER-1)) * NR_PAGEBLOCK_BITS)
+
#if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
#error Allocator MAX_ORDER exceeds SECTION_SIZE
#endif
@@ -696,6 +708,7 @@ struct mem_section {
* before using it wrong.
*/
unsigned long section_mem_map;
+ DECLARE_BITMAP(pageblock_flags, SECTION_BLOCKFLAGS_BITS);
};

#ifdef CONFIG_SPARSEMEM_EXTREME
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-009_stats/include/linux/pageblock-flags.h linux-2.6.19-rc5-mm2-101_pageblock_bits/include/linux/pageblock-flags.h
--- linux-2.6.19-rc5-mm2-009_stats/include/linux/pageblock-flags.h 2006-11-21 11:33:29.000000000 +0000
+++ linux-2.6.19-rc5-mm2-101_pageblock_bits/include/linux/pageblock-flags.h 2006-11-21 11:23:20.000000000 +0000
@@ -0,0 +1,48 @@
+/*
+ * Macros for manipulating and testing flags related to a
+ * MAX_ORDER_NR_PAGES block of pages.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation version 2 of the License
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2006
+ *
+ * Original author, Mel Gorman
+ * Major cleanups and reduction of bit operations, Andy Whitcroft
+ */
+#ifndef PAGEBLOCK_FLAGS_H
+#define PAGEBLOCK_FLAGS_H
+
+#include <linux/types.h>
+
+/* Bit indices that affect a whole block of pages */
+enum pageblock_bits {
+ NR_PAGEBLOCK_BITS
+};
+
+/* Forward declaration */
+struct page;
+
+/* Declarations for getting and setting flags. See mm/page_alloc.c */
+unsigned long get_pageblock_flags_group(struct page *page,
+ int start_bitidx, int end_bitidx);
+void set_pageblock_flags_group(struct page *page, unsigned long flags,
+ int start_bitidx, int end_bitidx);
+
+#define get_pageblock_flags(page) \
+ get_pageblock_flags_group(page, 0, NR_PAGEBLOCK_BITS-1)
+#define set_pageblock_flags(page) \
+ set_pageblock_flags_group(page, 0, NR_PAGEBLOCK_BITS-1)
+
+#endif /* PAGEBLOCK_FLAGS_H */
+
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-009_stats/mm/page_alloc.c linux-2.6.19-rc5-mm2-101_pageblock_bits/mm/page_alloc.c
--- linux-2.6.19-rc5-mm2-009_stats/mm/page_alloc.c 2006-11-21 11:52:40.000000000 +0000
+++ linux-2.6.19-rc5-mm2-101_pageblock_bits/mm/page_alloc.c 2006-11-21 11:52:45.000000000 +0000
@@ -2945,6 +2945,41 @@ static void __init calculate_node_totalp
realtotalpages);
}

+#ifndef CONFIG_SPARSEMEM
+/*
+ * Calculate the size of the zone->blockflags rounded to an unsigned long
+ * Start by making sure zonesize is a multiple of MAX_ORDER-1 by rounding up
+ * Then figure 1 NR_PAGEBLOCK_BITS worth of bits per MAX_ORDER-1, finally
+ * round what is now in bits to nearest long in bits, then return it in
+ * bytes.
+ */
+static unsigned long __init usemap_size(unsigned long zonesize)
+{
+ unsigned long usemapsize;
+
+ usemapsize = roundup(zonesize, MAX_ORDER_NR_PAGES);
+ usemapsize = usemapsize >> (MAX_ORDER-1);
+ usemapsize *= NR_PAGEBLOCK_BITS;
+ usemapsize = roundup(usemapsize, 8 * sizeof(unsigned long));
+
+ return usemapsize / 8;
+}
+
+static void __init setup_usemap(struct pglist_data *pgdat,
+ struct zone *zone, unsigned long zonesize)
+{
+ unsigned long usemapsize = usemap_size(zonesize);
+ zone->pageblock_flags = NULL;
+ if (usemapsize) {
+ zone->pageblock_flags = alloc_bootmem_node(pgdat, usemapsize);
+ memset(zone->pageblock_flags, 0, usemapsize);
+ }
+}
+#else
+static void inline setup_usemap(struct pglist_data *pgdat,
+ struct zone *zone, unsigned long zonesize) {}
+#endif /* CONFIG_SPARSEMEM */
+
/*
* Set up the zone data structures:
* - mark all pages reserved
@@ -3028,6 +3063,7 @@ static void __meminit free_area_init_cor
if (!size)
continue;

+ setup_usemap(pgdat, zone, size);
ret = init_currently_empty_zone(zone, zone_start_pfn, size);
BUG_ON(ret);
zone_start_pfn += size;
@@ -3744,3 +3780,82 @@ int highest_possible_node_id(void)
}
EXPORT_SYMBOL(highest_possible_node_id);
#endif
+
+/* Return a pointer to the bitmap storing bits affecting a block of pages */
+static inline unsigned long *get_pageblock_bitmap(struct zone *zone,
+ unsigned long pfn)
+{
+#ifdef CONFIG_SPARSEMEM
+ unsigned long blockpfn;
+ blockpfn = pfn & ~(MAX_ORDER_NR_PAGES - 1);
+ return __pfn_to_section(blockpfn)->pageblock_flags;
+#else
+ return zone->pageblock_flags;
+#endif /* CONFIG_SPARSEMEM */
+}
+
+static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
+{
+#ifdef CONFIG_SPARSEMEM
+ pfn &= (PAGES_PER_SECTION-1);
+ return (pfn >> (MAX_ORDER-1)) * NR_PAGEBLOCK_BITS;
+#else
+ pfn = pfn - zone->zone_start_pfn;
+ return (pfn >> (MAX_ORDER-1)) * NR_PAGEBLOCK_BITS;
+#endif /* CONFIG_SPARSEMEM */
+}
+
+/**
+ * get_pageblock_flags_group - Return the requested group of flags for the MAX_ORDER_NR_PAGES block of pages
+ * @page: The page within the block of interest
+ * @start_bitidx: The first bit of interest to retrieve
+ * @end_bitidx: The last bit of interest
+ * returns pageblock_bits flags
+ */
+unsigned long get_pageblock_flags_group(struct page *page,
+ int start_bitidx, int end_bitidx)
+{
+ struct zone *zone;
+ unsigned long *bitmap;
+ unsigned long pfn, bitidx;
+ unsigned long flags = 0;
+ unsigned long value = 1;
+
+ zone = page_zone(page);
+ pfn = page_to_pfn(page);
+ bitmap = get_pageblock_bitmap(zone, pfn);
+ bitidx = pfn_to_bitidx(zone, pfn);
+
+ for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
+ if (test_bit(bitidx + start_bitidx, bitmap))
+ flags |= value;
+
+ return flags;
+}
+
+/**
+ * set_pageblock_flags_group - Set the requested group of flags for a MAX_ORDER_NR_PAGES block of pages
+ * @page: The page within the block of interest
+ * @start_bitidx: The first bit of interest
+ * @end_bitidx: The last bit of interest
+ * @flags: The flags to set
+ */
+void set_pageblock_flags_group(struct page *page, unsigned long flags,
+ int start_bitidx, int end_bitidx)
+{
+ struct zone *zone;
+ unsigned long *bitmap;
+ unsigned long pfn, bitidx;
+ unsigned long value = 1;
+
+ zone = page_zone(page);
+ pfn = page_to_pfn(page);
+ bitmap = get_pageblock_bitmap(zone, pfn);
+ bitidx = pfn_to_bitidx(zone, pfn);
+
+ for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
+ if (flags & value)
+ __set_bit(bitidx + start_bitidx, bitmap);
+ else
+ __clear_bit(bitidx + start_bitidx, bitmap);
+}

2006-11-21 22:53:37

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 8/11] [DEBUG] Add statistics


This patch is strictly debug only. It outputs some information to
/proc/buddyinfo that may help explain what went wrong if page clustering
totally breaks down and prints out a trace when fallbacks occur to help
determine if allocation flagging is incomplete.


Signed-off-by: Mel Gorman <[email protected]>
---

page_alloc.c | 32 ++++++++++++++++++++++++++++++++
vmstat.c | 16 ++++++++++++++++
2 files changed, 48 insertions(+)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-008_reclaimable/mm/page_alloc.c linux-2.6.19-rc5-mm2-009_stats/mm/page_alloc.c
--- linux-2.6.19-rc5-mm2-008_reclaimable/mm/page_alloc.c 2006-11-21 10:57:46.000000000 +0000
+++ linux-2.6.19-rc5-mm2-009_stats/mm/page_alloc.c 2006-11-21 11:52:40.000000000 +0000
@@ -58,6 +58,10 @@ unsigned long totalram_pages __read_most
unsigned long totalreserve_pages __read_mostly;
long nr_swap_pages;
int percpu_pagelist_fraction;
+#ifdef CONFIG_PAGE_CLUSTERING
+int split_count[MIGRATE_TYPES];
+int fallback_counts[MIGRATE_TYPES];
+#endif /* CONFIG_PAGE_CLUSTERING */

static void __free_pages_ok(struct page *page, unsigned int order);

@@ -84,6 +88,8 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_Z
#endif
};

+static int printfallback_count;
+
EXPORT_SYMBOL(totalram_pages);

static char *zone_names[MAX_NR_ZONES] = {
@@ -750,6 +756,27 @@ static struct page *__rmqueue_fallback(s
struct page, lru);
area->nr_free--;

+ /* Account for a MAX_ORDER block being split */
+ if (current_order == MAX_ORDER - 1 &&
+ order < MAX_ORDER - 1) {
+ split_count[start_migratetype]++;
+ }
+
+ /* Account for fallbacks of interest */
+ if (order < HUGETLB_PAGE_ORDER &&
+ current_order != MAX_ORDER - 1) {
+ fallback_counts[start_migratetype]++;
+ if (printfallback_count < 500 && start_migratetype != MIGRATE_MOVABLE) {
+ printfallback_count++;
+ printk("ALLOC FALLBACK %d TYPE %d TO %d ZONE %s\n", printfallback_count, start_migratetype, migratetype, zone->name);
+ printk("===========================\n");
+ dump_stack();
+ printk("===========================\n");
+ }
+
+
+ }
+
/* Remove the page from the freelists */
list_del(&page->lru);
rmv_page_order(page);
@@ -805,6 +832,11 @@ static struct page *__rmqueue(struct zon
if (list_empty(&area->free_list[migratetype]))
continue;

+#ifdef CONFIG_PAGE_CLUSTERING
+ if (current_order == MAX_ORDER - 1 && order < MAX_ORDER - 1)
+ split_count[migratetype]++;
+#endif /* CONFIG_PAGE_CLUSTERING */
+
page = list_entry(area->free_list[migratetype].next,
struct page, lru);
list_del(&page->lru);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-008_reclaimable/mm/vmstat.c linux-2.6.19-rc5-mm2-009_stats/mm/vmstat.c
--- linux-2.6.19-rc5-mm2-008_reclaimable/mm/vmstat.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-009_stats/mm/vmstat.c 2006-11-21 10:59:36.000000000 +0000
@@ -13,6 +13,11 @@
#include <linux/module.h>
#include <linux/cpu.h>

+#ifdef CONFIG_PAGE_CLUSTERING
+extern int split_count[MIGRATE_TYPES];
+extern int fallback_counts[MIGRATE_TYPES];
+#endif /* CONFIG_PAGE_CLUSTERING */
+
void __get_zone_counts(unsigned long *active, unsigned long *inactive,
unsigned long *free, struct pglist_data *pgdat)
{
@@ -403,6 +408,17 @@ static void *frag_next(struct seq_file *

static void frag_stop(struct seq_file *m, void *arg)
{
+#ifdef CONFIG_PAGE_CLUSTERING
+ seq_printf(m, "Fallback counts\n");
+ seq_printf(m, "Unmovable: %8d\n", fallback_counts[MIGRATE_UNMOVABLE]);
+ seq_printf(m, "Reclaim: %8d\n", fallback_counts[MIGRATE_RECLAIMABLE]);
+ seq_printf(m, "Movable: %8d\n", fallback_counts[MIGRATE_MOVABLE]);
+
+ seq_printf(m, "\nSplit counts\n");
+ seq_printf(m, "Unmovable: %8d\n", split_count[MIGRATE_UNMOVABLE]);
+ seq_printf(m, "Reclaim: %8d\n", split_count[MIGRATE_RECLAIMABLE]);
+ seq_printf(m, "Movable: %8d\n", split_count[MIGRATE_MOVABLE]);
+#endif /* CONFIG_PAGE_CLUSTERING */
}

/*

2006-11-21 22:52:57

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 7/11] Mark short-lived and reclaimable kernel allocations


The kernel makes a number of allocations that are either short-lived such as
network buffers or are reclaimable such as inode allocations. When something
like updatedb is called, long-lived and unmovable kernel allocations tend
to be spread throughout the address space which increases fragmentation.

This patch clusters these allocations together as much as possible. As
it requires another page bit, the suspend bits are reused instead. Three
patches at the end of this set will introduce an alternative to using page
flags allowing suspend to be used again.

Signed-off-by: Mel Gorman <[email protected]>
---

arch/x86_64/kernel/e820.c | 8 +++++
fs/buffer.c | 3 +-
fs/dcache.c | 2 -
fs/ext2/super.c | 3 +-
fs/ext3/super.c | 2 -
fs/jbd/journal.c | 6 ++--
fs/jbd/revoke.c | 6 ++--
fs/ntfs/inode.c | 6 ++--
fs/proc/base.c | 13 ++++----
fs/proc/generic.c | 2 -
fs/reiserfs/super.c | 3 +-
include/linux/gfp.h | 16 ++++++++--
include/linux/mmzone.h | 14 +++++----
include/linux/page-flags.h | 50 +++++++++++++++++++++++++++------
init/Kconfig | 1
lib/radix-tree.c | 6 ++--
mm/page_alloc.c | 59 ++++++++++++++++++++++++++--------------
mm/shmem.c | 10 ++++--
net/core/skbuff.c | 1
19 files changed, 150 insertions(+), 61 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-007_movefree/arch/x86_64/kernel/e820.c linux-2.6.19-rc5-mm2-008_reclaimable/arch/x86_64/kernel/e820.c
--- linux-2.6.19-rc5-mm2-007_movefree/arch/x86_64/kernel/e820.c 2006-11-14 14:01:35.000000000 +0000
+++ linux-2.6.19-rc5-mm2-008_reclaimable/arch/x86_64/kernel/e820.c 2006-11-21 10:57:46.000000000 +0000
@@ -217,6 +217,13 @@ void __init e820_reserve_resources(void)
}
}

+#ifdef CONFIG_PAGE_CLUSTERING
+static void __init
+e820_mark_nosave_range(unsigned long start, unsigned long end)
+{
+ printk("Nosave not set when anti-frag is enabled");
+}
+#else
/* Mark pages corresponding to given address range as nosave */
static void __init
e820_mark_nosave_range(unsigned long start, unsigned long end)
@@ -232,6 +239,7 @@ e820_mark_nosave_range(unsigned long sta
if (pfn_valid(pfn))
SetPageNosave(pfn_to_page(pfn));
}
+#endif

/*
* Find the ranges of physical addresses that do not correspond to
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-007_movefree/fs/buffer.c linux-2.6.19-rc5-mm2-008_reclaimable/fs/buffer.c
--- linux-2.6.19-rc5-mm2-007_movefree/fs/buffer.c 2006-11-21 10:47:11.000000000 +0000
+++ linux-2.6.19-rc5-mm2-008_reclaimable/fs/buffer.c 2006-11-21 10:57:46.000000000 +0000
@@ -3004,7 +3004,8 @@ static void recalc_bh_state(void)

struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)
{
- struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags);
+ struct buffer_head *ret = kmem_cache_alloc(bh_cachep,
+ set_migrateflags(gfp_flags, __GFP_RECLAIMABLE));
if (ret) {
get_cpu_var(bh_accounting).nr++;
recalc_bh_state();
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-007_movefree/fs/dcache.c linux-2.6.19-rc5-mm2-008_reclaimable/fs/dcache.c
--- linux-2.6.19-rc5-mm2-007_movefree/fs/dcache.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-008_reclaimable/fs/dcache.c 2006-11-21 10:57:46.000000000 +0000
@@ -861,7 +861,7 @@ struct dentry *d_alloc(struct dentry * p
struct dentry *dentry;
char *dname;

- dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL);
+ dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL|__GFP_RECLAIMABLE);
if (!dentry)
return NULL;

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-007_movefree/fs/ext2/super.c linux-2.6.19-rc5-mm2-008_reclaimable/fs/ext2/super.c
--- linux-2.6.19-rc5-mm2-007_movefree/fs/ext2/super.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-008_reclaimable/fs/ext2/super.c 2006-11-21 10:57:46.000000000 +0000
@@ -140,7 +140,8 @@ static kmem_cache_t * ext2_inode_cachep;
static struct inode *ext2_alloc_inode(struct super_block *sb)
{
struct ext2_inode_info *ei;
- ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, SLAB_KERNEL);
+ ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep,
+ SLAB_KERNEL|__GFP_RECLAIMABLE);
if (!ei)
return NULL;
#ifdef CONFIG_EXT2_FS_POSIX_ACL
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-007_movefree/fs/ext3/super.c linux-2.6.19-rc5-mm2-008_reclaimable/fs/ext3/super.c
--- linux-2.6.19-rc5-mm2-007_movefree/fs/ext3/super.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-008_reclaimable/fs/ext3/super.c 2006-11-21 10:57:46.000000000 +0000
@@ -445,7 +445,7 @@ static struct inode *ext3_alloc_inode(st
{
struct ext3_inode_info *ei;

- ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS);
+ ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS|__GFP_RECLAIMABLE);
if (!ei)
return NULL;
#ifdef CONFIG_EXT3_FS_POSIX_ACL
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-007_movefree/fs/jbd/journal.c linux-2.6.19-rc5-mm2-008_reclaimable/fs/jbd/journal.c
--- linux-2.6.19-rc5-mm2-007_movefree/fs/jbd/journal.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-008_reclaimable/fs/jbd/journal.c 2006-11-21 10:57:46.000000000 +0000
@@ -1735,7 +1735,8 @@ static struct journal_head *journal_allo
#ifdef CONFIG_JBD_DEBUG
atomic_inc(&nr_journal_heads);
#endif
- ret = kmem_cache_alloc(journal_head_cache, GFP_NOFS);
+ ret = kmem_cache_alloc(journal_head_cache,
+ set_migrateflags(GFP_NOFS, __GFP_RECLAIMABLE));
if (ret == 0) {
jbd_debug(1, "out of memory for journal_head\n");
if (time_after(jiffies, last_warning + 5*HZ)) {
@@ -1745,7 +1746,8 @@ static struct journal_head *journal_allo
}
while (ret == 0) {
yield();
- ret = kmem_cache_alloc(journal_head_cache, GFP_NOFS);
+ ret = kmem_cache_alloc(journal_head_cache,
+ GFP_NOFS|__GFP_RECLAIMABLE);
}
}
return ret;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-007_movefree/fs/jbd/revoke.c linux-2.6.19-rc5-mm2-008_reclaimable/fs/jbd/revoke.c
--- linux-2.6.19-rc5-mm2-007_movefree/fs/jbd/revoke.c 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-008_reclaimable/fs/jbd/revoke.c 2006-11-21 10:57:46.000000000 +0000
@@ -206,7 +206,8 @@ int journal_init_revoke(journal_t *journ
while((tmp >>= 1UL) != 0UL)
shift++;

- journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
+ journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache,
+ GFP_KERNEL|__GFP_RECLAIMABLE);
if (!journal->j_revoke_table[0])
return -ENOMEM;
journal->j_revoke = journal->j_revoke_table[0];
@@ -229,7 +230,8 @@ int journal_init_revoke(journal_t *journ
for (tmp = 0; tmp < hash_size; tmp++)
INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);

- journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
+ journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache,
+ GFP_KERNEL|__GFP_RECLAIMABLE);
if (!journal->j_revoke_table[1]) {
kfree(journal->j_revoke_table[0]->hash_table);
kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-007_movefree/fs/ntfs/inode.c linux-2.6.19-rc5-mm2-008_reclaimable/fs/ntfs/inode.c
--- linux-2.6.19-rc5-mm2-007_movefree/fs/ntfs/inode.c 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-008_reclaimable/fs/ntfs/inode.c 2006-11-21 10:57:46.000000000 +0000
@@ -324,7 +324,8 @@ struct inode *ntfs_alloc_big_inode(struc
ntfs_inode *ni;

ntfs_debug("Entering.");
- ni = kmem_cache_alloc(ntfs_big_inode_cache, SLAB_NOFS);
+ ni = kmem_cache_alloc(ntfs_big_inode_cache,
+ SLAB_NOFS|__GFP_RECLAIMABLE);
if (likely(ni != NULL)) {
ni->state = 0;
return VFS_I(ni);
@@ -349,7 +350,8 @@ static inline ntfs_inode *ntfs_alloc_ext
ntfs_inode *ni;

ntfs_debug("Entering.");
- ni = kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS);
+ ni = kmem_cache_alloc(ntfs_inode_cache,
+ SLAB_NOFS|__GFP_RECLAIMABLE);
if (likely(ni != NULL)) {
ni->state = 0;
return ni;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-007_movefree/fs/proc/base.c linux-2.6.19-rc5-mm2-008_reclaimable/fs/proc/base.c
--- linux-2.6.19-rc5-mm2-007_movefree/fs/proc/base.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-008_reclaimable/fs/proc/base.c 2006-11-21 10:57:46.000000000 +0000
@@ -484,7 +484,7 @@ static ssize_t proc_info_read(struct fil
count = PROC_BLOCK_SIZE;

length = -ENOMEM;
- if (!(page = __get_free_page(GFP_KERNEL)))
+ if (!(page = __get_free_page(GFP_KERNEL|__GFP_RECLAIMABLE)))
goto out;

length = PROC_I(inode)->op.proc_read(task, (char*)page);
@@ -594,7 +594,7 @@ static ssize_t mem_write(struct file * f
goto out;

copied = -ENOMEM;
- page = (char *)__get_free_page(GFP_USER);
+ page = (char *)__get_free_page(GFP_USER|__GFP_RECLAIMABLE);
if (!page)
goto out;

@@ -751,7 +751,7 @@ static ssize_t proc_loginuid_write(struc
/* No partial writes. */
return -EINVAL;
}
- page = (char*)__get_free_page(GFP_USER);
+ page = (char*)__get_free_page(GFP_USER|__GFP_RECLAIMABLE);
if (!page)
return -ENOMEM;
length = -EFAULT;
@@ -933,7 +933,8 @@ static int do_proc_readlink(struct dentr
char __user *buffer, int buflen)
{
struct inode * inode;
- char *tmp = (char*)__get_free_page(GFP_KERNEL), *path;
+ char *tmp = (char*)__get_free_page(GFP_KERNEL|__GFP_RECLAIMABLE);
+ char *path;
int len;

if (!tmp)
@@ -1566,7 +1567,7 @@ static ssize_t proc_pid_attr_read(struct
if (count > PAGE_SIZE)
count = PAGE_SIZE;
length = -ENOMEM;
- if (!(page = __get_free_page(GFP_KERNEL)))
+ if (!(page = __get_free_page(GFP_KERNEL|__GFP_RECLAIMABLE)))
goto out;

length = security_getprocattr(task,
@@ -1601,7 +1602,7 @@ static ssize_t proc_pid_attr_write(struc
goto out;

length = -ENOMEM;
- page = (char*)__get_free_page(GFP_USER);
+ page = (char*)__get_free_page(GFP_USER|__GFP_RECLAIMABLE);
if (!page)
goto out;

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-007_movefree/fs/proc/generic.c linux-2.6.19-rc5-mm2-008_reclaimable/fs/proc/generic.c
--- linux-2.6.19-rc5-mm2-007_movefree/fs/proc/generic.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-008_reclaimable/fs/proc/generic.c 2006-11-21 10:57:46.000000000 +0000
@@ -73,7 +73,7 @@ proc_file_read(struct file *file, char _
nbytes = MAX_NON_LFS - pos;

dp = PDE(inode);
- if (!(page = (char*) __get_free_page(GFP_KERNEL)))
+ if (!(page = (char*) __get_free_page(GFP_KERNEL|__GFP_RECLAIMABLE)))
return -ENOMEM;

while ((nbytes > 0) && !eof) {
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-007_movefree/fs/reiserfs/super.c linux-2.6.19-rc5-mm2-008_reclaimable/fs/reiserfs/super.c
--- linux-2.6.19-rc5-mm2-007_movefree/fs/reiserfs/super.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-008_reclaimable/fs/reiserfs/super.c 2006-11-21 10:57:46.000000000 +0000
@@ -496,7 +496,8 @@ static struct inode *reiserfs_alloc_inod
{
struct reiserfs_inode_info *ei;
ei = (struct reiserfs_inode_info *)
- kmem_cache_alloc(reiserfs_inode_cachep, SLAB_KERNEL);
+ kmem_cache_alloc(reiserfs_inode_cachep,
+ SLAB_KERNEL|__GFP_RECLAIMABLE);
if (!ei)
return NULL;
return &ei->vfs_inode;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-007_movefree/include/linux/gfp.h linux-2.6.19-rc5-mm2-008_reclaimable/include/linux/gfp.h
--- linux-2.6.19-rc5-mm2-007_movefree/include/linux/gfp.h 2006-11-21 10:47:11.000000000 +0000
+++ linux-2.6.19-rc5-mm2-008_reclaimable/include/linux/gfp.h 2006-11-21 10:57:46.000000000 +0000
@@ -46,9 +46,10 @@ struct vm_area_struct;
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
-#define __GFP_MOVABLE ((__force gfp_t)0x80000u) /* Page is movable */
+#define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
+#define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */

-#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

/* if you forget to add the bitmask here kernel will crash, period */
@@ -56,7 +57,10 @@ struct vm_area_struct;
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|\
- __GFP_MOVABLE)
+ __GFP_RECLAIMABLE|__GFP_MOVABLE)
+
+/* This mask makes up all the page movable related flags */
+#define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)

/* This equals 0, but use constants in case they ever change */
#define GFP_NOWAIT (GFP_ATOMIC & ~__GFP_HIGH)
@@ -102,6 +106,12 @@ static inline enum zone_type gfp_zone(gf
return ZONE_NORMAL;
}

+static inline gfp_t set_migrateflags(gfp_t gfp, gfp_t migrate_flags)
+{
+ BUG_ON((gfp & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
+ return (gfp & ~(GFP_MOVABLE_MASK)) | migrate_flags;
+}
+
/*
* There is only one page-allocator function, and two main namespaces to
* it. The alloc_page*() variants return 'struct page *' and as such
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-007_movefree/include/linux/mmzone.h linux-2.6.19-rc5-mm2-008_reclaimable/include/linux/mmzone.h
--- linux-2.6.19-rc5-mm2-007_movefree/include/linux/mmzone.h 2006-11-21 10:52:26.000000000 +0000
+++ linux-2.6.19-rc5-mm2-008_reclaimable/include/linux/mmzone.h 2006-11-21 10:57:46.000000000 +0000
@@ -25,12 +25,14 @@
#define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))

#ifdef CONFIG_PAGE_CLUSTERING
-#define MIGRATE_UNMOVABLE 0
-#define MIGRATE_MOVABLE 1
-#define MIGRATE_TYPES 2
-#else
-#define MIGRATE_UNMOVABLE 0
-#define MIGRATE_MOVABLE 0
+#define MIGRATE_UNMOVABLE 0
+#define MIGRATE_RECLAIMABLE 1
+#define MIGRATE_MOVABLE 2
+#define MIGRATE_TYPES 3
+#else
+#define MIGRATE_UNMOVABLE 0
+#define MIGRATE_RECLAIMABLE 0
+#define MIGRATE_MOVABLE 0
#define MIGRATE_TYPES 1
#endif

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-007_movefree/include/linux/page-flags.h linux-2.6.19-rc5-mm2-008_reclaimable/include/linux/page-flags.h
--- linux-2.6.19-rc5-mm2-007_movefree/include/linux/page-flags.h 2006-11-21 10:48:55.000000000 +0000
+++ linux-2.6.19-rc5-mm2-008_reclaimable/include/linux/page-flags.h 2006-11-21 10:57:46.000000000 +0000
@@ -82,18 +82,28 @@
#define PG_private 11 /* If pagecache, has fs-private data */

#define PG_writeback 12 /* Page is under writeback */
-#define PG_nosave 13 /* Used for system suspend/resume */
#define PG_compound 14 /* Part of a compound page */
#define PG_swapcache 15 /* Swap page: swp_entry_t in private */

#define PG_mappedtodisk 16 /* Has blocks allocated on-disk */
#define PG_reclaim 17 /* To be reclaimed asap */
-#define PG_nosave_free 18 /* Used for system suspend/resume */
#define PG_buddy 19 /* Page is free, on buddy lists */

#define PG_readahead 20 /* Reminder to do readahead */

-#define PG_movable 21 /* Page may be moved */
+/*
+ * As page clustering requires two flags, it was best to reuse the suspend
+ * flags and make page clustering depend on !SOFTWARE_SUSPEND. This works
+ * on the assumption that machines being suspended do not really care about
+ * large contiguous allocations.
+ */
+#ifndef CONFIG_PAGE_CLUSTERING
+#define PG_nosave 13 /* Used for system suspend/resume */
+#define PG_nosave_free 18 /* Free, should not be written */
+#else
+#define PG_reclaimable 13 /* Page is reclaimable */
+#define PG_movable 18 /* Page is movable */
+#endif

#if (BITS_PER_LONG > 32)
/*
@@ -211,6 +221,7 @@ static inline void SetPageUptodate(struc
ret; \
})

+#ifndef CONFIG_PAGE_CLUSTERING
#define PageNosave(page) test_bit(PG_nosave, &(page)->flags)
#define SetPageNosave(page) set_bit(PG_nosave, &(page)->flags)
#define TestSetPageNosave(page) test_and_set_bit(PG_nosave, &(page)->flags)
@@ -221,6 +232,33 @@ static inline void SetPageUptodate(struc
#define SetPageNosaveFree(page) set_bit(PG_nosave_free, &(page)->flags)
#define ClearPageNosaveFree(page) clear_bit(PG_nosave_free, &(page)->flags)

+#define PageReclaimable(page) (0)
+#define SetPageReclaimable(page) do {} while (0)
+#define ClearPageReclaimable(page) do {} while (0)
+#define __SetPageReclaimable(page) do {} while (0)
+#define __ClearPageReclaimable(page) do {} while (0)
+
+#define PageMovable(page) (0)
+#define SetPageMovable(page) do {} while (0)
+#define ClearPageMovable(page) do {} while (0)
+#define __SetPageMovable(page) do {} while (0)
+#define __ClearPageMovable(page) do {} while (0)
+
+#else
+
+#define PageReclaimable(page) test_bit(PG_reclaimable, &(page)->flags)
+#define SetPageReclaimable(page) set_bit(PG_reclaimable, &(page)->flags)
+#define ClearPageReclaimable(page) clear_bit(PG_reclaimable, &(page)->flags)
+#define __SetPageReclaimable(page) __set_bit(PG_reclaimable, &(page)->flags)
+#define __ClearPageReclaimable(page) __clear_bit(PG_reclaimable, &(page)->flags)
+
+#define PageMovable(page) test_bit(PG_movable, &(page)->flags)
+#define SetPageMovable(page) set_bit(PG_movable, &(page)->flags)
+#define ClearPageMovable(page) clear_bit(PG_movable, &(page)->flags)
+#define __SetPageMovable(page) __set_bit(PG_movable, &(page)->flags)
+#define __ClearPageMovable(page) __clear_bit(PG_movable, &(page)->flags)
+#endif /* CONFIG_PAGE_CLUSTERING */
+
#define PageBuddy(page) test_bit(PG_buddy, &(page)->flags)
#define __SetPageBuddy(page) __set_bit(PG_buddy, &(page)->flags)
#define __ClearPageBuddy(page) __clear_bit(PG_buddy, &(page)->flags)
@@ -254,12 +292,6 @@ static inline void SetPageUptodate(struc
#define SetPageReadahead(page) set_bit(PG_readahead, &(page)->flags)
#define TestClearPageReadahead(page) test_and_clear_bit(PG_readahead, &(page)->flags)

-#define PageMovable(page) test_bit(PG_movable, &(page)->flags)
-#define SetPageMovable(page) set_bit(PG_movable, &(page)->flags)
-#define ClearPageMovable(page) clear_bit(PG_movable, &(page)->flags)
-#define __SetPageMovable(page) __set_bit(PG_movable, &(page)->flags)
-#define __ClearPageMovable(page) __clear_bit(PG_movable, &(page)->flags)
-
struct page; /* forward declaration */

int test_clear_page_dirty(struct page *page);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-007_movefree/init/Kconfig linux-2.6.19-rc5-mm2-008_reclaimable/init/Kconfig
--- linux-2.6.19-rc5-mm2-007_movefree/init/Kconfig 2006-11-21 10:52:26.000000000 +0000
+++ linux-2.6.19-rc5-mm2-008_reclaimable/init/Kconfig 2006-11-21 10:57:46.000000000 +0000
@@ -502,6 +502,7 @@ config SLOB

config PAGE_CLUSTERING
bool "Cluster movable pages together in the page allocator"
+ depends on !SOFTWARE_SUSPEND
def_bool n
help
The standard allocator will fragment memory over time which means
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-007_movefree/lib/radix-tree.c linux-2.6.19-rc5-mm2-008_reclaimable/lib/radix-tree.c
--- linux-2.6.19-rc5-mm2-007_movefree/lib/radix-tree.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-008_reclaimable/lib/radix-tree.c 2006-11-21 10:57:46.000000000 +0000
@@ -93,7 +93,8 @@ radix_tree_node_alloc(struct radix_tree_
struct radix_tree_node *ret;
gfp_t gfp_mask = root_gfp_mask(root);

- ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+ ret = kmem_cache_alloc(radix_tree_node_cachep,
+ set_migrateflags(gfp_mask, __GFP_RECLAIMABLE));
if (ret == NULL && !(gfp_mask & __GFP_WAIT)) {
struct radix_tree_preload *rtp;

@@ -137,7 +138,8 @@ int radix_tree_preload(gfp_t gfp_mask)
rtp = &__get_cpu_var(radix_tree_preloads);
while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
preempt_enable();
- node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+ node = kmem_cache_alloc(radix_tree_node_cachep,
+ set_migrateflags(gfp_mask, __GFP_RECLAIMABLE));
if (node == NULL)
goto out;
preempt_disable();
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-007_movefree/mm/page_alloc.c linux-2.6.19-rc5-mm2-008_reclaimable/mm/page_alloc.c
--- linux-2.6.19-rc5-mm2-007_movefree/mm/page_alloc.c 2006-11-21 10:56:06.000000000 +0000
+++ linux-2.6.19-rc5-mm2-008_reclaimable/mm/page_alloc.c 2006-11-21 10:57:46.000000000 +0000
@@ -139,12 +139,15 @@ static unsigned long __initdata dma_rese
#ifdef CONFIG_PAGE_CLUSTERING
static inline int get_page_migratetype(struct page *page)
{
- return (PageMovable(page) != 0);
+ return ((PageMovable(page) != 0) << 1) | (PageReclaimable(page) != 0);
}

static inline int gfpflags_to_migratetype(gfp_t gfp_flags)
{
- return ((gfp_flags & __GFP_MOVABLE) != 0);
+ WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
+
+ return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
+ ((gfp_flags & __GFP_RECLAIMABLE) != 0);
}
#else
static inline int get_page_migratetype(struct page *page)
@@ -440,6 +443,7 @@ static inline void __free_one_page(struc
* will result in less bit manipulations
*/
__SetPageMovable(page);
+ __ClearPageReclaimable(page);

VM_BUG_ON(page_idx & (order_size - 1));
VM_BUG_ON(bad_range(zone, page));
@@ -717,6 +721,12 @@ int move_freepages_block(struct zone *zo
return move_freepages(zone, start_page, end_page, migratetype);
}

+static int fallbacks[MIGRATE_TYPES][MIGRATE_TYPES] = {
+ { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE }, /* UNMOVABLE Fallback */
+ { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE }, /* RECLAIMABLE Fallback */
+ { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE} /* MOVABLE Fallback */
+};
+
/* Remove an element from the buddy allocator from the fallback list */
static struct page *__rmqueue_fallback(struct zone *zone, int order,
int start_migratetype)
@@ -724,30 +734,36 @@ static struct page *__rmqueue_fallback(s
struct free_area * area;
int current_order;
struct page *page;
- int migratetype = !start_migratetype;
+ int migratetype, i;

/* Find the largest possible block of pages in the other list */
for (current_order = MAX_ORDER-1; current_order >= order;
--current_order) {
- area = &(zone->free_area[current_order]);
- if (list_empty(&area->free_list[migratetype]))
- continue;
+ for (i = 0; i < MIGRATE_TYPES - 1; i++) {
+ migratetype = fallbacks[start_migratetype][i];

- page = list_entry(area->free_list[migratetype].next,
- struct page, lru);
- area->nr_free--;
+ area = &(zone->free_area[current_order]);
+ if (list_empty(&area->free_list[migratetype]))
+ continue;

- /* Remove the page from the freelists */
- list_del(&page->lru);
- rmv_page_order(page);
- zone->free_pages -= 1UL << order;
- expand(zone, page, order, current_order, area, migratetype);
+ page = list_entry(area->free_list[migratetype].next,
+ struct page, lru);
+ area->nr_free--;

- /* Move free pages between lists if stealing a large block */
- if (current_order > MAX_ORDER / 2)
- move_freepages_block(zone, page, start_migratetype);
+ /* Remove the page from the freelists */
+ list_del(&page->lru);
+ rmv_page_order(page);
+ zone->free_pages -= 1UL << order;
+ expand(zone, page, order, current_order, area,
+ start_migratetype);
+
+ /* Move free pages between lists for large blocks */
+ if (current_order >= MAX_ORDER / 2)
+ move_freepages_block(zone, page,
+ start_migratetype);

- return page;
+ return page;
+ }
}

return NULL;
@@ -802,9 +818,12 @@ static struct page *__rmqueue(struct zon
page = __rmqueue_fallback(zone, order, migratetype);

got_page:
- if (unlikely(migratetype == MIGRATE_UNMOVABLE) && page)
+ if (unlikely(migratetype != MIGRATE_MOVABLE) && page)
__ClearPageMovable(page);

+ if (migratetype == MIGRATE_RECLAIMABLE && page)
+ __SetPageReclaimable(page);
+
return page;
}

@@ -891,7 +910,7 @@ static void __drain_pages(unsigned int c
}
#endif /* CONFIG_DRAIN_PERCPU_PAGES */

-#ifdef CONFIG_PM
+#ifdef CONFIG_SOFTWARE_SUSPEND

void mark_free_pages(struct zone *zone)
{
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-007_movefree/mm/shmem.c linux-2.6.19-rc5-mm2-008_reclaimable/mm/shmem.c
--- linux-2.6.19-rc5-mm2-007_movefree/mm/shmem.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-008_reclaimable/mm/shmem.c 2006-11-21 10:57:46.000000000 +0000
@@ -94,7 +94,8 @@ static inline struct page *shmem_dir_all
* BLOCKS_PER_PAGE on indirect pages, assume PAGE_CACHE_SIZE:
* might be reconsidered if it ever diverges from PAGE_SIZE.
*/
- return alloc_pages(gfp_mask, PAGE_CACHE_SHIFT-PAGE_SHIFT);
+ return alloc_pages(set_migrateflags(gfp_mask, __GFP_RECLAIMABLE),
+ PAGE_CACHE_SHIFT-PAGE_SHIFT);
}

static inline void shmem_dir_free(struct page *page)
@@ -976,7 +977,9 @@ shmem_alloc_page(gfp_t gfp, struct shmem
pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
pvma.vm_pgoff = idx;
pvma.vm_end = PAGE_SIZE;
- page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
+ page = alloc_page_vma(
+ set_migrateflags(gfp | __GFP_ZERO, __GFP_RECLAIMABLE),
+ &pvma, 0);
mpol_free(pvma.vm_policy);
return page;
}
@@ -996,7 +999,8 @@ shmem_swapin(struct shmem_inode_info *in
static inline struct page *
shmem_alloc_page(gfp_t gfp,struct shmem_inode_info *info, unsigned long idx)
{
- return alloc_page(gfp | __GFP_ZERO);
+ return alloc_page(
+ set_migrateflags(gfp | __GFP_ZERO, __GFP_RECLAIMABLE));
}
#endif

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-007_movefree/net/core/skbuff.c linux-2.6.19-rc5-mm2-008_reclaimable/net/core/skbuff.c
--- linux-2.6.19-rc5-mm2-007_movefree/net/core/skbuff.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-008_reclaimable/net/core/skbuff.c 2006-11-21 10:57:46.000000000 +0000
@@ -169,6 +169,7 @@ struct sk_buff *__alloc_skb(unsigned int
u8 *data;

cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+ gfp_mask = set_migrateflags(gfp_mask, __GFP_RECLAIMABLE);

/* Get the HEAD */
skb = kmem_cache_alloc(cache, gfp_mask & ~__GFP_DMA);

2006-11-21 22:54:46

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 11/11] Use pageblock flags for page clustering


This patch alters page clustering to use the pageblock bits for track how
movable a block of pages is.

Signed-off-by: Mel Gorman <[email protected]>
---

include/linux/pageblock-flags.h | 4 ++++
mm/page_alloc.c | 27 ++++++++++++++++++++++-----
2 files changed, 26 insertions(+), 5 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-102_remove_clustering_flags/include/linux/pageblock-flags.h linux-2.6.19-rc5-mm2-103_clustering_pageblock/include/linux/pageblock-flags.h
--- linux-2.6.19-rc5-mm2-102_remove_clustering_flags/include/linux/pageblock-flags.h 2006-11-21 11:23:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-103_clustering_pageblock/include/linux/pageblock-flags.h 2006-11-21 11:27:10.000000000 +0000
@@ -27,6 +27,10 @@

/* Bit indices that affect a whole block of pages */
enum pageblock_bits {
+#ifdef CONFIG_PAGE_CLUSTERING
+ PB_migrate,
+ PB_migrate_end = (PB_migrate + 2) - 1, /* 2 bits for migrate types */
+#endif /* CONFIG_PAGE_CLUSTERING */
NR_PAGEBLOCK_BITS
};

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-102_remove_clustering_flags/mm/page_alloc.c linux-2.6.19-rc5-mm2-103_clustering_pageblock/mm/page_alloc.c
--- linux-2.6.19-rc5-mm2-102_remove_clustering_flags/mm/page_alloc.c 2006-11-21 11:52:50.000000000 +0000
+++ linux-2.6.19-rc5-mm2-103_clustering_pageblock/mm/page_alloc.c 2006-11-21 11:52:53.000000000 +0000
@@ -143,8 +143,15 @@ static unsigned long __initdata dma_rese
#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */

#ifdef CONFIG_PAGE_CLUSTERING
-static inline int get_page_migratetype(struct page *page)
+static inline int get_pageblock_migratetype(struct page *page)
{
+ return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
+}
+
+static void set_pageblock_migratetype(struct page *page, int migratetype)
+{
+ set_pageblock_flags_group(page, (unsigned long)migratetype,
+ PB_migrate, PB_migrate_end);
}

static inline int gfpflags_to_migratetype(gfp_t gfp_flags)
@@ -155,11 +162,15 @@ static inline int gfpflags_to_migratetyp
((gfp_flags & __GFP_RECLAIMABLE) != 0);
}
#else
-static inline int get_page_migratetype(struct page *page)
+static inline int get_pageblock_migratetype(struct page *page)
{
return MIGRATE_UNMOVABLE;
}

+static inline void set_pageblock_migratetype(struct page *page, int rclmtype)
+{
+}
+
static inline int gfpflags_to_migratetype(gfp_t gfp_flags)
{
return MIGRATE_UNMOVABLE;
@@ -435,7 +446,7 @@ static inline void __free_one_page(struc
{
unsigned long page_idx;
int order_size = 1 << order;
- int migratetype = get_page_migratetype(page);
+ int migratetype = get_pageblock_migratetype(page);

if (unlikely(PageCompound(page)))
destroy_compound_page(page, order);
@@ -715,6 +726,7 @@ int move_freepages_block(struct zone *zo
if (page_zone(page) != page_zone(end_page))
return 0;

+ set_pageblock_migratetype(start_page, migratetype);
return move_freepages(zone, start_page, end_page, migratetype);
}

@@ -834,6 +846,10 @@ static struct page *__rmqueue(struct zon
rmv_page_order(page);
area->nr_free--;
zone->free_pages -= 1UL << order;
+
+ if (current_order == MAX_ORDER - 1)
+ set_pageblock_migratetype(page, migratetype);
+
expand(zone, page, order, current_order, area, migratetype);
goto got_page;
}
@@ -1022,7 +1038,7 @@ static void fastcall free_hot_cold_page(
local_irq_save(flags);
__count_vm_event(PGFREE);
list_add(&page->lru, &pcp->list);
- set_page_private(page, get_page_migratetype(page));
+ set_page_private(page, get_pageblock_migratetype(page));
pcp->count++;
if (pcp->count >= pcp->high) {
free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
@@ -2291,12 +2307,13 @@ void __meminit memmap_init_zone(unsigned
SetPageReserved(page);

/*
- * Mark the page movable so that blocks are reserved for
+ * Mark the block movable so that blocks are reserved for
* movable at startup. This will force kernel allocations
* to reserve their blocks rather than leaking throughout
* the address space during boot when many long-lived
* kernel allocations are made
*/
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);

INIT_LIST_HEAD(&page->lru);
#ifdef WANT_PAGE_VIRTUAL

2006-11-21 23:30:46

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers

Are GFP_HIGHUSER allocations always movable? It would reduce the size of
the patch if this would be added to GFP_HIGHUSER.


2006-11-21 23:43:54

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers

On Tue, 21 Nov 2006, Christoph Lameter wrote:

> Are GFP_HIGHUSER allocations always movable? It would reduce the size of
> the patch if this would be added to GFP_HIGHUSER.
>

No, they aren't. Page tables allocated with HIGHPTE are currently not
movable for example. A number of drivers (infiniband for example) also use
__GFP_HIGHMEM that are not movable.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2006-11-21 23:51:44

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers

On Tue, 2006-11-21 at 23:43 +0000, Mel Gorman wrote:
> On Tue, 21 Nov 2006, Christoph Lameter wrote:
> > Are GFP_HIGHUSER allocations always movable? It would reduce the size of
> > the patch if this would be added to GFP_HIGHUSER.
>
> No, they aren't. Page tables allocated with HIGHPTE are currently not
> movable for example. A number of drivers (infiniband for example) also use
> __GFP_HIGHMEM that are not movable.

I think Christoph was saying that it might reduce the size of the patch
to include it by _default_. You could always go to the
weird^Wspecialized users and mask the bits back off.

We probably also need to start getting a nice list of those users which
are HIGH but not MOVABLE. This would provide that by default, I think.

-- Dave

2006-11-22 00:44:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers



On Tue, 21 Nov 2006, Mel Gorman wrote:
>
> On Tue, 21 Nov 2006, Christoph Lameter wrote:
>
> > Are GFP_HIGHUSER allocations always movable? It would reduce the size of
> > the patch if this would be added to GFP_HIGHUSER.
> >
>
> No, they aren't. Page tables allocated with HIGHPTE are currently not movable
> for example. A number of drivers (infiniband for example) also use
> __GFP_HIGHMEM that are not movable.

It might make sense to just use another GFP_HIGHxyzzy #define for the
non-movable HIGHMEM users. There's probably much fewer of those, and their
behaviour obviously is very different from the traditional GFP_HIGHUSER
pages (ie page cache and anonymous user mappings).

So you could literally use "GFP_HIGHPTE" for the PTE mappings, and that
would in fact even simplify some of the users (ie it would allow moving
the #ifdef CONFIG_HIGHPTE check from the code to <linux/gfp.h>). Similarly
for any other non-movable things, no?

So then we'd just make GFP_HIGHUSER implicitly mean "movable". It could be
nice if GFP_USER would do the same, but I guess we have too many of those
around to verify (although _most_ of those are probably kmalloc, and
kmalloc would obviously better strip away the __GFP_MOVABLE bit anyway).

Linus

2006-11-22 02:25:24

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers

On Tue, 21 Nov 2006, Mel Gorman wrote:

> On Tue, 21 Nov 2006, Christoph Lameter wrote:
>
> > Are GFP_HIGHUSER allocations always movable? It would reduce the size of
> > the patch if this would be added to GFP_HIGHUSER.
> No, they aren't. Page tables allocated with HIGHPTE are currently not movable
> for example. A number of drivers (infiniband for example) also use
> __GFP_HIGHMEM that are not movable.

HIGHPTE with __GFP_USER set? This is a page table page right?
pte_alloc_one does currently not set GFP_USER:

struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
{
struct page *pte;

#ifdef CONFIG_HIGHPTE
pte =
alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT|__GFP_ZERO, 0);
#else
pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
#endif
return pte;
}

How does infiniband insure that page migration does not move those pages?

2006-11-23 15:00:19

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers

On Tue, 21 Nov 2006, Christoph Lameter wrote:

> On Tue, 21 Nov 2006, Mel Gorman wrote:
>
>> On Tue, 21 Nov 2006, Christoph Lameter wrote:
>>
>>> Are GFP_HIGHUSER allocations always movable? It would reduce the size of
>>> the patch if this would be added to GFP_HIGHUSER.
>> No, they aren't. Page tables allocated with HIGHPTE are currently not movable
>> for example. A number of drivers (infiniband for example) also use
>> __GFP_HIGHMEM that are not movable.
>
> HIGHPTE with __GFP_USER set? This is a page table page right?
> pte_alloc_one does currently not set GFP_USER:
>

What is __GFP_USER? The difference between GFP_USER and GFP_KERNEL is only
in the use of __GFP_HARDWALL. But HARDWALL on it's own is not enough to
distinguish movable and non-movable.

> struct page *pte_alloc_one(struct mm_struct *mm, unsigned long address)
> {
> struct page *pte;
>
> #ifdef CONFIG_HIGHPTE
> pte =
> alloc_pages(GFP_KERNEL|__GFP_HIGHMEM|__GFP_REPEAT|__GFP_ZERO, 0);
> #else
> pte = alloc_pages(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO, 0);
> #endif
> return pte;
> }
>
> How does infiniband insure that page migration does not move those pages?
>

I have not looked closely at infiniband and how it uses it's pages.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2006-11-23 16:36:17

by mel

[permalink] [raw]
Subject: Re: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers

On (21/11/06 16:44), Linus Torvalds didst pronounce:
>
>
> On Tue, 21 Nov 2006, Mel Gorman wrote:
> >
> > On Tue, 21 Nov 2006, Christoph Lameter wrote:
> >
> > > Are GFP_HIGHUSER allocations always movable? It would reduce the size of
> > > the patch if this would be added to GFP_HIGHUSER.
> > >
> >
> > No, they aren't. Page tables allocated with HIGHPTE are currently not movable
> > for example. A number of drivers (infiniband for example) also use
> > __GFP_HIGHMEM that are not movable.
>
> It might make sense to just use another GFP_HIGHxyzzy #define for the
> non-movable HIGHMEM users. There's probably much fewer of those, and their
> behaviour obviously is very different from the traditional GFP_HIGHUSER
> pages (ie page cache and anonymous user mappings).
>

There are a suprising number of GFP_HIGHUSER users. I've included an
untested patch below to give an idea of what the reworked patch would
look like. What is most suprising about this is that there are driver and
filesystem allocations obeying cpuset limits. I wonder was that really
the intention or was GFP_HIGHUSER seen as an easy (but mistaken) way of
identifying allocations for page cache and user mappings.

> So you could literally use "GFP_HIGHPTE" for the PTE mappings, and that
> would in fact even simplify some of the users (ie it would allow moving
> the #ifdef CONFIG_HIGHPTE check from the code to <linux/gfp.h>). Similarly
> for any other non-movable things, no?
>

As it turns out, HighPTE is just one of many cases where pinned allocations
use high memory. However, defining GFP_PTE and making the CONFIG_HIGHPTE
check in <linux/gfp.h> is probably not a bad idea either way.

> So then we'd just make GFP_HIGHUSER implicitly mean "movable". It could be
> nice if GFP_USER would do the same, but I guess we have too many of those
> around to verify (although _most_ of those are probably kmalloc, and
> kmalloc would obviously better strip away the __GFP_MOVABLE bit anyway).
>

Your hunch here is right. GFP_USER has about 30 callers that are not using
kmalloc(). As part of a cleanup, an audit could be made to identify GFP_USER
allocations that are movable and those that are not. A brief look shows that
most are not movable so the split is doable.

Here is an untested patch that creates a GFP_HIGHUNMOVABLE flag for
unmovable kernel allocations in high memory and leaves GFP_HIGHUSER for
movable allocations.

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/drivers/infiniband/hw/mthca/mthca_eq.c linux-2.6.19-rc5-mm2-001_clustering_flags/drivers/infiniband/hw/mthca/mthca_eq.c
--- linux-2.6.19-rc5-mm2-clean/drivers/infiniband/hw/mthca/mthca_eq.c 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/drivers/infiniband/hw/mthca/mthca_eq.c 2006-11-23 16:26:37.000000000 +0000
@@ -793,7 +793,7 @@ int __devinit mthca_map_eq_icm(struct mt
* memory, or 1 KB total.
*/
dev->eq_table.icm_virt = icm_virt;
- dev->eq_table.icm_page = alloc_page(GFP_HIGHUSER);
+ dev->eq_table.icm_page = alloc_page(GFP_HIGHUNMOVABLE);
if (!dev->eq_table.icm_page)
return -ENOMEM;
dev->eq_table.icm_dma = pci_map_page(dev->pdev, dev->eq_table.icm_page, 0,
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/drivers/infiniband/hw/mthca/mthca_main.c linux-2.6.19-rc5-mm2-001_clustering_flags/drivers/infiniband/hw/mthca/mthca_main.c
--- linux-2.6.19-rc5-mm2-clean/drivers/infiniband/hw/mthca/mthca_main.c 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/drivers/infiniband/hw/mthca/mthca_main.c 2006-11-23 16:26:37.000000000 +0000
@@ -342,7 +342,7 @@ static int __devinit mthca_load_fw(struc

mdev->fw.arbel.fw_icm =
mthca_alloc_icm(mdev, mdev->fw.arbel.fw_pages,
- GFP_HIGHUSER | __GFP_NOWARN);
+ GFP_HIGHUNMOVABLE | __GFP_NOWARN);
if (!mdev->fw.arbel.fw_icm) {
mthca_err(mdev, "Couldn't allocate FW area, aborting.\n");
return -ENOMEM;
@@ -404,7 +404,7 @@ static int __devinit mthca_init_icm(stru
(unsigned long long) aux_pages << 2);

mdev->fw.arbel.aux_icm = mthca_alloc_icm(mdev, aux_pages,
- GFP_HIGHUSER | __GFP_NOWARN);
+ GFP_HIGHUNMOVABLE | __GFP_NOWARN);
if (!mdev->fw.arbel.aux_icm) {
mthca_err(mdev, "Couldn't allocate aux memory, aborting.\n");
return -ENOMEM;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/drivers/infiniband/hw/mthca/mthca_memfree.c linux-2.6.19-rc5-mm2-001_clustering_flags/drivers/infiniband/hw/mthca/mthca_memfree.c
--- linux-2.6.19-rc5-mm2-clean/drivers/infiniband/hw/mthca/mthca_memfree.c 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/drivers/infiniband/hw/mthca/mthca_memfree.c 2006-11-23 16:26:37.000000000 +0000
@@ -166,8 +166,8 @@ int mthca_table_get(struct mthca_dev *de
}

table->icm[i] = mthca_alloc_icm(dev, MTHCA_TABLE_CHUNK_SIZE >> PAGE_SHIFT,
- (table->lowmem ? GFP_KERNEL : GFP_HIGHUSER) |
- __GFP_NOWARN);
+ (table->lowmem ? GFP_KERNEL : GFP_HIGHUNMOVABLE) |
+ __GFP_NOWARN);
if (!table->icm[i]) {
ret = -ENOMEM;
goto out;
@@ -313,8 +313,8 @@ struct mthca_icm_table *mthca_alloc_icm_
chunk_size = nobj * obj_size - i * MTHCA_TABLE_CHUNK_SIZE;

table->icm[i] = mthca_alloc_icm(dev, chunk_size >> PAGE_SHIFT,
- (use_lowmem ? GFP_KERNEL : GFP_HIGHUSER) |
- __GFP_NOWARN);
+ (use_lowmem ? GFP_KERNEL : GFP_HIGHUNMOVABLE) |
+ __GFP_NOWARN);
if (!table->icm[i])
goto err;
if (mthca_MAP_ICM(dev, table->icm[i], virt + i * MTHCA_TABLE_CHUNK_SIZE,
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/drivers/kvm/kvm_main.c linux-2.6.19-rc5-mm2-001_clustering_flags/drivers/kvm/kvm_main.c
--- linux-2.6.19-rc5-mm2-clean/drivers/kvm/kvm_main.c 2006-11-23 15:08:51.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/drivers/kvm/kvm_main.c 2006-11-23 16:26:37.000000000 +0000
@@ -1535,7 +1535,7 @@ raced:

memset(new.phys_mem, 0, npages * sizeof(struct page *));
for (i = 0; i < npages; ++i) {
- new.phys_mem[i] = alloc_page(GFP_HIGHUSER);
+ new.phys_mem[i] = alloc_page(GFP_HIGHUNMOVABLE);
if (!new.phys_mem[i])
goto out_free;
}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/fs/block_dev.c linux-2.6.19-rc5-mm2-001_clustering_flags/fs/block_dev.c
--- linux-2.6.19-rc5-mm2-clean/fs/block_dev.c 2006-11-23 15:08:57.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/fs/block_dev.c 2006-11-23 16:26:37.000000000 +0000
@@ -380,7 +380,7 @@ struct block_device *bdget(dev_t dev)
inode->i_rdev = dev;
inode->i_bdev = bdev;
inode->i_data.a_ops = &def_blk_aops;
- mapping_set_gfp_mask(&inode->i_data, GFP_USER);
+ mapping_set_gfp_mask(&inode->i_data, GFP_USER|__GFP_MOVABLE);
inode->i_data.backing_dev_info = &default_backing_dev_info;
spin_lock(&bdev_lock);
list_add(&bdev->bd_list, &all_bdevs);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/fs/buffer.c linux-2.6.19-rc5-mm2-001_clustering_flags/fs/buffer.c
--- linux-2.6.19-rc5-mm2-clean/fs/buffer.c 2006-11-23 15:08:57.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/fs/buffer.c 2006-11-23 16:26:37.000000000 +0000
@@ -1048,7 +1048,8 @@ grow_dev_page(struct block_device *bdev,
struct page *page;
struct buffer_head *bh;

- page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
+ page = find_or_create_page(inode->i_mapping, index,
+ GFP_NOFS|__GFP_MOVABLE);
if (!page)
return NULL;

@@ -2723,7 +2724,7 @@ int submit_bh(int rw, struct buffer_head
* from here on down, it's all bio -- do the initial mapping,
* submit_bio -> generic_make_request may further map this bio around
*/
- bio = bio_alloc(GFP_NOIO, 1);
+ bio = bio_alloc(GFP_NOIO|__GFP_MOVABLE, 1);

bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
bio->bi_bdev = bh->b_bdev;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/fs/ncpfs/mmap.c linux-2.6.19-rc5-mm2-001_clustering_flags/fs/ncpfs/mmap.c
--- linux-2.6.19-rc5-mm2-clean/fs/ncpfs/mmap.c 2006-11-23 15:08:58.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/fs/ncpfs/mmap.c 2006-11-23 16:26:37.000000000 +0000
@@ -38,8 +38,8 @@ static struct page* ncp_file_mmap_nopage
int bufsize;
int pos;

- page = alloc_page(GFP_HIGHUSER); /* ncpfs has nothing against high pages
- as long as recvmsg and memset works on it */
+ page = alloc_page(GFP_HIGHUNMOVABLE); /* ncpfs has nothing against high
+ pages as long as recvmsg and memset works on it */
if (!page)
return page;
pg_addr = kmap(page);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/fs/nfs/dir.c linux-2.6.19-rc5-mm2-001_clustering_flags/fs/nfs/dir.c
--- linux-2.6.19-rc5-mm2-clean/fs/nfs/dir.c 2006-11-23 15:08:58.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/fs/nfs/dir.c 2006-11-23 16:26:37.000000000 +0000
@@ -471,7 +471,7 @@ int uncached_readdir(nfs_readdir_descrip
dfprintk(DIRCACHE, "NFS: uncached_readdir() searching for cookie %Lu\n",
(unsigned long long)*desc->dir_cookie);

- page = alloc_page(GFP_HIGHUSER);
+ page = alloc_page(GFP_HIGHUNMOVABLE);
if (!page) {
status = -ENOMEM;
goto out;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/fs/pipe.c linux-2.6.19-rc5-mm2-001_clustering_flags/fs/pipe.c
--- linux-2.6.19-rc5-mm2-clean/fs/pipe.c 2006-11-23 15:08:58.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/fs/pipe.c 2006-11-23 16:26:37.000000000 +0000
@@ -417,7 +417,7 @@ redo1:
int error, atomic = 1;

if (!page) {
- page = alloc_page(GFP_HIGHUSER);
+ page = alloc_page(GFP_HIGHUNMOVABLE);
if (unlikely(!page)) {
ret = ret ? : -ENOMEM;
break;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/linux/gfp.h linux-2.6.19-rc5-mm2-001_clustering_flags/include/linux/gfp.h
--- linux-2.6.19-rc5-mm2-clean/include/linux/gfp.h 2006-11-23 15:09:01.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/include/linux/gfp.h 2006-11-23 16:27:49.000000000 +0000
@@ -46,6 +46,7 @@ struct vm_area_struct;
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_MOVABLE ((__force gfp_t)0x80000u) /* Page is movable */

#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -54,7 +55,8 @@ struct vm_area_struct;
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
- __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE)
+ __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|\
+ __GFP_MOVABLE)

/* This equals 0, but use constants in case they ever change */
#define GFP_NOWAIT (GFP_ATOMIC & ~__GFP_HIGH)
@@ -65,7 +67,9 @@ struct vm_area_struct;
#define GFP_KERNEL (__GFP_WAIT | __GFP_IO | __GFP_FS)
#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | \
- __GFP_HIGHMEM)
+ __GFP_HIGHMEM | __GFP_MOVABLE)
+#define GFP_HIGHUNMOVABLE (__GFP_WAIT | __GFP_IO | __GFP_FS | \
+ __GFP_HARDWALL | __GFP_HIGHMEM)

#ifdef CONFIG_NUMA
#define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/linux/mempolicy.h linux-2.6.19-rc5-mm2-001_clustering_flags/include/linux/mempolicy.h
--- linux-2.6.19-rc5-mm2-clean/include/linux/mempolicy.h 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/include/linux/mempolicy.h 2006-11-23 16:26:37.000000000 +0000
@@ -258,7 +258,7 @@ static inline void mpol_fix_fork_child_f
static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
unsigned long addr)
{
- return NODE_DATA(0)->node_zonelists + gfp_zone(GFP_HIGHUSER);
+ return NODE_DATA(0)->node_zonelists + gfp_zone(GFP_HIGHUNMOVABLE);
}

static inline int do_migrate_pages(struct mm_struct *mm,
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/linux/slab.h linux-2.6.19-rc5-mm2-001_clustering_flags/include/linux/slab.h
--- linux-2.6.19-rc5-mm2-clean/include/linux/slab.h 2006-11-23 15:09:02.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/include/linux/slab.h 2006-11-23 16:26:37.000000000 +0000
@@ -99,7 +99,7 @@ extern void *__kmalloc(size_t, gfp_t);
* %GFP_ATOMIC - Allocation will not sleep.
* For example, use this inside interrupt handlers.
*
- * %GFP_HIGHUSER - Allocate pages from high memory.
+ * %GFP_HIGHUNMOVABLE - Allocate pages from high memory.
*
* %GFP_NOIO - Do not do any I/O at all while trying to get memory.
*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/kernel/kexec.c linux-2.6.19-rc5-mm2-001_clustering_flags/kernel/kexec.c
--- linux-2.6.19-rc5-mm2-clean/kernel/kexec.c 2006-11-23 15:09:03.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/kernel/kexec.c 2006-11-23 16:26:37.000000000 +0000
@@ -774,7 +774,7 @@ static int kimage_load_normal_segment(st
char *ptr;
size_t uchunk, mchunk;

- page = kimage_alloc_page(image, GFP_HIGHUSER, maddr);
+ page = kimage_alloc_page(image, GFP_HIGHUNMOVABLE, maddr);
if (page == 0) {
result = -ENOMEM;
goto out;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/mm/hugetlb.c linux-2.6.19-rc5-mm2-001_clustering_flags/mm/hugetlb.c
--- linux-2.6.19-rc5-mm2-clean/mm/hugetlb.c 2006-11-23 15:09:03.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/mm/hugetlb.c 2006-11-23 16:26:37.000000000 +0000
@@ -73,7 +73,7 @@ static struct page *dequeue_huge_page(st

for (z = zonelist->zones; *z; z++) {
nid = zone_to_nid(*z);
- if (cpuset_zone_allowed(*z, GFP_HIGHUSER) &&
+ if (cpuset_zone_allowed(*z, GFP_HIGHUNMOVABLE) &&
!list_empty(&hugepage_freelists[nid]))
break;
}
@@ -103,7 +103,7 @@ static int alloc_fresh_huge_page(void)
{
static int nid = 0;
struct page *page;
- page = alloc_pages_node(nid, GFP_HIGHUSER|__GFP_COMP|__GFP_NOWARN,
+ page = alloc_pages_node(nid, GFP_HIGHUNMOVABLE|__GFP_COMP|__GFP_NOWARN,
HUGETLB_PAGE_ORDER);
nid = next_node(nid, node_online_map);
if (nid == MAX_NUMNODES)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/mm/mempolicy.c linux-2.6.19-rc5-mm2-001_clustering_flags/mm/mempolicy.c
--- linux-2.6.19-rc5-mm2-clean/mm/mempolicy.c 2006-11-23 15:09:03.000000000 +0000
+++ linux-2.6.19-rc5-mm2-001_clustering_flags/mm/mempolicy.c 2006-11-23 16:26:37.000000000 +0000
@@ -1210,9 +1210,10 @@ struct zonelist *huge_zonelist(struct vm
unsigned nid;

nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
- return NODE_DATA(nid)->node_zonelists + gfp_zone(GFP_HIGHUSER);
+ return NODE_DATA(nid)->node_zonelists +
+ gfp_zone(GFP_HIGHUNMOVABLE);
}
- return zonelist_policy(GFP_HIGHUSER, pol);
+ return zonelist_policy(GFP_HIGHUNMOVABLE, pol);
}
#endif

2006-11-23 17:12:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers



On Thu, 23 Nov 2006, Mel Gorman wrote:
>
> There are a suprising number of GFP_HIGHUSER users. I've included an
> untested patch below to give an idea of what the reworked patch would
> look like.

Thanks. Seeing the patch actually was useful, because I think this isa
good idea quite regardless of anything else: it adds a certain amount of
"inherent documentation" when you see a line like

page = alloc_page(GFP_HIGHUNMOVABLE);

because it makes it very obvious that something is going on.

At the same time, I do get the feelign that maybe we should simply go the
other way: talk about allocating MOVABLE pages instead of talking about
allocating pages that are NOT movable.

Because usually it's really that way you think about it: when you allocate
a _movable_ page, you need to add support for moving it some way (ie you
need to put it on the proper page-cache lists etc), while a page that you
don't think about is generally _not_ movable.

So: I think this is the right direction, but I would actually prefer to
see

page = alloc_page(GFP_[HIGH_]MOVABLE);

instead, and then just teach the routines that create movable pages
(whether they are movable because they are in the page cache, or for some
other reason) to use that flag instead of GFP_[HIGH]USER.

And the assumption would be that if it's MOVABLE, then it's obviously a
USER allocation (it it can fail much more eagerly - that's really what the
whole USER bit ends up meaning internally).

Linus

2006-11-24 10:44:28

by mel

[permalink] [raw]
Subject: Re: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers

On (23/11/06 09:11), Linus Torvalds didst pronounce:
>
>
> On Thu, 23 Nov 2006, Mel Gorman wrote:
> >
> > There are a suprising number of GFP_HIGHUSER users. I've included an
> > untested patch below to give an idea of what the reworked patch would
> > look like.
>
> Thanks. Seeing the patch actually was useful, because I think this isa
> good idea quite regardless of anything else: it adds a certain amount of
> "inherent documentation" when you see a line like
>
> page = alloc_page(GFP_HIGHUNMOVABLE);
>
> because it makes it very obvious that something is going on.
>
> At the same time, I do get the feelign that maybe we should simply go the
> other way: talk about allocating MOVABLE pages instead of talking about
> allocating pages that are NOT movable.
>

I tend to agree. If GFP_HIGHUSER was the movable set of flags, a number
of out-of-tree drivers and new drivers would continue to use it instead of
GFP_HIGHUNMOVABLE. It would need to be periodically audited.

> Because usually it's really that way you think about it: when you allocate
> a _movable_ page, you need to add support for moving it some way (ie you
> need to put it on the proper page-cache lists etc), while a page that you
> don't think about is generally _not_ movable.
>

Good point.

> So: I think this is the right direction, but I would actually prefer to
> see
>
> page = alloc_page(GFP_[HIGH_]MOVABLE);
>
> instead, and then just teach the routines that create movable pages
> (whether they are movable because they are in the page cache, or for some
> other reason) to use that flag instead of GFP_[HIGH]USER.
>
> And the assumption would be that if it's MOVABLE, then it's obviously a
> USER allocation (it it can fail much more eagerly - that's really what the
> whole USER bit ends up meaning internally).
>

This is what the (compile-tested-only on x86) patch looks like for
GFP_HIGH_MOVABLE. The remaining in-tree GFP_HIGHUSER users are infiniband,
kvm, ncpfs, nfs, pipes (possible the most frequent user), m68knommu, hugepages
and kexec.

Signed-off-by: Mel Gorman <[email protected]>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/fs/compat.c linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/fs/compat.c
--- linux-2.6.19-rc5-mm2-clean/fs/compat.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/fs/compat.c 2006-11-24 10:17:48.000000000 +0000
@@ -1419,7 +1419,7 @@ static int compat_copy_strings(int argc,
page = bprm->page[i];
new = 0;
if (!page) {
- page = alloc_page(GFP_HIGHUSER);
+ page = alloc_page(GFP_HIGH_MOVABLE);
bprm->page[i] = page;
if (!page) {
ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/fs/exec.c linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/fs/exec.c
--- linux-2.6.19-rc5-mm2-clean/fs/exec.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/fs/exec.c 2006-11-24 10:17:48.000000000 +0000
@@ -239,7 +239,7 @@ static int copy_strings(int argc, char _
page = bprm->page[i];
new = 0;
if (!page) {
- page = alloc_page(GFP_HIGHUSER);
+ page = alloc_page(GFP_HIGH_MOVABLE);
bprm->page[i] = page;
if (!page) {
ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/fs/inode.c linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/fs/inode.c
--- linux-2.6.19-rc5-mm2-clean/fs/inode.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/fs/inode.c 2006-11-24 10:17:48.000000000 +0000
@@ -146,7 +146,7 @@ static struct inode *alloc_inode(struct
mapping->a_ops = &empty_aops;
mapping->host = inode;
mapping->flags = 0;
- mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
+ mapping_set_gfp_mask(mapping, GFP_HIGH_MOVABLE);
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/asm-alpha/page.h linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/include/asm-alpha/page.h
--- linux-2.6.19-rc5-mm2-clean/include/asm-alpha/page.h 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/include/asm-alpha/page.h 2006-11-24 10:17:48.000000000 +0000
@@ -17,7 +17,7 @@
extern void clear_page(void *page);
#define clear_user_page(page, vaddr, pg) clear_page(page)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vmaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGH_MOVABLE | __GFP_ZERO, vma, vmaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

extern void copy_page(void * _to, void * _from);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/asm-cris/page.h linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/include/asm-cris/page.h
--- linux-2.6.19-rc5-mm2-clean/include/asm-cris/page.h 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/include/asm-cris/page.h 2006-11-24 10:17:48.000000000 +0000
@@ -20,7 +20,7 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGH_MOVABLE | __GFP_ZERO, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

/*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/asm-h8300/page.h linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/include/asm-h8300/page.h
--- linux-2.6.19-rc5-mm2-clean/include/asm-h8300/page.h 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/include/asm-h8300/page.h 2006-11-24 10:17:48.000000000 +0000
@@ -22,7 +22,7 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGH_MOVABLE | __GFP_ZERO, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

/*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/asm-i386/page.h linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/include/asm-i386/page.h
--- linux-2.6.19-rc5-mm2-clean/include/asm-i386/page.h 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/include/asm-i386/page.h 2006-11-24 10:17:48.000000000 +0000
@@ -35,7 +35,7 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGH_MOVABLE|__GFP_ZERO, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

/*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/asm-ia64/page.h linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/include/asm-ia64/page.h
--- linux-2.6.19-rc5-mm2-clean/include/asm-ia64/page.h 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/include/asm-ia64/page.h 2006-11-24 10:17:48.000000000 +0000
@@ -89,7 +89,7 @@ do { \

#define alloc_zeroed_user_highpage(vma, vaddr) \
({ \
- struct page *page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr); \
+ struct page *page = alloc_page_vma(GFP_HIGH_MOVABLE | __GFP_ZERO, vma, vaddr); \
if (page) \
flush_dcache_page(page); \
page; \
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/asm-m32r/page.h linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/include/asm-m32r/page.h
--- linux-2.6.19-rc5-mm2-clean/include/asm-m32r/page.h 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/include/asm-m32r/page.h 2006-11-24 10:17:48.000000000 +0000
@@ -16,7 +16,7 @@ extern void copy_page(void *to, void *fr
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGH_MOVABLE | __GFP_ZERO, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

/*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/asm-s390/page.h linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/include/asm-s390/page.h
--- linux-2.6.19-rc5-mm2-clean/include/asm-s390/page.h 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/include/asm-s390/page.h 2006-11-24 10:17:48.000000000 +0000
@@ -64,7 +64,7 @@ static inline void copy_page(void *to, v
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGH_MOVABLE | __GFP_ZERO, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

/*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/asm-x86_64/page.h linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/include/asm-x86_64/page.h
--- linux-2.6.19-rc5-mm2-clean/include/asm-x86_64/page.h 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/include/asm-x86_64/page.h 2006-11-24 10:17:48.000000000 +0000
@@ -51,7 +51,7 @@ void copy_page(void *, void *);
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGH_MOVABLE|__GFP_ZERO, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
/*
* These are used to make use of C type-checking..
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/linux/gfp.h linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/include/linux/gfp.h
--- linux-2.6.19-rc5-mm2-clean/include/linux/gfp.h 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/include/linux/gfp.h 2006-11-24 10:32:26.000000000 +0000
@@ -30,6 +30,9 @@ struct vm_area_struct;
* cannot handle allocation failures.
*
* __GFP_NORETRY: The VM implementation must not retry indefinitely.
+ *
+ * __GFP_MOVABLE: Flag that this page will be movable by the page migration
+ * mechanism
*/
#define __GFP_WAIT ((__force gfp_t)0x10u) /* Can wait and reschedule? */
#define __GFP_HIGH ((__force gfp_t)0x20u) /* Should access emergency pools? */
@@ -46,6 +49,7 @@ struct vm_area_struct;
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_MOVABLE ((__force gfp_t)0x80000u) /* Page is movable */

#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -54,7 +58,8 @@ struct vm_area_struct;
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
- __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE)
+ __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|\
+ __GFP_MOVABLE)

/* This equals 0, but use constants in case they ever change */
#define GFP_NOWAIT (GFP_ATOMIC & ~__GFP_HIGH)
@@ -66,6 +71,9 @@ struct vm_area_struct;
#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | \
__GFP_HIGHMEM)
+#define GFP_HIGH_MOVABLE (__GFP_WAIT | __GFP_IO | __GFP_FS | \
+ __GFP_HARDWALL | __GFP_HIGHMEM | \
+ __GFP_MOVABLE)

#ifdef CONFIG_NUMA
#define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/linux/highmem.h linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/include/linux/highmem.h
--- linux-2.6.19-rc5-mm2-clean/include/linux/highmem.h 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/include/linux/highmem.h 2006-11-24 10:17:48.000000000 +0000
@@ -65,7 +65,7 @@ static inline void clear_user_highpage(s
static inline struct page *
alloc_zeroed_user_highpage(struct vm_area_struct *vma, unsigned long vaddr)
{
- struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+ struct page *page = alloc_page_vma(GFP_HIGH_MOVABLE, vma, vaddr);

if (page)
clear_user_highpage(page, vaddr);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/mm/memory.c linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/mm/memory.c
--- linux-2.6.19-rc5-mm2-clean/mm/memory.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/mm/memory.c 2006-11-24 10:17:48.000000000 +0000
@@ -1564,7 +1564,7 @@ gotten:
if (!new_page)
goto oom;
} else {
- new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ new_page = alloc_page_vma(GFP_HIGH_MOVABLE, vma, address);
if (!new_page)
goto oom;
cow_user_page(new_page, old_page, address);
@@ -2188,7 +2188,7 @@ retry:

if (unlikely(anon_vma_prepare(vma)))
goto oom;
- page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ page = alloc_page_vma(GFP_HIGH_MOVABLE, vma, address);
if (!page)
goto oom;
copy_user_highpage(page, new_page, address);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/mm/mempolicy.c linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/mm/mempolicy.c
--- linux-2.6.19-rc5-mm2-clean/mm/mempolicy.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/mm/mempolicy.c 2006-11-24 10:33:04.000000000 +0000
@@ -598,7 +598,7 @@ static void migrate_page_add(struct page

static struct page *new_node_page(struct page *page, unsigned long node, int **x)
{
- return alloc_pages_node(node, GFP_HIGHUSER, 0);
+ return alloc_pages_node(node, GFP_HIGH_MOVABLE, 0);
}

/*
@@ -714,7 +714,7 @@ static struct page *new_vma_page(struct
{
struct vm_area_struct *vma = (struct vm_area_struct *)private;

- return alloc_page_vma(GFP_HIGHUSER, vma, page_address_in_vma(page, vma));
+ return alloc_page_vma(GFP_HIGH_MOVABLE, vma, page_address_in_vma(page, vma));
}
#else

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/mm/migrate.c linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/mm/migrate.c
--- linux-2.6.19-rc5-mm2-clean/mm/migrate.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/mm/migrate.c 2006-11-24 10:17:48.000000000 +0000
@@ -748,7 +748,7 @@ static struct page *new_page_node(struct

*result = &pm->status;

- return alloc_pages_node(pm->node, GFP_HIGHUSER | GFP_THISNODE, 0);
+ return alloc_pages_node(pm->node, GFP_HIGH_MOVABLE | GFP_THISNODE, 0);
}

/*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/mm/swap_prefetch.c linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/mm/swap_prefetch.c
--- linux-2.6.19-rc5-mm2-clean/mm/swap_prefetch.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/mm/swap_prefetch.c 2006-11-24 10:17:48.000000000 +0000
@@ -204,7 +204,7 @@ static enum trickle_return trickle_swap_
* Get a new page to read from swap. We have already checked the
* watermarks so __alloc_pages will not call on reclaim.
*/
- page = alloc_pages_node(node, GFP_HIGHUSER & ~__GFP_WAIT, 0);
+ page = alloc_pages_node(node, GFP_HIGH_MOVABLE & ~__GFP_WAIT, 0);
if (unlikely(!page)) {
ret = TRICKLE_DELAY;
goto out;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/mm/swap_state.c linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/mm/swap_state.c
--- linux-2.6.19-rc5-mm2-clean/mm/swap_state.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_HIGH_MOVABLE/mm/swap_state.c 2006-11-24 10:17:48.000000000 +0000
@@ -343,7 +343,7 @@ struct page *read_swap_cache_async(swp_e
* Get a new page to read into from swap.
*/
if (!new_page) {
- new_page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+ new_page = alloc_page_vma(GFP_HIGH_MOVABLE, vma, addr);
if (!new_page)
break; /* Out of memory */
}

2006-11-24 17:59:32

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers

On Thu, 23 Nov 2006, Linus Torvalds wrote:

> And the assumption would be that if it's MOVABLE, then it's obviously a
> USER allocation (it it can fail much more eagerly - that's really what the
> whole USER bit ends up meaning internally).

We can probably make several types of kernel allocations movable if there
would be some benefit from it.

Mel already has a problem with mlocked user pages in the movable section.
If this is fixed by using page migration to move the mlocked pages then we
can likely make addititional classes kernel pages also movable and reduce
the amount of memory that is unmovable. If we have more movable pages then
the defrag can work more efficiently. Having most pages movable will also
help to make memory unplug a reality.

So please do not require movable pages to be user allocations.

2006-11-24 18:12:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers



On Fri, 24 Nov 2006, Christoph Lameter wrote:
>
> So please do not require movable pages to be user allocations.

I don't think you read the whole sentence I wrote.

Go back: USER just means "it it can fail much more eagerly". It really has
nothing to do with user-mode per se. It's just not so core that the kernel
cannot handle allocation failures, so it doesn't get to retry the
allocation so eagerly.

THAT is why "movable" is almost guaranteed to also imply USER. Not because
it's not a "kernel" allocation. After all, _all_ page allocations are
kernel allocations, it's just that some are more likely to be associated
with direct user requests, and some are more internal.

Linus

2006-11-24 19:57:38

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers

On Fri, 24 Nov 2006, Mel Gorman wrote:
>
> This is what the (compile-tested-only on x86) patch looks like for
> GFP_HIGH_MOVABLE. The remaining in-tree GFP_HIGHUSER users are infiniband,
> kvm, ncpfs, nfs, pipes (possible the most frequent user), m68knommu, hugepages
> and kexec.
>
> Signed-off-by: Mel Gorman <[email protected]>

You need to add in something like the patch below (mutatis mutandis
for whichever approach you end up taking): tmpfs uses highmem pages
for its swap vector blocks, noting where on swap the data pages are,
and allocates them with mapping_gfp_mask(inode->i_mapping); but we
don't have any mechanism in place for reclaiming or migrating those.

(We could add that; but there might be better things to do instead.
I've often wanted to remove that whole layer from tmpfs, and note
swap entries in the pagecache's radix-tree slots instead - but that
does then lock them into low memory. Hum, haw, never decided.)

You can certainly be forgiven for missing that, and may well wonder
why it doesn't just use GFP_HIGHUSER explicitly: because the loop
driver may be on top of that tmpfs file, masking off __GFP_IO and
__GFP_FS: the swap vector blocks should be allocated with the same
restrictions as the data pages.

Excuse me for moving the __GFP_ZERO too: I think it's tidier to
do them both within the little helper function.

Hugh

--- 2.6.19-rc5-mm2/mm/shmem.c 2006-11-14 09:58:21.000000000 +0000
+++ linux/mm/shmem.c 2006-11-24 19:22:30.000000000 +0000
@@ -94,7 +94,8 @@ static inline struct page *shmem_dir_all
* BLOCKS_PER_PAGE on indirect pages, assume PAGE_CACHE_SIZE:
* might be reconsidered if it ever diverges from PAGE_SIZE.
*/
- return alloc_pages(gfp_mask, PAGE_CACHE_SHIFT-PAGE_SHIFT);
+ return alloc_pages((gfp_mask & ~__GFP_MOVABLE) | __GFP_ZERO,
+ PAGE_CACHE_SHIFT-PAGE_SHIFT);
}

static inline void shmem_dir_free(struct page *page)
@@ -372,7 +373,7 @@ static swp_entry_t *shmem_swp_alloc(stru
}

spin_unlock(&info->lock);
- page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
+ page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping));
if (page)
set_page_private(page, 0);
spin_lock(&info->lock);

2006-11-24 20:04:53

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers

On Fri, 24 Nov 2006, Christoph Lameter wrote:

> On Thu, 23 Nov 2006, Linus Torvalds wrote:
>
>> And the assumption would be that if it's MOVABLE, then it's obviously a
>> USER allocation (it it can fail much more eagerly - that's really what the
>> whole USER bit ends up meaning internally).
>
> We can probably make several types of kernel allocations movable if there
> would be some benefit from it.
>

Page tables are the major type of allocation that comes to mind. From what
I've seen, they are the most common long-lived unmovable and unreclaimable
allocation.

> Mel already has a problem with mlocked user pages in the movable section.
> If this is fixed by using page migration to move the mlocked pages

That is the long-term plan.

> then we
> can likely make addititional classes kernel pages also movable and reduce
> the amount of memory that is unmovable. If we have more movable pages then
> the defrag can work more efficiently.

Indeed, although some sort of placement is still needed to keep these
movable allocations together.

> Having most pages movable will also
> help to make memory unplug a reality.
>
> So please do not require movable pages to be user allocations.
>

That is not the intention. It just happens that allocations that are
directly accessible by userspace are also the ones that are currently
movable.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2006-11-24 20:13:49

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers

On Fri, 24 Nov 2006, Hugh Dickins wrote:

> On Fri, 24 Nov 2006, Mel Gorman wrote:
>>
>> This is what the (compile-tested-only on x86) patch looks like for
>> GFP_HIGH_MOVABLE. The remaining in-tree GFP_HIGHUSER users are infiniband,
>> kvm, ncpfs, nfs, pipes (possible the most frequent user), m68knommu, hugepages
>> and kexec.
>>
>> Signed-off-by: Mel Gorman <[email protected]>
>
> You need to add in something like the patch below (mutatis mutandis
> for whichever approach you end up taking): tmpfs uses highmem pages
> for its swap vector blocks, noting where on swap the data pages are,
> and allocates them with mapping_gfp_mask(inode->i_mapping); but we
> don't have any mechanism in place for reclaiming or migrating those.
>

Good catch. In the page clustering patches I work on, I am doing this;

- page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
+ page = alloc_page_vma(
+ set_migrateflags(gfp | __GFP_ZERO, __GFP_RECLAIMABLE),
+ &pvma, 0);

to get rid of the MOVABLE flag and replace it with __GFP_RECLAIMABLE. This
clustered the allocations together with allocations like inode cache. In
retrospect, this was not a good idea because it assumes that tmpfs and
shmem pages are short-lived. That may not be the case at all.

> (We could add that; but there might be better things to do instead.

Indeed. I believe that making page tables movable would gain more.

> I've often wanted to remove that whole layer from tmpfs, and note
> swap entries in the pagecache's radix-tree slots instead - but that
> does then lock them into low memory. Hum, haw, never decided.)
>
> You can certainly be forgiven for missing that, and may well wonder
> why it doesn't just use GFP_HIGHUSER explicitly: because the loop
> driver may be on top of that tmpfs file, masking off __GFP_IO and
> __GFP_FS: the swap vector blocks should be allocated with the same
> restrictions as the data pages.
>

Thanks for that clarification. I suspected that something like this was
the case when I removed the MOVABLE flag and used RECLAIMABLE but I wasn't
100% certain. In the tests I was running, tmpfs pages weren't a major
problem so I didn't chase it down.

> Excuse me for moving the __GFP_ZERO too: I think it's tidier to
> do them both within the little helper function.
>

Agreed.

> Hugh
>
> --- 2.6.19-rc5-mm2/mm/shmem.c 2006-11-14 09:58:21.000000000 +0000
> +++ linux/mm/shmem.c 2006-11-24 19:22:30.000000000 +0000
> @@ -94,7 +94,8 @@ static inline struct page *shmem_dir_all
> * BLOCKS_PER_PAGE on indirect pages, assume PAGE_CACHE_SIZE:
> * might be reconsidered if it ever diverges from PAGE_SIZE.
> */
> - return alloc_pages(gfp_mask, PAGE_CACHE_SHIFT-PAGE_SHIFT);
> + return alloc_pages((gfp_mask & ~__GFP_MOVABLE) | __GFP_ZERO,
> + PAGE_CACHE_SHIFT-PAGE_SHIFT);
> }
>
> static inline void shmem_dir_free(struct page *page)
> @@ -372,7 +373,7 @@ static swp_entry_t *shmem_swp_alloc(stru
> }
>
> spin_unlock(&info->lock);
> - page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
> + page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping));
> if (page)
> set_page_private(page, 0);
> spin_lock(&info->lock);
>

I'll roll this into the movable patch, run a proper test and post a new
patch Monday.

Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2006-11-24 21:05:49

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers

On Fri, 24 Nov 2006, Mel Gorman wrote:
>
> Good catch. In the page clustering patches I work on, I am doing this;
>
> - page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
> + page = alloc_page_vma(
> + set_migrateflags(gfp | __GFP_ZERO, __GFP_RECLAIMABLE),
> + &pvma, 0);
>
> to get rid of the MOVABLE flag and replace it with __GFP_RECLAIMABLE. This
> clustered the allocations together with allocations like inode cache. In
> retrospect, this was not a good idea because it assumes that tmpfs and shmem
> pages are short-lived. That may not be the case at all.
>...
> Thanks for that clarification. I suspected that something like this was the
> case when I removed the MOVABLE flag and used RECLAIMABLE but I wasn't 100%
> certain. In the tests I was running, tmpfs pages weren't a major problem so I
> didn't chase it down.

I'm fairly confused as to what MOVABLE versus RECLAIMABLE is supposed to
be meaning, and understand it's in flux, so haven't tried too hard. Just
so long as you understand that tmpfs data pages go out to swap under memory
pressure, whereas ramfs pages do not, and tmpfs swap vector pages do not.

Hugh

2006-11-25 11:47:20

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers

On Fri, 24 Nov 2006, Hugh Dickins wrote:

> On Fri, 24 Nov 2006, Mel Gorman wrote:
>>
>> Good catch. In the page clustering patches I work on, I am doing this;
>>
>> - page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
>> + page = alloc_page_vma(
>> + set_migrateflags(gfp | __GFP_ZERO, __GFP_RECLAIMABLE),
>> + &pvma, 0);
>>
>> to get rid of the MOVABLE flag and replace it with __GFP_RECLAIMABLE. This
>> clustered the allocations together with allocations like inode cache. In
>> retrospect, this was not a good idea because it assumes that tmpfs and shmem
>> pages are short-lived. That may not be the case at all.
>> ...
>> Thanks for that clarification. I suspected that something like this was the
>> case when I removed the MOVABLE flag and used RECLAIMABLE but I wasn't 100%
>> certain. In the tests I was running, tmpfs pages weren't a major problem so I
>> didn't chase it down.
>
> I'm fairly confused as to what MOVABLE versus RECLAIMABLE is supposed to
> be meaning, and understand it's in flux, so haven't tried too hard.

A MOVABLE allocation may be moved with page migration or paged out by
kswapd.

RECLAIMABLE on the other hand applies to short-lived allocations (like a
socket buffer) or allocations for slab caches that may be reaped such as
inode caches or dcache.

> Just
> so long as you understand that tmpfs data pages go out to swap under memory
> pressure, whereas ramfs pages do not, and tmpfs swap vector pages do not.
>

Right, I'll take a much closer look with this in mind and make the
distinction. Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2006-11-25 19:02:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers



On Fri, 24 Nov 2006, Hugh Dickins wrote:
>
> You need to add in something like the patch below (mutatis mutandis
> for whichever approach you end up taking): tmpfs uses highmem pages
> for its swap vector blocks, noting where on swap the data pages are,
> and allocates them with mapping_gfp_mask(inode->i_mapping); but we
> don't have any mechanism in place for reclaiming or migrating those.

I think this really just points out that you should _not_ put MOVABLE into
the "mapping_gfp_mask()" at all.

The mapping_gfp_mask() should really just contain the "constraints" on
the allocation, not the "how the allocation is used". So things like "I
need all my pages to be in the 32bit DMA'able region" is a constraint on
the allocator, as is something like "I need the allocation to be atomic".

But MOVABLE is really not a constraint on the allocator, it's a guarantee
by the code _calling_ the allocator that it will then make sure that it
_uses_ the allocation in a way that means that it is movable.

So it shouldn't be a property of the mapping itself, it should always be a
property of the code that actually does the allocation.

Hmm?

Linus

2006-11-26 00:50:51

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers

On Sat, 25 Nov 2006, Linus Torvalds wrote:
> On Fri, 24 Nov 2006, Hugh Dickins wrote:
> >
> > You need to add in something like the patch below (mutatis mutandis
> > for whichever approach you end up taking): tmpfs uses highmem pages
> > for its swap vector blocks, noting where on swap the data pages are,
> > and allocates them with mapping_gfp_mask(inode->i_mapping); but we
> > don't have any mechanism in place for reclaiming or migrating those.
>
> I think this really just points out that you should _not_ put MOVABLE into
> the "mapping_gfp_mask()" at all.
>
> The mapping_gfp_mask() should really just contain the "constraints" on
> the allocation, not the "how the allocation is used". So things like "I
> need all my pages to be in the 32bit DMA'able region" is a constraint on
> the allocator, as is something like "I need the allocation to be atomic".
>
> But MOVABLE is really not a constraint on the allocator, it's a guarantee
> by the code _calling_ the allocator that it will then make sure that it
> _uses_ the allocation in a way that means that it is movable.
>
> So it shouldn't be a property of the mapping itself, it should always be a
> property of the code that actually does the allocation.
>
> Hmm?

Not anything I feel strongly about, but I don't see it that way.

mapping_gfp_mask() seems to me nothing more than a pragmatic way
of getting the appropriate gfp_mask down to page_cache_alloc().

alloc_inode() initializes it to whatever suits most filesystems
(currently GFP_HIGHUSER), and those who differ adjust it (e.g.
block_dev has good reason to avoid highmem so sets it to GFP_USER
instead). It used to be the case that several filesystems lacked
kmap() where needed, and those too would set GFP_USER: what you call
a constraint seems to me equally a property of the surrounding code.

If __GFP_MOVABLE is coming in, and most fs's are indeed allocating
movable pages, then I don't see why MOVABLE shouldn't be in the
mapping_gfp_mask. Specifying MOVABLE constrains both the caller's
use of the pages, and the way they are allocated; as does HIGHMEM.

And we shouldn't be guided by the way tmpfs (ab?)uses that gfp_mask
for its metadata allocations as well as its page_cache_alloc()s:
that's just a special case. Though the ramfs case is more telling
(its pagecache pages being not at present movable).

Hugh

2006-11-27 16:32:23

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers

On Sun, 26 Nov 2006, Hugh Dickins wrote:

> On Sat, 25 Nov 2006, Linus Torvalds wrote:
>> On Fri, 24 Nov 2006, Hugh Dickins wrote:
>>>
>>> You need to add in something like the patch below (mutatis mutandis
>>> for whichever approach you end up taking): tmpfs uses highmem pages
>>> for its swap vector blocks, noting where on swap the data pages are,
>>> and allocates them with mapping_gfp_mask(inode->i_mapping); but we
>>> don't have any mechanism in place for reclaiming or migrating those.
>>
>> I think this really just points out that you should _not_ put MOVABLE into
>> the "mapping_gfp_mask()" at all.
>>
>> The mapping_gfp_mask() should really just contain the "constraints" on
>> the allocation, not the "how the allocation is used". So things like "I
>> need all my pages to be in the 32bit DMA'able region" is a constraint on
>> the allocator, as is something like "I need the allocation to be atomic".
>>
>> But MOVABLE is really not a constraint on the allocator, it's a guarantee
>> by the code _calling_ the allocator that it will then make sure that it
>> _uses_ the allocation in a way that means that it is movable.
>>

Later, MOVABLE might be a constraint. For example, hotpluggable nodes may
only allow MOVABLE allocations to be allocated.

>> So it shouldn't be a property of the mapping itself, it should always be a
>> property of the code that actually does the allocation.
>>
>> Hmm?
>
> Not anything I feel strongly about, but I don't see it that way.
>
> mapping_gfp_mask() seems to me nothing more than a pragmatic way
> of getting the appropriate gfp_mask down to page_cache_alloc().
>

And that is important for any filesystem that uses generic_file_read(). As
page_cache_alloc() has no knowledge of the filesystem, it depends on the
mapping_gfp_mask() to determine if the pages are movable or not.

> alloc_inode() initializes it to whatever suits most filesystems
> (currently GFP_HIGHUSER), and those who differ adjust it (e.g.
> block_dev has good reason to avoid highmem so sets it to GFP_USER
> instead). It used to be the case that several filesystems lacked
> kmap() where needed, and those too would set GFP_USER: what you call
> a constraint seems to me equally a property of the surrounding code.
>
> If __GFP_MOVABLE is coming in, and most fs's are indeed allocating
> movable pages, then I don't see why MOVABLE shouldn't be in the
> mapping_gfp_mask. Specifying MOVABLE constrains both the caller's
> use of the pages, and the way they are allocated; as does HIGHMEM.
>

>From what I've seen, the majority of filesystems are suitable for using
__GFP_MOVABLE and it would be clearer to have GFP_HIGH_MOVABLE as the
default and setting GFP_HIGHUSER in filesystems like ramfs.

> And we shouldn't be guided by the way tmpfs (ab?)uses that gfp_mask
> for its metadata allocations as well as its page_cache_alloc()s:
> that's just a special case. Though the ramfs case is more telling
> (its pagecache pages being not at present movable).
>
> Hugh
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2006-11-27 17:29:31

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/11] Add __GFP_MOVABLE flag and update callers

On Mon, 27 Nov 2006, Mel Gorman wrote:

> Later, MOVABLE might be a constraint. For example, hotpluggable nodes may only
> allow MOVABLE allocations to be allocated.

Also, (a much more immediate need since we have memory hotplug already) if
a zone has hotpluggable memory then MOVABLE/unmovable allocations need to
be restricted to a portion of the zone. The plugin/plugout memory area of
a zone will not be able to tolerate unmovable allocations.

2006-11-27 19:48:14

by mel

[permalink] [raw]
Subject: Add __GFP_MOVABLE for callers to flag allocations that may be migrated

It is often known at allocation time when a page may be migrated or not. This
patch adds a flag called __GFP_MOVABLE. Allocations using the __GFP_MOVABLE
can be either migrated using the page migration mechanism or reclaimed by
syncing with backing storage and discarding.

Additional credit goes to Hugh Dickens for catching issues with shmem swap
vector and ramfs allocations.

Signed-off-by: Mel Gorman <[email protected]>

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/fs/compat.c linux-2.6.19-rc5-mm2-mark_highmovable/fs/compat.c
--- linux-2.6.19-rc5-mm2-clean/fs/compat.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_highmovable/fs/compat.c 2006-11-27 15:09:20.000000000 +0000
@@ -1419,7 +1419,7 @@ static int compat_copy_strings(int argc,
page = bprm->page[i];
new = 0;
if (!page) {
- page = alloc_page(GFP_HIGHUSER);
+ page = alloc_page(GFP_HIGH_MOVABLE);
bprm->page[i] = page;
if (!page) {
ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/fs/exec.c linux-2.6.19-rc5-mm2-mark_highmovable/fs/exec.c
--- linux-2.6.19-rc5-mm2-clean/fs/exec.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_highmovable/fs/exec.c 2006-11-27 15:09:20.000000000 +0000
@@ -239,7 +239,7 @@ static int copy_strings(int argc, char _
page = bprm->page[i];
new = 0;
if (!page) {
- page = alloc_page(GFP_HIGHUSER);
+ page = alloc_page(GFP_HIGH_MOVABLE);
bprm->page[i] = page;
if (!page) {
ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/fs/inode.c linux-2.6.19-rc5-mm2-mark_highmovable/fs/inode.c
--- linux-2.6.19-rc5-mm2-clean/fs/inode.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_highmovable/fs/inode.c 2006-11-27 15:09:20.000000000 +0000
@@ -146,7 +146,7 @@ static struct inode *alloc_inode(struct
mapping->a_ops = &empty_aops;
mapping->host = inode;
mapping->flags = 0;
- mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
+ mapping_set_gfp_mask(mapping, GFP_HIGH_MOVABLE);
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/fs/ramfs/inode.c linux-2.6.19-rc5-mm2-mark_highmovable/fs/ramfs/inode.c
--- linux-2.6.19-rc5-mm2-clean/fs/ramfs/inode.c 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_highmovable/fs/ramfs/inode.c 2006-11-27 16:08:19.000000000 +0000
@@ -61,6 +61,7 @@ struct inode *ramfs_get_inode(struct sup
inode->i_blocks = 0;
inode->i_mapping->a_ops = &ramfs_aops;
inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
+ mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
switch (mode & S_IFMT) {
default:
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/asm-alpha/page.h linux-2.6.19-rc5-mm2-mark_highmovable/include/asm-alpha/page.h
--- linux-2.6.19-rc5-mm2-clean/include/asm-alpha/page.h 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_highmovable/include/asm-alpha/page.h 2006-11-27 15:09:20.000000000 +0000
@@ -17,7 +17,7 @@
extern void clear_page(void *page);
#define clear_user_page(page, vaddr, pg) clear_page(page)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vmaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGH_MOVABLE | __GFP_ZERO, vma, vmaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

extern void copy_page(void * _to, void * _from);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/asm-cris/page.h linux-2.6.19-rc5-mm2-mark_highmovable/include/asm-cris/page.h
--- linux-2.6.19-rc5-mm2-clean/include/asm-cris/page.h 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_highmovable/include/asm-cris/page.h 2006-11-27 15:09:20.000000000 +0000
@@ -20,7 +20,7 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGH_MOVABLE | __GFP_ZERO, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

/*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/asm-h8300/page.h linux-2.6.19-rc5-mm2-mark_highmovable/include/asm-h8300/page.h
--- linux-2.6.19-rc5-mm2-clean/include/asm-h8300/page.h 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_highmovable/include/asm-h8300/page.h 2006-11-27 15:09:20.000000000 +0000
@@ -22,7 +22,7 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGH_MOVABLE | __GFP_ZERO, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

/*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/asm-i386/page.h linux-2.6.19-rc5-mm2-mark_highmovable/include/asm-i386/page.h
--- linux-2.6.19-rc5-mm2-clean/include/asm-i386/page.h 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_highmovable/include/asm-i386/page.h 2006-11-27 15:09:20.000000000 +0000
@@ -35,7 +35,7 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGH_MOVABLE|__GFP_ZERO, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

/*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/asm-ia64/page.h linux-2.6.19-rc5-mm2-mark_highmovable/include/asm-ia64/page.h
--- linux-2.6.19-rc5-mm2-clean/include/asm-ia64/page.h 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_highmovable/include/asm-ia64/page.h 2006-11-27 15:09:20.000000000 +0000
@@ -89,7 +89,7 @@ do { \

#define alloc_zeroed_user_highpage(vma, vaddr) \
({ \
- struct page *page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr); \
+ struct page *page = alloc_page_vma(GFP_HIGH_MOVABLE | __GFP_ZERO, vma, vaddr); \
if (page) \
flush_dcache_page(page); \
page; \
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/asm-m32r/page.h linux-2.6.19-rc5-mm2-mark_highmovable/include/asm-m32r/page.h
--- linux-2.6.19-rc5-mm2-clean/include/asm-m32r/page.h 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_highmovable/include/asm-m32r/page.h 2006-11-27 15:09:20.000000000 +0000
@@ -16,7 +16,7 @@ extern void copy_page(void *to, void *fr
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGH_MOVABLE | __GFP_ZERO, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

/*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/asm-s390/page.h linux-2.6.19-rc5-mm2-mark_highmovable/include/asm-s390/page.h
--- linux-2.6.19-rc5-mm2-clean/include/asm-s390/page.h 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_highmovable/include/asm-s390/page.h 2006-11-27 15:09:20.000000000 +0000
@@ -64,7 +64,7 @@ static inline void copy_page(void *to, v
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGH_MOVABLE | __GFP_ZERO, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

/*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/asm-x86_64/page.h linux-2.6.19-rc5-mm2-mark_highmovable/include/asm-x86_64/page.h
--- linux-2.6.19-rc5-mm2-clean/include/asm-x86_64/page.h 2006-11-08 02:24:20.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_highmovable/include/asm-x86_64/page.h 2006-11-27 15:09:20.000000000 +0000
@@ -51,7 +51,7 @@ void copy_page(void *, void *);
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGH_MOVABLE|__GFP_ZERO, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
/*
* These are used to make use of C type-checking..
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/linux/gfp.h linux-2.6.19-rc5-mm2-mark_highmovable/include/linux/gfp.h
--- linux-2.6.19-rc5-mm2-clean/include/linux/gfp.h 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_highmovable/include/linux/gfp.h 2006-11-27 15:09:20.000000000 +0000
@@ -30,6 +30,9 @@ struct vm_area_struct;
* cannot handle allocation failures.
*
* __GFP_NORETRY: The VM implementation must not retry indefinitely.
+ *
+ * __GFP_MOVABLE: Flag that this page will be movable by the page migration
+ * mechanism
*/
#define __GFP_WAIT ((__force gfp_t)0x10u) /* Can wait and reschedule? */
#define __GFP_HIGH ((__force gfp_t)0x20u) /* Should access emergency pools? */
@@ -46,6 +49,7 @@ struct vm_area_struct;
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_MOVABLE ((__force gfp_t)0x80000u) /* Page is movable */

#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -54,7 +58,8 @@ struct vm_area_struct;
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
- __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE)
+ __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|\
+ __GFP_MOVABLE)

/* This equals 0, but use constants in case they ever change */
#define GFP_NOWAIT (GFP_ATOMIC & ~__GFP_HIGH)
@@ -66,6 +71,9 @@ struct vm_area_struct;
#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | \
__GFP_HIGHMEM)
+#define GFP_HIGH_MOVABLE (__GFP_WAIT | __GFP_IO | __GFP_FS | \
+ __GFP_HARDWALL | __GFP_HIGHMEM | \
+ __GFP_MOVABLE)

#ifdef CONFIG_NUMA
#define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/include/linux/highmem.h linux-2.6.19-rc5-mm2-mark_highmovable/include/linux/highmem.h
--- linux-2.6.19-rc5-mm2-clean/include/linux/highmem.h 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_highmovable/include/linux/highmem.h 2006-11-27 15:09:20.000000000 +0000
@@ -65,7 +65,7 @@ static inline void clear_user_highpage(s
static inline struct page *
alloc_zeroed_user_highpage(struct vm_area_struct *vma, unsigned long vaddr)
{
- struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+ struct page *page = alloc_page_vma(GFP_HIGH_MOVABLE, vma, vaddr);

if (page)
clear_user_highpage(page, vaddr);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/mm/memory.c linux-2.6.19-rc5-mm2-mark_highmovable/mm/memory.c
--- linux-2.6.19-rc5-mm2-clean/mm/memory.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_highmovable/mm/memory.c 2006-11-27 15:09:20.000000000 +0000
@@ -1564,7 +1564,7 @@ gotten:
if (!new_page)
goto oom;
} else {
- new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ new_page = alloc_page_vma(GFP_HIGH_MOVABLE, vma, address);
if (!new_page)
goto oom;
cow_user_page(new_page, old_page, address);
@@ -2188,7 +2188,7 @@ retry:

if (unlikely(anon_vma_prepare(vma)))
goto oom;
- page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ page = alloc_page_vma(GFP_HIGH_MOVABLE, vma, address);
if (!page)
goto oom;
copy_user_highpage(page, new_page, address);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/mm/mempolicy.c linux-2.6.19-rc5-mm2-mark_highmovable/mm/mempolicy.c
--- linux-2.6.19-rc5-mm2-clean/mm/mempolicy.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_highmovable/mm/mempolicy.c 2006-11-27 15:09:20.000000000 +0000
@@ -598,7 +598,7 @@ static void migrate_page_add(struct page

static struct page *new_node_page(struct page *page, unsigned long node, int **x)
{
- return alloc_pages_node(node, GFP_HIGHUSER, 0);
+ return alloc_pages_node(node, GFP_HIGH_MOVABLE, 0);
}

/*
@@ -714,7 +714,7 @@ static struct page *new_vma_page(struct
{
struct vm_area_struct *vma = (struct vm_area_struct *)private;

- return alloc_page_vma(GFP_HIGHUSER, vma, page_address_in_vma(page, vma));
+ return alloc_page_vma(GFP_HIGH_MOVABLE, vma, page_address_in_vma(page, vma));
}
#else

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/mm/migrate.c linux-2.6.19-rc5-mm2-mark_highmovable/mm/migrate.c
--- linux-2.6.19-rc5-mm2-clean/mm/migrate.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_highmovable/mm/migrate.c 2006-11-27 15:09:20.000000000 +0000
@@ -748,7 +748,7 @@ static struct page *new_page_node(struct

*result = &pm->status;

- return alloc_pages_node(pm->node, GFP_HIGHUSER | GFP_THISNODE, 0);
+ return alloc_pages_node(pm->node, GFP_HIGH_MOVABLE | GFP_THISNODE, 0);
}

/*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/mm/shmem.c linux-2.6.19-rc5-mm2-mark_highmovable/mm/shmem.c
--- linux-2.6.19-rc5-mm2-clean/mm/shmem.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_highmovable/mm/shmem.c 2006-11-27 15:45:19.000000000 +0000
@@ -93,8 +93,11 @@ static inline struct page *shmem_dir_all
* The above definition of ENTRIES_PER_PAGE, and the use of
* BLOCKS_PER_PAGE on indirect pages, assume PAGE_CACHE_SIZE:
* might be reconsidered if it ever diverges from PAGE_SIZE.
+ *
+ * __GFP_MOVABLE is masked out as swap vectors cannot move
*/
- return alloc_pages(gfp_mask, PAGE_CACHE_SHIFT-PAGE_SHIFT);
+ return alloc_pages((gfp_mask & ~__GFP_MOVABLE) | __GFP_ZERO,
+ PAGE_CACHE_SHIFT-PAGE_SHIFT);
}

static inline void shmem_dir_free(struct page *page)
@@ -372,7 +376,7 @@ static swp_entry_t *shmem_swp_alloc(stru
}

spin_unlock(&info->lock);
- page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
+ page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping));
if (page)
set_page_private(page, 0);
spin_lock(&info->lock);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/mm/swap_prefetch.c linux-2.6.19-rc5-mm2-mark_highmovable/mm/swap_prefetch.c
--- linux-2.6.19-rc5-mm2-clean/mm/swap_prefetch.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_highmovable/mm/swap_prefetch.c 2006-11-27 15:09:20.000000000 +0000
@@ -204,7 +204,7 @@ static enum trickle_return trickle_swap_
* Get a new page to read from swap. We have already checked the
* watermarks so __alloc_pages will not call on reclaim.
*/
- page = alloc_pages_node(node, GFP_HIGHUSER & ~__GFP_WAIT, 0);
+ page = alloc_pages_node(node, GFP_HIGH_MOVABLE & ~__GFP_WAIT, 0);
if (unlikely(!page)) {
ret = TRICKLE_DELAY;
goto out;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.19-rc5-mm2-clean/mm/swap_state.c linux-2.6.19-rc5-mm2-mark_highmovable/mm/swap_state.c
--- linux-2.6.19-rc5-mm2-clean/mm/swap_state.c 2006-11-14 14:01:37.000000000 +0000
+++ linux-2.6.19-rc5-mm2-mark_highmovable/mm/swap_state.c 2006-11-27 15:09:20.000000000 +0000
@@ -343,7 +343,7 @@ struct page *read_swap_cache_async(swp_e
* Get a new page to read into from swap.
*/
if (!new_page) {
- new_page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+ new_page = alloc_page_vma(GFP_HIGH_MOVABLE, vma, addr);
if (!new_page)
break; /* Out of memory */
}