LinuxLists.cc - [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

[permalink] [raw]

Subject: [PATCH 5/8] ppc and powerpc - Specify amount of kernel memory at boot time

This patch adds the kernelcore= parameter for ppc and powerpc.

Signed-off-by: Mel Gorman <[email protected]>
---

powerpc/kernel/prom.c | 1 +
ppc/mm/init.c | 2 ++
2 files changed, 3 insertions(+)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-004_x86_set_kernelcore/arch/powerpc/kernel/prom.c linux-2.6.20-rc4-mm1-005_ppc64_set_kernelcore/arch/powerpc/kernel/prom.c
--- linux-2.6.20-rc4-mm1-004_x86_set_kernelcore/arch/powerpc/kernel/prom.c 2007-01-07 05:45:51.000000000 +0000
+++ linux-2.6.20-rc4-mm1-005_ppc64_set_kernelcore/arch/powerpc/kernel/prom.c 2007-01-25 17:38:17.000000000 +0000
@@ -431,6 +431,7 @@ static int __init early_parse_mem(char *
return 0;
}
early_param("mem", early_parse_mem);
+early_param("kernelcore", cmdline_parse_kernelcore);

/*
* The device tree may be allocated below our memory limit, or inside the
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-004_x86_set_kernelcore/arch/ppc/mm/init.c linux-2.6.20-rc4-mm1-005_ppc64_set_kernelcore/arch/ppc/mm/init.c
--- linux-2.6.20-rc4-mm1-004_x86_set_kernelcore/arch/ppc/mm/init.c 2007-01-07 05:45:51.000000000 +0000
+++ linux-2.6.20-rc4-mm1-005_ppc64_set_kernelcore/arch/ppc/mm/init.c 2007-01-25 17:38:17.000000000 +0000
@@ -214,6 +214,8 @@ void MMU_setup(void)
}
}

+early_param("kernelcore", cmdline_parse_kernelcore);
+
/*
* MMU_init sets up the basic memory mappings for the kernel,
* including both RAM and possibly some I/O regions,

2007-01-26 00:20:25

[permalink] [raw]

Subject: [PATCH 4/8] x86 - Specify amount of kernel memory at boot time

This patch adds the kernelcore= parameter for x86.

Signed-off-by: Mel Gorman <[email protected]>
---

setup.c | 1 +
1 files changed, 1 insertion(+)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-003_mark_hugepages_movable/arch/i386/kernel/setup.c linux-2.6.20-rc4-mm1-004_x86_set_kernelcore/arch/i386/kernel/setup.c
--- linux-2.6.20-rc4-mm1-003_mark_hugepages_movable/arch/i386/kernel/setup.c 2007-01-17 17:07:57.000000000 +0000
+++ linux-2.6.20-rc4-mm1-004_x86_set_kernelcore/arch/i386/kernel/setup.c 2007-01-25 17:36:17.000000000 +0000
@@ -196,6 +196,7 @@ static int __init parse_mem(char *arg)
return 0;
}
early_param("mem", parse_mem);
+early_param("kernelcore", cmdline_parse_kernelcore);

#ifdef CONFIG_PROC_VMCORE
/* elfcorehdr= specifies the location of elf core header

2007-01-26 00:20:46

[permalink] [raw]

Subject: [PATCH 8/8] Add documentation for additional boot parameter and sysctl

Once all patches are applied, a new command-line parameter exist and a new
sysctl. This patch adds the necessary documentation.

Signed-off-by: Mel Gorman <[email protected]>
---

filesystems/proc.txt | 15 +++++++++++++++
kernel-parameters.txt | 16 ++++++++++++++++
sysctl/vm.txt | 3 ++-
3 files changed, 33 insertions(+), 1 deletion(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-007_ia64_set_kernelcore/Documentation/filesystems/proc.txt linux-2.6.20-rc4-mm1-008_documentation/Documentation/filesystems/proc.txt
--- linux-2.6.20-rc4-mm1-007_ia64_set_kernelcore/Documentation/filesystems/proc.txt 2007-01-07 05:45:51.000000000 +0000
+++ linux-2.6.20-rc4-mm1-008_documentation/Documentation/filesystems/proc.txt 2007-01-25 18:27:28.000000000 +0000
@@ -1288,6 +1288,21 @@ nr_hugepages configures number of hugetl
hugetlb_shm_group contains group id that is allowed to create SysV shared
memory segment using hugetlb page.

+hugepages_treat_as_movable
+--------------------------
+
+This paramter is only useful when kernelcore= is specified at boot time to
+create ZONE_MOVABLE for pages that may be reclaimed or migrated. Huge pages
+are not movable so are not normally allocated from ZONE_MOVABLE. A non-zero
+value written to hugepages_treat_as_movable allows huge pages to be allocated
+from ZONE_MOVABLE.
+
+Once enabled, the ZONE_MOVABLE is treated as an area of memory the huge
+pages pool can easily grow or shrink within. Assuming that applications are
+not running that mlock() a lot of memory, it is likely the huge pages pool
+can grow to the size of ZONE_MOVABLE by repeatly entering the desired value
+into nr_hugepages and triggering page reclaim.
+
laptop_mode
-----------

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-007_ia64_set_kernelcore/Documentation/kernel-parameters.txt linux-2.6.20-rc4-mm1-008_documentation/Documentation/kernel-parameters.txt
--- linux-2.6.20-rc4-mm1-007_ia64_set_kernelcore/Documentation/kernel-parameters.txt 2007-01-17 17:07:54.000000000 +0000
+++ linux-2.6.20-rc4-mm1-008_documentation/Documentation/kernel-parameters.txt 2007-01-25 18:27:28.000000000 +0000
@@ -762,6 +762,22 @@ and is between 256 and 4096 characters.
js= [HW,JOY] Analog joystick
See Documentation/input/joystick.txt.

+ kernelcore=nn[KMG] [KNL,IA-32,IA-64,PPC,X86-64] This parameter
+ specifies the amount of memory usable by the kernel
+ for non-movable allocations. The requested amount is
+ spread evenly throughout all nodes in the system. The
+ remaining memory in each node is used for Movable
+ pages. In the event, a node is too small to have both
+ kernelcore and Movable pages, kernelcore pages will
+ take priority and other nodes will have a larger number
+ of kernelcore pages. The Movable zone is used for the
+ allocation of pages that may be reclaimed or moved
+ by the page migration sybsystem. This means that
+ HugeTLB pages may not be allocated from this zone.
+ Note that allocations like PTEs-from-HighMem still
+ use the HighMem zone if it exists, and the Normal
+ zone if it does not.
+
keepinitrd [HW,ARM]

kstack=N [IA-32,X86-64] Print N words from the kernel stack
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-007_ia64_set_kernelcore/Documentation/sysctl/vm.txt linux-2.6.20-rc4-mm1-008_documentation/Documentation/sysctl/vm.txt
--- linux-2.6.20-rc4-mm1-007_ia64_set_kernelcore/Documentation/sysctl/vm.txt 2007-01-17 17:07:54.000000000 +0000
+++ linux-2.6.20-rc4-mm1-008_documentation/Documentation/sysctl/vm.txt 2007-01-25 18:27:28.000000000 +0000
@@ -39,7 +39,8 @@ Currently, these files are in /proc/sys/

dirty_ratio, dirty_background_ratio, dirty_expire_centisecs,
dirty_writeback_centisecs, vfs_cache_pressure, laptop_mode,
-block_dump, swap_token_timeout, drop-caches:
+block_dump, swap_token_timeout, drop-caches,
+hugepages_treat_as_movable:

See Documentation/filesystems/proc.txt

2007-01-26 00:20:46

[permalink] [raw]

Subject: [PATCH 6/8] x86_64 - Specify amount of kernel memory at boot time

This patch adds the kernelcore= parameter for x86_64.

Signed-off-by: Mel Gorman <[email protected]>
---

e820.c | 1 +
1 files changed, 1 insertion(+)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-005_ppc64_set_kernelcore/arch/x86_64/kernel/e820.c linux-2.6.20-rc4-mm1-006_x8664_set_kernelcore/arch/x86_64/kernel/e820.c
--- linux-2.6.20-rc4-mm1-005_ppc64_set_kernelcore/arch/x86_64/kernel/e820.c 2007-01-17 17:08:01.000000000 +0000
+++ linux-2.6.20-rc4-mm1-006_x8664_set_kernelcore/arch/x86_64/kernel/e820.c 2007-01-25 17:40:16.000000000 +0000
@@ -617,6 +617,7 @@ static int __init parse_memopt(char *p)
return 0;
}
early_param("mem", parse_memopt);
+early_param("kernelcore", cmdline_parse_kernelcore);

static int userdef __initdata;

2007-01-26 00:20:46

[permalink] [raw]

Subject: [PATCH 7/8] ia64 - Specify amount of kernel memory at boot time

This patch adds the kernelcore= parameter for ia64.

Signed-off-by: Mel Gorman <[email protected]>
---

efi.c | 3 +++
1 files changed, 3 insertions(+)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-006_x8664_set_kernelcore/arch/ia64/kernel/efi.c linux-2.6.20-rc4-mm1-007_ia64_set_kernelcore/arch/ia64/kernel/efi.c
--- linux-2.6.20-rc4-mm1-006_x8664_set_kernelcore/arch/ia64/kernel/efi.c 2007-01-07 05:45:51.000000000 +0000
+++ linux-2.6.20-rc4-mm1-007_ia64_set_kernelcore/arch/ia64/kernel/efi.c 2007-01-25 17:42:15.000000000 +0000
@@ -27,6 +27,7 @@
#include <linux/time.h>
#include <linux/efi.h>
#include <linux/kexec.h>
+#include <linux/mm.h>

#include <asm/io.h>
#include <asm/kregs.h>
@@ -422,6 +423,8 @@ efi_init (void)
mem_limit = memparse(cp + 4, &cp);
} else if (memcmp(cp, "max_addr=", 9) == 0) {
max_addr = GRANULEROUNDDOWN(memparse(cp + 9, &cp));
+ } else if (memcmp(cp, "kernelcore=",11) == 0) {
+ cmdline_parse_kernelcore(cp+11);
} else if (memcmp(cp, "min_addr=", 9) == 0) {
min_addr = GRANULEROUNDDOWN(memparse(cp + 9, &cp));
} else {

2007-01-26 00:37:08

[permalink] [raw]

Subject: [PATCH 1/8] Add __GFP_MOVABLE for callers to flag allocations that may be migrated

It is often known at allocation time when a page may be migrated or
not. This patch adds a flag called __GFP_MOVABLE and a new mask called
GFP_HIGH_MOVABLE. Allocations using the __GFP_MOVABLE can be either migrated
using the page migration mechanism or reclaimed by syncing with backing
storage and discarding.

An API function very similar to alloc_zeroed_user_highpage() is added for
__GFP_MOVABLE allocations called alloc_zeroed_user_highpage_movable(). The
flags used by alloc_zeroed_user_highpage() are not changed because it changes
the semantics of an existing API. After this patch is applied there are no
in-kernel users of alloc_zeroed_user_highpage() so it probably should be
marked deprecated if this patch is merged.

Note that this patch includes a minor cleanup to the use of __GFP_ZERO
in shmem.c to keep all flag modifications to inode->mapping in the
shmem_dir_alloc() helper function. This clean-up suggestion is courtesy of
Hugh Dickens.

Additional credit goes to Christoph Lameter and Linus Torvalds for shaping
the concept. Credit to Hugh Dickens for catching issues with shmem swap
vector and ramfs allocations.

Signed-off-by: Mel Gorman <[email protected]>
---

fs/inode.c | 10 ++++++--
fs/ramfs/inode.c | 1
include/asm-alpha/page.h | 3 +-
include/asm-cris/page.h | 3 +-
include/asm-h8300/page.h | 3 +-
include/asm-i386/page.h | 3 +-
include/asm-ia64/page.h | 5 ++--
include/asm-m32r/page.h | 3 +-
include/asm-s390/page.h | 3 +-
include/asm-x86_64/page.h | 3 +-
include/linux/gfp.h | 10 +++++++-
include/linux/highmem.h | 51 +++++++++++++++++++++++++++++++++++++++--
mm/memory.c | 8 +++---
mm/mempolicy.c | 4 +--
mm/migrate.c | 2 -
mm/shmem.c | 7 ++++-
mm/swap_prefetch.c | 2 -
mm/swap_state.c | 2 -
18 files changed, 98 insertions(+), 25 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-clean/fs/inode.c linux-2.6.20-rc4-mm1-001_mark_highmovable/fs/inode.c
--- linux-2.6.20-rc4-mm1-clean/fs/inode.c 2007-01-17 17:08:26.000000000 +0000
+++ linux-2.6.20-rc4-mm1-001_mark_highmovable/fs/inode.c 2007-01-25 17:30:30.000000000 +0000
@@ -145,7 +145,7 @@ static struct inode *alloc_inode(struct
mapping->a_ops = &empty_aops;
mapping->host = inode;
mapping->flags = 0;
- mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
+ mapping_set_gfp_mask(mapping, GFP_HIGH_MOVABLE);
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;

@@ -521,7 +521,13 @@ repeat:
* new_inode - obtain an inode
* @sb: superblock
*
- * Allocates a new inode for given superblock.
+ * Allocates a new inode for given superblock. The default gfp_mask
+ * for allocations related to inode->i_mapping is GFP_HIGH_MOVABLE. If
+ * HIGHMEM pages are unsuitable or it is known that pages allocated
+ * for the page cache are not reclaimable or migratable,
+ * mapping_set_gfp_mask() must be called with suitable flags on the
+ * newly created inode's mapping
+ *
*/
struct inode *new_inode(struct super_block *sb)
{
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-clean/fs/ramfs/inode.c linux-2.6.20-rc4-mm1-001_mark_highmovable/fs/ramfs/inode.c
--- linux-2.6.20-rc4-mm1-clean/fs/ramfs/inode.c 2007-01-07 05:45:51.000000000 +0000
+++ linux-2.6.20-rc4-mm1-001_mark_highmovable/fs/ramfs/inode.c 2007-01-25 17:30:30.000000000 +0000
@@ -61,6 +61,7 @@ struct inode *ramfs_get_inode(struct sup
inode->i_blocks = 0;
inode->i_mapping->a_ops = &ramfs_aops;
inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
+ mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
switch (mode & S_IFMT) {
default:
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-clean/include/asm-alpha/page.h linux-2.6.20-rc4-mm1-001_mark_highmovable/include/asm-alpha/page.h
--- linux-2.6.20-rc4-mm1-clean/include/asm-alpha/page.h 2007-01-07 05:45:51.000000000 +0000
+++ linux-2.6.20-rc4-mm1-001_mark_highmovable/include/asm-alpha/page.h 2007-01-25 17:30:30.000000000 +0000
@@ -17,7 +17,8 @@
extern void clear_page(void *page);
#define clear_user_page(page, vaddr, pg) clear_page(page)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vmaddr)
+#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
+ alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vmaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

extern void copy_page(void * _to, void * _from);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-clean/include/asm-cris/page.h linux-2.6.20-rc4-mm1-001_mark_highmovable/include/asm-cris/page.h
--- linux-2.6.20-rc4-mm1-clean/include/asm-cris/page.h 2007-01-07 05:45:51.000000000 +0000
+++ linux-2.6.20-rc4-mm1-001_mark_highmovable/include/asm-cris/page.h 2007-01-25 17:30:30.000000000 +0000
@@ -20,7 +20,8 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
+ alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

/*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-clean/include/asm-h8300/page.h linux-2.6.20-rc4-mm1-001_mark_highmovable/include/asm-h8300/page.h
--- linux-2.6.20-rc4-mm1-clean/include/asm-h8300/page.h 2007-01-07 05:45:51.000000000 +0000
+++ linux-2.6.20-rc4-mm1-001_mark_highmovable/include/asm-h8300/page.h 2007-01-25 17:30:30.000000000 +0000
@@ -22,7 +22,8 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
+ alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

/*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-clean/include/asm-i386/page.h linux-2.6.20-rc4-mm1-001_mark_highmovable/include/asm-i386/page.h
--- linux-2.6.20-rc4-mm1-clean/include/asm-i386/page.h 2007-01-07 05:45:51.000000000 +0000
+++ linux-2.6.20-rc4-mm1-001_mark_highmovable/include/asm-i386/page.h 2007-01-25 17:30:30.000000000 +0000
@@ -35,7 +35,8 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
+ alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

/*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-clean/include/asm-ia64/page.h linux-2.6.20-rc4-mm1-001_mark_highmovable/include/asm-ia64/page.h
--- linux-2.6.20-rc4-mm1-clean/include/asm-ia64/page.h 2007-01-07 05:45:51.000000000 +0000
+++ linux-2.6.20-rc4-mm1-001_mark_highmovable/include/asm-ia64/page.h 2007-01-25 17:30:30.000000000 +0000
@@ -87,9 +87,10 @@ do { \
} while (0)

-#define alloc_zeroed_user_highpage(vma, vaddr) \
+#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
({ \
- struct page *page = alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr); \
+ struct page *page = alloc_page_vma(
+ GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr); \
if (page) \
flush_dcache_page(page); \
page; \
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-clean/include/asm-m32r/page.h linux-2.6.20-rc4-mm1-001_mark_highmovable/include/asm-m32r/page.h
--- linux-2.6.20-rc4-mm1-clean/include/asm-m32r/page.h 2007-01-17 17:08:31.000000000 +0000
+++ linux-2.6.20-rc4-mm1-001_mark_highmovable/include/asm-m32r/page.h 2007-01-25 17:30:30.000000000 +0000
@@ -15,7 +15,8 @@ extern void copy_page(void *to, void *fr
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
+ alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

/*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-clean/include/asm-s390/page.h linux-2.6.20-rc4-mm1-001_mark_highmovable/include/asm-s390/page.h
--- linux-2.6.20-rc4-mm1-clean/include/asm-s390/page.h 2007-01-07 05:45:51.000000000 +0000
+++ linux-2.6.20-rc4-mm1-001_mark_highmovable/include/asm-s390/page.h 2007-01-25 17:30:30.000000000 +0000
@@ -64,7 +64,8 @@ static inline void copy_page(void *to, v
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
+ alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

/*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-clean/include/asm-x86_64/page.h linux-2.6.20-rc4-mm1-001_mark_highmovable/include/asm-x86_64/page.h
--- linux-2.6.20-rc4-mm1-clean/include/asm-x86_64/page.h 2007-01-07 05:45:51.000000000 +0000
+++ linux-2.6.20-rc4-mm1-001_mark_highmovable/include/asm-x86_64/page.h 2007-01-25 17:30:30.000000000 +0000
@@ -51,7 +51,8 @@ void copy_page(void *, void *);
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define __alloc_zeroed_user_highpage(movableflags, vma, vaddr) \
+ alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO | movableflags, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
/*
* These are used to make use of C type-checking..
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-clean/include/linux/gfp.h linux-2.6.20-rc4-mm1-001_mark_highmovable/include/linux/gfp.h
--- linux-2.6.20-rc4-mm1-clean/include/linux/gfp.h 2007-01-17 17:08:35.000000000 +0000
+++ linux-2.6.20-rc4-mm1-001_mark_highmovable/include/linux/gfp.h 2007-01-25 17:30:30.000000000 +0000
@@ -30,6 +30,9 @@ struct vm_area_struct;
* cannot handle allocation failures.
*
* __GFP_NORETRY: The VM implementation must not retry indefinitely.
+ *
+ * __GFP_MOVABLE: Flag that this page will be movable by the page migration
+ * mechanism or reclaimed
*/
#define __GFP_WAIT ((__force gfp_t)0x10u) /* Can wait and reschedule? */
#define __GFP_HIGH ((__force gfp_t)0x20u) /* Should access emergency pools? */
@@ -46,6 +49,7 @@ struct vm_area_struct;
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_MOVABLE ((__force gfp_t)0x80000u) /* Page is movable */

#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -54,7 +58,8 @@ struct vm_area_struct;
#define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
- __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE)
+ __GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|\
+ __GFP_MOVABLE)

/* This equals 0, but use constants in case they ever change */
#define GFP_NOWAIT (GFP_ATOMIC & ~__GFP_HIGH)
@@ -66,6 +71,9 @@ struct vm_area_struct;
#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | \
__GFP_HIGHMEM)
+#define GFP_HIGH_MOVABLE (__GFP_WAIT | __GFP_IO | __GFP_FS | \
+ __GFP_HARDWALL | __GFP_HIGHMEM | \
+ __GFP_MOVABLE)

#ifdef CONFIG_NUMA
#define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-clean/include/linux/highmem.h linux-2.6.20-rc4-mm1-001_mark_highmovable/include/linux/highmem.h
--- linux-2.6.20-rc4-mm1-clean/include/linux/highmem.h 2007-01-17 17:08:35.000000000 +0000
+++ linux-2.6.20-rc4-mm1-001_mark_highmovable/include/linux/highmem.h 2007-01-25 17:30:30.000000000 +0000
@@ -62,10 +62,27 @@ static inline void clear_user_highpage(s
}

#ifndef __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
+/**
+ * __alloc_zeroed_user_highpage - Allocate a zeroed HIGHMEM page for a VMA with caller-specified movable GFP flags
+ * @movableflags: The GFP flags related to the pages future ability to move like __GFP_MOVABLE
+ * @vma: The VMA the page is to be allocated for
+ * @vaddr: The virtual address the page will be inserted into
+ *
+ * This function will allocate a page for a VMA but the caller is expected
+ * to specify via movableflags whether the page will be movable in the
+ * future or not
+ *
+ * An architecture may override this function by defining
+ * __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE and providing their own
+ * implementation.
+ */
static inline struct page *
-alloc_zeroed_user_highpage(struct vm_area_struct *vma, unsigned long vaddr)
+__alloc_zeroed_user_highpage(gfp_t movableflags,
+ struct vm_area_struct *vma,
+ unsigned long vaddr)
{
- struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+ struct page *page = alloc_page_vma(GFP_HIGHUSER | movableflags,
+ vma, vaddr);

if (page)
clear_user_highpage(page, vaddr);
@@ -74,6 +91,36 @@ alloc_zeroed_user_highpage(struct vm_are
}
#endif

+/**
+ * alloc_zeroed_user_highpage - Allocate a zeroed HIGHMEM page for a VMA
+ * @vma: The VMA the page is to be allocated for
+ * @vaddr: The virtual address the page will be inserted into
+ *
+ * This function will allocate a page for a VMA that the caller knows will
+ * not be able to move in the future using move_pages() or reclaim. If it
+ * is known that the page can move, use alloc_zeroed_user_highpage_movable
+ */
+static inline struct page *
+alloc_zeroed_user_highpage(struct vm_area_struct *vma, unsigned long vaddr)
+{
+ return __alloc_zeroed_user_highpage(0, vma, vaddr);
+}
+
+/**
+ * alloc_zeroed_user_highpage_movable - Allocate a zeroed HIGHMEM page for a VMA that the caller knows can move
+ * @vma: The VMA the page is to be allocated for
+ * @vaddr: The virtual address the page will be inserted into
+ *
+ * This function will allocate a page for a VMA that the caller knows will
+ * be able to migrate in the future using move_pages() or reclaimed
+ */
+static inline struct page *
+alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
+ unsigned long vaddr)
+{
+ return __alloc_zeroed_user_highpage(__GFP_MOVABLE, vma, vaddr);
+}
+
static inline void clear_highpage(struct page *page)
{
void *kaddr = kmap_atomic(page, KM_USER0);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-clean/mm/memory.c linux-2.6.20-rc4-mm1-001_mark_highmovable/mm/memory.c
--- linux-2.6.20-rc4-mm1-clean/mm/memory.c 2007-01-17 17:08:39.000000000 +0000
+++ linux-2.6.20-rc4-mm1-001_mark_highmovable/mm/memory.c 2007-01-25 17:30:30.000000000 +0000
@@ -1570,11 +1570,11 @@ gotten:
if (unlikely(anon_vma_prepare(vma)))
goto oom;
if (old_page == ZERO_PAGE(address)) {
- new_page = alloc_zeroed_user_highpage(vma, address);
+ new_page = alloc_zeroed_user_highpage_movable(vma, address);
if (!new_page)
goto oom;
} else {
- new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ new_page = alloc_page_vma(GFP_HIGH_MOVABLE, vma, address);
if (!new_page)
goto oom;
cow_user_page(new_page, old_page, address, vma);
@@ -2092,7 +2092,7 @@ static int do_anonymous_page(struct mm_s

if (unlikely(anon_vma_prepare(vma)))
goto oom;
- page = alloc_zeroed_user_highpage(vma, address);
+ page = alloc_zeroed_user_highpage_movable(vma, address);
if (!page)
goto oom;

@@ -2195,7 +2195,7 @@ retry:

if (unlikely(anon_vma_prepare(vma)))
goto oom;
- page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ page = alloc_page_vma(GFP_HIGH_MOVABLE, vma, address);
if (!page)
goto oom;
copy_user_highpage(page, new_page, address, vma);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-clean/mm/mempolicy.c linux-2.6.20-rc4-mm1-001_mark_highmovable/mm/mempolicy.c
--- linux-2.6.20-rc4-mm1-clean/mm/mempolicy.c 2007-01-17 17:08:39.000000000 +0000
+++ linux-2.6.20-rc4-mm1-001_mark_highmovable/mm/mempolicy.c 2007-01-25 17:30:30.000000000 +0000
@@ -598,7 +598,7 @@ static void migrate_page_add(struct page

static struct page *new_node_page(struct page *page, unsigned long node, int **x)
{
- return alloc_pages_node(node, GFP_HIGHUSER, 0);
+ return alloc_pages_node(node, GFP_HIGH_MOVABLE, 0);
}

/*
@@ -714,7 +714,7 @@ static struct page *new_vma_page(struct
{
struct vm_area_struct *vma = (struct vm_area_struct *)private;

- return alloc_page_vma(GFP_HIGHUSER, vma, page_address_in_vma(page, vma));
+ return alloc_page_vma(GFP_HIGH_MOVABLE, vma, page_address_in_vma(page, vma));
}
#else

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-clean/mm/migrate.c linux-2.6.20-rc4-mm1-001_mark_highmovable/mm/migrate.c
--- linux-2.6.20-rc4-mm1-clean/mm/migrate.c 2007-01-07 05:45:51.000000000 +0000
+++ linux-2.6.20-rc4-mm1-001_mark_highmovable/mm/migrate.c 2007-01-25 17:30:30.000000000 +0000
@@ -748,7 +748,7 @@ static struct page *new_page_node(struct

*result = &pm->status;

- return alloc_pages_node(pm->node, GFP_HIGHUSER | GFP_THISNODE, 0);
+ return alloc_pages_node(pm->node, GFP_HIGH_MOVABLE | GFP_THISNODE, 0);
}

/*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-clean/mm/shmem.c linux-2.6.20-rc4-mm1-001_mark_highmovable/mm/shmem.c
--- linux-2.6.20-rc4-mm1-clean/mm/shmem.c 2007-01-17 17:08:39.000000000 +0000
+++ linux-2.6.20-rc4-mm1-001_mark_highmovable/mm/shmem.c 2007-01-25 17:30:30.000000000 +0000
@@ -93,8 +93,11 @@ static inline struct page *shmem_dir_all
* The above definition of ENTRIES_PER_PAGE, and the use of
* BLOCKS_PER_PAGE on indirect pages, assume PAGE_CACHE_SIZE:
* might be reconsidered if it ever diverges from PAGE_SIZE.
+ *
+ * __GFP_MOVABLE is masked out as swap vectors cannot move
*/
- return alloc_pages(gfp_mask, PAGE_CACHE_SHIFT-PAGE_SHIFT);
+ return alloc_pages((gfp_mask & ~__GFP_MOVABLE) | __GFP_ZERO,
+ PAGE_CACHE_SHIFT-PAGE_SHIFT);
}

static inline void shmem_dir_free(struct page *page)
@@ -371,7 +374,7 @@ static swp_entry_t *shmem_swp_alloc(stru
}

spin_unlock(&info->lock);
- page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping) | __GFP_ZERO);
+ page = shmem_dir_alloc(mapping_gfp_mask(inode->i_mapping));
if (page)
set_page_private(page, 0);
spin_lock(&info->lock);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-clean/mm/swap_prefetch.c linux-2.6.20-rc4-mm1-001_mark_highmovable/mm/swap_prefetch.c
--- linux-2.6.20-rc4-mm1-clean/mm/swap_prefetch.c 2007-01-17 17:08:39.000000000 +0000
+++ linux-2.6.20-rc4-mm1-001_mark_highmovable/mm/swap_prefetch.c 2007-01-25 17:30:30.000000000 +0000
@@ -204,7 +204,7 @@ static enum trickle_return trickle_swap_
* Get a new page to read from swap. We have already checked the
* watermarks so __alloc_pages will not call on reclaim.
*/
- page = alloc_pages_node(node, GFP_HIGHUSER & ~__GFP_WAIT, 0);
+ page = alloc_pages_node(node, GFP_HIGH_MOVABLE & ~__GFP_WAIT, 0);
if (unlikely(!page)) {
ret = TRICKLE_DELAY;
goto out;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-clean/mm/swap_state.c linux-2.6.20-rc4-mm1-001_mark_highmovable/mm/swap_state.c
--- linux-2.6.20-rc4-mm1-clean/mm/swap_state.c 2007-01-17 17:08:39.000000000 +0000
+++ linux-2.6.20-rc4-mm1-001_mark_highmovable/mm/swap_state.c 2007-01-25 17:30:30.000000000 +0000
@@ -340,7 +340,7 @@ struct page *read_swap_cache_async(swp_e
* Get a new page to read into from swap.
*/
if (!new_page) {
- new_page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+ new_page = alloc_page_vma(GFP_HIGH_MOVABLE, vma, addr);
if (!new_page)
break; /* Out of memory */
}

2007-01-26 00:37:41

[permalink] [raw]

Subject: [PATCH 2/8] Create the ZONE_MOVABLE zone

This patch creates an additional zone, ZONE_MOVABLE. This zone is only
usable by allocations which specify both __GFP_HIGHMEM and __GFP_MOVABLE.
Hot-added memory continues to be placed in their existing destination as
there is no mechanism to redirect them to a specific zone.

Signed-off-by: Mel Gorman <[email protected]>
---

include/linux/gfp.h | 3
include/linux/mm.h | 1
include/linux/mmzone.h | 21 +++-
mm/highmem.c | 5
mm/page_alloc.c | 224 +++++++++++++++++++++++++++++++++++++++++++-
5 files changed, 247 insertions(+), 7 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-001_mark_highmovable/include/linux/gfp.h linux-2.6.20-rc4-mm1-002_create_movable_zone/include/linux/gfp.h
--- linux-2.6.20-rc4-mm1-001_mark_highmovable/include/linux/gfp.h 2007-01-25 17:30:30.000000000 +0000
+++ linux-2.6.20-rc4-mm1-002_create_movable_zone/include/linux/gfp.h 2007-01-25 17:32:18.000000000 +0000
@@ -101,6 +101,9 @@ static inline enum zone_type gfp_zone(gf
if (flags & __GFP_DMA32)
return ZONE_DMA32;
#endif
+ if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
+ (__GFP_HIGHMEM | __GFP_MOVABLE))
+ return ZONE_MOVABLE;
#ifdef CONFIG_HIGHMEM
if (flags & __GFP_HIGHMEM)
return ZONE_HIGHMEM;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-001_mark_highmovable/include/linux/mm.h linux-2.6.20-rc4-mm1-002_create_movable_zone/include/linux/mm.h
--- linux-2.6.20-rc4-mm1-001_mark_highmovable/include/linux/mm.h 2007-01-17 17:08:35.000000000 +0000
+++ linux-2.6.20-rc4-mm1-002_create_movable_zone/include/linux/mm.h 2007-01-25 17:32:18.000000000 +0000
@@ -974,6 +974,7 @@ extern unsigned long find_max_pfn_with_a
extern void free_bootmem_with_active_regions(int nid,
unsigned long max_low_pfn);
extern void sparse_memory_present_with_active_regions(int nid);
+extern int cmdline_parse_kernelcore(char *p);
#ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
extern int early_pfn_to_nid(unsigned long pfn);
#endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-001_mark_highmovable/include/linux/mmzone.h linux-2.6.20-rc4-mm1-002_create_movable_zone/include/linux/mmzone.h
--- linux-2.6.20-rc4-mm1-001_mark_highmovable/include/linux/mmzone.h 2007-01-17 17:08:35.000000000 +0000
+++ linux-2.6.20-rc4-mm1-002_create_movable_zone/include/linux/mmzone.h 2007-01-25 17:32:18.000000000 +0000
@@ -138,6 +138,7 @@ enum zone_type {
*/
ZONE_HIGHMEM,
#endif
+ ZONE_MOVABLE,
MAX_NR_ZONES
};

@@ -159,6 +160,7 @@ enum zone_type {
+ defined(CONFIG_ZONE_DMA32) \
+ 1 \
+ defined(CONFIG_HIGHMEM) \
+ + 1 \
)
#if __ZONE_COUNT < 2
#define ZONES_SHIFT 0
@@ -166,6 +168,8 @@ enum zone_type {
#define ZONES_SHIFT 1
#elif __ZONE_COUNT <= 4
#define ZONES_SHIFT 2
+#elif __ZONE_COUNT <= 8
+#define ZONES_SHIFT 3
#else
#error ZONES_SHIFT -- too many zones configured adjust calculation
#endif
@@ -499,10 +503,21 @@ static inline int populated_zone(struct
return (!!zone->present_pages);
}

+extern int movable_zone;
+static inline int zone_movable_is_highmem(void)
+{
+#ifdef CONFIG_HIGHMEM
+ return movable_zone == ZONE_HIGHMEM;
+#else
+ return 0;
+#endif
+}
+
static inline int is_highmem_idx(enum zone_type idx)
{
#ifdef CONFIG_HIGHMEM
- return (idx == ZONE_HIGHMEM);
+ return (idx == ZONE_HIGHMEM ||
+ (idx == ZONE_MOVABLE && zone_movable_is_highmem()));
#else
return 0;
#endif
@@ -522,7 +537,9 @@ static inline int is_normal_idx(enum zon
static inline int is_highmem(struct zone *zone)
{
#ifdef CONFIG_HIGHMEM
- return zone == zone->zone_pgdat->node_zones + ZONE_HIGHMEM;
+ int zone_idx = zone - zone->zone_pgdat->node_zones;
+ return zone_idx == ZONE_HIGHMEM ||
+ (zone_idx == ZONE_MOVABLE && zone_movable_is_highmem());
#else
return 0;
#endif
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-001_mark_highmovable/mm/highmem.c linux-2.6.20-rc4-mm1-002_create_movable_zone/mm/highmem.c
--- linux-2.6.20-rc4-mm1-001_mark_highmovable/mm/highmem.c 2007-01-07 05:45:51.000000000 +0000
+++ linux-2.6.20-rc4-mm1-002_create_movable_zone/mm/highmem.c 2007-01-25 17:32:18.000000000 +0000
@@ -46,8 +46,11 @@ unsigned int nr_free_highpages (void)
pg_data_t *pgdat;
unsigned int pages = 0;

- for_each_online_pgdat(pgdat)
+ for_each_online_pgdat(pgdat) {
pages += pgdat->node_zones[ZONE_HIGHMEM].free_pages;
+ if (zone_movable_is_highmem())
+ pages += pgdat->node_zones[ZONE_MOVABLE].free_pages;
+ }

return pages;
}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-001_mark_highmovable/mm/page_alloc.c linux-2.6.20-rc4-mm1-002_create_movable_zone/mm/page_alloc.c
--- linux-2.6.20-rc4-mm1-001_mark_highmovable/mm/page_alloc.c 2007-01-17 17:08:39.000000000 +0000
+++ linux-2.6.20-rc4-mm1-002_create_movable_zone/mm/page_alloc.c 2007-01-25 22:41:41.000000000 +0000
@@ -80,8 +80,9 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_Z
256,
#endif
#ifdef CONFIG_HIGHMEM
- 32
+ 32,
#endif
+ 32,
};

EXPORT_SYMBOL(totalram_pages);
@@ -95,8 +96,9 @@ static char * const zone_names[MAX_NR_ZO
#endif
"Normal",
#ifdef CONFIG_HIGHMEM
- "HighMem"
+ "HighMem",
#endif
+ "Movable",
};

int min_free_kbytes = 1024;
@@ -134,6 +136,11 @@ static unsigned long __initdata dma_rese
unsigned long __initdata node_boundary_start_pfn[MAX_NUMNODES];
unsigned long __initdata node_boundary_end_pfn[MAX_NUMNODES];
#endif /* CONFIG_MEMORY_HOTPLUG_RESERVE */
+ unsigned long __initdata required_kernelcore;
+ unsigned long __initdata zone_movable_pfn[MAX_NUMNODES];
+
+ /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
+ int movable_zone;
#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */

#ifdef CONFIG_DEBUG_VM
@@ -1580,7 +1587,7 @@ unsigned int nr_free_buffer_pages(void)
*/
unsigned int nr_free_pagecache_pages(void)
{
- return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER));
+ return nr_free_zone_pages(gfp_zone(GFP_HIGH_MOVABLE));
}

/*
@@ -2572,6 +2579,63 @@ void __init get_pfn_range_for_nid(unsign
}

/*
+ * This finds a zone that can be used for ZONE_MOVABLE pages. The
+ * assumption is made that zones within a node are ordered in monotonic
+ * increasing memory addresses so that the "highest" populated zone is used
+ */
+void __init find_usable_zone_for_movable(void)
+{
+ int zone_index;
+ for (zone_index = MAX_NR_ZONES - 1; zone_index >= 0; zone_index--) {
+ if (zone_index == ZONE_MOVABLE)
+ continue;
+
+ if (arch_zone_highest_possible_pfn[zone_index] >
+ arch_zone_lowest_possible_pfn[zone_index])
+ break;
+ }
+
+ VM_BUG_ON(zone_index == -1);
+ movable_zone = zone_index;
+}
+
+/*
+ * The zone ranges provided by the architecture do not include ZONE_MOVABLE
+ * because it is sized independant of architecture. Unlike the other zones,
+ * the starting point for ZONE_MOVABLE is not fixed. It may be different
+ * in each node depending on the size of each node and how evenly kernelcore
+ * is distributed. This helper function adjusts the zone ranges
+ * provided by the architecture for a given node by using the end of the
+ * highest usable zone for ZONE_MOVABLE. This preserves the assumption that
+ * zones within a node are in order of monotonic increases memory addresses
+ */
+void __init adjust_zone_range_for_zone_movable(int nid,
+ unsigned long zone_type,
+ unsigned long node_start_pfn,
+ unsigned long node_end_pfn,
+ unsigned long *zone_start_pfn,
+ unsigned long *zone_end_pfn)
+{
+ /* Only adjust if ZONE_MOVABLE is on this node */
+ if (zone_movable_pfn[nid]) {
+ /* Size ZONE_MOVABLE */
+ if (zone_type == ZONE_MOVABLE) {
+ *zone_start_pfn = zone_movable_pfn[nid];
+ *zone_end_pfn = min(node_end_pfn,
+ arch_zone_highest_possible_pfn[movable_zone]);
+
+ /* Adjust for ZONE_MOVABLE starting within this range */
+ } else if (*zone_start_pfn < zone_movable_pfn[nid] &&
+ *zone_end_pfn > zone_movable_pfn[nid]) {
+ *zone_end_pfn = zone_movable_pfn[nid];
+
+ /* Check if this whole range is within ZONE_MOVABLE */
+ } else if (*zone_start_pfn >= zone_movable_pfn[nid])
+ *zone_start_pfn = *zone_end_pfn;
+ }
+}
+
+/*
* Return the number of pages a zone spans in a node, including holes
* present_pages = zone_spanned_pages_in_node() - zone_absent_pages_in_node()
*/
@@ -2586,6 +2650,9 @@ unsigned long __init zone_spanned_pages_
get_pfn_range_for_nid(nid, &node_start_pfn, &node_end_pfn);
zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
+ adjust_zone_range_for_zone_movable(nid, zone_type,
+ node_start_pfn, node_end_pfn,
+ &zone_start_pfn, &zone_end_pfn);

/* Check that this node has pages within the zone's required range */
if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn)
@@ -2676,6 +2743,9 @@ unsigned long __init zone_absent_pages_i
zone_end_pfn = min(arch_zone_highest_possible_pfn[zone_type],
node_end_pfn);

+ adjust_zone_range_for_zone_movable(nid, zone_type,
+ node_start_pfn, node_end_pfn,
+ &zone_start_pfn, &zone_end_pfn);
return __absent_pages_in_range(nid, zone_start_pfn, zone_end_pfn);
}

@@ -3039,6 +3109,117 @@ unsigned long __init find_max_pfn_with_a
return max_pfn;
}

+/*
+ * Find the PFN the Movable zone begins in each node. Kernel memory
+ * is spread evenly between nodes as long as the nodes have enough
+ * memory. When they don't, some nodes will have more kernelcore than
+ * others
+ */
+void __init find_zone_movable_pfns_for_nodes(unsigned long *movable_pfn)
+{
+ int i, nid;
+ unsigned long usable_startpfn;
+ unsigned long kernelcore_node, kernelcore_remaining;
+ int usable_nodes = num_online_nodes();
+
+ /* If kernelcore was not specified, there is no ZONE_MOVABLE */
+ if (!required_kernelcore)
+ return;
+
+ /* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
+ find_usable_zone_for_movable();
+ usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
+
+restart:
+ /* Spread kernelcore memory as evenly as possible throughout nodes */
+ kernelcore_node = required_kernelcore / usable_nodes;
+ for_each_online_node(nid) {
+ /*
+ * Recalculate kernelcore_node if the division per node
+ * now exceeds what is necessary to satisfy the requested
+ * amount of memory for the kernel
+ */
+ if (required_kernelcore < kernelcore_node)
+ kernelcore_node = required_kernelcore / usable_nodes;
+
+ /*
+ * As the map is walked, we track how much memory is usable
+ * by the kernel using kernelcore_remaining. When it is
+ * 0, the rest of the node is usable by ZONE_MOVABLE
+ */
+ kernelcore_remaining = kernelcore_node;
+
+ /* Go through each range of PFNs within this node */
+ for_each_active_range_index_in_nid(i, nid) {
+ unsigned long start_pfn, end_pfn;
+ unsigned long size_pages;
+
+ start_pfn = max(early_node_map[i].start_pfn,
+ zone_movable_pfn[nid]);
+ end_pfn = early_node_map[i].end_pfn;
+ if (start_pfn >= end_pfn)
+ continue;
+
+ /* Account for what is only usable for kernelcore */
+ if (start_pfn < usable_startpfn) {
+ unsigned long kernel_pages;
+ kernel_pages = min(end_pfn, usable_startpfn)
+ - start_pfn;
+
+ kernelcore_remaining -= min(kernel_pages,
+ kernelcore_remaining);
+ required_kernelcore -= min(kernel_pages,
+ required_kernelcore);
+
+ /* Continue if range is now fully accounted */
+ if (end_pfn <= usable_startpfn) {
+
+ /*
+ * Push zone_movable_pfn to the end so
+ * that if we have to rebalance
+ * kernelcore across nodes, we will
+ * not double account here
+ */
+ zone_movable_pfn[nid] = end_pfn;
+ continue;
+ }
+ start_pfn = usable_startpfn;
+ }
+
+ /*
+ * The usable PFN range for ZONE_MOVABLE is from
+ * start_pfn->end_pfn. Calculate size_pages as the
+ * number of pages used as kernelcore
+ */
+ size_pages = end_pfn - start_pfn;
+ if (size_pages > kernelcore_remaining)
+ size_pages = kernelcore_remaining;
+ zone_movable_pfn[nid] = start_pfn + size_pages;
+
+ /*
+ * Some kernelcore has been met, update counts and
+ * break if the kernelcore for this node has been
+ * satisified
+ */
+ required_kernelcore -= min(required_kernelcore,
+ size_pages);
+ kernelcore_remaining -= size_pages;
+ if (!kernelcore_remaining)
+ break;
+ }
+ }
+
+ /*
+ * If there is still required_kernelcore, we do another pass with one
+ * less node in the count. This will push zone_movable_pfn[nid] further
+ * along on the nodes that still have memory until kernelcore is
+ * satisified
+ */
+ usable_nodes--;
+ if (usable_nodes && required_kernelcore > usable_nodes)
+ goto restart;
+}
+
/**
* free_area_init_nodes - Initialise all pg_data_t and zone data
* @max_zone_pfn: an array of max PFNs for each zone
@@ -3068,22 +3249,42 @@ void __init free_area_init_nodes(unsigne
arch_zone_lowest_possible_pfn[0] = find_min_pfn_with_active_regions();
arch_zone_highest_possible_pfn[0] = max_zone_pfn[0];
for (i = 1; i < MAX_NR_ZONES; i++) {
+ if (i == ZONE_MOVABLE)
+ continue;
+
arch_zone_lowest_possible_pfn[i] =
arch_zone_highest_possible_pfn[i-1];
arch_zone_highest_possible_pfn[i] =
max(max_zone_pfn[i], arch_zone_lowest_possible_pfn[i]);
}
+ arch_zone_lowest_possible_pfn[ZONE_MOVABLE] = 0;
+ arch_zone_highest_possible_pfn[ZONE_MOVABLE] = 0;

/* Print out the page size for debugging meminit problems */
printk(KERN_DEBUG "sizeof(struct page) = %zd\n", sizeof(struct page));

+ /* Find the PFNs that ZONE_MOVABLE begins at in each node */
+ memset(zone_movable_pfn, 0, sizeof(zone_movable_pfn));
+ find_zone_movable_pfns_for_nodes(zone_movable_pfn);
+
/* Print out the zone ranges */
printk("Zone PFN ranges:\n");
- for (i = 0; i < MAX_NR_ZONES; i++)
+ for (i = 0; i < MAX_NR_ZONES; i++) {
+ if (i == ZONE_MOVABLE)
+ continue;
+
printk(" %-8s %8lu -> %8lu\n",
zone_names[i],
arch_zone_lowest_possible_pfn[i],
arch_zone_highest_possible_pfn[i]);
+ }
+
+ /* Print out the PFNs ZONE_MOVABLE begins at in each node */
+ printk("Movable zone start PFN for each node\n");
+ for (i = 0; i < MAX_NUMNODES; i++) {
+ if (zone_movable_pfn[i])
+ printk(" Node %d: %lu\n", i, zone_movable_pfn[i]);
+ }

/* Print out the early_node_map[] */
printk("early_node_map[%d] active PFN ranges\n", nr_nodemap_entries);
@@ -3099,6 +3300,21 @@ void __init free_area_init_nodes(unsigne
find_min_pfn_for_node(nid), NULL);
}
}
+
+/*
+ * kernelcore=size sets the amount of memory for use for allocations that
+ * cannot be reclaimed or migrated.
+ */
+int __init cmdline_parse_kernelcore(char *p)
+{
+ unsigned long long coremem;
+ if (!p)
+ return -EINVAL;
+
+ coremem = memparse(p, &p);
+ required_kernelcore = coremem >> PAGE_SHIFT;
+ return 0;
+}
#endif /* CONFIG_ARCH_POPULATES_NODE_MAP */

/**

2007-01-26 00:37:42

[permalink] [raw]

Subject: [PATCH 3/8] Allow huge page allocations to use GFP_HIGH_MOVABLE

Huge pages are not movable so are not allocated from ZONE_MOVABLE. However,
as ZONE_MOVABLE will always have pages that can be migrated or reclaimed,
it can be used to satisfy hugepage allocations even when the system has been
running a long time. This allows an administrator to resize the hugepage
pool at runtime depending on the size of ZONE_MOVABLE.

This patch adds a new sysctl called hugepages_treat_as_movable. When
a non-zero value is written to it, future allocations for the huge page
pool will use ZONE_MOVABLE. Despite huge pages being non-movable, we do not
introduce additional external fragmentation of note as huge pages are always
the largest contiguous block we care about.

Signed-off-by: Mel Gorman <[email protected]>
---

include/linux/hugetlb.h | 3 +++
include/linux/mempolicy.h | 6 +++---
include/linux/sysctl.h | 1 +
kernel/sysctl.c | 8 ++++++++
mm/hugetlb.c | 23 ++++++++++++++++++++---
mm/mempolicy.c | 5 +++--
6 files changed, 38 insertions(+), 8 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-002_create_movable_zone/include/linux/hugetlb.h linux-2.6.20-rc4-mm1-003_mark_hugepages_movable/include/linux/hugetlb.h
--- linux-2.6.20-rc4-mm1-002_create_movable_zone/include/linux/hugetlb.h 2007-01-07 05:45:51.000000000 +0000
+++ linux-2.6.20-rc4-mm1-003_mark_hugepages_movable/include/linux/hugetlb.h 2007-01-25 17:34:15.000000000 +0000
@@ -14,6 +14,7 @@ static inline int is_vm_hugetlb_page(str
}

int hugetlb_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
+int hugetlb_treat_movable_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *);
int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, struct page **, struct vm_area_struct **, unsigned long *, int *, int);
void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long);
@@ -28,6 +29,8 @@ int hugetlb_reserve_pages(struct inode *
void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);

extern unsigned long max_huge_pages;
+extern unsigned long hugepages_treat_as_movable;
+extern gfp_t htlb_alloc_mask;
extern const unsigned long hugetlb_zero, hugetlb_infinity;
extern int sysctl_hugetlb_shm_group;

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-002_create_movable_zone/include/linux/mempolicy.h linux-2.6.20-rc4-mm1-003_mark_hugepages_movable/include/linux/mempolicy.h
--- linux-2.6.20-rc4-mm1-002_create_movable_zone/include/linux/mempolicy.h 2007-01-07 05:45:51.000000000 +0000
+++ linux-2.6.20-rc4-mm1-003_mark_hugepages_movable/include/linux/mempolicy.h 2007-01-25 17:34:15.000000000 +0000
@@ -159,7 +159,7 @@ extern void mpol_fix_fork_child_flag(str

extern struct mempolicy default_policy;
extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
- unsigned long addr);
+ unsigned long addr, gfp_t gfp_flags);
extern unsigned slab_node(struct mempolicy *policy);

extern enum zone_type policy_zone;
@@ -256,9 +256,9 @@ static inline void mpol_fix_fork_child_f
#define set_cpuset_being_rebound(x) do {} while (0)

static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
- unsigned long addr)
+ unsigned long addr, gfp_t gfp_flags)
{
- return NODE_DATA(0)->node_zonelists + gfp_zone(GFP_HIGHUSER);
+ return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags);
}

static inline int do_migrate_pages(struct mm_struct *mm,
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-002_create_movable_zone/include/linux/sysctl.h linux-2.6.20-rc4-mm1-003_mark_hugepages_movable/include/linux/sysctl.h
--- linux-2.6.20-rc4-mm1-002_create_movable_zone/include/linux/sysctl.h 2007-01-07 05:45:51.000000000 +0000
+++ linux-2.6.20-rc4-mm1-003_mark_hugepages_movable/include/linux/sysctl.h 2007-01-25 17:34:15.000000000 +0000
@@ -202,6 +202,7 @@ enum
VM_PANIC_ON_OOM=33, /* panic at out-of-memory */
VM_VDSO_ENABLED=34, /* map VDSO into new processes? */
VM_MIN_SLAB=35, /* Percent pages ignored by zone reclaim */
+ VM_HUGETLB_TREAT_MOVABLE=36, /* Allocate hugepages from ZONE_MOVABLE */
};

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-002_create_movable_zone/kernel/sysctl.c linux-2.6.20-rc4-mm1-003_mark_hugepages_movable/kernel/sysctl.c
--- linux-2.6.20-rc4-mm1-002_create_movable_zone/kernel/sysctl.c 2007-01-17 17:08:38.000000000 +0000
+++ linux-2.6.20-rc4-mm1-003_mark_hugepages_movable/kernel/sysctl.c 2007-01-25 17:34:15.000000000 +0000
@@ -919,6 +919,14 @@ static ctl_table vm_table[] = {
.mode = 0644,
.proc_handler = &proc_dointvec,
},
+ {
+ .ctl_name = VM_HUGETLB_TREAT_MOVABLE,
+ .procname = "hugepages_treat_as_movable",
+ .data = &hugepages_treat_as_movable,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &hugetlb_treat_movable_handler,
+ },
#endif
{
.ctl_name = VM_LOWMEM_RESERVE_RATIO,
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-002_create_movable_zone/mm/hugetlb.c linux-2.6.20-rc4-mm1-003_mark_hugepages_movable/mm/hugetlb.c
--- linux-2.6.20-rc4-mm1-002_create_movable_zone/mm/hugetlb.c 2007-01-07 05:45:51.000000000 +0000
+++ linux-2.6.20-rc4-mm1-003_mark_hugepages_movable/mm/hugetlb.c 2007-01-25 17:34:15.000000000 +0000
@@ -27,6 +27,9 @@ unsigned long max_huge_pages;
static struct list_head hugepage_freelists[MAX_NUMNODES];
static unsigned int nr_huge_pages_node[MAX_NUMNODES];
static unsigned int free_huge_pages_node[MAX_NUMNODES];
+gfp_t htlb_alloc_mask = GFP_HIGHUSER;
+unsigned long hugepages_treat_as_movable;
+
/*
* Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
*/
@@ -68,12 +71,13 @@ static struct page *dequeue_huge_page(st
{
int nid = numa_node_id();
struct page *page = NULL;
- struct zonelist *zonelist = huge_zonelist(vma, address);
+ struct zonelist *zonelist = huge_zonelist(vma, address,
+ htlb_alloc_mask);
struct zone **z;

for (z = zonelist->zones; *z; z++) {
nid = zone_to_nid(*z);
- if (cpuset_zone_allowed_softwall(*z, GFP_HIGHUSER) &&
+ if (cpuset_zone_allowed_softwall(*z, htlb_alloc_mask) &&
!list_empty(&hugepage_freelists[nid]))
break;
}
@@ -103,7 +107,7 @@ static int alloc_fresh_huge_page(void)
{
static int nid = 0;
struct page *page;
- page = alloc_pages_node(nid, GFP_HIGHUSER|__GFP_COMP|__GFP_NOWARN,
+ page = alloc_pages_node(nid, htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
HUGETLB_PAGE_ORDER);
nid = next_node(nid, node_online_map);
if (nid == MAX_NUMNODES)
@@ -243,6 +247,19 @@ int hugetlb_sysctl_handler(struct ctl_ta
max_huge_pages = set_max_huge_pages(max_huge_pages);
return 0;
}
+
+int hugetlb_treat_movable_handler(struct ctl_table *table, int write,
+ struct file *file, void __user *buffer,
+ size_t *length, loff_t *ppos)
+{
+ proc_dointvec(table, write, file, buffer, length, ppos);
+ if (hugepages_treat_as_movable)
+ htlb_alloc_mask = GFP_HIGH_MOVABLE;
+ else
+ htlb_alloc_mask = GFP_HIGHUSER;
+ return 0;
+}
+
#endif /* CONFIG_SYSCTL */

int hugetlb_report_meminfo(char *buf)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-002_create_movable_zone/mm/mempolicy.c linux-2.6.20-rc4-mm1-003_mark_hugepages_movable/mm/mempolicy.c
--- linux-2.6.20-rc4-mm1-002_create_movable_zone/mm/mempolicy.c 2007-01-25 17:30:30.000000000 +0000
+++ linux-2.6.20-rc4-mm1-003_mark_hugepages_movable/mm/mempolicy.c 2007-01-25 17:34:15.000000000 +0000
@@ -1203,7 +1203,8 @@ static inline unsigned interleave_nid(st

#ifdef CONFIG_HUGETLBFS
/* Return a zonelist suitable for a huge page allocation. */
-struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr)
+struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
+ gfp_t gfp_flags)
{
struct mempolicy *pol = get_vma_policy(current, vma, addr);

@@ -1211,7 +1212,7 @@ struct zonelist *huge_zonelist(struct vm
unsigned nid;

nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
- return NODE_DATA(nid)->node_zonelists + gfp_zone(GFP_HIGHUSER);
+ return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
}
return zonelist_policy(GFP_HIGHUSER, pol);
}

2007-01-26 11:08:14

[permalink] [raw]

Subject: Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Thu, 25 Jan 2007 23:44:58 +0000 (GMT)
Mel Gorman <[email protected]> wrote:

> The following 8 patches against 2.6.20-rc4-mm1 create a zone called
> ZONE_MOVABLE

Argh. These surely get all tangled up with the
make-zones-optional-by-adding-zillions-of-ifdef patches:

deal-with-cases-of-zone_dma-meaning-the-first-zone.patch
introduce-config_zone_dma.patch
optional-zone_dma-in-the-vm.patch
optional-zone_dma-in-the-vm-no-gfp_dma-check-in-the-slab-if-no-config_zone_dma-is-set.patch
optional-zone_dma-in-the-vm-no-gfp_dma-check-in-the-slab-if-no-config_zone_dma-is-set-reduce-config_zone_dma-ifdefs.patch
optional-zone_dma-for-ia64.patch
remove-zone_dma-remains-from-parisc.patch
remove-zone_dma-remains-from-sh-sh64.patch
set-config_zone_dma-for-arches-with-generic_isa_dma.patch
zoneid-fix-up-calculations-for-zoneid_pgshift.patch

My objections to those patches:

- They add zillions of ifdefs

- They make the VM's behaviour diverge between different platforms and
between differen configs on the same platforms, and hence degrade
maintainability and increase complexity.

- We kicked around some quite different ways of implementing the same
things, but nothing came of it. iirc, one was to remove the hard-coded
zones altogether and rework all the MM to operate in terms of

for (idx = 0; idx < NUMBER_OF_ZONES; idx++)
...

- I haven't seen any hard numbers to justify the change.

So I want to drop them all.

2007-01-26 12:27:30

by Nick Piggin

[permalink] [raw]

Subject: Re: [PATCH 1/8] Add __GFP_MOVABLE for callers to flag allocations that may be migrated

Mel Gorman wrote:
> It is often known at allocation time when a page may be migrated or
> not. This patch adds a flag called __GFP_MOVABLE and a new mask called
> GFP_HIGH_MOVABLE.

Shouldn't that be HIGHUSER_MOVABLE?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-26 13:25:18

[permalink] [raw]

Subject: Re: [PATCH 1/8] Add __GFP_MOVABLE for callers to flag allocations that may be migrated

On Fri, 26 Jan 2007, Nick Piggin wrote:

> Mel Gorman wrote:
>> It is often known at allocation time when a page may be migrated or
>> not. This patch adds a flag called __GFP_MOVABLE and a new mask called
>> GFP_HIGH_MOVABLE.
>
> Shouldn't that be HIGHUSER_MOVABLE?
>

I suppose, but it's a bit verbose. I don't feel very strongly about the
name and the choice of name was taken from here -
http://lkml.org/lkml/2006/11/23/157 . I can make it GFP_HIGHUSER_MOVABLE
in the next revision

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-01-26 14:29:48

[permalink] [raw]

Subject: Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007, Andrew Morton wrote:

> On Thu, 25 Jan 2007 23:44:58 +0000 (GMT)
> Mel Gorman <[email protected]> wrote:
>
>> The following 8 patches against 2.6.20-rc4-mm1 create a zone called
>> ZONE_MOVABLE
>
> Argh. These surely get all tangled up with the
> make-zones-optional-by-adding-zillions-of-ifdef patches:
>

There may be some entertainment there all right. I didn't see any obvious
way of avoiding collisions with those patches but for what it's worth,
ZONE_MOVABLE could also be made optional.

In this patchset, I made no assumptions about the number of zones other
than the value of MAX_NR_ZONES. There should be no critical collisions but
I'll look through this patch list and see what I can spot.

> deal-with-cases-of-zone_dma-meaning-the-first-zone.patch

This patch looks ok and looks like it stands on it's own.

> introduce-config_zone_dma.patch

ok, no collisions here but obviously this patch does not stand on it's
own.

> optional-zone_dma-in-the-vm.patch

There are collisions here with the __ZONE_COUNT stuff but it's not
difficult to work around.

> optional-zone_dma-in-the-vm-no-gfp_dma-check-in-the-slab-if-no-config_zone_dma-is-set.patch
> optional-zone_dma-in-the-vm-no-gfp_dma-check-in-the-slab-if-no-config_zone_dma-is-set-reduce-config_zone_dma-ifdefs.patch

There is no cross-over here with the ZONE_MOVABLE patches. They are
messing around with slab

> optional-zone_dma-for-ia64.patch

No collision here

> remove-zone_dma-remains-from-parisc.patch
> remove-zone_dma-remains-from-sh-sh64.patch

No collisions here either. I see that there were discussions about Power
potentially doing something similar.

> set-config_zone_dma-for-arches-with-generic_isa_dma.patch

No collisions

> zoneid-fix-up-calculations-for-zoneid_pgshift.patch
>

Fun, but no collisions.

To my suprise, I only spotted one major conflict point with
optional-zone_dma-in-the-vm.patch and that should be easy enough to
resolve. What I could do is break up one of my patches into
most-of-the-patch and the-part-that-may-conflict-with-optional-dma-zone .
The smaller part would then change depending on whether the optional DMA
zone work is present. Would that be any help?

> My objections to those patches:
>
> - They add zillions of ifdefs
>
> - They make the VM's behaviour diverge between different platforms and
> between differen configs on the same platforms, and hence degrade
> maintainability and increase complexity.
>

I haven't thought about it much so I probably am missing something. The
major difference I see is when only one zone is present. In that case, a
number of loops presumably get optimised away and the behavior is very
different (presumably better although you point out no figures exist to
prove it). Where there are two or more zones, the code paths should be
similar whether there are 2, 3 or 4 zones present.

As the common platforms will always have more than one zone, it'll be
heavily tested and I'm guessing that distros are always going to have to
ship kernels with ZONE_DMA for the devices that require it. The only
platform I see that may have problems at the moment is IA64 which looks
like the only platform that can have one and only one zone. I am guessing
that Christoph will catch problems here fairly quickly although a
non-optional ZONE_MOVABLE would throw a spanner into the works somewhat.

> - We kicked around some quite different ways of implementing the same
> things, but nothing came of it. iirc, one was to remove the hard-coded
> zones altogether and rework all the MM to operate in terms of
>
> for (idx = 0; idx < NUMBER_OF_ZONES; idx++)
> ...
>

hmm. Assuming the aim is to have a situation where all zone-related loops
are optimised away at compile-time, it's hard to see an alternative that
works. Any dynamic way of creating zone at boot time will not have the
compile-time optimizations and any API that is page-range aware will
eventually hit the problems zones were made to solve (i.e. unmovable pages
locked in the lower address ranges).

> - I haven't seen any hard numbers to justify the change.
>
> So I want to drop them all.
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-01-26 15:56:26

[permalink] [raw]

Subject: Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007, Andrew Morton wrote:

> - They add zillions of ifdefs

They just add a few for ZONE_DMA where we alreaday have similar ifdefs for
ZONE_DMA32 and ZONE_HIGHMEM.

> - They make the VM's behaviour diverge between different platforms and
> between differen configs on the same platforms, and hence degrade
> maintainability and increase complexity.

They avoid unecessary complexity on platforms. They could be made to work
on more platforms with measures to deal with what ZONE_DMA
provides in different ways. There are 6 or so platforms that do not need
ZONE_DMA at all.

> - We kicked around some quite different ways of implementing the same
> things, but nothing came of it. iirc, one was to remove the hard-coded
> zones altogether and rework all the MM to operate in terms of
>
> for (idx = 0; idx < NUMBER_OF_ZONES; idx++)
> ...

Hmmm.. How would that be simpler?

> - I haven't seen any hard numbers to justify the change.

I have send you numbers showing significant reductions in code size.

2007-01-26 16:01:21

[permalink] [raw]

Subject: Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007, Mel Gorman wrote:

> I haven't thought about it much so I probably am missing something. The major
> difference I see is when only one zone is present. In that case, a number of
> loops presumably get optimised away and the behavior is very different
> (presumably better although you point out no figures exist to prove it). Where
> there are two or more zones, the code paths should be similar whether there
> are 2, 3 or 4 zones present.

The balancing of allocations between zones is becoming unnecessary. Also
in a NUMA system we then have zone == node which allows for a series of
simplifications.

> As the common platforms will always have more than one zone, it'll be heavily
> tested and I'm guessing that distros are always going to have to ship kernels
> with ZONE_DMA for the devices that require it. The only platform I see that
> may have problems at the moment is IA64 which looks like the only platform
> that can have one and only one zone. I am guessing that Christoph will catch
> problems here fairly quickly although a non-optional ZONE_MOVABLE would throw
> a spanner into the works somewhat.

There are 6 platforms that have only one zone. These are not major
platforms. In order for major platforms to go to a single zone in general
we would have to implement a generic mechanism to do an allocation where
one can specify the memory boundaries. Many DMA engines have different
limitations from what ZONE_DMA and ZONE_DMA32 can provide. If such a
scheme would be implemented then those would be able to utilize memory
better and the amount of bounce buffers would be reduced.

2007-01-26 16:22:04

[permalink] [raw]

Subject: Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Thu, 25 Jan 2007, Mel Gorman wrote:

> The following 8 patches against 2.6.20-rc4-mm1 create a zone called
> ZONE_MOVABLE that is only usable by allocations that specify both __GFP_HIGHMEM
> and __GFP_MOVABLE. This has the effect of keeping all non-movable pages
> within a single memory partition while allowing movable allocations to be
> satisified from either partition.

For arches that do not have HIGHMEM other zones would be okay too it
seems.

> The size of the zone is determined by a kernelcore= parameter specified at
> boot-time. This specifies how much memory is usable by non-movable allocations
> and the remainder is used for ZONE_MOVABLE. Any range of pages within
> ZONE_MOVABLE can be released by migrating the pages or by reclaiming.

The user has to manually fiddle around with the size of the unmovable
partition until it works?

> When selecting a zone to take pages from for ZONE_MOVABLE, there are two
> things to consider. First, only memory from the highest populated zone is
> used for ZONE_MOVABLE. On the x86, this is probably going to be ZONE_HIGHMEM
> but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64. Second,
> the amount of memory usable by the kernel will be spreadly evenly throughout
> NUMA nodes where possible. If the nodes are not of equal size, the amount
> of memory usable by the kernel on some nodes may be greater than others.

So how is the amount of movable memory on a node calculated? Evenly
distributed? There are some NUMA architectures that are not that
symmetric.

> By default, the zone is not as useful for hugetlb allocations because they
> are pinned and non-migratable (currently at least). A sysctl is provided that
> allows huge pages to be allocated from that zone. This means that the huge
> page pool can be resized to the size of ZONE_MOVABLE during the lifetime of
> the system assuming that pages are not mlocked. Despite huge pages being
> non-movable, we do not introduce additional external fragmentation of note
> as huge pages are always the largest contiguous block we care about.

The user already has to specify the partitioning of the system at bootup
and could take the huge page sizes into account.

Also huge pages may have variable sizes that can be specified on bootup
for IA64. The assumption that a huge page is always the largest
contiguous block is *not true*.

The huge page sizes on i386 and x86_64 platforms are contigent on
their page table structure. This can be completely different on other
platforms.

2007-01-26 16:28:44

[permalink] [raw]

Subject: Re: [PATCH 2/8] Create the ZONE_MOVABLE zone

On Thu, 25 Jan 2007, Mel Gorman wrote:

> @@ -166,6 +168,8 @@ enum zone_type {
> #define ZONES_SHIFT 1
> #elif __ZONE_COUNT <= 4
> #define ZONES_SHIFT 2
> +#elif __ZONE_COUNT <= 8
> +#define ZONES_SHIFT 3
> #else

You do not need a shift of 3. Even with ZONE_MOVABLE the maximum
number of zones is still 4.

x86_64 has DMA, DMA32, NORMAL, MOVABLE
i386 has DMA, NORMAL, HIGHMEM, MOVABLE

x86_64 is the only platform that has DMA32.

2007-01-26 16:33:55

[permalink] [raw]

Subject: Re: [PATCH 3/8] Allow huge page allocations to use GFP_HIGH_MOVABLE

Unmovable allocations in the movable zone. Yuck. Why dont you abandon the
whole concept of statically sized movable zone and go back to the nice
earlier idea of dynamically assigning MAX_ORDER chunks to be movable or not?

2007-01-26 16:48:14

[permalink] [raw]

Subject: Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007, Christoph Lameter wrote:

> On Thu, 25 Jan 2007, Mel Gorman wrote:
>
>> The following 8 patches against 2.6.20-rc4-mm1 create a zone called
>> ZONE_MOVABLE that is only usable by allocations that specify both __GFP_HIGHMEM
>> and __GFP_MOVABLE. This has the effect of keeping all non-movable pages
>> within a single memory partition while allowing movable allocations to be
>> satisified from either partition.
>
> For arches that do not have HIGHMEM other zones would be okay too it
> seems.
>

It would, but it'd obscure the code to take advantage of that.

>> The size of the zone is determined by a kernelcore= parameter specified at
>> boot-time. This specifies how much memory is usable by non-movable allocations
>> and the remainder is used for ZONE_MOVABLE. Any range of pages within
>> ZONE_MOVABLE can be released by migrating the pages or by reclaiming.
>
> The user has to manually fiddle around with the size of the unmovable
> partition until it works?
>

They have to fiddle with the size of the unmovable partition if their
workload uses more unmovable kernel allocations than expected. This was
always going to be the restriction with using zones for partitioning
memory. Resizing zones on the fly is not really an option because the
resizing would only work reliably in one direction.

The anti-fragmentation code could potentially be used to have subzone
groups that kept movable and unmovable allocations as far apart as
possible and at opposite ends of a zone. That approach has been kicked a
few times because of complexity.

>> When selecting a zone to take pages from for ZONE_MOVABLE, there are two
>> things to consider. First, only memory from the highest populated zone is
>> used for ZONE_MOVABLE. On the x86, this is probably going to be ZONE_HIGHMEM
>> but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64. Second,
>> the amount of memory usable by the kernel will be spreadly evenly throughout
>> NUMA nodes where possible. If the nodes are not of equal size, the amount
>> of memory usable by the kernel on some nodes may be greater than others.
>
> So how is the amount of movable memory on a node calculated?

Subtle difference. The amount of unmovable memory is calculated per node.

> Evenly
> distributed?

As evenly as possible.

> There are some NUMA architectures that are not that
> symmetric.
>

I know, it's why find_zone_movable_pfns_for_nodes() is as complex as it
is. The mechanism spreads the unmovable memory evenly throughout all
nodes. In the event some nodes are too small to hold their share, the
remaining unmovable memory is divided between the nodes that are larger.

>> By default, the zone is not as useful for hugetlb allocations because they
>> are pinned and non-migratable (currently at least). A sysctl is provided that
>> allows huge pages to be allocated from that zone. This means that the huge
>> page pool can be resized to the size of ZONE_MOVABLE during the lifetime of
>> the system assuming that pages are not mlocked. Despite huge pages being
>> non-movable, we do not introduce additional external fragmentation of note
>> as huge pages are always the largest contiguous block we care about.
>
> The user already has to specify the partitioning of the system at bootup
> and could take the huge page sizes into account.
>

Not in all cases. Some systems will not know how many huge pages they need
in advance because it is used as a batch system running jobs as requested.
The zone allows an amount of memory to be set aside that can be
*optionally* used for hugepages if desired or base pages if not. Between
jobs, the hugepage pool can be resized up to the size of ZONE_MOVABLE.

The other case is ever supporting memory hot-remove. Any memory within
ZONE_MOVABLE can potentially be removed by migrating pages and off-lined.

> Also huge pages may have variable sizes that can be specified on bootup
> for IA64. The assumption that a huge page is always the largest
> contiguous block is *not true*.
>

I didn't say they were the largest supported contiguous block, I said they
were the largest contiguous block we *care* about. Right now, it is
assumed that variable pages are not supported at runtime. If they were,
some smarts would be needed to keep huge pages of the same size together
to control external fragmentation but that's about it.

> The huge page sizes on i386 and x86_64 platforms are contigent on
> their page table structure. This can be completely different on other
> platforms.
>

The size doesn't really make much difference to the mechanism.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-01-26 16:49:06

[permalink] [raw]

Subject: Re: [PATCH 2/8] Create the ZONE_MOVABLE zone

On Fri, 26 Jan 2007, Christoph Lameter wrote:

> On Thu, 25 Jan 2007, Mel Gorman wrote:
>
>> @@ -166,6 +168,8 @@ enum zone_type {
>> #define ZONES_SHIFT 1
>> #elif __ZONE_COUNT <= 4
>> #define ZONES_SHIFT 2
>> +#elif __ZONE_COUNT <= 8
>> +#define ZONES_SHIFT 3
>> #else
>
> You do not need a shift of 3. Even with ZONE_MOVABLE the maximum
> number of zones is still 4.
>
> x86_64 has DMA, DMA32, NORMAL, MOVABLE
> i386 has DMA, NORMAL, HIGHMEM, MOVABLE
>
> x86_64 is the only platform that has DMA32.
>

Good point. I'll recheck this to be sure but if it's true, it means that
the only major collision point between these patches and the optional
ZONE_DMA patches goes away.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-01-26 16:59:01

[permalink] [raw]

Subject: Re: [PATCH 3/8] Allow huge page allocations to use GFP_HIGH_MOVABLE

On Fri, 26 Jan 2007, Christoph Lameter wrote:

> Unmovable allocations in the movable zone. Yuck.

I know, but my objective at this time is to allow the hugepage pool to be
resized at runtime for situations where the number of required hugepages
is not known in advance. Having a zone for movable pages allows that to
happen. Also, it's possible that migration of hugepages will be supported
at some time in the future. That's a more reasonable possibility than
moving kernel memory.

> Why dont you abandon the
> whole concept of statically sized movable zone and go back to the nice
> earlier idea of dynamically assigning MAX_ORDER chunks to be movable or not?
>

Because Andrew has made it pretty clear he will not take those patches on
the grounds of complexity - at least until it can be shown that they fix
the e1000 problem. Any improvement on the behavior of those patches such
as address biasing to allow memory hot-remove of the higher addresses
makes them even more complex.

Also, almost every time the anti-frag patches are posted, someone suggests
that zones be used instead. I wanted to show what those patches look like.
(of course, every time I post the zone approach, someone suggests I go
back the other way)

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-01-26 17:02:26

[permalink] [raw]

Subject: Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007, Mel Gorman wrote:

> > For arches that do not have HIGHMEM other zones would be okay too it
> > seems.
> It would, but it'd obscure the code to take advantage of that.

No MOVABLE memory for 64 bit platforms that do not have HIGHMEM right now?

> The anti-fragmentation code could potentially be used to have subzone groups
> that kept movable and unmovable allocations as far apart as possible and at
> opposite ends of a zone. That approach has been kicked a few times because of
> complexity.

Hmm... But his patch also introduces additional complexity plus its
difficult to handle for the end user.

> > There are some NUMA architectures that are not that
> > symmetric.
> I know, it's why find_zone_movable_pfns_for_nodes() is as complex as it is.
> The mechanism spreads the unmovable memory evenly throughout all nodes. In the
> event some nodes are too small to hold their share, the remaining unmovable
> memory is divided between the nodes that are larger.

I would have expected a percentage of a node. If equal amounts of
unmovable memory are assigned to all nodes at first then there will be
large disparities in the amount of movable memories f.e. between a node
with 8G memory compared to a node with 1GB memory.

How do you handle headless nodes? I.e. memory nodes with no processors?
Those may be particularly large compared to the rest but these are mainly
used for movable pages since unmovable things like device drivers buffers
have to be kept near the processors that take the interrupt.

2007-01-26 17:04:17

[permalink] [raw]

Subject: Re: [PATCH 3/8] Allow huge page allocations to use GFP_HIGH_MOVABLE

On Fri, 26 Jan 2007, Mel Gorman wrote:

> Because Andrew has made it pretty clear he will not take those patches on the
> grounds of complexity - at least until it can be shown that they fix the e1000
> problem. Any improvement on the behavior of those patches such as address
> biasing to allow memory hot-remove of the higher addresses makes them even
> more complex.

What is the e1000 problem? Jumbo packet allocation via GFP_KERNEL?

2007-01-26 17:17:00

[permalink] [raw]

Subject: Re: [PATCH 2/8] Create the ZONE_MOVABLE zone

I do not see any updates of vmstat.c and vmstat.h. This
means that VM statistics are not kept / considered for ZONE_MOVABLE.

2007-01-26 17:20:11

[permalink] [raw]

Subject: Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007, Christoph Lameter wrote:

> On Fri, 26 Jan 2007, Mel Gorman wrote:
>
>>> For arches that do not have HIGHMEM other zones would be okay too it
>>> seems.
>> It would, but it'd obscure the code to take advantage of that.
>
> No MOVABLE memory for 64 bit platforms that do not have HIGHMEM right now?
>

err, no, I misinterpreted what you meant by "other zones would be ok..". I
though you were suggesting the reuse of zone names for some reason.

The zone used to for ZONE_MOVABLE is the highest populated zone on the
architecture. On some architectures, that will be ZONE_HIGHMEM. On others,
it will be ZONE_DMA. See the function find_usable_zone_for_movable()

ZONE_MOVABLE never spans zones. For example, it will not use some
ZONE_HIGHMEM and some ZONE_NORMAL memory.

>> The anti-fragmentation code could potentially be used to have subzone groups
>> that kept movable and unmovable allocations as far apart as possible and at
>> opposite ends of a zone. That approach has been kicked a few times because of
>> complexity.
>
> Hmm... But his patch also introduces additional complexity plus its
> difficult to handle for the end user.
>

It's harder for the user to setup all right. But it works within limits
that are known well in advance and doesn't add additional code to the main
allocator path. Once it's setup, it acts like any other zone and zone
behavior is better understood than anti-fragmentations behavior.

>>> There are some NUMA architectures that are not that
>>> symmetric.
>> I know, it's why find_zone_movable_pfns_for_nodes() is as complex as it is.
>> The mechanism spreads the unmovable memory evenly throughout all nodes. In the
>> event some nodes are too small to hold their share, the remaining unmovable
>> memory is divided between the nodes that are larger.
>
> I would have expected a percentage of a node. If equal amounts of
> unmovable memory are assigned to all nodes at first then there will be
> large disparities in the amount of movable memories f.e. between a node
> with 8G memory compared to a node with 1GB memory.
>

On the other hand, percentages make it harder for the administrator to
know in advance how much unmovable memory will be available when the
system starts even if the machine changes configuration. The absolute
figure is easier to understand. If there was a requirement, an alternative
configuration option could be made available that takes a fixed percentage
of each node with memory.

> How do you handle headless nodes? I.e. memory nodes with no processors?

The code only cares about memory, not processors.

> Those may be particularly large compared to the rest but these are mainly
> used for movable pages since unmovable things like device drivers buffers
> have to be kept near the processors that take the interrupt.
>

Then what I'd do is specify kernelcore to be

(number_of_nodes_with_processors * largest_amount_of_memory_on_node_with_processors)

That would have all memory near processors available as unmovable memory
(that movable allocations will still use so they don't always go remote)
while keeping a large amount of memory on the headless nodes for movable
allocations only.

If requirements demanded, a configuration option could be made that allows
the administrator to specify exactly how much unmovable memory he wants on
a specific node.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-01-26 17:20:50

[permalink] [raw]

Subject: Re: [PATCH 3/8] Allow huge page allocations to use GFP_HIGH_MOVABLE

On Fri, 26 Jan 2007, Christoph Lameter wrote:

> On Fri, 26 Jan 2007, Mel Gorman wrote:
>
>> Because Andrew has made it pretty clear he will not take those patches on the
>> grounds of complexity - at least until it can be shown that they fix the e1000
>> problem. Any improvement on the behavior of those patches such as address
>> biasing to allow memory hot-remove of the higher addresses makes them even
>> more complex.
>
> What is the e1000 problem? Jumbo packet allocation via GFP_KERNEL?
>

Yes. Potentially the anti-fragmentation patches could address this by
clustering atomic allocations together as much as possible.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-01-26 17:22:55

[permalink] [raw]

Subject: Re: [PATCH 3/8] Allow huge page allocations to use GFP_HIGH_MOVABLE

On Fri, 26 Jan 2007, Mel Gorman wrote:

> > What is the e1000 problem? Jumbo packet allocation via GFP_KERNEL?
> Yes. Potentially the anti-fragmentation patches could address this by
> clustering atomic allocations together as much as possible.

GFP_ATOMIC allocs? Do you have a reference to the thread where this was
discussed?

2007-01-26 17:24:30

[permalink] [raw]

Subject: Re: [PATCH 2/8] Create the ZONE_MOVABLE zone

On Fri, 26 Jan 2007, Christoph Lameter wrote:

> I do not see any updates of vmstat.c and vmstat.h. This
> means that VM statistics are not kept / considered for ZONE_MOVABLE.
>

hmm, dirt.

Other than adding some TEXT_FOR_MOVABLE, an addition to TEXTS_FOR_ZONES()
and similar updates for FOR_ALL_ZONES(), what code in there uses special
awareness of the zone?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-01-26 17:26:10

[permalink] [raw]

Subject: Re: [PATCH 2/8] Create the ZONE_MOVABLE zone

On Fri, 26 Jan 2007, Mel Gorman wrote:

> Other than adding some TEXT_FOR_MOVABLE, an addition to TEXTS_FOR_ZONES() and
> similar updates for FOR_ALL_ZONES(), what code in there uses special awareness
> of the zone?

Look for special handling of ZONE_DMA32 and you will find what you are
looking for. In particular ZONE_MOVABLE needs to be considered for
node_page_state calculations.

2007-01-26 17:37:30

[permalink] [raw]

Subject: Re: [PATCH 3/8] Allow huge page allocations to use GFP_HIGH_MOVABLE

On Fri, 26 Jan 2007, Christoph Lameter wrote:

> On Fri, 26 Jan 2007, Mel Gorman wrote:
>
>>> What is the e1000 problem? Jumbo packet allocation via GFP_KERNEL?
>> Yes. Potentially the anti-fragmentation patches could address this by
>> clustering atomic allocations together as much as possible.
>
> GFP_ATOMIC allocs?

Yes

> Do you have a reference to the thread where this was
> discussed?
>

It's come up a few times and the converation is always fairly similar
although the thread http://lkml.org/lkml/2006/9/22/44 has interesting
information on the topic. There has been no serious discussion on whether
anti-fragmentation would help it or not. I think it would if atomic
allocations were clustered together because then jumbo frame allocations
would cluster together in the same MAX_ORDER blocks and tend to keep other
allocations away.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-01-26 17:38:27

[permalink] [raw]

Subject: Re: [PATCH 2/8] Create the ZONE_MOVABLE zone

On Fri, 26 Jan 2007, Christoph Lameter wrote:

> On Fri, 26 Jan 2007, Mel Gorman wrote:
>
>> Other than adding some TEXT_FOR_MOVABLE, an addition to TEXTS_FOR_ZONES() and
>> similar updates for FOR_ALL_ZONES(), what code in there uses special awareness
>> of the zone?
>
> Look for special handling of ZONE_DMA32 and you will find what you are
> looking for. In particular ZONE_MOVABLE needs to be considered for
> node_page_state calculations.
>

Ok, pretty clear. I've some additional work to do there. Thanks.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-01-26 17:45:55

[permalink] [raw]

Subject: Re: [PATCH 3/8] Allow huge page allocations to use GFP_HIGH_MOVABLE

On Fri, 26 Jan 2007, Mel Gorman wrote:

> It's come up a few times and the converation is always fairly similar although
> the thread http://lkml.org/lkml/2006/9/22/44 has interesting information on
> the topic. There has been no serious discussion on whether anti-fragmentation
> would help it or not. I think it would if atomic allocations were clustered
> together because then jumbo frame allocations would cluster together in the
> same MAX_ORDER blocks and tend to keep other allocations away.

They are clustered in both schemes together with other non movable allocs
right? The problem is to defrag while atomic? How is the zone based
concept different in that area from the max order block based one?

2007-01-26 17:53:27

[permalink] [raw]

Subject: Re: [PATCH 3/8] Allow huge page allocations to use GFP_HIGH_MOVABLE

On Fri, 26 Jan 2007, Christoph Lameter wrote:

> On Fri, 26 Jan 2007, Mel Gorman wrote:
>
>> It's come up a few times and the converation is always fairly similar although
>> the thread http://lkml.org/lkml/2006/9/22/44 has interesting information on
>> the topic. There has been no serious discussion on whether anti-fragmentation
>> would help it or not. I think it would if atomic allocations were clustered
>> together because then jumbo frame allocations would cluster together in the
>> same MAX_ORDER blocks and tend to keep other allocations away.
>
> They are clustered in both schemes together with other non movable allocs
> right?

For the jumbo frame problem, only the antifragmentation approach of
clustering types of pages together in MAX_ORDER blocks has any chance of
helping.

> The problem is to defrag while atomic?

Worse, the problem is to have high order contiguous blocks free at the
time of allocation without reclaim or migration. If the allocations were
not atomic, anti-fragmentation as it is today would be enough.

By clustering atomic allocations together though, I would expect the jumbo
frames to be allocated and freed within the same area without interference
from other allocation types as long as min_free_kbytes was also set higher
than default. I lack the hardware to prove/disprove the idea though.

> How is the zone based
> concept different in that area from the max order block based one?

The zone-based approach does nothing to help jumbo frame allocations. It
only helps hugepage allocations at runtime and potentially memory
hot-remove.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-01-26 18:20:22

[permalink] [raw]

Subject: Re: [PATCH 3/8] Allow huge page allocations to use GFP_HIGH_MOVABLE

On Fri, 26 Jan 2007, Mel Gorman wrote:

> The zone-based approach does nothing to help jumbo frame allocations. It only
> helps hugepage allocations at runtime and potentially memory hot-remove.

Sounds like the max order based approach is better in many ways. Also
avoids modifications to vmstat.c/.h ;-)

2007-01-26 18:56:47

by Chris Friesen

[permalink] [raw]

Subject: Re: [PATCH 3/8] Allow huge page allocations to use GFP_HIGH_MOVABLE

Mel Gorman wrote:

> Worse, the problem is to have high order contiguous blocks free at the
> time of allocation without reclaim or migration. If the allocations were
> not atomic, anti-fragmentation as it is today would be enough.

Has anyone looked at marking the buffers as "needs refilling" then kick
off a kernel thread or something to do the allocations under GFP_KERNEL?
That way we avoid having to allocate the buffers with GFP_ATOMIC.

I seem to recall that the tulip driver used to do this. Is it just too
complicated from a race condition standpoint?

We currently see this issue on our systems, as we have older e1000
hardware with 9KB jumbo frames. After a while we just fail to allocate
buffers and the system goes belly-up.

Chris

2007-01-26 19:46:30

[permalink] [raw]

Subject: Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007 07:56:09 -0800 (PST)
Christoph Lameter <[email protected]> wrote:

> On Fri, 26 Jan 2007, Andrew Morton wrote:
>
> > - They add zillions of ifdefs
>
> They just add a few for ZONE_DMA where we alreaday have similar ifdefs for
> ZONE_DMA32 and ZONE_HIGHMEM.

I refreshed my memory. It remains awful.

> > - They make the VM's behaviour diverge between different platforms and
> > between differen configs on the same platforms, and hence degrade
> > maintainability and increase complexity.
>
> They avoid unecessary complexity on platforms. They could be made to work
> on more platforms with measures to deal with what ZONE_DMA
> provides in different ways. There are 6 or so platforms that do not need
> ZONE_DMA at all.

As Mel points out, distros will ship with CONFIG_ZONE_DMA=y, so the number
of machines which will actually benefit from this change is really small.
And the benefit to those few machines will also, I suspect, be small.

> > - We kicked around some quite different ways of implementing the same
> > things, but nothing came of it. iirc, one was to remove the hard-coded
> > zones altogether and rework all the MM to operate in terms of
> >
> > for (idx = 0; idx < NUMBER_OF_ZONES; idx++)
> > ...
>
> Hmmm.. How would that be simpler?

Replace a sprinkle of open-coded ifdefs with a regular code sequence which
everyone uses. Pretty obvious, I'd thought.

Plus it becoems straightforward to extend this from the present four zones
to a complete 12 zones, which gives use the full set of
ZONE_DMA20,ZONE_DMA21,...,ZONE_DMA32 for those funny devices.

> > - I haven't seen any hard numbers to justify the change.
>
> I have send you numbers showing significant reductions in code size.

If it isn't in the changelog it doesn't exist. I guess I didn't copy it
into the changelog.

If the only demonstrable benefit is a saving of a few k of text on a small
number of machines then things are looking very grim, IMO.

2007-01-26 19:58:31

[permalink] [raw]

Subject: Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007, Andrew Morton wrote:

> As Mel points out, distros will ship with CONFIG_ZONE_DMA=y, so the number
> of machines which will actually benefit from this change is really small.
> And the benefit to those few machines will also, I suspect, be small.
>
> > > - We kicked around some quite different ways of implementing the same
> > > things, but nothing came of it. iirc, one was to remove the hard-coded
> > > zones altogether and rework all the MM to operate in terms of
> > >
> > > for (idx = 0; idx < NUMBER_OF_ZONES; idx++)
> > > ...
> >
> > Hmmm.. How would that be simpler?
>
> Replace a sprinkle of open-coded ifdefs with a regular code sequence which
> everyone uses. Pretty obvious, I'd thought.

We do use such loops in many places. However, stuff like array
initialization and special casing cannot use a loop. I am not sure what we
could change there. The hard coding is necessary because each zone
currently has these invariant characteristics that we need to consider.
Reducing the number of zones reduces the amount of special casing in the
VM that needs to be considered at run time and that is a potential issue
for trouble.

> Plus it becoems straightforward to extend this from the present four zones
> to a complete 12 zones, which gives use the full set of
> ZONE_DMA20,ZONE_DMA21,...,ZONE_DMA32 for those funny devices.

I just hope we can handle the VM complexity of load balancing etc etc that
this will introduce. Also each zone has management overhead and will cause
the touching of additional cachelines on many VM operations. Much of that
management overhead becomes unnecessary if we reduce zones.

> If the only demonstrable benefit is a saving of a few k of text on a small
> number of machines then things are looking very grim, IMO.

The main benefit is a significant simplification of the VM, leading to
robust and reliable operations and a reduction of the maintenance
headaches coming with the additional zones.

If we would introduce the ability of allocating from a range of
physical addresses then the need for DMA zones would go away allowing
flexibility for device driver DMA allocations and at the same time we get
rid of special casing in the VM.

2007-01-26 20:28:01

[permalink] [raw]

Subject: Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007 11:58:18 -0800 (PST)
Christoph Lameter <[email protected]> wrote:

> > If the only demonstrable benefit is a saving of a few k of text on a small
> > number of machines then things are looking very grim, IMO.
>
> The main benefit is a significant simplification of the VM, leading to
> robust and reliable operations and a reduction of the maintenance
> headaches coming with the additional zones.
>
> If we would introduce the ability of allocating from a range of
> physical addresses then the need for DMA zones would go away allowing
> flexibility for device driver DMA allocations and at the same time we get
> rid of special casing in the VM.

None of this is valid. The great majority of machines out there will
continue to have the same number of zones. Nothing changes.

What will happen is that a small number of machines will have different
runtime behaviour. So they don't benefit from the majority's testing and
they don't contrinute to it and they potentially have unique-to-them
problems which we need to worry about.

That's all a real cost, so we need to see *good* benefits to outweigh that
cost. Thus far I don't think we've seen that.

2007-01-26 20:38:03

[permalink] [raw]

Subject: Re: [PATCH 3/8] Allow huge page allocations to use GFP_HIGH_MOVABLE

On Fri, 26 Jan 2007, Christoph Lameter wrote:

> On Fri, 26 Jan 2007, Mel Gorman wrote:
>
>> The zone-based approach does nothing to help jumbo frame allocations. It only
>> helps hugepage allocations at runtime and potentially memory hot-remove.
>
> Sounds like the max order based approach is better in many ways.

I agree but too many people are not pleased with the main allocator path
being affected and wanted to see zones, so here we are :)

> Also avoids modifications to vmstat.c/.h ;-)
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-01-26 20:44:47

[permalink] [raw]

Subject: Re: [PATCH 3/8] Allow huge page allocations to use GFP_HIGH_MOVABLE

On Fri, 26 Jan 2007, Chris Friesen wrote:

> Mel Gorman wrote:
>
>> Worse, the problem is to have high order contiguous blocks free at the time
>> of allocation without reclaim or migration. If the allocations were not
>> atomic, anti-fragmentation as it is today would be enough.
>
> Has anyone looked at marking the buffers as "needs refilling" then kick off a
> kernel thread or something to do the allocations under GFP_KERNEL?

I haven't seen it being discussed although it's probably doable as an
addition to the existing mempool mechanism. Anti-fragmentation would mean
that the non-atomic GFP_KERNEL allocation had a chance of succeeding.

> That way we avoid having to allocate the buffers with GFP_ATOMIC.
>

Unless the load was so high that the pool was getting depleted and memory
under so much pressure that reclaim could not keep up. But yes, it's
possible that GFP_ATOMIC allocations could be avoided the majority of
times.

> I seem to recall that the tulip driver used to do this. Is it just too
> complicated from a race condition standpoint?
>

It shouldn't be that complicated.

> We currently see this issue on our systems, as we have older e1000 hardware
> with 9KB jumbo frames. After a while we just fail to allocate buffers and
> the system goes belly-up.
>

Can you describe a reliable way of triggering this problem? At best, I
hear "on our undescribed workload, we sometimes see this problem" but not
much in the way of details.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-01-26 21:37:13

by Chris Friesen

[permalink] [raw]

Subject: Re: [PATCH 3/8] Allow huge page allocations to use GFP_HIGH_MOVABLE

Mel Gorman wrote:
> On Fri, 26 Jan 2007, Chris Friesen wrote:

>> We currently see this issue on our systems, as we have older e1000
>> hardware with 9KB jumbo frames. After a while we just fail to
>> allocate buffers and the system goes belly-up.

> Can you describe a reliable way of triggering this problem? At best, I
> hear "on our undescribed workload, we sometimes see this problem" but
> not much in the way of details.

I work on embedded server applications. One of our blades is a
dual-Xeon with 8GB of RAM and 6 e1000 cards. The hardware is 32-bit
only, so we're using the i386 kernel with HIGHMEM64G enabled.

This blade acts essentially as storage for other blades in the shelf.
Basically all disk and network I/O. After being up for a month or two
it starts getting e1000 allocation failures. In some of the cases at
least it appears that the page cache has hundreds of megs of freeable
memory, but it can't get at that memory to fulfill an atomic allocation.

I should point out that we haven't yet tried tuning
/proc/sys/vm/min_free_kbytes. The default value on this system is 3831.

Chris

2007-01-29 17:28:41

[permalink] [raw]

Subject: Re: [PATCH 2/8] Create the ZONE_MOVABLE zone

On Fri, 26 Jan 2007, Christoph Lameter wrote:

> On Thu, 25 Jan 2007, Mel Gorman wrote:
>
>> @@ -166,6 +168,8 @@ enum zone_type {
>> #define ZONES_SHIFT 1
>> #elif __ZONE_COUNT <= 4
>> #define ZONES_SHIFT 2
>> +#elif __ZONE_COUNT <= 8
>> +#define ZONES_SHIFT 3
>> #else
>
> You do not need a shift of 3. Even with ZONE_MOVABLE the maximum
> number of zones is still 4.
>

Yep, this is correct. If it's ever wrong, there is an additional check for
__ZONE_COUNT that will print out the appropriate warning.

Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-01-29 18:03:45

by mel

[permalink] [raw]

Subject: Re: [PATCH 2/8] Create the ZONE_MOVABLE zone

On (26/01/07 09:16), Christoph Lameter didst pronounce:
> I do not see any updates of vmstat.c and vmstat.h. This
> means that VM statistics are not kept / considered for ZONE_MOVABLE.

Based on searching around for ZONE_DMA32, the following patch appears to be
all that is required;

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-009_backout_zonecount/include/linux/vmstat.h linux-2.6.20-rc4-mm1-010_update_zonecounters/include/linux/vmstat.h
--- linux-2.6.20-rc4-mm1-009_backout_zonecount/include/linux/vmstat.h 2007-01-17 17:08:36.000000000 +0000
+++ linux-2.6.20-rc4-mm1-010_update_zonecounters/include/linux/vmstat.h 2007-01-29 16:52:42.000000000 +0000
@@ -24,7 +24,7 @@
#define HIGHMEM_ZONE(xx)
#endif

-#define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL HIGHMEM_ZONE(xx)
+#define FOR_ALL_ZONES(xx) DMA_ZONE(xx) DMA32_ZONE(xx) xx##_NORMAL HIGHMEM_ZONE(xx) , xx##_MOVABLE

enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGALLOC),
@@ -171,7 +171,8 @@ static inline unsigned long node_page_st
#ifdef CONFIG_HIGHMEM
zone_page_state(&zones[ZONE_HIGHMEM], item) +
#endif
- zone_page_state(&zones[ZONE_NORMAL], item);
+ zone_page_state(&zones[ZONE_NORMAL], item) +
+ zone_page_state(&zones[ZONE_MOVABLE], item);
}

extern void zone_statistics(struct zonelist *, struct zone *);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.20-rc4-mm1-009_backout_zonecount/mm/vmstat.c linux-2.6.20-rc4-mm1-010_update_zonecounters/mm/vmstat.c
--- linux-2.6.20-rc4-mm1-009_backout_zonecount/mm/vmstat.c 2007-01-17 17:08:39.000000000 +0000
+++ linux-2.6.20-rc4-mm1-010_update_zonecounters/mm/vmstat.c 2007-01-29 16:52:42.000000000 +0000
@@ -456,7 +456,7 @@ const struct seq_operations fragmentatio
#endif

#define TEXTS_FOR_ZONES(xx) TEXT_FOR_DMA(xx) TEXT_FOR_DMA32(xx) xx "_normal", \
- TEXT_FOR_HIGHMEM(xx)
+ TEXT_FOR_HIGHMEM(xx) xx "_movable",

static const char * const vmstat_text[] = {
/* Zoned VM counters */

2007-01-29 21:54:51

[permalink] [raw]

Subject: Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Fri, 26 Jan 2007, Andrew Morton wrote:

> > The main benefit is a significant simplification of the VM, leading to
> > robust and reliable operations and a reduction of the maintenance
> > headaches coming with the additional zones.
> >
> > If we would introduce the ability of allocating from a range of
> > physical addresses then the need for DMA zones would go away allowing
> > flexibility for device driver DMA allocations and at the same time we get
> > rid of special casing in the VM.
>
> None of this is valid. The great majority of machines out there will
> continue to have the same number of zones. Nothing changes.

All 64 bit machine will only have a single zone if we have such a range
alloc mechanism. The 32bit ones with HIGHMEM wont be able to avoid it,
true. But all arches that do not need gymnastics to access their memory
will be able run with a single zone.

> That's all a real cost, so we need to see *good* benefits to outweigh that
> cost. Thus far I don't think we've seen that.

The real savings is the simplicity of VM design, robustness and
efficiency. We loose on all these fronts if we keep or add useless zones.

The main reason for the recent problems with dirty handling seem to be due
to exactly such a multizone balancing issues involving ZONE_NORMAL and
HIGHMEM. Those problems cannot occur on single ZONE arches (this means
right now on a series of embedded arches, UML and IA64).

Multiple ZONES are a recipie for VM fragility and result in complexity
that is difficult to manage.

2007-01-29 22:37:09

[permalink] [raw]

Subject: Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Mon, 29 Jan 2007 13:54:38 -0800 (PST)
Christoph Lameter <[email protected]> wrote:

> On Fri, 26 Jan 2007, Andrew Morton wrote:
>
> > > The main benefit is a significant simplification of the VM, leading to
> > > robust and reliable operations and a reduction of the maintenance
> > > headaches coming with the additional zones.
> > >
> > > If we would introduce the ability of allocating from a range of
> > > physical addresses then the need for DMA zones would go away allowing
> > > flexibility for device driver DMA allocations and at the same time we get
> > > rid of special casing in the VM.
> >
> > None of this is valid. The great majority of machines out there will
> > continue to have the same number of zones. Nothing changes.
>
> All 64 bit machine will only have a single zone if we have such a range
> alloc mechanism. The 32bit ones with HIGHMEM wont be able to avoid it,
> true. But all arches that do not need gymnastics to access their memory
> will be able run with a single zone.

What is "such a range alloc mechanism"?

> > That's all a real cost, so we need to see *good* benefits to outweigh that
> > cost. Thus far I don't think we've seen that.
>
> The real savings is the simplicity of VM design, robustness and
> efficiency. We loose on all these fronts if we keep or add useless zones.
>
> The main reason for the recent problems with dirty handling seem to be due
> to exactly such a multizone balancing issues involving ZONE_NORMAL and
> HIGHMEM. Those problems cannot occur on single ZONE arches (this means
> right now on a series of embedded arches, UML and IA64).
>
> Multiple ZONES are a recipie for VM fragility and result in complexity
> that is difficult to manage.

Why do I have to keep repeating myself? 90% of known FC6-running machines
are x86-32. 90% of vendor-shipped kernels need all three zones. And the
remaining 10% ship with multiple nodes as well.

So please stop telling me what a wonderful world it is to not have multiple
zones. It just isn't going to happen for a long long time. The
multiple-zone kernel is the case we need to care about most by a very large
margin indeed. Single-zone is an infinitesimal corner-case.

2007-01-29 22:45:45

[permalink] [raw]

Subject: Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Mon, 29 Jan 2007, Andrew Morton wrote:

> > All 64 bit machine will only have a single zone if we have such a range
> > alloc mechanism. The 32bit ones with HIGHMEM wont be able to avoid it,
> > true. But all arches that do not need gymnastics to access their memory
> > will be able run with a single zone.
>
> What is "such a range alloc mechanism"?

As I mentioned above: A function that allows an allocation to specify
which physical memory ranges are permitted.

> So please stop telling me what a wonderful world it is to not have multiple
> zones. It just isn't going to happen for a long long time. The
> multiple-zone kernel is the case we need to care about most by a very large
> margin indeed. Single-zone is an infinitesimal corner-case.

We can still reduce the number of zones for those that require highmem to
two which may allows us to avoid ZONE_DMA/DMA32 issues and allow dma
devices to avoid bunce buffers that can do I/O to memory ranges not
compatible with the current boundaries of DMA/DMA32. And I am also
repeating myself.

2007-01-29 22:50:17

by Russell King

[permalink] [raw]

Subject: Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Mon, Jan 29, 2007 at 02:45:06PM -0800, Christoph Lameter wrote:
> On Mon, 29 Jan 2007, Andrew Morton wrote:
>
> > > All 64 bit machine will only have a single zone if we have such a range
> > > alloc mechanism. The 32bit ones with HIGHMEM wont be able to avoid it,
> > > true. But all arches that do not need gymnastics to access their memory
> > > will be able run with a single zone.
> >
> > What is "such a range alloc mechanism"?
>
> As I mentioned above: A function that allows an allocation to specify
> which physical memory ranges are permitted.
>
> > So please stop telling me what a wonderful world it is to not have multiple
> > zones. It just isn't going to happen for a long long time. The
> > multiple-zone kernel is the case we need to care about most by a very large
> > margin indeed. Single-zone is an infinitesimal corner-case.
>
> We can still reduce the number of zones for those that require highmem to
> two which may allows us to avoid ZONE_DMA/DMA32 issues and allow dma
> devices to avoid bunce buffers that can do I/O to memory ranges not
> compatible with the current boundaries of DMA/DMA32. And I am also
> repeating myself.

This sounds like it could help ARM where we have some weird DMA areas.

What will help even more is if the block layer can also be persuaded that
a device dma mask is precisely that - a mask - and not a set of leading
ones followed by a set of zeros, then we could eliminate the really ugly
dmabounce code.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2007-01-29 23:38:09

[permalink] [raw]

Subject: Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Mon, 29 Jan 2007, Russell King wrote:

> This sounds like it could help ARM where we have some weird DMA areas.

Some ARM platforms have no need for a ZONE_DMA. The code in mm allows you
to not compile ZONE_DMA support into these kernels.

> What will help even more is if the block layer can also be persuaded that
> a device dma mask is precisely that - a mask - and not a set of leading
> ones followed by a set of zeros, then we could eliminate the really ugly
> dmabounce code.

With a alloc_pages_range() one would be able to specify upper and lower
boundaries. The device dma mask can be translated to a fitting boundary.
Maybe we can then also get rid of the device mask and specify a boundary
there. There is a lot of ugly code all around that circumvents the
existing issues with dma masks. That would all go away.

2007-01-30 00:09:41

[permalink] [raw]

Subject: Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Mon, 29 Jan 2007 15:37:29 -0800 (PST)
Christoph Lameter <[email protected]> wrote:

> With a alloc_pages_range() one would be able to specify upper and lower
> boundaries.

Is there a proposal anywhere regarding how this would be implemented?

2007-01-30 09:55:39

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Mon, 2007-01-29 at 16:09 -0800, Andrew Morton wrote:
> On Mon, 29 Jan 2007 15:37:29 -0800 (PST)
> Christoph Lameter <[email protected]> wrote:
>
> > With a alloc_pages_range() one would be able to specify upper and lower
> > boundaries.
>
> Is there a proposal anywhere regarding how this would be implemented?

I'm guessing this will involve page migration.

Still, would we need to place bounds on non movable pages, or will it be
a best effort? It seems the current zone approach is a best effort too,
although it does try to keep allocations away from the lower zones as
much as possible.

But I guess we could make a single zone allocator prefer high addresses
too.

So then we'd end up with a single zone, and each allocation would give a
range. Try and pick a free page with as high an address as possible in
the given range. If no pages available in the given range try and move
some movable pages out of it.

This does of course involve finding free pages in a given range, and
identifying pages as movable.

And a gazillion trivial but tedious things I've forgotten. Christoph, is
this what you were getting at?

2007-02-02 05:22:55

[permalink] [raw]

Subject: Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Mon, 29 Jan 2007, Andrew Morton wrote:

> On Mon, 29 Jan 2007 15:37:29 -0800 (PST)
> Christoph Lameter <[email protected]> wrote:
>
> > With a alloc_pages_range() one would be able to specify upper and lower
> > boundaries.
>
> Is there a proposal anywhere regarding how this would be implemented?

Yes it was discussed a while back in August. Look for alloc_pages_range.
Sadly I have not been able to do work on it since there are too many
other issues.

2007-02-02 05:28:08