2006-02-17 14:17:16

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/7] Reducing fragmentation using zones v5

Changelog since v4
o Move x86_64 from one patch to another
o Fix for oops bug on ppc64

Changelog since v3
o Minor bugs
o ppc64 can specify kernelcore
o Ability to disable use of ZONE_EASYRCLM at boot time
o HugeTLB uses ZONE_EASYRCLM
o Add drain-percpu caches for testing
o boot-parameter documentation added

This is a zone-based approach to anti-fragmentation. This is posted in light
of the discussions related to the list-based (sometimes dubbed as sub-zones)
approach where the prevailing opinion was that zones were the answer. The
patches have been tested based on linux-2.6.16-rc3-mm1 with x86 and two
different ppc64 machines. It has also been just booted tested on x86_64. If
there are no objections, I would like to these patches to spend some time
in -mm. The patches also apply with offsets to 2.6.16-rc3.

The usage scenario I set up to test out the patches is;

1. Test machine 1: 4-way x86 machine with 1.5GiB physical RAM
Test machine 2: 4-way PPC64 machine with 4GiB physical RAM
2. Boot with kernelcore=512MB . This gives the kernel 512MB to work with and
the rest is placed in ZONE_EASYRCLM. (see patch 3 for more comments about
the value of kernelcore)
3. Benchmark kbuild, aim9 and high order allocations

An alternative scenario has been tested that produces similar figures. The
scenario is;

1. Test machine: 4-way x86 machine with 1.5GiB physical RAM
2. Boot with mem=512MB
3. Hot-add the remaining memory
4. Benchmark kbuild, aim9 and high order allocations

The alternative scenario requires two more patches related to hot-adding on
the x86. I can post them if people want to take a look or experiment with
hot-add instead of using kernelcore= .

Benchmark comparison between -mm+NoOOM tree and with the new zones on the
x86. The tests are run in order with no reboots between the tests. This is
so that the system is in a used state when we try and allocate large pages.

KBuild
2.6.16-rc3-mm1-clean 2.6.16-rc3-mm1-zbuddy-v5
Time taken to extract kernel: 16 14
Time taken to build kernel: 1110 1110

Comparable performance there. On ppc64, zbuddy was slightly faster as well.

Aim9
2.6.16-rc3-mm1-clean 2.6.16-rc3-mm1-zbuddy-v5
1 creat-clo 13381.10 13276.69 -104.41 -0.78% File Creations and Closes/second
2 page_test 121821.06 123350.55 1529.49 1.26% System Allocations & Pages/second
3 brk_test 597817.76 589136.95 -8680.81 -1.45% System Memory Allocations/second
4 jmp_test 4372237.96 4377790.74 5552.78 0.13% Non-local gotos/second
5 signal_test 84038.65 82755.75 -1282.90 -1.53% Signal Traps/second
6 exec_test 61.99 62.21 0.22 0.35% Program Loads/second
7 fork_test 1178.82 1162.36 -16.46 -1.40% Task Creations/second
8 link_test 4801.10 4799.00 -2.10 -0.04% Link/Unlink Pairs/second

Comparable performance again. On ppc64, almost all the microbenchmarks were
within 0.2% of each other. The one exception was jmp_test but then I found
that jmp_test results vary widely between runs on the same kernel on ppc64
so I ignored it.

HugeTLB Allocation Capability under load
This test is different from older reports. Older reports used a kernel module
to really try and allocate HugeTLB-sized pages which is artificial. This test
uses only the proc interface to adjust nr_hugepages. The test is;

1. Build 7 kernels at the same time
2. Try and get HugeTLB pages
3. Stop the compiles, delete the trees and try and get HugeTLB pages
4. DD a file the size of physical memory, cat it to /dev/null and delete
it, try and get HugeTLB pages.

The results for x86 were;

2.6.16-rc3-mm1-clean 2.6.16-rc3-mm1-zbuddy-v5
During compile: 7 7
At rest before dd of large file: 56 60
At rest after dd of large file: 79 85

These seem comparable, but it is only because the HighMem zone on the clean
kernel is a similar size to the EasyRclm zone with zbuddy-v5. With PPC64,
where there was a much larger difference, the results were;

2.6.16-rc3-mm1-clean 2.6.16-rc3-mm1-zbuddy-v5
During compile: 0 0
At rest before dd of large file: 0 103
At rest after dd of large file: 6 154

So, the zone can potentially make a massive difference to the availability
of HugeTLB pages some systems. I am expecting a similar result on an x86
with more physical RAM than my current test box.

Using the kernel module to really stress the allocation of pages,
there were even wider differences between the results with the zone-based
anti-fragmentation kernel having a clear advantage. On ppc64, 191 huge pages
were available at rest with zone-based anti-fragmentation in comparison to
33 huge pages without it. On x86, the zone-based anti-fragmentation had
only slightly more huge pages, but this again is because of the similar
size of the HighMem on the clean kernel compared to the EasyRclm zone on
the anti-fragmentation kernel.

Bottom line, both kernels perform similarly on two different architectures but,
when configured correctly, there can be massive difference to the availability
of huge pages and the regions that can be hot-removed. When kernelcore is
not configured at boot time, there is no measurable differences between
the kernels.

The final diffstat for the patches is;
Documentation/kernel-parameters.txt | 20 ++++++++++++
arch/i386/kernel/setup.c | 49 ++++++++++++++++++++++++++++++-
arch/i386/mm/init.c | 2 -
arch/powerpc/mm/mem.c | 2 -
arch/powerpc/mm/numa.c | 56 ++++++++++++++++++++++++++++++++++--
arch/x86_64/mm/init.c | 2 -
fs/compat.c | 2 -
fs/exec.c | 2 -
fs/inode.c | 2 -
include/asm-i386/page.h | 3 +
include/linux/gfp.h | 3 +
include/linux/highmem.h | 2 -
include/linux/memory_hotplug.h | 1
include/linux/mmzone.h | 12 ++++---
mm/hugetlb.c | 4 +-
mm/memory.c | 4 +-
mm/mempolicy.c | 2 -
mm/page_alloc.c | 20 +++++++++---
mm/shmem.c | 4 ++
mm/swap_state.c | 2 -
20 files changed, 165 insertions(+), 29 deletions(-)

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab


2006-02-17 14:17:45

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/7] Create the ZONE_EASYRCLM zone


This patch adds the ZONE_EASYRCLM zone and updates relevant contants and
helper functions. After this patch is applied, memory that is hot-added on
the x86 will be placed in ZONE_EASYRCLM. Memory hot-added on the ppc64 still
goes to ZONE_DMA.

Signed-off-by: Mel Gorman <[email protected]>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-101_antifrag_flags/arch/x86_64/mm/init.c linux-2.6.16-rc3-mm1-102_addzone/arch/x86_64/mm/init.c
--- linux-2.6.16-rc3-mm1-101_antifrag_flags/arch/x86_64/mm/init.c 2006-02-16 09:50:42.000000000 +0000
+++ linux-2.6.16-rc3-mm1-102_addzone/arch/x86_64/mm/init.c 2006-02-17 09:42:01.000000000 +0000
@@ -495,7 +495,7 @@ void online_page(struct page *page)
int add_memory(u64 start, u64 size)
{
struct pglist_data *pgdat = NODE_DATA(0);
- struct zone *zone = pgdat->node_zones + MAX_NR_ZONES-2;
+ struct zone *zone = pgdat->node_zones + MAX_NR_ZONES-3;
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-101_antifrag_flags/include/linux/mmzone.h linux-2.6.16-rc3-mm1-102_addzone/include/linux/mmzone.h
--- linux-2.6.16-rc3-mm1-101_antifrag_flags/include/linux/mmzone.h 2006-02-13 00:27:25.000000000 +0000
+++ linux-2.6.16-rc3-mm1-102_addzone/include/linux/mmzone.h 2006-02-17 09:42:01.000000000 +0000
@@ -73,9 +73,10 @@ struct per_cpu_pageset {
#define ZONE_DMA32 1
#define ZONE_NORMAL 2
#define ZONE_HIGHMEM 3
+#define ZONE_EASYRCLM 4

-#define MAX_NR_ZONES 4 /* Sync this with ZONES_SHIFT */
-#define ZONES_SHIFT 2 /* ceil(log2(MAX_NR_ZONES)) */
+#define MAX_NR_ZONES 5 /* Sync this with ZONES_SHIFT */
+#define ZONES_SHIFT 3 /* ceil(log2(MAX_NR_ZONES)) */


/*
@@ -103,7 +104,7 @@ struct per_cpu_pageset {
*
* NOTE! Make sure this matches the zones in <linux/gfp.h>
*/
-#define GFP_ZONEMASK 0x07
+#define GFP_ZONEMASK 0x0f
/* #define GFP_ZONETYPES (GFP_ZONEMASK + 1) */ /* Non-loner */
#define GFP_ZONETYPES ((GFP_ZONEMASK + 1) / 2 + 1) /* Loner */

@@ -408,7 +409,7 @@ static inline int populated_zone(struct

static inline int is_highmem_idx(int idx)
{
- return (idx == ZONE_HIGHMEM);
+ return (idx == ZONE_HIGHMEM || idx == ZONE_EASYRCLM);
}

static inline int is_normal_idx(int idx)
@@ -424,7 +425,8 @@ static inline int is_normal_idx(int idx)
*/
static inline int is_highmem(struct zone *zone)
{
- return zone == zone->zone_pgdat->node_zones + ZONE_HIGHMEM;
+ return zone == zone->zone_pgdat->node_zones + ZONE_HIGHMEM ||
+ zone == zone->zone_pgdat->node_zones + ZONE_EASYRCLM;
}

static inline int is_normal(struct zone *zone)
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-101_antifrag_flags/mm/page_alloc.c linux-2.6.16-rc3-mm1-102_addzone/mm/page_alloc.c
--- linux-2.6.16-rc3-mm1-101_antifrag_flags/mm/page_alloc.c 2006-02-16 09:50:44.000000000 +0000
+++ linux-2.6.16-rc3-mm1-102_addzone/mm/page_alloc.c 2006-02-17 09:42:01.000000000 +0000
@@ -68,7 +68,7 @@ static void __free_pages_ok(struct page
* TBD: should special case ZONE_DMA32 machines here - in those we normally
* don't need any ZONE_NORMAL reservation
*/
-int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = { 256, 256, 32 };
+int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = { 256, 256, 32, 32 };

EXPORT_SYMBOL(totalram_pages);

@@ -79,7 +79,8 @@ EXPORT_SYMBOL(totalram_pages);
struct zone *zone_table[1 << ZONETABLE_SHIFT] __read_mostly;
EXPORT_SYMBOL(zone_table);

-static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal", "HighMem" };
+static char *zone_names[MAX_NR_ZONES] = { "DMA", "DMA32", "Normal",
+ "HighMem", "EasyRclm" };
int min_free_kbytes = 1024;

unsigned long __initdata nr_kernel_pages;
@@ -758,6 +759,7 @@ static inline void prep_zero_page(struct
int i;

BUG_ON((gfp_flags & (__GFP_WAIT | __GFP_HIGHMEM)) == __GFP_HIGHMEM);
+ BUG_ON((gfp_flags & (__GFP_WAIT | __GFP_EASYRCLM)) == __GFP_EASYRCLM);
for(i = 0; i < (1 << order); i++)
clear_highpage(page + i);
}
@@ -1266,7 +1268,7 @@ unsigned int nr_free_buffer_pages(void)
*/
unsigned int nr_free_pagecache_pages(void)
{
- return nr_free_zone_pages(gfp_zone(GFP_HIGHUSER));
+ return nr_free_zone_pages(gfp_zone(GFP_RCLMUSER));
}

#ifdef CONFIG_HIGHMEM
@@ -1276,7 +1278,7 @@ unsigned int nr_free_highpages (void)
unsigned int pages = 0;

for_each_pgdat(pgdat)
- pages += pgdat->node_zones[ZONE_HIGHMEM].free_pages;
+ pages += pgdat->node_zones[ZONE_EASYRCLM].free_pages;

return pages;
}
@@ -1584,7 +1586,7 @@ static int __init build_zonelists_node(p
{
struct zone *zone;

- BUG_ON(zone_type > ZONE_HIGHMEM);
+ BUG_ON(zone_type > ZONE_EASYRCLM);

do {
zone = pgdat->node_zones + zone_type;
@@ -1604,6 +1606,8 @@ static int __init build_zonelists_node(p
static inline int highest_zone(int zone_bits)
{
int res = ZONE_NORMAL;
+ if (zone_bits & (__force int)__GFP_EASYRCLM)
+ res = ZONE_EASYRCLM;
if (zone_bits & (__force int)__GFP_HIGHMEM)
res = ZONE_HIGHMEM;
if (zone_bits & (__force int)__GFP_DMA32)

2006-02-17 14:17:20

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/7] Add __GFP_EASYRCLM flag and update callers


This creates a zone modifier __GFP_EASYRCLM and a set of GFP flags called
GFP_RCLMUSER. The only difference between GFP_HIGHUSER and GFP_RCLMUSER is the
zone that is used. Callers appropriate to use the ZONE_EASYRCLM are changed.

Signed-off-by: Mel Gorman <[email protected]>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-clean/fs/compat.c linux-2.6.16-rc3-mm1-101_antifrag_flags/fs/compat.c
--- linux-2.6.16-rc3-mm1-clean/fs/compat.c 2006-02-13 00:27:25.000000000 +0000
+++ linux-2.6.16-rc3-mm1-101_antifrag_flags/fs/compat.c 2006-02-17 09:41:14.000000000 +0000
@@ -1397,7 +1397,7 @@ static int compat_copy_strings(int argc,
page = bprm->page[i];
new = 0;
if (!page) {
- page = alloc_page(GFP_HIGHUSER);
+ page = alloc_page(GFP_RCLMUSER);
bprm->page[i] = page;
if (!page) {
ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-clean/fs/exec.c linux-2.6.16-rc3-mm1-101_antifrag_flags/fs/exec.c
--- linux-2.6.16-rc3-mm1-clean/fs/exec.c 2006-02-16 09:50:43.000000000 +0000
+++ linux-2.6.16-rc3-mm1-101_antifrag_flags/fs/exec.c 2006-02-17 09:41:14.000000000 +0000
@@ -238,7 +238,7 @@ static int copy_strings(int argc, char _
page = bprm->page[i];
new = 0;
if (!page) {
- page = alloc_page(GFP_HIGHUSER);
+ page = alloc_page(GFP_RCLMUSER);
bprm->page[i] = page;
if (!page) {
ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-clean/fs/inode.c linux-2.6.16-rc3-mm1-101_antifrag_flags/fs/inode.c
--- linux-2.6.16-rc3-mm1-clean/fs/inode.c 2006-02-16 09:50:43.000000000 +0000
+++ linux-2.6.16-rc3-mm1-101_antifrag_flags/fs/inode.c 2006-02-17 09:41:14.000000000 +0000
@@ -147,7 +147,7 @@ static struct inode *alloc_inode(struct
mapping->a_ops = &empty_aops;
mapping->host = inode;
mapping->flags = 0;
- mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
+ mapping_set_gfp_mask(mapping, GFP_RCLMUSER);
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;

diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-clean/include/asm-i386/page.h linux-2.6.16-rc3-mm1-101_antifrag_flags/include/asm-i386/page.h
--- linux-2.6.16-rc3-mm1-clean/include/asm-i386/page.h 2006-02-13 00:27:25.000000000 +0000
+++ linux-2.6.16-rc3-mm1-101_antifrag_flags/include/asm-i386/page.h 2006-02-17 09:41:14.000000000 +0000
@@ -36,7 +36,8 @@
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)

-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) \
+ alloc_page_vma(GFP_RCLMUSER | __GFP_ZERO, vma, vaddr)
#define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE

/*
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-clean/include/linux/gfp.h linux-2.6.16-rc3-mm1-101_antifrag_flags/include/linux/gfp.h
--- linux-2.6.16-rc3-mm1-clean/include/linux/gfp.h 2006-02-13 00:27:25.000000000 +0000
+++ linux-2.6.16-rc3-mm1-101_antifrag_flags/include/linux/gfp.h 2006-02-17 09:41:14.000000000 +0000
@@ -21,6 +21,7 @@ struct vm_area_struct;
#else
#define __GFP_DMA32 ((__force gfp_t)0x04) /* Has own ZONE_DMA32 */
#endif
+#define __GFP_EASYRCLM ((__force gfp_t)0x08u)

/*
* Action modifiers - doesn't change the zoning
@@ -65,6 +66,8 @@ struct vm_area_struct;
#define GFP_USER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
#define GFP_HIGHUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | \
__GFP_HIGHMEM)
+#define GFP_RCLMUSER (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | \
+ __GFP_EASYRCLM)

/* Flag - indicates that the buffer will be suitable for DMA. Ignored on some
platforms, used as appropriate on others */
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-clean/include/linux/highmem.h linux-2.6.16-rc3-mm1-101_antifrag_flags/include/linux/highmem.h
--- linux-2.6.16-rc3-mm1-clean/include/linux/highmem.h 2006-02-13 00:27:25.000000000 +0000
+++ linux-2.6.16-rc3-mm1-101_antifrag_flags/include/linux/highmem.h 2006-02-17 09:41:14.000000000 +0000
@@ -47,7 +47,7 @@ static inline void clear_user_highpage(s
static inline struct page *
alloc_zeroed_user_highpage(struct vm_area_struct *vma, unsigned long vaddr)
{
- struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+ struct page *page = alloc_page_vma(GFP_RCLMUSER, vma, vaddr);

if (page)
clear_user_highpage(page, vaddr);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-clean/mm/memory.c linux-2.6.16-rc3-mm1-101_antifrag_flags/mm/memory.c
--- linux-2.6.16-rc3-mm1-clean/mm/memory.c 2006-02-16 09:50:44.000000000 +0000
+++ linux-2.6.16-rc3-mm1-101_antifrag_flags/mm/memory.c 2006-02-17 09:41:14.000000000 +0000
@@ -1480,7 +1480,7 @@ gotten:
if (!new_page)
goto oom;
} else {
- new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ new_page = alloc_page_vma(GFP_RCLMUSER, vma, address);
if (!new_page)
goto oom;
cow_user_page(new_page, old_page, address);
@@ -2079,7 +2079,7 @@ retry:

if (unlikely(anon_vma_prepare(vma)))
goto oom;
- page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ page = alloc_page_vma(GFP_RCLMUSER, vma, address);
if (!page)
goto oom;
copy_user_highpage(page, new_page, address);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-clean/mm/shmem.c linux-2.6.16-rc3-mm1-101_antifrag_flags/mm/shmem.c
--- linux-2.6.16-rc3-mm1-clean/mm/shmem.c 2006-02-16 09:50:44.000000000 +0000
+++ linux-2.6.16-rc3-mm1-101_antifrag_flags/mm/shmem.c 2006-02-17 09:41:14.000000000 +0000
@@ -921,6 +921,8 @@ shmem_alloc_page(gfp_t gfp, struct shmem
pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
pvma.vm_pgoff = idx;
pvma.vm_end = PAGE_SIZE;
+ if (gfp & __GFP_HIGHMEM)
+ gfp = (gfp & ~__GFP_HIGHMEM) | __GFP_EASYRCLM;
page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
mpol_free(pvma.vm_policy);
return page;
@@ -936,6 +938,8 @@ shmem_swapin(struct shmem_inode_info *in
static inline struct page *
shmem_alloc_page(gfp_t gfp,struct shmem_inode_info *info, unsigned long idx)
{
+ if (gfp & __GFP_HIGHMEM)
+ gfp = (gfp & ~__GFP_HIGHMEM) | __GFP_EASYRCLM;
return alloc_page(gfp | __GFP_ZERO);
}
#endif
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-clean/mm/swap_state.c linux-2.6.16-rc3-mm1-101_antifrag_flags/mm/swap_state.c
--- linux-2.6.16-rc3-mm1-clean/mm/swap_state.c 2006-02-13 00:27:25.000000000 +0000
+++ linux-2.6.16-rc3-mm1-101_antifrag_flags/mm/swap_state.c 2006-02-17 09:41:14.000000000 +0000
@@ -334,7 +334,7 @@ struct page *read_swap_cache_async(swp_e
* Get a new page to read into from swap.
*/
if (!new_page) {
- new_page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+ new_page = alloc_page_vma(GFP_RCLMUSER, vma, addr);
if (!new_page)
break; /* Out of memory */
}

2006-02-17 14:18:00

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/7] x86 - Specify amount of kernel memory at boot time


This patch was originally written by Kamezawa Hiroyuki.

It should be possible for the administrator to specify at boot-time how much
memory should be used for the kernel and how much should go to ZONE_EASYRCLM.
After this patch is applied, the boot option kernelcore= can be used to
specify how much memory should be used by the kernel.

(Note that Kamezawa called this parameter coremem= . This was renamed because
of the way ppc64 parses command line arguments and would confuse coremem=
with mem=. The name was chosen that could be used across architectures)

The value of kernelcore is important. If it is too small, there will be more
pressure on ZONE_NORMAL and a potential loss of performance. If it is about
896MB, it means that ZONE_HIGHMEM will have a size of zero. Any differences in
tests will depend on whether CONFIG_HIGHPTE is set in the standard kernel or
not. With lots of memory, the ideal is to specify a kernelcore that gives
ZONE_NORMAL it's full size and a ZONE_HIGHMEM for PTEs. The right value
depends, like any tunable, on the workload.

It is also important to note that if kernelcore is less than the maximum
size of ZONE_NORMAL, GFP_HIGHMEM allocations will use ZONE_NORMAL, not the
reachable portion of ZONE_EASYRCLM.

Signed-off-by: Mel Gorman <[email protected]>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-102_addzone/arch/i386/kernel/setup.c linux-2.6.16-rc3-mm1-103_x86coremem/arch/i386/kernel/setup.c
--- linux-2.6.16-rc3-mm1-102_addzone/arch/i386/kernel/setup.c 2006-02-16 09:50:41.000000000 +0000
+++ linux-2.6.16-rc3-mm1-103_x86coremem/arch/i386/kernel/setup.c 2006-02-17 09:42:45.000000000 +0000
@@ -112,6 +112,9 @@ int bootloader_type;
/* user-defined highmem size */
static unsigned int highmem_pages = -1;

+/* user-defined easy-reclaim-size */
+static unsigned int core_mem_pages = -1;
+static unsigned int easyrclm_pages = 0;
/*
* Setup options
*/
@@ -912,6 +915,15 @@ static void __init parse_cmdline_early (
*/
else if (!memcmp(from, "vmalloc=", 8))
__VMALLOC_RESERVE = memparse(from+8, &from);
+ /*
+ * kernelcore=size sets the amount of memory for use for
+ * kernel allocations that cannot be reclaimed easily.
+ * The remaining memory is set aside for easy reclaim
+ * for features like memory remove or huge page allocations
+ */
+ else if (!memcmp(from, "kernelcore=",11)) {
+ core_mem_pages = memparse(from+11, &from) >> PAGE_SHIFT;
+ }

next_char:
c = *(from++);
@@ -981,6 +993,17 @@ void __init find_max_pfn(void)
}
}

+unsigned long __init calculate_core_memory(unsigned long max_low_pfn)
+{
+ if (max_low_pfn < core_mem_pages) {
+ highmem_pages -= (core_mem_pages - max_low_pfn);
+ } else {
+ max_low_pfn = core_mem_pages;
+ highmem_pages = 0;
+ }
+ easyrclm_pages = max_pfn - core_mem_pages;
+ return max_low_pfn;
+}
/*
* Determine low and high memory ranges:
*/
@@ -1037,6 +1060,8 @@ unsigned long __init find_max_low_pfn(vo
printk(KERN_ERR "ignoring highmem size on non-highmem kernel!\n");
#endif
}
+ if (core_mem_pages != -1)
+ max_low_pfn = calculate_core_memory(max_low_pfn);
return max_low_pfn;
}

@@ -1157,7 +1182,8 @@ void __init zone_sizes_init(void)
zones_size[ZONE_DMA] = max_dma;
zones_size[ZONE_NORMAL] = low - max_dma;
#ifdef CONFIG_HIGHMEM
- zones_size[ZONE_HIGHMEM] = highend_pfn - low;
+ zones_size[ZONE_HIGHMEM] = highend_pfn - low - easyrclm_pages;
+ zones_size[ZONE_EASYRCLM] = easyrclm_pages;
#endif
}
free_area_init(zones_size);

2006-02-17 14:18:24

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 4/7] ppc64 - Specify amount of kernel memory at boot time


This patch adds the kernelcore= parameter for ppc64.

The amount of memory will requested will not be reserved in all nodes. The
first node that is found that can accomodate the requested amount of memory
and have remaining more for ZONE_EASYRCLM is used. If a node has memory holes,
it also will not be used.

Signed-off-by: Mel Gorman <[email protected]>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-103_x86coremem/arch/powerpc/mm/numa.c linux-2.6.16-rc3-mm1-104_ppc64coremem/arch/powerpc/mm/numa.c
--- linux-2.6.16-rc3-mm1-103_x86coremem/arch/powerpc/mm/numa.c 2006-02-16 09:50:42.000000000 +0000
+++ linux-2.6.16-rc3-mm1-104_ppc64coremem/arch/powerpc/mm/numa.c 2006-02-17 09:43:30.000000000 +0000
@@ -21,6 +21,7 @@
#include <asm/lmb.h>
#include <asm/system.h>
#include <asm/smp.h>
+#include <asm/machdep.h>

static int numa_enabled = 1;

@@ -722,20 +723,54 @@ void __init paging_init(void)
unsigned long zones_size[MAX_NR_ZONES];
unsigned long zholes_size[MAX_NR_ZONES];
int nid;
+ unsigned long core_mem_size = 0;
+ unsigned long core_mem_pfn = 0;
+ char *opt;

memset(zones_size, 0, sizeof(zones_size));
memset(zholes_size, 0, sizeof(zholes_size));

+ /* Check if ZONE_EASYRCLM should be populated */
+ opt = strstr(cmd_line, "kernelcore=");
+ if (opt) {
+ opt += 11;
+ core_mem_size = memparse(opt, &opt);
+ core_mem_pfn = core_mem_size >> PAGE_SHIFT;
+ }
+
for_each_online_node(nid) {
unsigned long start_pfn, end_pfn, pages_present;

get_region(nid, &start_pfn, &end_pfn, &pages_present);

- zones_size[ZONE_DMA] = end_pfn - start_pfn;
- zholes_size[ZONE_DMA] = zones_size[ZONE_DMA] - pages_present;
+ /*
+ * Set up a zone for EASYRCLM as long as this node is large
+ * enough to accomodate the requested size and that there
+ * are no memory holes
+ */
+ if (core_mem_pfn == 0 ||
+ end_pfn - start_pfn < core_mem_pfn ||
+ end_pfn - start_pfn != pages_present) {
+ zones_size[ZONE_DMA] = end_pfn - start_pfn;
+ zones_size[ZONE_EASYRCLM] = 0;
+ zholes_size[ZONE_DMA] =
+ zones_size[ZONE_DMA] - pages_present;
+ zholes_size[ZONE_EASYRCLM] = 0;
+ if (core_mem_pfn >= pages_present)
+ core_mem_pfn -= pages_present;
+ } else {
+ zones_size[ZONE_DMA] = core_mem_pfn;
+ zones_size[ZONE_EASYRCLM] = end_pfn - core_mem_pfn;
+ zholes_size[ZONE_DMA] = 0;
+ zholes_size[ZONE_EASYRCLM] = 0;
+ core_mem_pfn = 0;
+ }

- dbg("free_area_init node %d %lx %lx (hole: %lx)\n", nid,
+ dbg("free_area_init DMA node %d %lx %lx (hole: %lx)\n", nid,
zones_size[ZONE_DMA], start_pfn, zholes_size[ZONE_DMA]);
+ dbg("free_area_init EASYRCLM node %d %lx %lx (hole: %lx)\n",
+ nid, zones_size[ZONE_EASYRCLM], zones_size[ZONE_DMA],
+ zholes_size[ZONE_EASYRCLM]);

free_area_init_node(nid, NODE_DATA(nid), zones_size, start_pfn,
zholes_size);
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-103_x86coremem/mm/page_alloc.c linux-2.6.16-rc3-mm1-104_ppc64coremem/mm/page_alloc.c
--- linux-2.6.16-rc3-mm1-103_x86coremem/mm/page_alloc.c 2006-02-17 09:42:01.000000000 +0000
+++ linux-2.6.16-rc3-mm1-104_ppc64coremem/mm/page_alloc.c 2006-02-17 09:43:30.000000000 +0000
@@ -1592,7 +1592,11 @@ static int __init build_zonelists_node(p
zone = pgdat->node_zones + zone_type;
if (populated_zone(zone)) {
#ifndef CONFIG_HIGHMEM
- BUG_ON(zone_type > ZONE_NORMAL);
+ /*
+ * On architectures with only ZONE_DMA, it is still
+ * valid to have a ZONE_EASYRCLM
+ */
+ BUG_ON(zone_type == ZONE_HIGHMEM);
#endif
zonelist->zones[nr_zones++] = zone;
check_highest_zone(zone_type);

2006-02-17 14:18:40

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 5/7] At boot, determine what zone memory will hot-add to


Once ZONE_EASYRCLM is added, the x86 by default adds to ZONE_EASYRCLM and
ppc64 by default uses ZONE_DMA. This patch changes the behavior slightly on
x86 and ppc64.

o By default, ZONE_DMA is used on ppc64 and ZONE_HIGHMEM is used on x86
o If kernelcore is specified at boot time, x86 and ppc64 hotadd to ZONE_EASYRCLM
o If kernelcore and noeasyrclm is used, ppc64 will use ZONE_DMA and x86 will
use ZONE_HIGHMEM

This is a list of scenarios and what happens with different options on an
x86 with 1.5GiB of physical RAM. ./activate is a script that tries to online
all inactive physical memory.

Boot with no special parameters
- ./activate does nothing
- All high memory in HIGHMEM

Boot with mem=512MB
- Machine boots with 512MB active RAM
- ./activate adds memory to ZONE_HIGHMEM
- No memory in ZONE_EASYRCLM

Boot with kernelcore=512MB
- Machine boots with 1.5GiB RAM
- ./activate does nothing
- No memory in HIGHMEM
- Some of what would be NORMAL and all of HIGHMEM is in EASYRCLM

Boot with kernelcore=512MB mem=512MB
- Machine boots with 512MB RAM
- ./activate adds memory to ZONE_EASYRCLM
- No memory in HIGHMEM

Boot with kernelcore=512MB mem=512MB noeasyrclm
- Machine boots with 512MB RAM
- ./activate adds memory to ZONE_EASYRCLM
- No memory in HIGHMEM
- With noeasyrclm, this is identical to booting with just mem=512MB

Boot with kernelcore=1024MB mem=1024MB
- Machine boots with 1024MB RAM
- Some memory already in ZONE_HIGHMEM
- ./activate adds memory to ZONE_EASYRCLM

Boot with kernelcore=1024MB mem=1024MB noeasyrclm
- Machine boots with 1024MB RAM
- Some memory already in ZONE_EASYRCLM
- ./activate adds memory to ZONE_HIGHMEM

Signed-off-by: Mel Gorman <[email protected]>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-104_ppc64coremem/arch/i386/kernel/setup.c linux-2.6.16-rc3-mm1-106_zonechoose/arch/i386/kernel/setup.c
--- linux-2.6.16-rc3-mm1-104_ppc64coremem/arch/i386/kernel/setup.c 2006-02-17 09:42:45.000000000 +0000
+++ linux-2.6.16-rc3-mm1-106_zonechoose/arch/i386/kernel/setup.c 2006-02-17 09:44:16.000000000 +0000
@@ -115,6 +115,8 @@ static unsigned int highmem_pages = -1;
/* user-defined easy-reclaim-size */
static unsigned int core_mem_pages = -1;
static unsigned int easyrclm_pages = 0;
+static int hotadd_zone_offset = -1;
+
/*
* Setup options
*/
@@ -923,6 +925,18 @@ static void __init parse_cmdline_early (
*/
else if (!memcmp(from, "kernelcore=",11)) {
core_mem_pages = memparse(from+11, &from) >> PAGE_SHIFT;
+
+ if (hotadd_zone_offset == -1)
+ hotadd_zone_offset = ZONE_EASYRCLM;
+ }
+
+ /*
+ * Once kernelcore= is specified, the default zone to add to
+ * is ZONE_EASYRCLM. This parameter allows an administrator
+ * to override that
+ */
+ else if (!memcmp(from, "noeasyrclm", 10)) {
+ hotadd_zone_offset = ZONE_HIGHMEM;
}

next_char:
@@ -1550,6 +1564,13 @@ void __init setup_arch(char **cmdline_p)
tsc_init();
}

+struct zone *get_zone_for_hotadd(struct pglist_data *pgdata) {
+ if (unlikely(hotadd_zone_offset == -1))
+ hotadd_zone_offset = ZONE_HIGHMEM;
+
+ return &pgdata->node_zones[hotadd_zone_offset];
+}
+
#include "setup_arch_post.h"
/*
* Local Variables:
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-104_ppc64coremem/arch/i386/mm/init.c linux-2.6.16-rc3-mm1-106_zonechoose/arch/i386/mm/init.c
--- linux-2.6.16-rc3-mm1-104_ppc64coremem/arch/i386/mm/init.c 2006-02-16 09:50:41.000000000 +0000
+++ linux-2.6.16-rc3-mm1-106_zonechoose/arch/i386/mm/init.c 2006-02-17 09:44:16.000000000 +0000
@@ -655,7 +655,7 @@ void __init mem_init(void)
int add_memory(u64 start, u64 size)
{
struct pglist_data *pgdata = &contig_page_data;
- struct zone *zone = pgdata->node_zones + MAX_NR_ZONES-1;
+ struct zone *zone = get_zone_for_hotadd(pgdata);
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;

diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-104_ppc64coremem/arch/powerpc/mm/mem.c linux-2.6.16-rc3-mm1-106_zonechoose/arch/powerpc/mm/mem.c
--- linux-2.6.16-rc3-mm1-104_ppc64coremem/arch/powerpc/mm/mem.c 2006-02-16 09:50:42.000000000 +0000
+++ linux-2.6.16-rc3-mm1-106_zonechoose/arch/powerpc/mm/mem.c 2006-02-17 09:44:16.000000000 +0000
@@ -129,7 +129,7 @@ int __devinit add_memory(u64 start, u64
create_section_mapping(start, start + size);

/* this should work for most non-highmem platforms */
- zone = pgdata->node_zones;
+ zone = get_zone_for_hotadd(pgdata);

return __add_pages(zone, start_pfn, nr_pages);

diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-104_ppc64coremem/arch/powerpc/mm/numa.c linux-2.6.16-rc3-mm1-106_zonechoose/arch/powerpc/mm/numa.c
--- linux-2.6.16-rc3-mm1-104_ppc64coremem/arch/powerpc/mm/numa.c 2006-02-17 09:43:30.000000000 +0000
+++ linux-2.6.16-rc3-mm1-106_zonechoose/arch/powerpc/mm/numa.c 2006-02-17 09:44:16.000000000 +0000
@@ -39,6 +39,7 @@ EXPORT_SYMBOL(node_data);
static bootmem_data_t __initdata plat_node_bdata[MAX_NUMNODES];
static int min_common_depth;
static int n_mem_addr_cells, n_mem_size_cells;
+static int hotadd_zone_offset = -1;

/*
* We need somewhere to store start/end/node for each region until we have
@@ -736,6 +737,13 @@ void __init paging_init(void)
opt += 11;
core_mem_size = memparse(opt, &opt);
core_mem_pfn = core_mem_size >> PAGE_SHIFT;
+ hotadd_zone_offset = ZONE_EASYRCLM;
+ }
+
+ /* Check if the administrator requests only ZONE_DMA be used */
+ opt = strstr(cmd_line, "noeasyrclm");
+ if (opt) {
+ hotadd_zone_offset = ZONE_DMA;
}

for_each_online_node(nid) {
@@ -847,4 +855,12 @@ got_numa_domain:
}
return numa_domain;
}
+
+struct zone *get_zone_for_hotadd(struct pglist_data *pgdata)
+{
+ if (unlikely(hotadd_zone_offset == -1))
+ hotadd_zone_offset = ZONE_DMA;
+
+ return &pgdata->node_zones[hotadd_zone_offset];
+}
#endif /* CONFIG_MEMORY_HOTPLUG */
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-104_ppc64coremem/include/linux/memory_hotplug.h linux-2.6.16-rc3-mm1-106_zonechoose/include/linux/memory_hotplug.h
--- linux-2.6.16-rc3-mm1-104_ppc64coremem/include/linux/memory_hotplug.h 2006-02-13 00:27:25.000000000 +0000
+++ linux-2.6.16-rc3-mm1-106_zonechoose/include/linux/memory_hotplug.h 2006-02-17 09:44:16.000000000 +0000
@@ -57,6 +57,7 @@ extern void online_page(struct page *pag
extern int add_memory(u64 start, u64 size);
extern int remove_memory(u64 start, u64 size);
extern int online_pages(unsigned long, unsigned long);
+extern struct zone *get_zone_for_hotadd(struct pglist_data *);

/* reasonably generic interface to expand the physical pages in a zone */
extern int __add_pages(struct zone *zone, unsigned long start_pfn,

2006-02-17 14:19:31

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 7/7] Add documentation for extra boot parameters


Once all patches are applied, two new command-line parameters exist -
kernelcore and noeasyrclm. This patch adds the necessary documentation.

Signed-off-by: Mel Gorman <[email protected]>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-107_hugetlb_use_easyrclm/Documentation/kernel-parameters.txt linux-2.6.16-rc3-mm1-108_docs/Documentation/kernel-parameters.txt
--- linux-2.6.16-rc3-mm1-107_hugetlb_use_easyrclm/Documentation/kernel-parameters.txt 2006-02-16 09:50:42.000000000 +0000
+++ linux-2.6.16-rc3-mm1-108_docs/Documentation/kernel-parameters.txt 2006-02-17 09:45:49.000000000 +0000
@@ -704,6 +704,16 @@ running once the system is up.
js= [HW,JOY] Analog joystick
See Documentation/input/joystick.txt.

+ kernelcore=nn[KMG] [KNL,IA-32,PPC] On the x86 and ppc64, this
+ parameter specifies the amount of memory usable
+ by the kernel and places the rest in an EasyRclm
+ zone. The EasyRclm zone is used for the allocation
+ of pages on behalf of a process and for HugeTLB
+ pages. On ppc64, it is likely that memory sections
+ on this zone can be offlined. Note that allocations
+ like PTEs-from-HighMem still use the HighMem zone
+ if it exists, and the Normal zone if it does not.
+
keepinitrd [HW,ARM]

kstack=N [IA-32,X86-64] Print N words from the kernel stack
@@ -1006,6 +1016,16 @@ running once the system is up.

nodisconnect [HW,SCSI,M68K] Disables SCSI disconnects.

+ noeasyrclm [IA-32,PPC] If kernelcore= is specified, the default
+ zone to add memory to for IA-32 and PPC is EasyRclm. If
+ this is undesirable, noeasyrclm can be specified to
+ force the adding of memory on IA-32 to ZONE_HIGHMEM
+ and to ZONE_DMA on PPC. This is desirable when the
+ EasyRclm zone is setup as a "soft" area for HugeTLB
+ pages to be allocated from to give the chance for
+ administrators to grow the reserved number of Huge
+ pages when the system has been running for some time.
+
noexec [IA-64]

noexec [IA-32,X86-64]

2006-02-17 14:19:09

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 6/7] Allow HugeTLB allocations to use ZONE_EASYRCLM


On ppc64 at least, a HugeTLB is the same size as a memory section. Hence,
it causes no fragmentation that is worth caring about because a section can
still be offlined.

Once HugeTLB is allowed to use ZONE_EASYRCLM, the size of the zone becomes a
"soft" area where HugeTLB allocations may be satisified. For example, take
a situation where a system administrator is not willing to reserve HugeTLB
pages at boot time. In this case, he can use kernelcore to size the EasyRclm
zone which is still usable by normal processes. If a job starts that need
HugeTLB pages, one could dd a file the size of physical memory, delete it
and have a good chance of getting a number of HugeTLB pages. To get all of
EasyRclm as HugeTLB pages, the ability to drain per-cpu pages is required.

Signed-off-by: Mel Gorman <[email protected]>
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-106_zonechoose/mm/hugetlb.c linux-2.6.16-rc3-mm1-107_hugetlb_use_easyrclm/mm/hugetlb.c
--- linux-2.6.16-rc3-mm1-106_zonechoose/mm/hugetlb.c 2006-02-16 09:50:44.000000000 +0000
+++ linux-2.6.16-rc3-mm1-107_hugetlb_use_easyrclm/mm/hugetlb.c 2006-02-17 09:44:58.000000000 +0000
@@ -49,7 +49,7 @@ static struct page *dequeue_huge_page(st

for (z = zonelist->zones; *z; z++) {
nid = (*z)->zone_pgdat->node_id;
- if (cpuset_zone_allowed(*z, GFP_HIGHUSER) &&
+ if (cpuset_zone_allowed(*z, GFP_RCLMUSER) &&
!list_empty(&hugepage_freelists[nid]))
break;
}
@@ -68,7 +68,7 @@ static int alloc_fresh_huge_page(void)
{
static int nid = 0;
struct page *page;
- page = alloc_pages_node(nid, GFP_HIGHUSER|__GFP_COMP|__GFP_NOWARN,
+ page = alloc_pages_node(nid, GFP_RCLMUSER|__GFP_COMP|__GFP_NOWARN,
HUGETLB_PAGE_ORDER);
nid = (nid + 1) % num_online_nodes();
if (page) {
diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-106_zonechoose/mm/mempolicy.c linux-2.6.16-rc3-mm1-107_hugetlb_use_easyrclm/mm/mempolicy.c
--- linux-2.6.16-rc3-mm1-106_zonechoose/mm/mempolicy.c 2006-02-16 09:50:44.000000000 +0000
+++ linux-2.6.16-rc3-mm1-107_hugetlb_use_easyrclm/mm/mempolicy.c 2006-02-17 09:44:58.000000000 +0000
@@ -1201,7 +1201,7 @@ struct zonelist *huge_zonelist(struct vm
unsigned nid;

nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
- return NODE_DATA(nid)->node_zonelists + gfp_zone(GFP_HIGHUSER);
+ return NODE_DATA(nid)->node_zonelists + gfp_zone(GFP_RCLMUSER);
}
return zonelist_policy(GFP_HIGHUSER, pol);
}

2006-02-17 17:17:42

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 4/7] ppc64 - Specify amount of kernel memory at boot time

On Fri, 2006-02-17 at 14:17 +0000, Mel Gorman wrote:
> This patch adds the kernelcore= parameter for ppc64.
>
> The amount of memory will requested will not be reserved in all nodes. The
> first node that is found that can accomodate the requested amount of memory
> and have remaining more for ZONE_EASYRCLM is used. If a node has memory holes,
> it also will not be used.

One thing I think we really need to see before these go into mainline is
the ability to shrink the ZONE_EASYRCLM at runtime, and give the memory
back to NORMAL/DMA.

Otherwise, any system starting off sufficiently small will end up having
lowmem starvation issues. Allowing resizing at least gives the admin a
chance to avoid those issues.

> + if (core_mem_pfn == 0 ||
> + end_pfn - start_pfn < core_mem_pfn ||
> + end_pfn - start_pfn != pages_present) {
> + zones_size[ZONE_DMA] = end_pfn - start_pfn;
> + zones_size[ZONE_EASYRCLM] = 0;
> + zholes_size[ZONE_DMA] =
> + zones_size[ZONE_DMA] - pages_present;
> + zholes_size[ZONE_EASYRCLM] = 0;
> + if (core_mem_pfn >= pages_present)
> + core_mem_pfn -= pages_present;
> + } else {
> + zones_size[ZONE_DMA] = core_mem_pfn;
> + zones_size[ZONE_EASYRCLM] = end_pfn - core_mem_pfn;
> + zholes_size[ZONE_DMA] = 0;
> + zholes_size[ZONE_EASYRCLM] = 0;
> + core_mem_pfn = 0;
> + }

I'm finding this bit of code really hard to parse.

First of all, please give "core_mem_size" and "core_mem_pfn" some better
names. "core_mem_size" in _what_? Bytes? Pages? g0ats? ;)

The "pfn" in "core_mem_pfn" is usually used to denote a physical address
>> PAGE_SHIFT. However, yours is actually a _number_ of pages, not an
address, right? Actually, as I look at it closer, it appears to be a
pfn in the else{} and a nr_page in the if{} block.

core_mem_nr_pages or nr_core_mem_pages might be more appropriate.

Users will _not_ care about memory holes. They'll just want to specify
a number of pages. I think this:

> + zones_size[ZONE_DMA] = core_mem_pfn;
> + zones_size[ZONE_EASYRCLM] = end_pfn - core_mem_pfn;

is probably bogus because it doesn't deal with holes at all.

Walking those init_node_data() structures in get_region() is probably
pretty darn fast, and we don't need to be careful about how many times
we do it. I think I'd probably separate out the problem a bit.

1. make get_region() not care about holes. Have it just return the
range of the node's pages.
2. make a new function (get_region_holes()??) that, given a pfn range,
walks the init_node_data[] just like get_region() (have them share
code) and return the present_pages in that pfn range.
3. go back to paging init, and try to properly size ZONE_DMA. Find
holes with your new function, and increase its size proportionately,
set zholes_size[ZONE_DMA] at this time. Make sure the user size is
in nr_page, _NOT_ max_pfns.
4. give the rest of the space to ZONE_EASYRCLM. Call your new function
to properly size its zone hole(s).
5. Profit!

This may all belong broken out in a new function.

-- Dave

2006-02-17 19:04:18

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 4/7] ppc64 - Specify amount of kernel memory at boot time

On Fri, 17 Feb 2006, Dave Hansen wrote:

> On Fri, 2006-02-17 at 14:17 +0000, Mel Gorman wrote:
> > This patch adds the kernelcore= parameter for ppc64.
> >
> > The amount of memory will requested will not be reserved in all nodes. The
> > first node that is found that can accomodate the requested amount of memory
> > and have remaining more for ZONE_EASYRCLM is used. If a node has memory holes,
> > it also will not be used.
>
> One thing I think we really need to see before these go into mainline is
> the ability to shrink the ZONE_EASYRCLM at runtime, and give the memory
> back to NORMAL/DMA.
>

I consider this to be highly desirable, but I am not convinced it is a
prerequisite for zone-based-anti-frag because it is a problem that will
only affect admins specifying kernelcore= - i.e. a limited number of
people that care about getting more HugeTLB pages at runtime or removing
memory.

> Otherwise, any system starting off sufficiently small will end up having
> lowmem starvation issues. Allowing resizing at least gives the admin a
> chance to avoid those issues.
>

The lowmem starvation issue is an existing problem for 32 bit x86 systems.
The pain is that an admin of a ppc64 machine that previously did not care
about lowmem starvation will have to start caring when kernelcore= is
used. Again, this is a new problem for ppc64 and only exists if you care
about hotplug-remove or having a soft area for HugeTLB pages.

If I think that zone-based anti-frag has a chance of getting into
mainline, I can start tackling the lowmem starvation issue as two separate
problems

1. There needs to be an ability to measure presure on lowmem at runtime to
help an admin decide if memory needs to be moved around or not. This would
have a secondary benefit for existing 32 bit x86 machines that need to
know that it is lowmem starvation and not lack of memory that is affecting
their loads and maybe an upgrade to a 64 bit machine is a good plan.

2. The ability to hot-add to a specified zone. When pressure is detected,
an admin would have the option to hot-remove from the EasyRclm area and
add the same memory back to the DMA/Normal zone. This will only work at a
pretty coarse granularity but it would be enough

I think that problem 1 already exists and is only tackled on the ppc64
with the introduction of EasyRclm. Problem 2 is new, but only exists when
the admin wants to remove memory. They need to be solved, but not
necessarily before EasyRclm is used.

> > + if (core_mem_pfn == 0 ||
> > + end_pfn - start_pfn < core_mem_pfn ||
> > + end_pfn - start_pfn != pages_present) {
> > + zones_size[ZONE_DMA] = end_pfn - start_pfn;
> > + zones_size[ZONE_EASYRCLM] = 0;
> > + zholes_size[ZONE_DMA] =
> > + zones_size[ZONE_DMA] - pages_present;
> > + zholes_size[ZONE_EASYRCLM] = 0;
> > + if (core_mem_pfn >= pages_present)
> > + core_mem_pfn -= pages_present;
> > + } else {
> > + zones_size[ZONE_DMA] = core_mem_pfn;
> > + zones_size[ZONE_EASYRCLM] = end_pfn - core_mem_pfn;
> > + zholes_size[ZONE_DMA] = 0;
> > + zholes_size[ZONE_EASYRCLM] = 0;
> > + core_mem_pfn = 0;
> > + }
>
> I'm finding this bit of code really hard to parse.
>
> First of all, please give "core_mem_size" and "core_mem_pfn" some better
> names. "core_mem_size" in _what_? Bytes? Pages? g0ats? ;)
>

bytes, but I take your point. This could be clearer.

> The "pfn" in "core_mem_pfn" is usually used to denote a physical address
> >> PAGE_SHIFT. However, yours is actually a _number_ of pages, not an
> address, right? Actually, as I look at it closer, it appears to be a
> pfn in the else{} and a nr_page in the if{} block.
>

It works out as number of pages but it really should be the "page number
after start_pfn that kernel memory ends and easyrclm begins". I'll rework
the code.

> core_mem_nr_pages or nr_core_mem_pages might be more appropriate.
>

It would.

> Users will _not_ care about memory holes. They'll just want to specify
> a number of pages. I think this:
>
> > + zones_size[ZONE_DMA] = core_mem_pfn;
> > + zones_size[ZONE_EASYRCLM] = end_pfn - core_mem_pfn;
>
> is probably bogus because it doesn't deal with holes at all.
>

In this patch, if a region has holes in it, kernelcore is ignored because
holes would not be dealt with correctly. The check is made above with
end_pfn - start_pfn != pages_present

> Walking those init_node_data() structures in get_region() is probably
> pretty darn fast, and we don't need to be careful about how many times
> we do it. I think I'd probably separate out the problem a bit.
>
> 1. make get_region() not care about holes. Have it just return the
> range of the node's pages.
> 2. make a new function (get_region_holes()??) that, given a pfn range,
> walks the init_node_data[] just like get_region() (have them share
> code) and return the present_pages in that pfn range.
> 3. go back to paging init, and try to properly size ZONE_DMA. Find
> holes with your new function, and increase its size proportionately,
> set zholes_size[ZONE_DMA] at this time. Make sure the user size is
> in nr_page, _NOT_ max_pfns.
> 4. give the rest of the space to ZONE_EASYRCLM. Call your new function
> to properly size its zone hole(s).
> 5. Profit!
>
> This may all belong broken out in a new function.
>

Sounds reasonable enough although, as you say, it'll need a new function
at the versy least. I'll start hitting it and see what I come up with.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2006-02-17 19:18:29

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 4/7] ppc64 - Specify amount of kernel memory at boot time

On Fri, 2006-02-17 at 19:03 +0000, Mel Gorman wrote:
> On Fri, 17 Feb 2006, Dave Hansen wrote:
> > One thing I think we really need to see before these go into mainline is
> > the ability to shrink the ZONE_EASYRCLM at runtime, and give the memory
> > back to NORMAL/DMA.
>
> I consider this to be highly desirable, but I am not convinced it is a
> prerequisite for zone-based-anti-frag because it is a problem that will
> only affect admins specifying kernelcore= - i.e. a limited number of
> people that care about getting more HugeTLB pages at runtime or removing
> memory.

Fair enough. I guess it certainly shrinks the set. But, I _really_
think we need it before it gets into the hands of too many customers. I
hate to tell people to reboot boxes, and I hate adding boot options.

If people start using kernelcore=, we're stuck with it for a long time.
Why not just make the system flexible from the beginning? I hate
introducing hacky boot options only to have them removed 2 kernel
revisions later when we fix it properly.

Maybe the kernelcore= patch is a good candidate to stay in -mm during
the transition time, but not ever follow along into mainline. Dunno.

> If I think that zone-based anti-frag has a chance of getting into
> mainline, I can start tackling the lowmem starvation issue as two separate
> problems
>
> 1. There needs to be an ability to measure presure on lowmem at runtime to
> help an admin decide if memory needs to be moved around or not. This would
> have a secondary benefit for existing 32 bit x86 machines that need to
> know that it is lowmem starvation and not lack of memory that is affecting
> their loads and maybe an upgrade to a 64 bit machine is a good plan.
>
> 2. The ability to hot-add to a specified zone. When pressure is detected,
> an admin would have the option to hot-remove from the EasyRclm area and
> add the same memory back to the DMA/Normal zone. This will only work at a
> pretty coarse granularity but it would be enough

But, if you get the EasyRclm to DMA transition working properly, this
problem goes away. So, if you solve a problem in the kernel, the user
gets fewer ways to screw up, _and_ you have less code to deal with that
part of the user interface.

> > Users will _not_ care about memory holes. They'll just want to specify
> > a number of pages. I think this:
> >
> > > + zones_size[ZONE_DMA] = core_mem_pfn;
> > > + zones_size[ZONE_EASYRCLM] = end_pfn - core_mem_pfn;
> >
> > is probably bogus because it doesn't deal with holes at all.
> >
>
> In this patch, if a region has holes in it, kernelcore is ignored because
> holes would not be dealt with correctly. The check is made above with
> end_pfn - start_pfn != pages_present

I missed that. Is that my fault for not looking closely enough, or the
patch's fault for being a bit obtuse? ;)

I think it is pretty bogus to just punt when it sees a memory hole.
They really need to me dealt with properly. Otherwise, you'll never
hear about it until a customer complains that kernelcore= isn't working
and they can't remove memory or use hugetlb pages on two "identical"
systems.

-- Dave

2006-02-17 19:37:50

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 4/7] ppc64 - Specify amount of kernel memory at boot time

On Fri, 17 Feb 2006, Dave Hansen wrote:

> On Fri, 2006-02-17 at 19:03 +0000, Mel Gorman wrote:
> > On Fri, 17 Feb 2006, Dave Hansen wrote:
> > > One thing I think we really need to see before these go into mainline is
> > > the ability to shrink the ZONE_EASYRCLM at runtime, and give the memory
> > > back to NORMAL/DMA.
> >
> > I consider this to be highly desirable, but I am not convinced it is a
> > prerequisite for zone-based-anti-frag because it is a problem that will
> > only affect admins specifying kernelcore= - i.e. a limited number of
> > people that care about getting more HugeTLB pages at runtime or removing
> > memory.
>
> Fair enough. I guess it certainly shrinks the set. But, I _really_
> think we need it before it gets into the hands of too many customers. I
> hate to tell people to reboot boxes, and I hate adding boot options.
>

Ok, hard to argue with that logic but we knew we were going to be dealing
with some sort of tunable when zone-based-anti-frag was first discussed.

> If people start using kernelcore=, we're stuck with it for a long time.
> Why not just make the system flexible from the beginning? I hate
> introducing hacky boot options only to have them removed 2 kernel
> revisions later when we fix it properly.
>
> Maybe the kernelcore= patch is a good candidate to stay in -mm during
> the transition time, but not ever follow along into mainline. Dunno.
>

I would like to see this in -mm getting a trial while I work on the moving
pages from EasyRclm to DMA/HighMem problem. I feel that there is a logical
progression from kernelcore= boot option to a sysctl (or something
similar) that is adjustable at runtime. When the code exists to size the
zone at runtime, it can be built on top of the majority of this patchset.
They are not mutually exclusive.

Before even trying to get into -mm though, I should fix the
dealing-with-memory-holes-on-ppc first (more below).

> > If I think that zone-based anti-frag has a chance of getting into
> > mainline, I can start tackling the lowmem starvation issue as two separate
> > problems
> >
> > 1. There needs to be an ability to measure presure on lowmem at runtime to
> > help an admin decide if memory needs to be moved around or not. This would
> > have a secondary benefit for existing 32 bit x86 machines that need to
> > know that it is lowmem starvation and not lack of memory that is affecting
> > their loads and maybe an upgrade to a 64 bit machine is a good plan.
> >
> > 2. The ability to hot-add to a specified zone. When pressure is detected,
> > an admin would have the option to hot-remove from the EasyRclm area and
> > add the same memory back to the DMA/Normal zone. This will only work at a
> > pretty coarse granularity but it would be enough
>
> But, if you get the EasyRclm to DMA transition working properly, this
> problem goes away. So, if you solve a problem in the kernel, the user
> gets fewer ways to screw up, _and_ you have less code to deal with that
> part of the user interface.
>

Well, they'll have different ways to screw up. Once the EasyRclm zone can
be resized, they might decide to migrate it all to DMA and then cannot
hotplug anything putting them back in square one. I need to think about
this a bit more to see how it'll work in practice.

> > > Users will _not_ care about memory holes. They'll just want to specify
> > > a number of pages. I think this:
> > >
> > > > + zones_size[ZONE_DMA] = core_mem_pfn;
> > > > + zones_size[ZONE_EASYRCLM] = end_pfn - core_mem_pfn;
> > >
> > > is probably bogus because it doesn't deal with holes at all.
> > >
> >
> > In this patch, if a region has holes in it, kernelcore is ignored because
> > holes would not be dealt with correctly. The check is made above with
> > end_pfn - start_pfn != pages_present
>
> I missed that. Is that my fault for not looking closely enough, or the
> patch's fault for being a bit obtuse? ;)
>

c) all of the above. This section of the patch is not winning any prizes
for clarity.

> I think it is pretty bogus to just punt when it sees a memory hole.
> They really need to me dealt with properly.

Your previous mail about having a function similar to get_region() that
returns the number of present pages in a pfn range feels like the right
approach. I'll work on this before touching zone resizing and post another
version.

> Otherwise, you'll never
> hear about it until a customer complains that kernelcore= isn't working
> and they can't remove memory or use hugetlb pages on two "identical"
> systems.
>

Bah, you're right, I just am not too keen on admitting it on a Friday
evening :).

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2006-02-17 21:32:41

by Joel Schopp

[permalink] [raw]
Subject: Re: [PATCH 4/7] ppc64 - Specify amount of kernel memory at boot time

>> This patch adds the kernelcore= parameter for ppc64.
>>
>> The amount of memory will requested will not be reserved in all nodes. The
>> first node that is found that can accomodate the requested amount of memory
>> and have remaining more for ZONE_EASYRCLM is used. If a node has memory holes,
>> it also will not be used.
>
> One thing I think we really need to see before these go into mainline is
> the ability to shrink the ZONE_EASYRCLM at runtime, and give the memory
> back to NORMAL/DMA.
>
> Otherwise, any system starting off sufficiently small will end up having
> lowmem starvation issues. Allowing resizing at least gives the admin a
> chance to avoid those issues.
>

I'm not too keen on calling it resizing, because that term is
misleading. The resizing is one way. You can't later resize back. It's
like a window that you can only close but never reopen. We should call
it "runtime incremental disabling", or RID.

I don't think we need RID in order to merge these patches. RID can be
merged later if people decide they want a special easy reclaim zone that
could disappear at any moment. I personally fall in the camp of wanting
my zones I explicitly enabled to stay put and am opposed to RID.

If only somebody had presented a solution that was flexible enough to
dynamically resize reclaimable and non-reclaimable both ways.
http://sourceforge.net/mailarchive/message.php?msg_id=13864331




2006-02-21 14:52:08

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 4/7] ppc64 - Specify amount of kernel memory at boot time

On Fri, 17 Feb 2006, Dave Hansen wrote:

> On Fri, 2006-02-17 at 14:17 +0000, Mel Gorman wrote:
>> This patch adds the kernelcore= parameter for ppc64.
>>
>> The amount of memory will requested will not be reserved in all nodes. The
>> first node that is found that can accomodate the requested amount of memory
>> and have remaining more for ZONE_EASYRCLM is used. If a node has memory holes,
>> it also will not be used.
>
> One thing I think we really need to see before these go into mainline is
> the ability to shrink the ZONE_EASYRCLM at runtime, and give the memory
> back to NORMAL/DMA.
>
> Otherwise, any system starting off sufficiently small will end up having
> lowmem starvation issues. Allowing resizing at least gives the admin a
> chance to avoid those issues.
>
>> + if (core_mem_pfn == 0 ||
>> + end_pfn - start_pfn < core_mem_pfn ||
>> + end_pfn - start_pfn != pages_present) {
>> + zones_size[ZONE_DMA] = end_pfn - start_pfn;
>> + zones_size[ZONE_EASYRCLM] = 0;
>> + zholes_size[ZONE_DMA] =
>> + zones_size[ZONE_DMA] - pages_present;
>> + zholes_size[ZONE_EASYRCLM] = 0;
>> + if (core_mem_pfn >= pages_present)
>> + core_mem_pfn -= pages_present;
>> + } else {
>> + zones_size[ZONE_DMA] = core_mem_pfn;
>> + zones_size[ZONE_EASYRCLM] = end_pfn - core_mem_pfn;
>> + zholes_size[ZONE_DMA] = 0;
>> + zholes_size[ZONE_EASYRCLM] = 0;
>> + core_mem_pfn = 0;
>> + }
>
> I'm finding this bit of code really hard to parse.
>
> First of all, please give "core_mem_size" and "core_mem_pfn" some better
> names. "core_mem_size" in _what_? Bytes? Pages? g0ats? ;)
>
> The "pfn" in "core_mem_pfn" is usually used to denote a physical address
>>> PAGE_SHIFT. However, yours is actually a _number_ of pages, not an
> address, right? Actually, as I look at it closer, it appears to be a
> pfn in the else{} and a nr_page in the if{} block.
>
> core_mem_nr_pages or nr_core_mem_pages might be more appropriate.
>
> Users will _not_ care about memory holes. They'll just want to specify
> a number of pages. I think this:
>
>> + zones_size[ZONE_DMA] = core_mem_pfn;
>> + zones_size[ZONE_EASYRCLM] = end_pfn - core_mem_pfn;
>
> is probably bogus because it doesn't deal with holes at all.
>
> Walking those init_node_data() structures in get_region() is probably
> pretty darn fast, and we don't need to be careful about how many times
> we do it. I think I'd probably separate out the problem a bit.
>
> 1. make get_region() not care about holes. Have it just return the
> range of the node's pages.
> 2. make a new function (get_region_holes()??) that, given a pfn range,
> walks the init_node_data[] just like get_region() (have them share
> code) and return the present_pages in that pfn range.
> 3. go back to paging init, and try to properly size ZONE_DMA. Find
> holes with your new function, and increase its size proportionately,
> set zholes_size[ZONE_DMA] at this time. Make sure the user size is
> in nr_page, _NOT_ max_pfns.
> 4. give the rest of the space to ZONE_EASYRCLM. Call your new function
> to properly size its zone hole(s).
> 5. Profit!
>
> This may all belong broken out in a new function.
>

A new release of patches is a long time away but here is an early draft of
what the above currently looks like. Is this more or less what you were
thinking?

diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-103_x86coremem/arch/powerpc/mm/numa.c linux-2.6.16-rc3-mm1-104_ppc64coremem/arch/powerpc/mm/numa.c
--- linux-2.6.16-rc3-mm1-103_x86coremem/arch/powerpc/mm/numa.c 2006-02-16 09:50:42.000000000 +0000
+++ linux-2.6.16-rc3-mm1-104_ppc64coremem/arch/powerpc/mm/numa.c 2006-02-21 11:45:38.000000000 +0000
@@ -17,10 +17,12 @@
#include <linux/nodemask.h>
#include <linux/cpu.h>
#include <linux/notifier.h>
+#include <linux/sort.h>
#include <asm/sparsemem.h>
#include <asm/lmb.h>
#include <asm/system.h>
#include <asm/smp.h>
+#include <asm/machdep.h>

static int numa_enabled = 1;

@@ -101,22 +103,41 @@ void __init add_region(unsigned int nid,
init_node_data[i].nid = nid;
}

+/* Compare two elements in init_node_data. Assume start_pfn is at start */
+static int cmp_init_node_data(const void *a, const void *b)
+{
+ return *(unsigned long *)a - *(unsigned long *)b;
+}
+
+/*
+ * init_node_data is not necessarilly in pfn order making it difficult to
+ * determine where the EasyRclm should begin if it is requested. This sorts
+ * init_node_data by start_pfn
+ */
+void __init sort_regions(void)
+{
+ size_t num = 0;
+ size_t size_element = sizeof(init_node_data) / MAX_REGIONS;
+
+ while (init_node_data[num].end_pfn)
+ num++;
+
+ sort(init_node_data, num, size_element, cmp_init_node_data, NULL);
+}
+
/* We assume init_node_data has no overlapping regions */
void __init get_region(unsigned int nid, unsigned long *start_pfn,
- unsigned long *end_pfn, unsigned long *pages_present)
+ unsigned long *end_pfn)
{
unsigned int i;

*start_pfn = -1UL;
- *end_pfn = *pages_present = 0;
+ *end_pfn = 0;

for (i = 0; init_node_data[i].end_pfn; i++) {
if (init_node_data[i].nid != nid)
continue;

- *pages_present += init_node_data[i].end_pfn -
- init_node_data[i].start_pfn;

if (init_node_data[i].start_pfn < *start_pfn)
*start_pfn = init_node_data[i].start_pfn;

@@ -129,6 +150,108 @@ void __init get_region(unsigned int nid,
*start_pfn = 0;
}

+/*
+ * Given a pfn range and the required number of pages for the kernel, return
+ * the zone of the DMA and EASYRCLM zones and any holes that are contained
+ */
+void __init get_zones_info(unsigned int nid,
+ unsigned long start_pfn,
+ unsigned long end_region_pfn,
+ unsigned long kernelcore_pages,
+ unsigned long *zones_size,
+ unsigned long *zholes_size)
+{
+ unsigned long end_kernelcore_pfn = 0;
+ unsigned long pages_present_kernel = 0;
+ unsigned long pages_present = 0;
+ unsigned int i;
+
+ dbg("get_zones_info: nid: %u kernelcore_pages: %lu range %lu -> %lu\n",
+ nid,
+ kernelcore_pages,
+ start_pfn,
+ end_region_pfn);
+
+ /* Handle the case where this region is totally empty */
+ if (end_region_pfn == start_pfn) {
+ dbg("get_zones_info: region is empty\n");
+ zones_size[ZONE_DMA] = 0;
+ zones_size[ZONE_EASYRCLM] = 0;
+ zholes_size[ZONE_DMA] = 0;
+ zholes_size[ZONE_EASYRCLM] = 0;
+ return;
+ }
+
+ /*
+ * If kernelcore_pages were not required, make sure that this whole
+ * region will be used for ZONE_DMA
+ */
+ dbg("get_zones_info: kernelcore_pages = %lu\n", kernelcore_pages);
+ if (kernelcore_pages == 0) {
+ dbg("get_zones_info: kernel will use whole region\n");
+ kernelcore_pages = end_region_pfn - start_pfn;
+ end_kernelcore_pfn = end_region_pfn;
+ }
+
+ /*
+ * Walk through node data to find a pfn where kernelcore_pages will
+ * be in the range start_pfn -> kernelcore_pfn .
+ */
+ for (i = 0; init_node_data[i].end_pfn; i++) {
+ unsigned long segment_start_pfn, segment_end_pfn;
+ unsigned long local_pages_present;
+
+ /*
+ * If this segment does not contain pages in the required
+ * pfn range, skip it
+ */
+ if (init_node_data[i].nid != nid)
+ continue;
+ if (start_pfn >= init_node_data[i].end_pfn)
+ continue;
+ if (end_region_pfn <= init_node_data[i].start_pfn)
+ continue;
+
+ /* Find what portion of the segment is of interest */
+ segment_start_pfn = init_node_data[i].start_pfn;
+ segment_end_pfn = init_node_data[i].end_pfn;
+ if (segment_start_pfn < start_pfn)
+ segment_start_pfn = start_pfn;
+ if (segment_end_pfn > end_region_pfn)
+ segment_end_pfn = end_region_pfn;
+
+ /*
+ * Update pages_present and see if we have reached the end
+ * of the kernelcore portion of the region
+ */
+ local_pages_present = segment_end_pfn - segment_start_pfn;
+ if (local_pages_present > kernelcore_pages) {
+ end_kernelcore_pfn =
+ segment_start_pfn + kernelcore_pages;
+ pages_present_kernel =
+ pages_present + kernelcore_pages;
+ kernelcore_pages = 0;
+ }
+
+ pages_present += local_pages_present;
+ if (kernelcore_pages) {
+ kernelcore_pages -= pages_present;
+ pages_present_kernel += pages_present;
+ }
+ }
+
+ /* Set the zones value */
+ dbg("get_zones_info: nid %d ZONE_DMA: %lu -> %lu\n", nid,
+ start_pfn, end_kernelcore_pfn);
+ dbg("get_zones_info: nid %d ZONE_EASYRCLM: %lu -> %lu\n", nid,
+ end_kernelcore_pfn, end_region_pfn);
+ zones_size[ZONE_DMA] = end_kernelcore_pfn - start_pfn;
+ zholes_size[ZONE_DMA] = zones_size[ZONE_DMA] - pages_present_kernel;
+ zones_size[ZONE_EASYRCLM] = end_region_pfn - end_kernelcore_pfn;
+ zholes_size[ZONE_EASYRCLM] = zones_size[ZONE_EASYRCLM] -
+ (pages_present - pages_present_kernel);
+}
+
static inline void map_cpu_to_node(int cpu, int node)
{
numa_cpu_lookup_table[cpu] = node;
@@ -622,11 +745,11 @@ void __init do_init_bootmem(void)
register_cpu_notifier(&ppc64_numa_nb);

for_each_online_node(nid) {
- unsigned long start_pfn, end_pfn, pages_present;
+ unsigned long start_pfn, end_pfn;
unsigned long bootmem_paddr;
unsigned long bootmap_pages;

- get_region(nid, &start_pfn, &end_pfn, &pages_present);
+ get_region(nid, &start_pfn, &end_pfn);

/* Allocate the node structure node local if possible */
NODE_DATA(nid) = careful_allocation(nid,
@@ -721,21 +844,39 @@ void __init paging_init(void)
{
unsigned long zones_size[MAX_NR_ZONES];
unsigned long zholes_size[MAX_NR_ZONES];
+ unsigned long kernelcore_pages = 0;
int nid;
+ char *opt;

memset(zones_size, 0, sizeof(zones_size));
memset(zholes_size, 0, sizeof(zholes_size));

- for_each_online_node(nid) {
- unsigned long start_pfn, end_pfn, pages_present;
+ /* Check if ZONE_EASYRCLM should be populated */
+ opt = strstr(cmd_line, "kernelcore=");
+ if (opt) {
+ opt += 11;
+ unsigned long size_bytes = memparse(opt, &opt);
+ kernelcore_pages = size_bytes >> PAGE_SHIFT;
+ }

- get_region(nid, &start_pfn, &end_pfn, &pages_present);
+ sort_regions();
+ for_each_online_node(nid) {
+ unsigned long start_pfn, end_region_pfn;

- zones_size[ZONE_DMA] = end_pfn - start_pfn;
- zholes_size[ZONE_DMA] = zones_size[ZONE_DMA] - pages_present;
+ get_region(nid, &start_pfn, &end_region_pfn);

- dbg("free_area_init node %d %lx %lx (hole: %lx)\n", nid,
- zones_size[ZONE_DMA], start_pfn, zholes_size[ZONE_DMA]);
+ get_zones_info(nid, start_pfn, end_region_pfn,
+ kernelcore_pages,
+ &zones_size[0],
+ &zholes_size[0]);
+
+ dbg("free_area_init DMA node %d %lx %lx (hole: %lx)\n",
+ nid, zones_size[ZONE_DMA],
+ start_pfn, zholes_size[ZONE_DMA]);
+
+ dbg("free_area_init EasyRclm node %d %lx %lx (hole: %lx)\n",
+ nid, zones_size[ZONE_EASYRCLM],
+ start_pfn, zholes_size[ZONE_EASYRCLM]);

free_area_init_node(nid, NODE_DATA(nid), zones_size, start_pfn,
zholes_size);

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2006-02-21 17:36:05

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 4/7] ppc64 - Specify amount of kernel memory at boot time

On Tue, 2006-02-21 at 14:51 +0000, Mel Gorman wrote:
> A new release of patches is a long time away but here is an early draft of
> what the above currently looks like. Is this more or less what you were
> thinking?

I think it may be a bit harder to understand than even the other
one. :(

In a nutshell, get_zones_info() tries to do too much. Six function
arguments should be a big, red, warning light that something is really
wrong. Calling a function _info() is another bad sign. It means that
you can't discretely describe what it does.

Remember, there are 3 distinct tasks here:

1. size the node information from init_node_data[]
2. size the easy reclaim zone based on the boot parameters
3. take holes into account when doing the reclaim zone sizing

You don't have to do all of those tasks in one pass. This is not a
performance-critical path, so try to be as clear as possible, even if it
means an extra run or two through the data.

Maybe something like this?

ZONE_DMA = all memory in node
if (kernelcore was set)
while (zone size with holes of ZONE_DMA > kernelcore)
move memory into EASYRCLM
zholes[DMA] = ...
zholes[EASYRCLM] = ...
free_area_init_node()

Those are all operations I can understand.

-- Dave

2006-02-22 16:44:17

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 4/7] ppc64 - Specify amount of kernel memory at boot time

On Tue, 21 Feb 2006, Dave Hansen wrote:

> On Tue, 2006-02-21 at 14:51 +0000, Mel Gorman wrote:
>> A new release of patches is a long time away but here is an early draft of
>> what the above currently looks like. Is this more or less what you were
>> thinking?
>
> I think it may be a bit harder to understand than even the other
> one. :(
>

:/

> In a nutshell, get_zones_info() tries to do too much. Six function
> arguments should be a big, red, warning light that something is really
> wrong. Calling a function _info() is another bad sign. It means that
> you can't discretely describe what it does.
>
> Remember, there are 3 distinct tasks here:
>
> 1. size the node information from init_node_data[]
> 2. size the easy reclaim zone based on the boot parameters
> 3. take holes into account when doing the reclaim zone sizing
>

right

*some pounding on the keyboard*

Is this a bit clearer? It's built and boot tested on one ppc64 machine. I
am having trouble finding a ppc64 machine that *has* memory holes to be
100% sure it's ok.

diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-103_x86coremem/arch/powerpc/mm/numa.c linux-2.6.16-rc3-mm1-104_ppc64coremem/arch/powerpc/mm/numa.c
--- linux-2.6.16-rc3-mm1-103_x86coremem/arch/powerpc/mm/numa.c 2006-02-16 09:50:42.000000000 +0000
+++ linux-2.6.16-rc3-mm1-104_ppc64coremem/arch/powerpc/mm/numa.c 2006-02-22 16:07:35.000000000 +0000
@@ -17,10 +17,12 @@
#include <linux/nodemask.h>
#include <linux/cpu.h>
#include <linux/notifier.h>
+#include <linux/sort.h>
#include <asm/sparsemem.h>
#include <asm/lmb.h>
#include <asm/system.h>
#include <asm/smp.h>
+#include <asm/machdep.h>

static int numa_enabled = 1;

@@ -101,22 +103,41 @@ void __init add_region(unsigned int nid,
init_node_data[i].nid = nid;
}

+/* Compare two elements in init_node_data. Assume start_pfn is at start */
+static int cmp_init_node_data(const void *a, const void *b)
+{
+ return *(unsigned long *)a - *(unsigned long *)b;
+}
+
+/*
+ * init_node_data is not necessarilly in pfn order making it difficult to
+ * determine where the EasyRclm should begin if it is requested. This sorts
+ * init_node_data by start_pfn
+ */
+void __init sort_regions(void)
+{
+ size_t num = 0;
+ size_t size_element = sizeof(init_node_data) / MAX_REGIONS;
+
+ while (init_node_data[num].end_pfn)
+ num++;
+
+ sort(init_node_data, num, size_element, cmp_init_node_data, NULL);
+}
+
/* We assume init_node_data has no overlapping regions */
void __init get_region(unsigned int nid, unsigned long *start_pfn,
- unsigned long *end_pfn, unsigned long *pages_present)
+ unsigned long *end_pfn)
{
unsigned int i;

*start_pfn = -1UL;
- *end_pfn = *pages_present = 0;
+ *end_pfn = 0;

for (i = 0; init_node_data[i].end_pfn; i++) {
if (init_node_data[i].nid != nid)
continue;

- *pages_present += init_node_data[i].end_pfn -
- init_node_data[i].start_pfn;
-
if (init_node_data[i].start_pfn < *start_pfn)
*start_pfn = init_node_data[i].start_pfn;

@@ -129,6 +150,88 @@ void __init get_region(unsigned int nid,
*start_pfn = 0;
}

+/* Initialise the size of each zone in a node */
+void __init zone_sizes_init(unsigned int nid,
+ unsigned long kernelcore_pages,
+ unsigned long *zones_size)
+{
+ unsigned int i;
+ unsigned long pages_present = 0;
+
+ /* Get the number of present pages in the node */
+ for (i = 0; init_node_data[i].end_pfn; i++) {
+ if (init_node_data[i].nid != nid)
+ continue;
+
+ pages_present += init_node_data[i].end_pfn -
+ init_node_data[i].start_pfn;
+ }
+
+ if (kernelcore_pages && kernelcore_pages < pages_present) {
+ zones_size[ZONE_DMA] = kernelcore_pages;
+ zones_size[ZONE_EASYRCLM] = pages_present - kernelcore_pages;
+ } else {
+ zones_size[ZONE_DMA] = pages_present;
+ zones_size[ZONE_EASYRCLM] = 0;
+ }
+}
+
+void __init get_zholes_size(unsigned int nid, unsigned long *zones_size,
+ unsigned long *zholes_size) {
+ unsigned int i = 0;
+ unsigned int start_easyrclm_pfn;
+ unsigned long last_end_pfn, first;
+
+ /* Find where the PFN of the end of DMA is */
+ unsigned long pages_count = zones_size[ZONE_DMA];
+ for (i = 0; init_node_data[i].end_pfn; i++) {
+ unsigned long segment_size;
+ if (init_node_data[i].nid != nid)
+ continue;
+
+ /*
+ * Check if the end of ZONE_DMA is in this segment of the
+ * init_node_data
+ */
+ segment_size = init_node_data[i].end_pfn -
+ init_node_data[i].start_pfn;
+ if (pages_count > segment_size) {
+ /* End of ZONE_DMA is not here, move on */
+ pages_count -= segment_size;
+ continue;
+ }
+
+ /* End of ZONE_DMA is here */
+ start_easyrclm_pfn = init_node_data[i].start_pfn + pages_count;
+ break;
+ }
+
+ /* Walk the map again and get the size of the holes */
+ first = 1;
+ zholes_size[ZONE_DMA] = 0;
+ zholes_size[ZONE_EASYRCLM] = 0;
+ for (i = 1; init_node_data[i].end_pfn; i++) {
+ unsigned long hole_size;
+ if (init_node_data[i].nid != nid)
+ continue;
+
+ if (first) {
+ last_end_pfn = init_node_data[i].end_pfn;
+ first = 0;
+ continue;
+ }
+
+ /* Hole found */
+ hole_size = init_node_data[i].start_pfn - last_end_pfn;
+ if (init_node_data[i].start_pfn < start_easyrclm_pfn) {
+ zholes_size[ZONE_DMA] += hole_size;
+ } else {
+ zholes_size[ZONE_EASYRCLM] += hole_size;
+ }
+ last_end_pfn = init_node_data[i].end_pfn;
+ }
+}
+
static inline void map_cpu_to_node(int cpu, int node)
{
numa_cpu_lookup_table[cpu] = node;
@@ -622,11 +725,11 @@ void __init do_init_bootmem(void)
register_cpu_notifier(&ppc64_numa_nb);

for_each_online_node(nid) {
- unsigned long start_pfn, end_pfn, pages_present;
+ unsigned long start_pfn, end_pfn;
unsigned long bootmem_paddr;
unsigned long bootmap_pages;

- get_region(nid, &start_pfn, &end_pfn, &pages_present);
+ get_region(nid, &start_pfn, &end_pfn);

/* Allocate the node structure node local if possible */
NODE_DATA(nid) = careful_allocation(nid,
@@ -721,21 +824,36 @@ void __init paging_init(void)
{
unsigned long zones_size[MAX_NR_ZONES];
unsigned long zholes_size[MAX_NR_ZONES];
+ unsigned long kernelcore_pages = 0;
int nid;
+ char *opt;

memset(zones_size, 0, sizeof(zones_size));
memset(zholes_size, 0, sizeof(zholes_size));

- for_each_online_node(nid) {
- unsigned long start_pfn, end_pfn, pages_present;
-
- get_region(nid, &start_pfn, &end_pfn, &pages_present);
+ /* Check if ZONE_EASYRCLM should be populated */
+ opt = strstr(cmd_line, "kernelcore=");
+ if (opt) {
+ opt += 11;
+ unsigned long size_bytes = memparse(opt, &opt);
+ kernelcore_pages = size_bytes >> PAGE_SHIFT;
+ }

- zones_size[ZONE_DMA] = end_pfn - start_pfn;
- zholes_size[ZONE_DMA] = zones_size[ZONE_DMA] - pages_present;
+ sort_regions();
+ for_each_online_node(nid) {
+ unsigned long start_pfn, end_region_pfn;

- dbg("free_area_init node %d %lx %lx (hole: %lx)\n", nid,
- zones_size[ZONE_DMA], start_pfn, zholes_size[ZONE_DMA]);
+ get_region(nid, &start_pfn, &end_region_pfn);
+ zone_sizes_init(nid, kernelcore_pages, &zones_size[0]);
+ get_zholes_size(nid, &zones_size[0], &zholes_size[0]);
+
+ dbg("free_area_init DMA node %d %lx %lx (hole: %lx)\n",
+ nid, zones_size[ZONE_DMA],
+ start_pfn, zholes_size[ZONE_DMA]);
+
+ dbg("free_area_init EasyRclm node %d %lx %lx (hole: %lx)\n",
+ nid, zones_size[ZONE_EASYRCLM],
+ start_pfn, zholes_size[ZONE_EASYRCLM]);

free_area_init_node(nid, NODE_DATA(nid), zones_size, start_pfn,
zholes_size);

2006-02-23 16:43:05

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 4/7] ppc64 - Specify amount of kernel memory at boot time

On Wed, 2006-02-22 at 16:43 +0000, Mel Gorman wrote:
> Is this a bit clearer? It's built and boot tested on one ppc64 machine. I
> am having trouble finding a ppc64 machine that *has* memory holes to be
> 100% sure it's ok.

Yeah, it looks that way. If you need a machine, see Mike Kravetz. I
think he was working on a way to automate creating memory holes.

> diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-103_x86coremem/arch/powerpc/mm/numa.c linux-2.6.16-rc3-mm1-104_ppc64coremem/arch/powerpc/mm/numa.c
> --- linux-2.6.16-rc3-mm1-103_x86coremem/arch/powerpc/mm/numa.c 2006-02-16 09:50:42.000000000 +0000
> +++ linux-2.6.16-rc3-mm1-104_ppc64coremem/arch/powerpc/mm/numa.c 2006-02-22 16:07:35.000000000 +0000
> @@ -17,10 +17,12 @@
> #include <linux/nodemask.h>
> #include <linux/cpu.h>
> #include <linux/notifier.h>
> +#include <linux/sort.h>
> #include <asm/sparsemem.h>
> #include <asm/lmb.h>
> #include <asm/system.h>
> #include <asm/smp.h>
> +#include <asm/machdep.h>

Is the email spacing getting screwed up here?

> +/* Initialise the size of each zone in a node */
> +void __init zone_sizes_init(unsigned int nid,
> + unsigned long kernelcore_pages,
> + unsigned long *zones_size)

Minor nit territory: set_zone_sizes(), maybe?

> +{
> + unsigned int i;
> + unsigned long pages_present = 0;

pages_present_in_node?

> + /* Get the number of present pages in the node */
> + for (i = 0; init_node_data[i].end_pfn; i++) {
> + if (init_node_data[i].nid != nid)
> + continue;
> +
> + pages_present += init_node_data[i].end_pfn -
> + init_node_data[i].start_pfn;
> + }
> +
> + if (kernelcore_pages && kernelcore_pages < pages_present) {
> + zones_size[ZONE_DMA] = kernelcore_pages;
> + zones_size[ZONE_EASYRCLM] = pages_present - kernelcore_pages;
> + } else {
> + zones_size[ZONE_DMA] = pages_present;
> + zones_size[ZONE_EASYRCLM] = 0;
> + }
> +}

I think there are a couple of buglets here. I think the
kernelcore_pages is going to be applied per-zone, right?

Also, how do we want to distribute kernelcore memory over each node?
The way it is coded up for now, it will all be sliced out of the first
node. I'm not sure that's a good thing.

My inclination would be to completely separate out the ZONE_EASYRCLM
into separate code. It makes it easier to set whatever policy you want
in one place. Just a suggestion.

> +void __init get_zholes_size(unsigned int nid, unsigned long *zones_size,
> + unsigned long *zholes_size) {

nid_zholes_size()? I'm not too sure about this one. Just promise me
you'll think about it a bit more. ;)

> + unsigned int i = 0;
> + unsigned int start_easyrclm_pfn;
> + unsigned long last_end_pfn, first;
> +
> + /* Find where the PFN of the end of DMA is */
> + unsigned long pages_count = zones_size[ZONE_DMA];

<tangent> This (virtually) proves that zones_size[] needs to get a
different name. Perhaps we need to make it more like the zone structure
itself and go to spanned and present pages? </tangent>

> + for (i = 0; init_node_data[i].end_pfn; i++) {
> + unsigned long segment_size;
> + if (init_node_data[i].nid != nid)
> + continue;
> +
> + /*
> + * Check if the end of ZONE_DMA is in this segment of the
> + * init_node_data
> + */
> + segment_size = init_node_data[i].end_pfn -
> + init_node_data[i].start_pfn;

"segment" is probably a bad term to use here, especially on ppc.

One other thing, I want to _know_ that variables being compared are in
the same units. When one is called "pages_" something and the other is
something "_size", I don't _know_.

> +
> + /* Walk the map again and get the size of the holes */
> + first = 1;
> + zholes_size[ZONE_DMA] = 0;
> + zholes_size[ZONE_EASYRCLM] = 0;
> + for (i = 1; init_node_data[i].end_pfn; i++) {
> + unsigned long hole_size;
> + if (init_node_data[i].nid != nid)
> + continue;
> +
> + if (first) {
> + last_end_pfn = init_node_data[i].end_pfn;
> + first = 0;
> + continue;
> + }
> +
> + /* Hole found */
> + hole_size = init_node_data[i].start_pfn - last_end_pfn;
> + if (init_node_data[i].start_pfn < start_easyrclm_pfn) {
> + zholes_size[ZONE_DMA] += hole_size;
> + } else {
> + zholes_size[ZONE_EASYRCLM] += hole_size;
> + }
> + last_end_pfn = init_node_data[i].end_pfn;
> + }
> +}

I'd probably put this loop in another function. It is pretty
self-contained, no?

-- Dave

2006-02-23 17:19:40

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 4/7] ppc64 - Specify amount of kernel memory at boot time

On Thu, 23 Feb 2006, Dave Hansen wrote:

> On Wed, 2006-02-22 at 16:43 +0000, Mel Gorman wrote:
>> Is this a bit clearer? It's built and boot tested on one ppc64 machine. I
>> am having trouble finding a ppc64 machine that *has* memory holes to be
>> 100% sure it's ok.
>
> Yeah, it looks that way. If you need a machine, see Mike Kravetz. I
> think he was working on a way to automate creating memory holes.
>

Will do. If there is an automatic way of creating holes, I'll write it
into the current "compare two running kernels" testing script.

>> diff -rup -X /usr/src/patchset-0.5/bin//dontdiff linux-2.6.16-rc3-mm1-103_x86coremem/arch/powerpc/mm/numa.c linux-2.6.16-rc3-mm1-104_ppc64coremem/arch/powerpc/mm/numa.c
>> --- linux-2.6.16-rc3-mm1-103_x86coremem/arch/powerpc/mm/numa.c 2006-02-16 09:50:42.000000000 +0000
>> +++ linux-2.6.16-rc3-mm1-104_ppc64coremem/arch/powerpc/mm/numa.c 2006-02-22 16:07:35.000000000 +0000
>> @@ -17,10 +17,12 @@
>> #include <linux/nodemask.h>
>> #include <linux/cpu.h>
>> #include <linux/notifier.h>
>> +#include <linux/sort.h>
>> #include <asm/sparsemem.h>
>> #include <asm/lmb.h>
>> #include <asm/system.h>
>> #include <asm/smp.h>
>> +#include <asm/machdep.h>
>
> Is the email spacing getting screwed up here?
>

Yes, mail client issue. The "real" patch is fine.

>> +/* Initialise the size of each zone in a node */
>> +void __init zone_sizes_init(unsigned int nid,
>> + unsigned long kernelcore_pages,
>> + unsigned long *zones_size)
>
> Minor nit territory: set_zone_sizes(), maybe?
>

In this case, the choice of name is to match an x86 function that does
something very similar. If one had read through the x86 code and then saw
this function, it would set their expectations of what the code is
intended to do.

>> +{
>> + unsigned int i;
>> + unsigned long pages_present = 0;
>
> pages_present_in_node?
>

That would cause > 80 character violation without a lot of breaking up of
lines. I think it would end up looking worse.

>> + /* Get the number of present pages in the node */
>> + for (i = 0; init_node_data[i].end_pfn; i++) {
>> + if (init_node_data[i].nid != nid)
>> + continue;
>> +
>> + pages_present += init_node_data[i].end_pfn -
>> + init_node_data[i].start_pfn;
>> + }
>> +
>> + if (kernelcore_pages && kernelcore_pages < pages_present) {
>> + zones_size[ZONE_DMA] = kernelcore_pages;
>> + zones_size[ZONE_EASYRCLM] = pages_present - kernelcore_pages;
>> + } else {
>> + zones_size[ZONE_DMA] = pages_present;
>> + zones_size[ZONE_EASYRCLM] = 0;
>> + }
>> +}
>
> I think there are a couple of buglets here. I think the
> kernelcore_pages is going to be applied per-zone, right?
>

per-node. A node goes no ZONE_EASYRCLM pages if it is not large enough to
contain kernelcore_pages. That means that on a system with 2 nodes,
kernelcore=512MB will results in 1024MB of ZONE_DMA in total.

> Also, how do we want to distribute kernelcore memory over each node?
> The way it is coded up for now, it will all be sliced out of the first
> node. I'm not sure that's a good thing.
>

It gets set in every node.

> My inclination would be to completely separate out the ZONE_EASYRCLM
> into separate code. It makes it easier to set whatever policy you want
> in one place. Just a suggestion.
>

It's a possibility. My feeling is that it would be easier to understand
overall of all the zone-sizing code was in one place.

>> +void __init get_zholes_size(unsigned int nid, unsigned long *zones_size,
>> + unsigned long *zholes_size) {
>
> nid_zholes_size()? I'm not too sure about this one. Just promise me
> you'll think about it a bit more. ;)
>

The choice of names is again to match the name of a equivalent code from
the x86. I'm not saying it's a great name :)

>> + unsigned int i = 0;
>> + unsigned int start_easyrclm_pfn;
>> + unsigned long last_end_pfn, first;
>> +
>> + /* Find where the PFN of the end of DMA is */
>> + unsigned long pages_count = zones_size[ZONE_DMA];
>
> <tangent> This (virtually) proves that zones_size[] needs to get a
> different name. Perhaps we need to make it more like the zone structure
> itself and go to spanned and present pages? </tangent>
>

zones_size[] is what free_area_init() expects to receive so there is not a
lot of room to fiddle with it's meaning without causing more trouble.

>> + for (i = 0; init_node_data[i].end_pfn; i++) {
>> + unsigned long segment_size;
>> + if (init_node_data[i].nid != nid)
>> + continue;
>> +
>> + /*
>> + * Check if the end of ZONE_DMA is in this segment of the
>> + * init_node_data
>> + */
>> + segment_size = init_node_data[i].end_pfn -
>> + init_node_data[i].start_pfn;
>
> "segment" is probably a bad term to use here, especially on ppc.
>

Good point, I forgot that segment has a very different meaning on ppc.

> One other thing, I want to _know_ that variables being compared are in
> the same units. When one is called "pages_" something and the other is
> something "_size", I don't _know_.
>

chunk_num_pages ?

>> +
>> + /* Walk the map again and get the size of the holes */
>> + first = 1;
>> + zholes_size[ZONE_DMA] = 0;
>> + zholes_size[ZONE_EASYRCLM] = 0;
>> + for (i = 1; init_node_data[i].end_pfn; i++) {
>> + unsigned long hole_size;
>> + if (init_node_data[i].nid != nid)
>> + continue;
>> +
>> + if (first) {
>> + last_end_pfn = init_node_data[i].end_pfn;
>> + first = 0;
>> + continue;
>> + }
>> +
>> + /* Hole found */
>> + hole_size = init_node_data[i].start_pfn - last_end_pfn;
>> + if (init_node_data[i].start_pfn < start_easyrclm_pfn) {
>> + zholes_size[ZONE_DMA] += hole_size;
>> + } else {
>> + zholes_size[ZONE_EASYRCLM] += hole_size;
>> + }
>> + last_end_pfn = init_node_data[i].end_pfn;
>> + }
>> +}
>
> I'd probably put this loop in another function. It is pretty
> self-contained, no?
>

yep. get_zholes_size() could be split into two functions
find_start_easyrclm_pfn() and get_nid_zholes_size(). Would that be pretty
clear-cut?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2006-02-23 17:38:39

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 4/7] ppc64 - Specify amount of kernel memory at boot time

On Thu, 2006-02-23 at 17:19 +0000, Mel Gorman wrote:
> On Thu, 23 Feb 2006, Dave Hansen wrote:
> >> +/* Initialise the size of each zone in a node */
> >> +void __init zone_sizes_init(unsigned int nid,
> >> + unsigned long kernelcore_pages,
> >> + unsigned long *zones_size)
> >
> > Minor nit territory: set_zone_sizes(), maybe?
> >
>
> In this case, the choice of name is to match an x86 function that does
> something very similar. If one had read through the x86 code and then saw
> this function, it would set their expectations of what the code is
> intended to do.

x86 is bad. Try to do better. :)

> per-node. A node goes no ZONE_EASYRCLM pages if it is not large enough to
> contain kernelcore_pages. That means that on a system with 2 nodes,
> kernelcore=512MB will results in 1024MB of ZONE_DMA in total.
>
> > Also, how do we want to distribute kernelcore memory over each node?
> > The way it is coded up for now, it will all be sliced out of the first
> > node. I'm not sure that's a good thing.
>
> It gets set in every node.

They hypervisor has a memory allocator which it uses to piece out memory
to various LPARs. When things like partition memory resizing or reboots
occur, that memory gets allocated and freed and so forth.

These machines are _also_ NUMA. Memory can effectively get lumped so
that you can't always get memory on a single NUMA node. Because of all
of the other actions of other partitions, these conditions are always
changing. The end result is that the amount of memory which each NUMA
node has *in* *a* *single* *partition* can theoretically change between
reboots.

OK, back to the hapless system admin using kernelcore. They have a
4-node system with 2GB of RAM in each node for 8GB total. They use
kernelcore=1GB. They end up with 4x1GB ZONE_DMA and 4x1GB
ZONE_EASYRCLM. Perfect. You can safely remove 4GB of RAM.

Now, imagine that the machine has been heavily used for a while, there
is only 1 node's memory available, but CPUs are available in the same
places as before. So, you start up your partition again have 8GB of
memory in one node. Same kernelcore=1GB option. You get 1x7GB ZONE_DMA
and 1x1GB ZONE_EASYRCLM. I'd argue this is going to be a bit of a
surprise to the poor admin.

> zones_size[] is what free_area_init() expects to receive so there is not a
> lot of room to fiddle with it's meaning without causing more trouble.

It is just passed in there as an argument. If you can think of a way to
make it more understandable, change it in your architecture, and send
the patch for the main one.

> > One other thing, I want to _know_ that variables being compared are in
> > the same units. When one is called "pages_" something and the other is
> > something "_size", I don't _know_.
>
> chunk_num_pages ?

No. :) The words "chunks" and "clumps" have a bit of a stigma, just
like the number "3". Ask Matt Dobson.

num_pages_WHAT? node_num_pages? silly_num_pages?

Chunk is pretty meaningless.

> yep. get_zholes_size() could be split into two functions
> find_start_easyrclm_pfn() and get_nid_zholes_size(). Would that be pretty
> clear-cut?

I think so.

-- Dave

2006-02-23 17:40:45

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 4/7] ppc64 - Specify amount of kernel memory at boot time

On Thu, Feb 23, 2006 at 05:19:19PM +0000, Mel Gorman wrote:
> On Thu, 23 Feb 2006, Dave Hansen wrote:
>
> >On Wed, 2006-02-22 at 16:43 +0000, Mel Gorman wrote:
> >>Is this a bit clearer? It's built and boot tested on one ppc64 machine. I
> >>am having trouble finding a ppc64 machine that *has* memory holes to be
> >>100% sure it's ok.
> >
> >Yeah, it looks that way. If you need a machine, see Mike Kravetz. I
> >think he was working on a way to automate creating memory holes.
> >
>
> Will do. If there is an automatic way of creating holes, I'll write it
> into the current "compare two running kernels" testing script.

I don't realy have an automatic way to create holes. Just turns out that
the system I was working with was good at creating them itself.

I've sliced and diced (made lots of partitioning changes) the system
recently and am still working on getting everything working right.
When I get everything working again, I'll give the patch set a try.

--
Mike

2006-02-23 18:01:13

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 4/7] ppc64 - Specify amount of kernel memory at boot time

On Thu, 23 Feb 2006, Dave Hansen wrote:

> On Thu, 2006-02-23 at 17:19 +0000, Mel Gorman wrote:
>> On Thu, 23 Feb 2006, Dave Hansen wrote:
>>>> +/* Initialise the size of each zone in a node */
>>>> +void __init zone_sizes_init(unsigned int nid,
>>>> + unsigned long kernelcore_pages,
>>>> + unsigned long *zones_size)
>>>
>>> Minor nit territory: set_zone_sizes(), maybe?
>>>
>>
>> In this case, the choice of name is to match an x86 function that does
>> something very similar. If one had read through the x86 code and then saw
>> this function, it would set their expectations of what the code is
>> intended to do.
>
> x86 is bad. Try to do better. :)
>

set_zone_sizes() it is.

>> per-node. A node goes no ZONE_EASYRCLM pages if it is not large enough to
>> contain kernelcore_pages. That means that on a system with 2 nodes,
>> kernelcore=512MB will results in 1024MB of ZONE_DMA in total.
>>
>>> Also, how do we want to distribute kernelcore memory over each node?
>>> The way it is coded up for now, it will all be sliced out of the first
>>> node. I'm not sure that's a good thing.
>>
>> It gets set in every node.
>
> They hypervisor has a memory allocator which it uses to piece out memory
> to various LPARs. When things like partition memory resizing or reboots
> occur, that memory gets allocated and freed and so forth.
>
> These machines are _also_ NUMA. Memory can effectively get lumped so
> that you can't always get memory on a single NUMA node. Because of all
> of the other actions of other partitions, these conditions are always
> changing. The end result is that the amount of memory which each NUMA
> node has *in* *a* *single* *partition* can theoretically change between
> reboots.
>

wow. Ok, I didn't know that. It totally rules out any future option where
an admin can say how much kernelcore they want in each node. They won't
know for sure the size of the node in advance or the number of nodes for
that matter.

> OK, back to the hapless system admin using kernelcore. They have a
> 4-node system with 2GB of RAM in each node for 8GB total. They use
> kernelcore=1GB. They end up with 4x1GB ZONE_DMA and 4x1GB
> ZONE_EASYRCLM. Perfect. You can safely remove 4GB of RAM.
>
> Now, imagine that the machine has been heavily used for a while, there
> is only 1 node's memory available, but CPUs are available in the same
> places as before. So, you start up your partition again have 8GB of
> memory in one node. Same kernelcore=1GB option. You get 1x7GB ZONE_DMA
> and 1x1GB ZONE_EASYRCLM. I'd argue this is going to be a bit of a
> surprise to the poor admin.
>

That sort of surprise is totally unacceptable but the behaviour of
kernelcore needs to be consistent on both the x86 and the ppc (any any
other ar. How about;

1. kernelcore=X determines the total amount of memory for !ZONE_EASYRCLM
(be it ZONE_DMA, ZONE_NORMAL or ZONE_HIGHMEM)
2. For every node that can have ZONE_EASYRCLM, split the kernelcore across
the nodes as a percentage of the node size

Example: 4 nodes, 1 GiB each, kernelcore=512MB
node 0 ZONE_DMA = 128MB
node 1 ZONE_DMA = 128MB
node 2 ZONE_DMA = 128MB
node 3 ZONE_DMA = 128MB

2 nodes, 3GiB and 1GIB, kernelcore=512MB
node 0 ZONE_DMA = 384
node 1 ZONE_DMA = 128

It gets a bit more complex on NUMA for x86 because ZONE_NORMAL is
involved but the idea would essentially be the same.

>> zones_size[] is what free_area_init() expects to receive so there is not a
>> lot of room to fiddle with it's meaning without causing more trouble.
>
> It is just passed in there as an argument. If you can think of a way to
> make it more understandable, change it in your architecture, and send
> the patch for the main one.
>

I'll think about it when my head finishes crunching through the sizing of
kernelcore.

>>> One other thing, I want to _know_ that variables being compared are in
>>> the same units. When one is called "pages_" something and the other is
>>> something "_size", I don't _know_.
>>
>> chunk_num_pages ?
>
> No. :) The words "chunks" and "clumps" have a bit of a stigma, just
> like the number "3". Ask Matt Dobson.
>

It sounds like a touchy subject :)

> num_pages_WHAT? node_num_pages? silly_num_pages?
>
> Chunk is pretty meaningless.
>

thingie, bit, bob, piece? :)

I'll think of something.

>> yep. get_zholes_size() could be split into two functions
>> find_start_easyrclm_pfn() and get_nid_zholes_size(). Would that be pretty
>> clear-cut?
>
> I think so.
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2006-02-23 18:16:05

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 4/7] ppc64 - Specify amount of kernel memory at boot time

On Thu, 2006-02-23 at 18:01 +0000, Mel Gorman wrote:
> On Thu, 23 Feb 2006, Dave Hansen wrote:
> > OK, back to the hapless system admin using kernelcore. They have a
> > 4-node system with 2GB of RAM in each node for 8GB total. They use
> > kernelcore=1GB. They end up with 4x1GB ZONE_DMA and 4x1GB
> > ZONE_EASYRCLM. Perfect. You can safely remove 4GB of RAM.
> >
> > Now, imagine that the machine has been heavily used for a while, there
> > is only 1 node's memory available, but CPUs are available in the same
> > places as before. So, you start up your partition again have 8GB of
> > memory in one node. Same kernelcore=1GB option. You get 1x7GB ZONE_DMA
> > and 1x1GB ZONE_EASYRCLM. I'd argue this is going to be a bit of a
> > surprise to the poor admin.
> >
>
> That sort of surprise is totally unacceptable but the behaviour of
> kernelcore needs to be consistent on both the x86 and the ppc (any any
> other ar. How about;
>
> 1. kernelcore=X determines the total amount of memory for !ZONE_EASYRCLM
> (be it ZONE_DMA, ZONE_NORMAL or ZONE_HIGHMEM)

Sounds reasonable. But, if you're going to do that, should we just make
it the opposite and explicitly be easy_reclaim_mem=? Do we want the
limit to be set as "I need this much kernel memory", or "I want this
much removable memory". I dunno.

> 2. For every node that can have ZONE_EASYRCLM, split the kernelcore across
> the nodes as a percentage of the node size
>
> Example: 4 nodes, 1 GiB each, kernelcore=512MB
> node 0 ZONE_DMA = 128MB
> node 1 ZONE_DMA = 128MB
> node 2 ZONE_DMA = 128MB
> node 3 ZONE_DMA = 128MB
>
> 2 nodes, 3GiB and 1GIB, kernelcore=512MB
> node 0 ZONE_DMA = 384
> node 1 ZONE_DMA = 128
>
> It gets a bit more complex on NUMA for x86 because ZONE_NORMAL is
> involved but the idea would essentially be the same.

Yes, chopping it up seems like the right thing (or as close as we can
get) to me.

-- Dave

2006-02-24 00:14:29

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 4/7] ppc64 - Specify amount of kernel memory at boot time

Dave Hansen wrote:
>> That sort of surprise is totally unacceptable but the behaviour of
>> kernelcore needs to be consistent on both the x86 and the ppc (any any
>> other ar. How about;
>>
>> 1. kernelcore=X determines the total amount of memory for !ZONE_EASYRCLM
>> (be it ZONE_DMA, ZONE_NORMAL or ZONE_HIGHMEM)
>
> Sounds reasonable. But, if you're going to do that, should we just make
> it the opposite and explicitly be easy_reclaim_mem=? Do we want the
> limit to be set as "I need this much kernel memory", or "I want this
> much removable memory". I dunno.

Now, amount of EASYRCLM memory can change in run time, but kernelcore memory
cannot. So, I like kernelcore= option. I think this is clear setting for admin.

-- Kame

2006-02-24 09:04:50

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 4/7] ppc64 - Specify amount of kernel memory at boot time

On Thu, 23 Feb 2006, Dave Hansen wrote:

> On Thu, 2006-02-23 at 18:01 +0000, Mel Gorman wrote:
>> On Thu, 23 Feb 2006, Dave Hansen wrote:
>>> OK, back to the hapless system admin using kernelcore. They have a
>>> 4-node system with 2GB of RAM in each node for 8GB total. They use
>>> kernelcore=1GB. They end up with 4x1GB ZONE_DMA and 4x1GB
>>> ZONE_EASYRCLM. Perfect. You can safely remove 4GB of RAM.
>>>
>>> Now, imagine that the machine has been heavily used for a while, there
>>> is only 1 node's memory available, but CPUs are available in the same
>>> places as before. So, you start up your partition again have 8GB of
>>> memory in one node. Same kernelcore=1GB option. You get 1x7GB ZONE_DMA
>>> and 1x1GB ZONE_EASYRCLM. I'd argue this is going to be a bit of a
>>> surprise to the poor admin.
>>>
>>
>> That sort of surprise is totally unacceptable but the behaviour of
>> kernelcore needs to be consistent on both the x86 and the ppc (any any
>> other ar. How about;
>>
>> 1. kernelcore=X determines the total amount of memory for !ZONE_EASYRCLM
>> (be it ZONE_DMA, ZONE_NORMAL or ZONE_HIGHMEM)
>
> Sounds reasonable. But, if you're going to do that, should we just make
> it the opposite and explicitly be easy_reclaim_mem=? Do we want the
> limit to be set as "I need this much kernel memory", or "I want this
> much removable memory". I dunno.
>

I think we should keep it at kernelcore=. If you have too little easyrclm
memory, then hot-remove and hugetlb availability is impaired. If you
have too little kernel memory, you have a really bad day.

>> 2. For every node that can have ZONE_EASYRCLM, split the kernelcore across
>> the nodes as a percentage of the node size
>>
>> Example: 4 nodes, 1 GiB each, kernelcore=512MB
>> node 0 ZONE_DMA = 128MB
>> node 1 ZONE_DMA = 128MB
>> node 2 ZONE_DMA = 128MB
>> node 3 ZONE_DMA = 128MB
>>
>> 2 nodes, 3GiB and 1GIB, kernelcore=512MB
>> node 0 ZONE_DMA = 384
>> node 1 ZONE_DMA = 128
>>
>> It gets a bit more complex on NUMA for x86 because ZONE_NORMAL is
>> involved but the idea would essentially be the same.
>
> Yes, chopping it up seems like the right thing (or as close as we can
> get) to me.
>

Ok, will rework the code to make it happen.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab