2007-08-02 17:21:29

by mel

[permalink] [raw]
Subject: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

The NUMA layer only supports NUMA policies for the highest zone. When
ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes
ZONE_MOVABLE. The result is that policies are only applied to allocations
like anonymous pages and page cache allocated from ZONE_MOVABLE when the
zone is used.

This patch applies policies to the two highest zones when the highest zone
is ZONE_MOVABLE. As ZONE_MOVABLE consists of pages from the highest "real"
zone, it's always functionally equivalent.

The patch has been tested on a variety of machines both NUMA and non-NUMA
covering x86, x86_64 and ppc64. No abnormal results were seen in kernbench,
tbench, dbench or hackbench. It passes regression tests from the numactl
package with and without kernelcore= once numactl tests are patched to
wait for vmstat counters to update.

Signed-off-by: Mel Gorman <[email protected]>

---
include/linux/mempolicy.h | 2 +-
include/linux/mmzone.h | 18 ++++++++++++++++++
mm/mempolicy.c | 2 +-
mm/page_alloc.c | 13 +++++++++++++
4 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index e147cf5..5bdd656 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -166,7 +166,7 @@ extern enum zone_type policy_zone;

static inline void check_highest_zone(enum zone_type k)
{
- if (k > policy_zone)
+ if (k > policy_zone && k != ZONE_MOVABLE)
policy_zone = k;
}

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3ea68cd..4e56273 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -410,6 +410,24 @@ struct zonelist {
#endif
};

+#ifdef CONFIG_NUMA
+/*
+ * Only custom zonelists like MPOL_BIND need to be filtered as part of
+ * policies. As described in the comment for struct zonelist_cache, these
+ * zonelists will not have a zlcache so zlcache_ptr will not be set. Use
+ * that to determine if the zonelists needs to be filtered or not.
+ */
+static inline int alloc_should_filter_zonelist(struct zonelist *zonelist)
+{
+ return !zonelist->zlcache_ptr;
+}
+#else
+static inline int alloc_should_filter_zonelist(struct zonelist *zonelist)
+{
+ return 0;
+}
+#endif /* CONFIG_NUMA */
+
#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
struct node_active_region {
unsigned long start_pfn;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 71b84b4..172abff 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -149,7 +149,7 @@ static struct zonelist *bind_zonelist(nodemask_t *nodes)
lower zones etc. Avoid empty zones because the memory allocator
doesn't like them. If you implement node hot removal you
have to fix that. */
- k = policy_zone;
+ k = MAX_NR_ZONES - 1;
while (1) {
for_each_node_mask(nd, *nodes) {
struct zone *z = &NODE_DATA(nd)->node_zones[k];
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3da85b8..6427653 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1157,6 +1157,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
int zlc_active = 0; /* set if using zonelist_cache */
int did_zlc_setup = 0; /* just call zlc_setup() one time */
+ enum zone_type highest_zoneidx = -1; /* Gets set for policy zonelists */

zonelist_scan:
/*
@@ -1166,6 +1167,18 @@ zonelist_scan:
z = zonelist->zones;

do {
+ /*
+ * In NUMA, this could be a policy zonelist which contains
+ * zones that may not be allowed by the current gfp_mask.
+ * Check the zone is allowed by the current flags
+ */
+ if (unlikely(alloc_should_filter_zonelist(zonelist))) {
+ if (highest_zoneidx == -1)
+ highest_zoneidx = gfp_zone(gfp_mask);
+ if (zone_idx(*z) > highest_zoneidx)
+ continue;
+ }
+
if (NUMA_BUILD && zlc_active &&
!zlc_zone_worth_trying(zonelist, z, allowednodes))
continue;

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab


2007-08-02 19:47:48

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

On Thu, 2007-08-02 at 18:21 +0100, Mel Gorman wrote:
> The NUMA layer only supports NUMA policies for the highest zone. When
> ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes
> ZONE_MOVABLE. The result is that policies are only applied to allocations
> like anonymous pages and page cache allocated from ZONE_MOVABLE when the
> zone is used.
>
> This patch applies policies to the two highest zones when the highest zone
> is ZONE_MOVABLE. As ZONE_MOVABLE consists of pages from the highest "real"
> zone, it's always functionally equivalent.
>
> The patch has been tested on a variety of machines both NUMA and non-NUMA
> covering x86, x86_64 and ppc64. No abnormal results were seen in kernbench,
> tbench, dbench or hackbench. It passes regression tests from the numactl
> package with and without kernelcore= once numactl tests are patched to
> wait for vmstat counters to update.
>
> Signed-off-by: Mel Gorman <[email protected]>


And an ia_64 NUMA platform with some ad hoc, interactive functional
testing with memtoy and an overnight run of a usex job mix. Job mix
included a 32-way kernel build, several povray tracing apps, IO tests,
sequential and random vm fault tests and a number of 'bin' tests that
simulate a half a dozen crazed monkeys pounding away at keyboards
entering surprisingly error-free commands. All of these loop until
stopped or the system hangs/crashes--which it didn't...


Acked-by: Lee Schermerhorn <[email protected]>
Tested-by: Lee Schermerhorn <[email protected]>



2007-08-02 20:45:32

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

On Thu, 2 Aug 2007, Mel Gorman wrote:

> +#ifdef CONFIG_NUMA
> +/*
> + * Only custom zonelists like MPOL_BIND need to be filtered as part of
> + * policies. As described in the comment for struct zonelist_cache, these
> + * zonelists will not have a zlcache so zlcache_ptr will not be set. Use
> + * that to determine if the zonelists needs to be filtered or not.
> + */
> +static inline int alloc_should_filter_zonelist(struct zonelist *zonelist)
> +{
> + return !zonelist->zlcache_ptr;
> +}

I guess Paul needs to have a look at this one.

Otherwise

Acked-by: Christoph Lameter <[email protected]>

> @@ -1166,6 +1167,18 @@ zonelist_scan:
> z = zonelist->zones;
>
> do {
> + /*
> + * In NUMA, this could be a policy zonelist which contains
> + * zones that may not be allowed by the current gfp_mask.
> + * Check the zone is allowed by the current flags
> + */
> + if (unlikely(alloc_should_filter_zonelist(zonelist))) {
> + if (highest_zoneidx == -1)
> + highest_zoneidx = gfp_zone(gfp_mask);
> + if (zone_idx(*z) > highest_zoneidx)
> + continue;
> + }
> +
> if (NUMA_BUILD && zlc_active &&

Hotpath. Sigh.

2007-08-03 22:02:40

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

On Thursday 02 August 2007 19:21:18 Mel Gorman wrote:
> The NUMA layer only supports NUMA policies for the highest zone. When
> ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes
> ZONE_MOVABLE. The result is that policies are only applied to allocations
> like anonymous pages and page cache allocated from ZONE_MOVABLE when the
> zone is used.
>
> This patch applies policies to the two highest zones when the highest zone
> is ZONE_MOVABLE. As ZONE_MOVABLE consists of pages from the highest "real"
> zone, it's always functionally equivalent.
>
> The patch has been tested on a variety of machines both NUMA and non-NUMA
> covering x86, x86_64 and ppc64. No abnormal results were seen in kernbench,
> tbench, dbench or hackbench. It passes regression tests from the numactl
> package with and without kernelcore= once numactl tests are patched to
> wait for vmstat counters to update.

I must honestly say I really hate the patch. It's a horrible hack and makes fast paths
slower. When I designed mempolicies I especially tried to avoid things
like that, please don't add them through the backdoor now.

-Andi

2007-08-04 00:24:08

by mel

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

On (04/08/07 00:02), Andi Kleen didst pronounce:
> On Thursday 02 August 2007 19:21:18 Mel Gorman wrote:
> > The NUMA layer only supports NUMA policies for the highest zone. When
> > ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes
> > ZONE_MOVABLE. The result is that policies are only applied to allocations
> > like anonymous pages and page cache allocated from ZONE_MOVABLE when the
> > zone is used.
> >
> > This patch applies policies to the two highest zones when the highest zone
> > is ZONE_MOVABLE. As ZONE_MOVABLE consists of pages from the highest "real"
> > zone, it's always functionally equivalent.
> >
> > The patch has been tested on a variety of machines both NUMA and non-NUMA
> > covering x86, x86_64 and ppc64. No abnormal results were seen in kernbench,
> > tbench, dbench or hackbench. It passes regression tests from the numactl
> > package with and without kernelcore= once numactl tests are patched to
> > wait for vmstat counters to update.
>
> I must honestly say I really hate the patch. It's a horrible hack and makes fast paths
> slower. When I designed mempolicies I especially tried to avoid things
> like that, please don't add them through the backdoor now.
>

It only affects hot paths in the NUMA case so non-NUMA users will not care. For
NUMA users, I have posted patches that eliminate multiple zonelists altogether
which will reduce cache footprint (something like 7K per node on x86_64)
and make things like MPOL_BIND behave in a consistent manner. That would
cost on CPU but save on cache which would (hopefully) result in a net gain
in most cases.

I would like to go with this patch for now just for policies but for 2.6.23,
we could leave it as "policies only apply to ZONE_MOVABLE when it is used" if
you really insisted on it. It's less than ideal though for sure.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-08-04 08:53:03

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE


> It only affects hot paths in the NUMA case so non-NUMA users will not care.

For x86-64 most distribution kernels are NUMA these days.

> For NUMA users, I have posted patches that eliminate multiple zonelists
> altogether which will reduce cache footprint (something like 7K per node on
> x86_64)

How do you get to 7k? We got worst case 3 zones node (normally less);
that's three pointers per GFP level.

> and make things like MPOL_BIND behave in a consistent manner. That
> would cost on CPU but save on cache which would (hopefully) result in a net
> gain in most cases.

That might be a good tradeoff, but without seeing the patch
the 7k number sounds very dubious.

> I would like to go with this patch for now just for policies but for
> 2.6.23, we could leave it as "policies only apply to ZONE_MOVABLE when it
> is used" if you really insisted on it. It's less than ideal though for
> sure.

Or disable ZONE_MOVABLE. It seems to be clearly not well thought
out well yet. Perhaps make it dependent on !CONFIG_NUMA.

-Andi

2007-08-04 16:39:32

by mel

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

On (04/08/07 10:51), Andi Kleen didst pronounce:
>
> > It only affects hot paths in the NUMA case so non-NUMA users will not care.
>
> For x86-64 most distribution kernels are NUMA these days.
>
> > For NUMA users, I have posted patches that eliminate multiple zonelists
> > altogether which will reduce cache footprint (something like 7K per node on
> > x86_64)
>
> How do you get to 7k? We got worst case 3 zones node (normally less);
> that's three pointers per GFP level.
>

The zonelists are pretty big. On a 4 node x86_64 machine (elm3b6 from tko),
the size of pg_data_t goes from 13632 bytes to 5824 (almost 8k in fact)
when only one zonelists is used.

> > and make things like MPOL_BIND behave in a consistent manner. That
> > would cost on CPU but save on cache which would (hopefully) result in a net
> > gain in most cases.
>
> That might be a good tradeoff, but without seeing the patch
> the 7k number sounds very dubious.
>

Proof-of-concept patch is below. It's not suitable for merging and I was
getting the policy issue resolved first before spending more time on it. The
patch was a big too heavy to call a fix for a bug.

> > I would like to go with this patch for now just for policies but for
> > 2.6.23, we could leave it as "policies only apply to ZONE_MOVABLE when it
> > is used" if you really insisted on it. It's less than ideal though for
> > sure.
>
> Or disable ZONE_MOVABLE. It seems to be clearly not well thought
> out well yet.

The zone is disabled by default. When enabled, the policies are only applied
to it which is expected, but not desirable which is why I wanted to apply
policies to the two highest zones when the highest was ZONE_MOVABLE.

>Perhaps make it dependent on !CONFIG_NUMA.
>

That would make no sense. The systems that will be using hugepages and
looking to resize their pool will often be NUMA machines and you state
that most x86_64 distros will have NUMA enabled.

This is the prototype patch for removing multiple zonelists altogether.
It would also act as a fix for the
policies-only-applying-to-ZONE_MOVABLE problem. You may not that where
the filtering takes place in __alloc_pages() is in the same place as
with the patch to fix policies so there is a logical progression from
bug fix now to something with wider usefulness later.

diff --git a/arch/parisc/mm/init.c b/arch/parisc/mm/init.c
index e724b36..4d417c4 100644
--- a/arch/parisc/mm/init.c
+++ b/arch/parisc/mm/init.c
@@ -602,12 +602,15 @@ void show_mem(void)
int i, j, k;

for (i = 0; i < npmem_ranges; i++) {
+ zl = &NODE_DATA(i)->node_zonelist;
for (j = 0; j < MAX_NR_ZONES; j++) {
- zl = NODE_DATA(i)->node_zonelists + j;

printk("Zone list for zone %d on node %d: ", j, i);
- for (k = 0; zl->zones[k] != NULL; k++)
+ for (k = 0; zl->zones[k] != NULL; k++) {
+ if (should_filter_zone(zl->zones[k]), j)
+ continue;
printk("[%ld/%s] ", zone_to_nid(zl->zones[k]), zl->zones[k]->name);
+ }
printk("\n");
}
}
diff --git a/drivers/char/sysrq.c b/drivers/char/sysrq.c
index 39cc318..b56d17f 100644
--- a/drivers/char/sysrq.c
+++ b/drivers/char/sysrq.c
@@ -270,7 +270,7 @@ static struct sysrq_key_op sysrq_term_op = {

static void moom_callback(struct work_struct *ignored)
{
- out_of_memory(&NODE_DATA(0)->node_zonelists[ZONE_NORMAL],
+ out_of_memory(&NODE_DATA(0)->node_zonelist,
GFP_KERNEL, 0);
}

diff --git a/fs/buffer.c b/fs/buffer.c
index 0e5ec37..8e9bbef 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -354,7 +354,7 @@ static void free_more_memory(void)
yield();

for_each_online_pgdat(pgdat) {
- zones = pgdat->node_zonelists[gfp_zone(GFP_NOFS)].zones;
+ zones = pgdat->node_zonelist.zones;
if (*zones)
try_to_free_pages(zones, 0, GFP_NOFS);
}
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index bc68dd9..f2a597e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -116,6 +116,13 @@ static inline enum zone_type gfp_zone(gfp_t flags)
return ZONE_NORMAL;
}

+static inline int should_filter_zone(struct zone *zone, int highest_zoneidx)
+{
+ if (zone_idx(zone) > highest_zoneidx)
+ return 1;
+ return 0;
+}
+
/*
* There is only one page-allocator function, and two main namespaces to
* it. The alloc_page*() variants return 'struct page *' and as such
@@ -151,8 +158,7 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
if (nid < 0)
nid = numa_node_id();

- return __alloc_pages(gfp_mask, order,
- NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));
+ return __alloc_pages(gfp_mask, order, &NODE_DATA(nid)->node_zonelist);
}

#ifdef CONFIG_NUMA
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index e147cf5..83e5256 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -166,7 +166,7 @@ extern enum zone_type policy_zone;

static inline void check_highest_zone(enum zone_type k)
{
- if (k > policy_zone)
+ if (k > policy_zone && k != ZONE_MOVABLE)
policy_zone = k;
}

@@ -258,7 +258,7 @@ static inline void mpol_fix_fork_child_flag(struct task_struct *p)
static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
unsigned long addr, gfp_t gfp_flags)
{
- return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags);
+ return &NODE_DATA(0)->node_zonelist;
}

static inline int do_migrate_pages(struct mm_struct *mm,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3ea68cd..d2fe32e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -201,6 +201,7 @@ struct zone {
*/
unsigned long lowmem_reserve[MAX_NR_ZONES];

+ int zone_idx;
#ifdef CONFIG_NUMA
int node;
/*
@@ -437,7 +438,7 @@ extern struct page *mem_map;
struct bootmem_data;
typedef struct pglist_data {
struct zone node_zones[MAX_NR_ZONES];
- struct zonelist node_zonelists[MAX_NR_ZONES];
+ struct zonelist node_zonelist;
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP
struct page *node_mem_map;
@@ -501,7 +502,7 @@ unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long);
/*
* zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
*/
-#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones)
+#define zone_idx(zone) ((zone)->zone_idx)

static inline int populated_zone(struct zone *zone)
{
@@ -543,7 +544,7 @@ static inline int is_normal_idx(enum zone_type idx)
static inline int is_highmem(struct zone *zone)
{
#ifdef CONFIG_HIGHMEM
- int zone_idx = zone - zone->zone_pgdat->node_zones;
+ int zone_idx = zone_idx(zone);
return zone_idx == ZONE_HIGHMEM ||
(zone_idx == ZONE_MOVABLE && zone_movable_is_highmem());
#else
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 71b84b4..8b16ca3 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -149,7 +149,7 @@ static struct zonelist *bind_zonelist(nodemask_t *nodes)
lower zones etc. Avoid empty zones because the memory allocator
doesn't like them. If you implement node hot removal you
have to fix that. */
- k = policy_zone;
+ k = MAX_NR_ZONES - 1;
while (1) {
for_each_node_mask(nd, *nodes) {
struct zone *z = &NODE_DATA(nd)->node_zones[k];
@@ -1116,7 +1116,7 @@ static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
nd = 0;
BUG();
}
- return NODE_DATA(nd)->node_zonelists + gfp_zone(gfp);
+ return &NODE_DATA(nd)->node_zonelist;
}

/* Do dynamic interleaving for a process */
@@ -1212,7 +1212,7 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
unsigned nid;

nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
- return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
+ return &NODE_DATA(nid)->node_zonelist;
}
return zonelist_policy(GFP_HIGHUSER, pol);
}
@@ -1226,7 +1226,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
struct zonelist *zl;
struct page *page;

- zl = NODE_DATA(nid)->node_zonelists + gfp_zone(gfp);
+ zl = &NODE_DATA(nid)->node_zonelist;
page = __alloc_pages(gfp, order, zl);
if (page && page_zone(page) == zl->zones[0])
inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f9b82ad..1cca18e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -179,6 +179,7 @@ static inline int constrained_alloc(struct zonelist *zonelist, gfp_t gfp_mask)
struct zone **z;
nodemask_t nodes;
int node;
+ enum zone_type highest_zoneidx = gfp_zone(gfp_mask);

nodes_clear(nodes);
/* node has memory ? */
@@ -186,11 +187,15 @@ static inline int constrained_alloc(struct zonelist *zonelist, gfp_t gfp_mask)
if (NODE_DATA(node)->node_present_pages)
node_set(node, nodes);

- for (z = zonelist->zones; *z; z++)
+ for (z = zonelist->zones; *z; z++) {
+
+ if (should_filter_zone(*z, highest_zoneidx))
+ continue;
if (cpuset_zone_allowed_softwall(*z, gfp_mask))
node_clear(zone_to_nid(*z), nodes);
else
return CONSTRAINT_CPUSET;
+ }

if (!nodes_empty(nodes))
return CONSTRAINT_MEMORY_POLICY;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3da85b8..190994d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1157,6 +1157,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
int zlc_active = 0; /* set if using zonelist_cache */
int did_zlc_setup = 0; /* just call zlc_setup() one time */
+ enum zone_type highest_zoneidx = gfp_zone(gfp_mask);

zonelist_scan:
/*
@@ -1166,6 +1167,9 @@ zonelist_scan:
z = zonelist->zones;

do {
+ if (should_filter_zone(*z, highest_zoneidx))
+ continue;
+
if (NUMA_BUILD && zlc_active &&
!zlc_zone_worth_trying(zonelist, z, allowednodes))
continue;
@@ -1460,11 +1464,11 @@ static unsigned int nr_free_zone_pages(int offset)
pg_data_t *pgdat = NODE_DATA(numa_node_id());
unsigned int sum = 0;

- struct zonelist *zonelist = pgdat->node_zonelists + offset;
- struct zone **zonep = zonelist->zones;
- struct zone *zone;
+ struct zone **zonep = pgdat->node_zonelist.zones;
+ struct zone *zone = *zonep;

- for (zone = *zonep++; zone; zone = *zonep++) {
+ for (zone = *zonep++; zone && zone_idx(zone) > offset; zone = *zonep++);
+ for (; zone; zone = *zonep++) {
unsigned long size = zone->present_pages;
unsigned long high = zone->pages_high;
if (size > high)
@@ -1823,17 +1827,14 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
*/
static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
{
- enum zone_type i;
int j;
struct zonelist *zonelist;

- for (i = 0; i < MAX_NR_ZONES; i++) {
- zonelist = pgdat->node_zonelists + i;
- for (j = 0; zonelist->zones[j] != NULL; j++)
- ;
- j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
- zonelist->zones[j] = NULL;
- }
+ zonelist = &pgdat->node_zonelist;
+ for (j = 0; zonelist->zones[j] != NULL; j++)
+ ;
+ j = build_zonelists_node(NODE_DATA(node), zonelist, j, MAX_NR_ZONES-1);
+ zonelist->zones[j] = NULL;
}

/*
@@ -1846,27 +1847,24 @@ static int node_order[MAX_NUMNODES];

static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
{
- enum zone_type i;
int pos, j, node;
int zone_type; /* needs to be signed */
struct zone *z;
struct zonelist *zonelist;

- for (i = 0; i < MAX_NR_ZONES; i++) {
- zonelist = pgdat->node_zonelists + i;
- pos = 0;
- for (zone_type = i; zone_type >= 0; zone_type--) {
- for (j = 0; j < nr_nodes; j++) {
- node = node_order[j];
- z = &NODE_DATA(node)->node_zones[zone_type];
- if (populated_zone(z)) {
- zonelist->zones[pos++] = z;
- check_highest_zone(zone_type);
- }
+ zonelist = &pgdat->node_zonelist;
+ pos = 0;
+ for (zone_type = MAX_NR_ZONES-1; zone_type >= 0; zone_type--) {
+ for (j = 0; j < nr_nodes; j++) {
+ node = node_order[j];
+ z = &NODE_DATA(node)->node_zones[zone_type];
+ if (populated_zone(z)) {
+ zonelist->zones[pos++] = z;
+ check_highest_zone(zone_type);
}
}
- zonelist->zones[pos] = NULL;
}
+ zonelist->zones[pos] = NULL;
}

static int default_zonelist_order(void)
@@ -1933,17 +1931,14 @@ static void set_zonelist_order(void)
static void build_zonelists(pg_data_t *pgdat)
{
int j, node, load;
- enum zone_type i;
nodemask_t used_mask;
int local_node, prev_node;
struct zonelist *zonelist;
int order = current_zonelist_order;

- /* initialize zonelists */
- for (i = 0; i < MAX_NR_ZONES; i++) {
- zonelist = pgdat->node_zonelists + i;
- zonelist->zones[0] = NULL;
- }
+ /* initialize zonelist */
+ zonelist = &pgdat->node_zonelist;
+ zonelist->zones[0] = NULL;

/* NUMA-aware ordering of nodes */
local_node = pgdat->node_id;
@@ -1997,7 +1992,7 @@ static void build_zonelist_cache(pg_data_t *pgdat)
struct zonelist_cache *zlc;
struct zone **z;

- zonelist = pgdat->node_zonelists + i;
+ zonelist = &pgdat->node_zonelist;
zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
for (z = zonelist->zones; *z; z++)
@@ -2016,36 +2011,36 @@ static void set_zonelist_order(void)
static void build_zonelists(pg_data_t *pgdat)
{
int node, local_node;
- enum zone_type i,j;
+ enum zone_type j;
+ struct zonelist *zonelist;

local_node = pgdat->node_id;
- for (i = 0; i < MAX_NR_ZONES; i++) {
- struct zonelist *zonelist;

- zonelist = pgdat->node_zonelists + i;
-
- j = build_zonelists_node(pgdat, zonelist, 0, i);
- /*
- * Now we build the zonelist so that it contains the zones
- * of all the other nodes.
- * We don't want to pressure a particular node, so when
- * building the zones for node N, we make sure that the
- * zones coming right after the local ones are those from
- * node N+1 (modulo N)
- */
- for (node = local_node + 1; node < MAX_NUMNODES; node++) {
- if (!node_online(node))
- continue;
- j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
- }
- for (node = 0; node < local_node; node++) {
- if (!node_online(node))
- continue;
- j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
- }
+ zonelist = &pgdat->node_zonelist;
+ j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES-1);

- zonelist->zones[j] = NULL;
+ /*
+ * Now we build the zonelist so that it contains the zones
+ * of all the other nodes.
+ * We don't want to pressure a particular node, so when
+ * building the zones for node N, we make sure that the
+ * zones coming right after the local ones are those from
+ * node N+1 (modulo N)
+ */
+ for (node = local_node + 1; node < MAX_NUMNODES; node++) {
+ if (!node_online(node))
+ continue;
+ j = build_zonelists_node(NODE_DATA(node), zonelist, j,
+ MAX_NR_ZONES-1);
}
+ for (node = 0; node < local_node; node++) {
+ if (!node_online(node))
+ continue;
+ j = build_zonelists_node(NODE_DATA(node), zonelist, j,
+ MAX_NR_ZONES-1);
+ }
+
+ zonelist->zones[j] = NULL;
}

/* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
@@ -2054,7 +2049,7 @@ static void build_zonelist_cache(pg_data_t *pgdat)
int i;

for (i = 0; i < MAX_NR_ZONES; i++)
- pgdat->node_zonelists[i].zlcache_ptr = NULL;
+ pgdat->node_zonelist.zlcache_ptr = NULL;
}

#endif /* CONFIG_NUMA */
@@ -2940,6 +2935,7 @@ static void __meminit free_area_init_core(struct pglist_data *pgdat,
nr_kernel_pages += realsize;
nr_all_pages += realsize;

+ zone->zone_idx = j;
zone->spanned_pages = size;
zone->present_pages = realsize;
#ifdef CONFIG_NUMA
diff --git a/mm/slab.c b/mm/slab.c
index a684778..558cf96 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3216,12 +3216,12 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
struct zone **z;
void *obj = NULL;
int nid;
+ enum zone_type highest_zoneidx = gfp_zone(flags);

if (flags & __GFP_THISNODE)
return NULL;

- zonelist = &NODE_DATA(slab_node(current->mempolicy))
- ->node_zonelists[gfp_zone(flags)];
+ zonelist = &NODE_DATA(slab_node(current->mempolicy))->node_zonelist;
local_flags = (flags & GFP_LEVEL_MASK);

retry:
@@ -3230,6 +3230,9 @@ retry:
* from existing per node queues.
*/
for (z = zonelist->zones; *z && !obj; z++) {
+ if (should_filter_zone(*z, highest_zoneidx))
+ continue;
+
nid = zone_to_nid(*z);

if (cpuset_zone_allowed_hardwall(*z, flags) &&
diff --git a/mm/slub.c b/mm/slub.c
index 6c6d74f..eea184b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1276,6 +1276,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
struct zonelist *zonelist;
struct zone **z;
struct page *page;
+ enum zone_type highest_zoneidx = gfp_zone(flags);

/*
* The defrag ratio allows a configuration of the tradeoffs between
@@ -1298,11 +1299,13 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
if (!s->defrag_ratio || get_cycles() % 1024 > s->defrag_ratio)
return NULL;

- zonelist = &NODE_DATA(slab_node(current->mempolicy))
- ->node_zonelists[gfp_zone(flags)];
+ zonelist = &NODE_DATA(slab_node(current->mempolicy))->node_zonelist;
for (z = zonelist->zones; *z; z++) {
struct kmem_cache_node *n;

+ if (should_filter_zone(*z, highest_zoneidx))
+ continue;
+
n = get_node(s, zone_to_nid(*z));

if (n && cpuset_zone_allowed_hardwall(*z, flags) &&
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d419e10..8672d61 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1124,6 +1124,7 @@ unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
unsigned long nr_reclaimed = 0;
struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long lru_pages = 0;
+ enum zone_type highest_zoneidx;
int i;
struct scan_control sc = {
.gfp_mask = gfp_mask,
@@ -1136,9 +1137,14 @@ unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask)

count_vm_event(ALLOCSTALL);

+ highest_zoneidx = gfp_zone(gfp_mask);
+
for (i = 0; zones[i] != NULL; i++) {
struct zone *zone = zones[i];

+ if (should_filter_zone(zone, highest_zoneidx))
+ continue;
+
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-08-06 19:17:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

On Sat, 4 Aug 2007 00:02:17 +0200 Andi Kleen <[email protected]> wrote:

> On Thursday 02 August 2007 19:21:18 Mel Gorman wrote:
> > The NUMA layer only supports NUMA policies for the highest zone. When
> > ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes
> > ZONE_MOVABLE. The result is that policies are only applied to allocations
> > like anonymous pages and page cache allocated from ZONE_MOVABLE when the
> > zone is used.
> >
> > This patch applies policies to the two highest zones when the highest zone
> > is ZONE_MOVABLE. As ZONE_MOVABLE consists of pages from the highest "real"
> > zone, it's always functionally equivalent.
> >
> > The patch has been tested on a variety of machines both NUMA and non-NUMA
> > covering x86, x86_64 and ppc64. No abnormal results were seen in kernbench,
> > tbench, dbench or hackbench. It passes regression tests from the numactl
> > package with and without kernelcore= once numactl tests are patched to
> > wait for vmstat counters to update.
>
> I must honestly say I really hate the patch. It's a horrible hack and makes fast paths
> slower. When I designed mempolicies I especially tried to avoid things
> like that, please don't add them through the backdoor now.
>

We don't want to be adding horrible hacks and slowness to the core of
__alloc_pages().

So where do we stand on this? We made a mess of NUMA policies, and merging
"grouping pages by mobility" would fix that mess, only we're not sure that
we want to merge those and it's too late for 2.6.23 anwyay?

If correct, I would suggest merging the horrible hack for .23 then taking
it out when we merge "grouping pages by mobility". But what if we don't do
that merge?

2007-08-06 19:18:29

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

On Mon, 6 Aug 2007, Andrew Morton wrote:

> If correct, I would suggest merging the horrible hack for .23 then taking
> it out when we merge "grouping pages by mobility". But what if we don't do
> that merge?

Take the horrible hack for .23.

For .24 either merge the mobility or get the other solution that Mel is
working on. That solution would only use a single zonelist per node and
filter on the fly. That may help performance and also help to make memory
policies work better.

2007-08-06 19:44:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

On Thu, 2 Aug 2007 13:45:23 -0700 (PDT) Christoph Lameter <[email protected]> wrote:

> On Thu, 2 Aug 2007, Mel Gorman wrote:
>
> > +#ifdef CONFIG_NUMA
> > +/*
> > + * Only custom zonelists like MPOL_BIND need to be filtered as part of
> > + * policies. As described in the comment for struct zonelist_cache, these
> > + * zonelists will not have a zlcache so zlcache_ptr will not be set. Use
> > + * that to determine if the zonelists needs to be filtered or not.
> > + */
> > +static inline int alloc_should_filter_zonelist(struct zonelist *zonelist)
> > +{
> > + return !zonelist->zlcache_ptr;
> > +}
>
> I guess Paul needs to have a look at this one.

Which Paul?

> Otherwise
>
> Acked-by: Christoph Lameter <[email protected]>
>
> > @@ -1166,6 +1167,18 @@ zonelist_scan:
> > z = zonelist->zones;
> >
> > do {
> > + /*
> > + * In NUMA, this could be a policy zonelist which contains
> > + * zones that may not be allowed by the current gfp_mask.
> > + * Check the zone is allowed by the current flags
> > + */
> > + if (unlikely(alloc_should_filter_zonelist(zonelist))) {
> > + if (highest_zoneidx == -1)
> > + highest_zoneidx = gfp_zone(gfp_mask);
> > + if (zone_idx(*z) > highest_zoneidx)
> > + continue;
> > + }
> > +
> > if (NUMA_BUILD && zlc_active &&
>
> Hotpath. Sigh.

2007-08-06 20:13:56

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

On Mon, 6 Aug 2007, Andrew Morton wrote:

> Which Paul?

Paul Jackson.

2007-08-06 20:32:04

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE


> If correct, I would suggest merging the horrible hack for .23 then taking
> it out when we merge "grouping pages by mobility". But what if we don't do
> that merge?

Or disable ZONE_MOVABLE until it is usable? I don't think we have the
infrastructure to really use it anyways, so it shouldn't make too much difference
in terms of features. And it's not that there is some sort of deadline
around for it.

Or mark it CONFIG_EXPERIMENTAL with a warning that it'll break NUMA. But disabling
is probably better.

Then for .24 or .25 a better solution can be developed.

I would prefer that instead of merging bandaid horrible hacks -- they have
a tendency to stay around.

-Andi

2007-08-06 21:48:27

by mel

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

On (06/08/07 12:15), Andrew Morton didst pronounce:
> On Sat, 4 Aug 2007 00:02:17 +0200 Andi Kleen <[email protected]> wrote:
>
> > On Thursday 02 August 2007 19:21:18 Mel Gorman wrote:
> > > The NUMA layer only supports NUMA policies for the highest zone. When
> > > ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes
> > > ZONE_MOVABLE. The result is that policies are only applied to allocations
> > > like anonymous pages and page cache allocated from ZONE_MOVABLE when the
> > > zone is used.
> > >
> > > This patch applies policies to the two highest zones when the highest zone
> > > is ZONE_MOVABLE. As ZONE_MOVABLE consists of pages from the highest "real"
> > > zone, it's always functionally equivalent.
> > >
> > > The patch has been tested on a variety of machines both NUMA and non-NUMA
> > > covering x86, x86_64 and ppc64. No abnormal results were seen in kernbench,
> > > tbench, dbench or hackbench. It passes regression tests from the numactl
> > > package with and without kernelcore= once numactl tests are patched to
> > > wait for vmstat counters to update.
> >
> > I must honestly say I really hate the patch. It's a horrible hack and makes fast paths
> > slower. When I designed mempolicies I especially tried to avoid things
> > like that, please don't add them through the backdoor now.
> >
>
> We don't want to be adding horrible hacks and slowness to the core of
> __alloc_pages().
>
> So where do we stand on this? We made a mess of NUMA policies, and merging
> "grouping pages by mobility" would fix that mess, only we're not sure that
> we want to merge those and it's too late for 2.6.23 anwyay?
>

Grouping pages by mobility would still apply polciies only to
ZONE_MOVABLE when it is configured. What grouping pages by mobility
would relieve is much of the motivation to configure ZONE_MOVABLE at all
for hugepages. The zone has such attributes as being useful to
hot-remove as well as guaranteeing how much memory can be allocated as
hugepages at runtime that is not necessarily provided by grouping pages
by mobility.

> If correct, I would suggest merging the horrible hack for .23 then taking
> it out when we merge "grouping pages by mobility". But what if we don't do
> that merge?

That hack aspect of the fix is that it alters the hot-path in the
allocator. However, there is a logical way forward.

There are patches in the works that change zonelists from having multiple
zonelists to only having only one zonelist per node that is filtered based
on the allocation flags. The place this filtering happens is the same as what
the "hack" is currently doing. The cost of filtering should be offset by the
reduced size of the node structure and tests with kernbench, hackbench and
tbench seem to confirm that. This will bring the hack into being line with
what we wanted with policies in the first place because things like MPOL_BIND
will try nodes in node-local order instead of node-numeric order as it does
currently.

>From there, we can eliminate policy_zone altogether by applying policies
to all zones but forcing a situation where MPOL_BIND will always contain
one node that GFP_KERNEL allocations can be satisified from. For example,
if I have a NUMAQ that only has ZONE_NORMAL on node 0 and a user tries to
bind to nodes 2+3, they will really bind to nodes 0,2,3 so that GFP_KERNEL
allocations on that process will not return NULL. Alternatively, we could
have mbind return a failure if it doesn't include a node that can satisfy
GFP_KERNEL allocations. Either of these options seem more sensible than
sometimes applying policies and other times not applying them.

The worst aspect is that filtering require many lookups of zone_idx().
However, we may be able to optimise this by keeping the zone index in the
pointer within the zonelist as suggested by Christoph Lameter. However,
I haven't tried implementing it yet to see what it looks like in practice.

I'm for merging the hack for 2.6.23 and having one-zonelist-per-node
ready for 2.6.24. If there is much fear that the hack will persist for too
long, I'm ok with applying policies only to ZONE_MOVABLE when kernelcore=
is specified on the command line as one-zonelist-per-node can fix the same
problem. Ultimately if we agree on patches to eliminate policy_zone altogether,
the problem becomes moot as it no longer exists.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-08-06 21:55:53

by mel

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

On (06/08/07 22:31), Andi Kleen didst pronounce:
>
> > If correct, I would suggest merging the horrible hack for .23 then taking
> > it out when we merge "grouping pages by mobility". But what if we don't do
> > that merge?
>
> Or disable ZONE_MOVABLE until it is usable?

It's usable now. The issue with policies only occurs if the user specifies
kernelcore= or movablecore= on the command-line. Your language suggests
that you believe policies are not applied when ZONE_MOVABLE is configured
at build-time.

> I don't think we have the
> infrastructure to really use it anyways, so it shouldn't make too much difference
> in terms of features. And it's not that there is some sort of deadline
> around for it.
>
> Or mark it CONFIG_EXPERIMENTAL with a warning that it'll break NUMA. But disabling
> is probably better.
>

Saying it breaks NUMA is a excessively strong language. It doesn't break
policies in that they still get applied to the highest zone. If kernelcore=
or movablecore= is not specified, the behaviour doesn't change.

> Then for .24 or .25 a better solution can be developed.
>

The better solution in my mind is to always filter the zonelist instead
of applying them only for MPOL_BIND zonelists as the hack does.

> I would prefer that instead of merging bandaid horrible hacks -- they have
> a tendency to stay around.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-08-06 21:56:23

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

Christoph wrote, responding to Mel and Andrew:
> > +static inline int alloc_should_filter_zonelist(struct zonelist *zonelist)
> > +{
> > + return !zonelist->zlcache_ptr;
> > +}
>
> I guess Paul needs to have a look at this one.

...

> > Which Paul?
>
> Paul Jackson.


I'll ack that the above snippet of code, alloc_should_filter_zonelist(),
does, as its comment explains, return true iff it's a custom zonelist such
as from MPOL_BIND.

As to the more interesting issues that this patch raises ... I have no clue.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2007-08-06 22:32:06

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

On Mon, 6 Aug 2007, Mel Gorman wrote:

> > So where do we stand on this? We made a mess of NUMA policies, and merging
> > "grouping pages by mobility" would fix that mess, only we're not sure that
> > we want to merge those and it's too late for 2.6.23 anwyay?
> >
>
> Grouping pages by mobility would still apply polciies only to
> ZONE_MOVABLE when it is configured. What grouping pages by mobility
> would relieve is much of the motivation to configure ZONE_MOVABLE at all
> for hugepages. The zone has such attributes as being useful to

Ultimately ZONE_MOVABLE can be removed. AFAIK ZONE_MOVABLE is a temporary
stepping stone to address concerns of about defrag reliability. Somehow
the stepping stone got into .23 without the real thing.

An additional issue with the current ZONE_MOVABLE in .23 is that the
tentative association of ZONE_MOVABLE with HIGHMEM also makes use of large
pages by SLUB not possible.

> There are patches in the works that change zonelists from having multiple
> zonelists to only having only one zonelist per node that is filtered based
> on the allocation flags. The place this filtering happens is the same as what
> the "hack" is currently doing. The cost of filtering should be offset by the
> reduced size of the node structure and tests with kernbench, hackbench and
> tbench seem to confirm that. This will bring the hack into being line with
> what we wanted with policies in the first place because things like MPOL_BIND
> will try nodes in node-local order instead of node-numeric order as it does
> currently.

I'd like to see that patch.

> >From there, we can eliminate policy_zone altogether by applying policies
> to all zones but forcing a situation where MPOL_BIND will always contain
> one node that GFP_KERNEL allocations can be satisified from. For example,
> if I have a NUMAQ that only has ZONE_NORMAL on node 0 and a user tries to
> bind to nodes 2+3, they will really bind to nodes 0,2,3 so that GFP_KERNEL
> allocations on that process will not return NULL. Alternatively, we could
> have mbind return a failure if it doesn't include a node that can satisfy
> GFP_KERNEL allocations. Either of these options seem more sensible than
> sometimes applying policies and other times not applying them.

We would still need to check on which nodes which zones area available.
Zones that are not available on all zones would need to be exempt from
policies. Maybe one could define an upper boundary of zones that are
policed? On NUMAQ zones up to ZONE_NORMAL would be under policy. On x86_64
this may only include ZONE_DMA. A similar thing would occur on ia64 with
the 4G DMA zone. Maybe policy_zone could become configurable?

> I'm for merging the hack for 2.6.23 and having one-zonelist-per-node
> ready for 2.6.24. If there is much fear that the hack will persist for too

Why not for .23? It does not seem to be too much code?

> long, I'm ok with applying policies only to ZONE_MOVABLE when kernelcore=
> is specified on the command line as one-zonelist-per-node can fix the same
> problem. Ultimately if we agree on patches to eliminate policy_zone altogether,
> the problem becomes moot as it no longer exists.

We cannot have a kernel release with broken mempolicy. We either need the
patch here or the one-zonelist patch for .23.

2007-08-06 22:57:32

by mel

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

On (06/08/07 15:31), Christoph Lameter didst pronounce:
> On Mon, 6 Aug 2007, Mel Gorman wrote:
>
> > > So where do we stand on this? We made a mess of NUMA policies, and merging
> > > "grouping pages by mobility" would fix that mess, only we're not sure that
> > > we want to merge those and it's too late for 2.6.23 anwyay?
> > >
> >
> > Grouping pages by mobility would still apply polciies only to
> > ZONE_MOVABLE when it is configured. What grouping pages by mobility
> > would relieve is much of the motivation to configure ZONE_MOVABLE at all
> > for hugepages. The zone has such attributes as being useful to
>
> Ultimately ZONE_MOVABLE can be removed. AFAIK ZONE_MOVABLE is a temporary
> stepping stone to address concerns of about defrag reliability. Somehow
> the stepping stone got into .23 without the real thing.
>
> An additional issue with the current ZONE_MOVABLE in .23 is that the
> tentative association of ZONE_MOVABLE with HIGHMEM also makes use of large
> pages by SLUB not possible.
>

Pretty much. The use of ZONE_MOVABLE really only applies to hugepages
and potentially memory hot-remove right now.

> > There are patches in the works that change zonelists from having multiple
> > zonelists to only having only one zonelist per node that is filtered based
> > on the allocation flags. The place this filtering happens is the same as what
> > the "hack" is currently doing. The cost of filtering should be offset by the
> > reduced size of the node structure and tests with kernbench, hackbench and
> > tbench seem to confirm that. This will bring the hack into being line with
> > what we wanted with policies in the first place because things like MPOL_BIND
> > will try nodes in node-local order instead of node-numeric order as it does
> > currently.
>
> I'd like to see that patch.
>

I'll find the time to get it implemented this week. I've been
prioritising anything that looked like a bug recently so it languished
on the TODO pile.

> > >From there, we can eliminate policy_zone altogether by applying policies
> > to all zones but forcing a situation where MPOL_BIND will always contain
> > one node that GFP_KERNEL allocations can be satisified from. For example,
> > if I have a NUMAQ that only has ZONE_NORMAL on node 0 and a user tries to
> > bind to nodes 2+3, they will really bind to nodes 0,2,3 so that GFP_KERNEL
> > allocations on that process will not return NULL. Alternatively, we could
> > have mbind return a failure if it doesn't include a node that can satisfy
> > GFP_KERNEL allocations. Either of these options seem more sensible than
> > sometimes applying policies and other times not applying them.
>
> We would still need to check on which nodes which zones area available.
> Zones that are not available on all zones would need to be exempt from
> policies. Maybe one could define an upper boundary of zones that are
> policed? On NUMAQ zones up to ZONE_NORMAL would be under policy. On x86_64
> this may only include ZONE_DMA. A similar thing would occur on ia64 with
> the 4G DMA zone. Maybe policy_zone could become configurable?
>

A sensible upper-boundary would be if GFP_KERNEL can be used on zones within
that that node or not. gfp_zone(GFP_KERNEL) provides that sort of information
and would resolve to ZONE_NORMAL on NUMAQ, ZONE_DMA on ia64 etc. Prehaps
that would not work out for GFP_DMA32, I'm not 100% sure at the moment
(it's late and this is meant to be a holiday, sue me :) )

This discussion is independent of one-zonelist-per-node which is the
stepping stone between where we are now and getting rid of policy_zone
altogether.

> > I'm for merging the hack for 2.6.23 and having one-zonelist-per-node
> > ready for 2.6.24. If there is much fear that the hack will persist for too
>
> Why not for .23? It does not seem to be too much code?
>

I'm working under the assumption that if it's not a bug-fix, you can't
get it in after the merge window closes. I've seen complaints before where
"bug-fixes" were adding features which one-zonelist-per-node may be preceived
by some people to be. Perhaps the rules will flex for this patch when it comes
out, perhaps not. I made the assumption that the least invasive bug-fix was
sensible outside of the merge window and that's what this hack is.

> > long, I'm ok with applying policies only to ZONE_MOVABLE when kernelcore=
> > is specified on the command line as one-zonelist-per-node can fix the same
> > problem. Ultimately if we agree on patches to eliminate policy_zone altogether,
> > the problem becomes moot as it no longer exists.
>
> We cannot have a kernel release with broken mempolicy. We either need the
> patch here or the one-zonelist patch for .23.

I'll get a sensible version of one-zonelist ASAP. Prehaps we'll end up
just going with that altogether if no performance issues are evident in
testing.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-08-07 05:13:54

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

On Mon, 6 Aug 2007 22:55:41 +0100 [email protected] (Mel Gorman) wrote:

> On (06/08/07 22:31), Andi Kleen didst pronounce:
> >
> > > If correct, I would suggest merging the horrible hack for .23 then taking
> > > it out when we merge "grouping pages by mobility". But what if we don't do
> > > that merge?
> >
> > Or disable ZONE_MOVABLE until it is usable?
>
> It's usable now. The issue with policies only occurs if the user specifies
> kernelcore= or movablecore= on the command-line. Your language suggests
> that you believe policies are not applied when ZONE_MOVABLE is configured
> at build-time.

So.. the problem which we're fixing here is only present when someone
use kernelcore=. This is in fact an argument for _not_ merging the
horrible-hack.

How commonly do we expect people to specify kernelcore=? If "not much" then
it isn't worth adding the __alloc_pages() overhead?

(It's a pretty darn small overhead, I must say)

2007-08-07 16:55:57

by mel

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

On (06/08/07 22:12), Andrew Morton didst pronounce:
> On Mon, 6 Aug 2007 22:55:41 +0100 [email protected] (Mel Gorman) wrote:
>
> > On (06/08/07 22:31), Andi Kleen didst pronounce:
> > >
> > > > If correct, I would suggest merging the horrible hack for .23 then taking
> > > > it out when we merge "grouping pages by mobility". But what if we don't do
> > > > that merge?
> > >
> > > Or disable ZONE_MOVABLE until it is usable?
> >
> > It's usable now. The issue with policies only occurs if the user specifies
> > kernelcore= or movablecore= on the command-line. Your language suggests
> > that you believe policies are not applied when ZONE_MOVABLE is configured
> > at build-time.
>
> So.. the problem which we're fixing here is only present when someone
> use kernelcore=. This is in fact an argument for _not_ merging the
> horrible-hack.
>

It's even more constrained than that. It only applies to the MPOL_BIND
policy when kernelcore= is specified. The other policies work the same
as they ever did.

> How commonly do we expect people to specify kernelcore=? If "not much" then
> it isn't worth adding the __alloc_pages() overhead?
>

For 2.6.23 at least, it'll be "not much". While I'm not keen on leaving
MPOL_BIND as it is for 2.6.23, we can postpone the final decision until
we've bashed the one-zonelist-per-node patches a bit and see do we want to
do that instead.

> (It's a pretty darn small overhead, I must say)

And it's simplier than the one-zone-list-per-node patches. The
current draft of the patch I'm working on looks something like;

arch/parisc/mm/init.c | 10 ++-
drivers/char/sysrq.c | 2
fs/buffer.c | 2
include/linux/gfp.h | 3 -
include/linux/mempolicy.h | 2
include/linux/mmzone.h | 42 +++++++++++++++
include/linux/swap.h | 2
mm/mempolicy.c | 6 +-
mm/mmzone.c | 28 ++++++++++
mm/oom_kill.c | 8 +--
mm/page_alloc.c | 122 +++++++++++++++++++++-------------------------
mm/slab.c | 11 ++--
mm/slub.c | 11 ++--
mm/vmscan.c | 16 +++---
14 files changed, 164 insertions(+), 101 deletions(-)

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-08-07 18:15:11

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

On Tue, 7 Aug 2007 17:55:47 +0100 [email protected] (Mel Gorman) wrote:

> On (06/08/07 22:12), Andrew Morton didst pronounce:
> > On Mon, 6 Aug 2007 22:55:41 +0100 [email protected] (Mel Gorman) wrote:
> >
> > > On (06/08/07 22:31), Andi Kleen didst pronounce:
> > > >
> > > > > If correct, I would suggest merging the horrible hack for .23 then taking
> > > > > it out when we merge "grouping pages by mobility". But what if we don't do
> > > > > that merge?
> > > >
> > > > Or disable ZONE_MOVABLE until it is usable?
> > >
> > > It's usable now. The issue with policies only occurs if the user specifies
> > > kernelcore= or movablecore= on the command-line. Your language suggests
> > > that you believe policies are not applied when ZONE_MOVABLE is configured
> > > at build-time.
> >
> > So.. the problem which we're fixing here is only present when someone
> > use kernelcore=. This is in fact an argument for _not_ merging the
> > horrible-hack.
> >
>
> It's even more constrained than that. It only applies to the MPOL_BIND
> policy when kernelcore= is specified. The other policies work the same
> as they ever did.

so.. should we forget about merging the horrible-hack?

2007-08-07 20:37:41

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

On Tue, 7 Aug 2007, Andrew Morton wrote:

> > It's even more constrained than that. It only applies to the MPOL_BIND
> > policy when kernelcore= is specified. The other policies work the same
> > as they ever did.
>
> so.. should we forget about merging the horrible-hack?

Support MPOL_BIND is a basic NUMA feature. This is going to make .23
unusable for us if kernelcore= is used. If we cannot use kernelcore then
NUMA systems cannot use the features that depend on kernelcore.



2007-08-08 16:49:56

by mel

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

On (07/08/07 11:14), Andrew Morton didst pronounce:
> On Tue, 7 Aug 2007 17:55:47 +0100 [email protected] (Mel Gorman) wrote:
>
> > On (06/08/07 22:12), Andrew Morton didst pronounce:
> > > On Mon, 6 Aug 2007 22:55:41 +0100 [email protected] (Mel Gorman) wrote:
> > >
> > > > On (06/08/07 22:31), Andi Kleen didst pronounce:
> > > > >
> > > > > > If correct, I would suggest merging the horrible hack for .23 then taking
> > > > > > it out when we merge "grouping pages by mobility". But what if we don't do
> > > > > > that merge?
> > > > >
> > > > > Or disable ZONE_MOVABLE until it is usable?
> > > >
> > > > It's usable now. The issue with policies only occurs if the user specifies
> > > > kernelcore= or movablecore= on the command-line. Your language suggests
> > > > that you believe policies are not applied when ZONE_MOVABLE is configured
> > > > at build-time.
> > >
> > > So.. the problem which we're fixing here is only present when someone
> > > use kernelcore=. This is in fact an argument for _not_ merging the
> > > horrible-hack.
> > >
> >
> > It's even more constrained than that. It only applies to the MPOL_BIND
> > policy when kernelcore= is specified. The other policies work the same
> > as they ever did.
>
> so.. should we forget about merging the horrible-hack?

Despite a fairly specific case, I'd still like to get the problem fixed
for 2.6.23. I've posted up an alternative fix under the subject "Use one
zonelist per node instead of multiple zonelists v2". It's a more invasive
fix although arguably it's better overall than the hack because it's not
dealing with a specific special case and has a sensible path forward that
makes policies a saner ultimately. However, unlike the hack it affects all
callers of the page allocator so lets see what the reaction from reviewers
is before forgetting about the hack altogether.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-08-08 17:03:55

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE

On Wed, 8 Aug 2007, Mel Gorman wrote:

> Despite a fairly specific case, I'd still like to get the problem fixed
> for 2.6.23. I've posted up an alternative fix under the subject "Use one
> zonelist per node instead of multiple zonelists v2". It's a more invasive
> fix although arguably it's better overall than the hack because it's not
> dealing with a specific special case and has a sensible path forward that
> makes policies a saner ultimately. However, unlike the hack it affects all
> callers of the page allocator so lets see what the reaction from reviewers
> is before forgetting about the hack altogether.

Well its more a precursor to things to come than a hack. If the tests go
well for the single zonelist scheme then we may end up doing what the
patch did for the mpol_bind zonelists for all allocations.