2007-09-11 21:30:26

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v5 (resend)

(Sorry for the resend, I mucked up the TO: line in the earlier sending)

This is the latest version of one-zonelist and it should be solid enough
for wider testing. To briefly summarise, the patchset replaces multiple
zonelists-per-node with one zonelist that is filtered based on nodemask and
GFP flags. I've dropped the patch that replaces inline functions with macros
from the end as it obscures the code for something that may or may not be a
performance benefit on older compilers. If we see performance regressions that
might have something to do with it, the patch is trivially to bring forward.

Andrew, please merge to -mm for wider testing and consideration for merging
to mainline. Minimally, it gets rid of the hack in relation to ZONE_MOVABLE
and MPOL_BIND.

Changelog since V5
o Rebase to 2.6.23-rc4-mm1
o Drop patch that replaces inline functions with macros

Changelog since V4
o Rebase to -mm kernel. Host of memoryless patches collisions dealt with
o Do not call wakeup_kswapd() for every zone in a zonelist
o Dropped the FASTCALL removal
o Have cursor in iterator advance earlier
o Use nodes_and in cpuset_nodes_valid_mems_allowed()
o Use defines instead of inlines, noticably better performance on gcc-3.4
No difference on later compilers such as gcc 4.1
o Dropped gfp_skip patch until it is proven to be of benefit. Tests are
currently inconclusive but it definitly consumes at least one cache
line

Changelog since V3
o Fix compile error in the parisc change
o Calculate gfp_zone only once in __alloc_pages
o Calculate classzone_idx properly in get_page_from_freelist
o Alter check so that zone id embedded may still be used on UP
o Use Kamezawa-sans suggestion for skipping zones in zonelist
o Add __alloc_pages_nodemask() to filter zonelist based on a nodemask. This
removes the need for MPOL_BIND to have a custom zonelist
o Move zonelist iterators and helpers to mm.h
o Change _zones from struct zone * to unsigned long

Changelog since V2
o shrink_zones() uses zonelist instead of zonelist->zones
o hugetlb uses zonelist iterator
o zone_idx information is embedded in zonelist pointers
o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid)

Changelog since V1
o Break up the patch into 3 patches
o Introduce iterators for zonelists
o Performance regression test

The following patches replace multiple zonelists per node with one zonelist
that is filtered based on the GFP flags. The patches as a set fix a bug
with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset,
the MPOL_BIND will apply to the two highest zones when the highest zone
is ZONE_MOVABLE. This should be considered as an alternative fix for the
MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that
filters only custom zonelists. As a bonus, the patchset reduces the cache
footprint of the kernel and should improve performance in a number of cases.

The first patch cleans up an inconsitency where direct reclaim uses
zonelist->zones where other places use zonelist. The second patch introduces
a helper function node_zonelist() for looking up the appropriate zonelist
for a GFP mask which simplifies patches later in the set.

The third patch replaces multiple zonelists with two zonelists that are
filtered. The two zonelists are due to the fact that the memoryless patchset
introduces a second set of zonelists for __GFP_THISNODE.

The fourth patch introduces filtering of the zonelists based on a nodemask.

The final patch replaces the two zonelists with one zonelist. A nodemask is
created when __GFP_THISNODE is specified to filter the list. The nodelists
could be pre-allocated with one-per-node but it's not clear that __GFP_THISNODE
is used often enough to be worth the effort.

Performance results varied depending on the machine configuration but were
usually small performance gains. In real workloads the gain/loss will depend
on how much the userspace portion of the benchmark benefits from having more
cache available due to reduced referencing of zonelists.

These are the range of performance losses/gains when running against
2.6.23-rc3-mm1. The set and these machines are a mix of i386, x86_64 and
ppc64 both NUMA and non-NUMA.

Total CPU time on Kernbench: -0.67% to 3.05%
Elapsed time on Kernbench: -0.25% to 2.96%
page_test from aim9: -6.98% to 5.60%
brk_test from aim9: -3.94% to 4.11%
fork_test from aim9: -5.72% to 4.14%
exec_test from aim9: -1.02% to 1.56%

The TBench figures were too variable between runs to draw conclusions from but
there didn't appear to be any regressions there. The hackbench results for both
sockets and pipes were within noise.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab


2007-09-11 21:30:45

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages


The allocator deals with zonelists which indicate the order in which zones
should be targeted for an allocation. Similarly, direct reclaim of pages
iterates over an array of zones. For consistency, this patch converts direct
reclaim to use a zonelist. No functionality is changed by this patch. This
simplifies zonelist iterators in the next patch.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
---

include/linux/swap.h | 2 +-
mm/page_alloc.c | 2 +-
mm/vmscan.c | 13 ++++++++-----
3 files changed, 10 insertions(+), 7 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-fix-pcnet32/include/linux/swap.h linux-2.6.23-rc4-mm1-005_freepages_zonelist/include/linux/swap.h
--- linux-2.6.23-rc4-mm1-fix-pcnet32/include/linux/swap.h 2007-09-10 09:29:14.000000000 +0100
+++ linux-2.6.23-rc4-mm1-005_freepages_zonelist/include/linux/swap.h 2007-09-10 16:06:06.000000000 +0100
@@ -189,7 +189,7 @@ extern int rotate_reclaimable_page(struc
extern void swap_setup(void);

/* linux/mm/vmscan.c */
-extern unsigned long try_to_free_pages(struct zone **zones, int order,
+extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask);
extern unsigned long try_to_free_mem_container_pages(struct mem_container *mem);
extern int __isolate_lru_page(struct page *page, int mode);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-fix-pcnet32/mm/page_alloc.c linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/page_alloc.c
--- linux-2.6.23-rc4-mm1-fix-pcnet32/mm/page_alloc.c 2007-09-10 09:29:14.000000000 +0100
+++ linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/page_alloc.c 2007-09-10 16:06:06.000000000 +0100
@@ -1667,7 +1667,7 @@ nofail_alloc:
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;

- did_some_progress = try_to_free_pages(zonelist->zones, order, gfp_mask);
+ did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);

p->reclaim_state = NULL;
p->flags &= ~PF_MEMALLOC;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-fix-pcnet32/mm/vmscan.c linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/vmscan.c
--- linux-2.6.23-rc4-mm1-fix-pcnet32/mm/vmscan.c 2007-09-10 09:29:14.000000000 +0100
+++ linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/vmscan.c 2007-09-10 16:06:06.000000000 +0100
@@ -1207,10 +1207,11 @@ static unsigned long shrink_zone(int pri
* If a zone is deemed to be full of pinned pages then just give it a light
* scan then give up on it.
*/
-static unsigned long shrink_zones(int priority, struct zone **zones,
+static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
struct scan_control *sc)
{
unsigned long nr_reclaimed = 0;
+ struct zone **zones = zonelist->zones;
int i;

sc->all_unreclaimable = 1;
@@ -1248,7 +1249,7 @@ static unsigned long shrink_zones(int pr
* holds filesystem locks which prevent writeout this might not work, and the
* allocation attempt will fail.
*/
-unsigned long do_try_to_free_pages(struct zone **zones, gfp_t gfp_mask,
+unsigned long do_try_to_free_pages(struct zonelist *zonelist, gfp_t gfp_mask,
struct scan_control *sc)
{
int priority;
@@ -1257,6 +1258,7 @@ unsigned long do_try_to_free_pages(struc
unsigned long nr_reclaimed = 0;
struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long lru_pages = 0;
+ struct zone **zones = zonelist->zones;
int i;

count_vm_event(ALLOCSTALL);
@@ -1275,7 +1277,7 @@ unsigned long do_try_to_free_pages(struc
sc->nr_scanned = 0;
if (!priority)
disable_swap_token();
- nr_reclaimed += shrink_zones(priority, zones, sc);
+ nr_reclaimed += shrink_zones(priority, zonelist, sc);
/*
* Don't shrink slabs when reclaiming memory from
* over limit containers
@@ -1333,7 +1335,8 @@ out:
return ret;
}

-unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
+unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
+ gfp_t gfp_mask)
{
struct scan_control sc = {
.gfp_mask = gfp_mask,
@@ -1346,7 +1349,7 @@ unsigned long try_to_free_pages(struct z
.isolate_pages = isolate_pages_global,
};

- return do_try_to_free_pages(zones, gfp_mask, &sc);
+ return do_try_to_free_pages(zonelist, gfp_mask, &sc);
}

#ifdef CONFIG_CONTAINER_MEM_CONT

2007-09-11 21:31:14

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/6] Introduce node_zonelist() for accessing the zonelist for a GFP mask


This patch introduces a node_zonelist() helper function. It is used to lookup
the appropriate zonelist given a node and a GFP mask. The patch on its own is
a cleanup but it helps clarify parts of the one-zonelist-per-node patchset. If
necessary, it can be merged with the next patch in this set without problems.

Signed-off-by: Mel Gorman <[email protected]>
---

drivers/char/sysrq.c | 3 +--
fs/buffer.c | 6 +++---
include/linux/gfp.h | 10 +++++++---
include/linux/mempolicy.h | 2 +-
mm/mempolicy.c | 6 +++---
mm/page_alloc.c | 3 +--
mm/slab.c | 3 +--
mm/slub.c | 3 +--
8 files changed, 18 insertions(+), 18 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-005_freepages_zonelist/drivers/char/sysrq.c linux-2.6.23-rc4-mm1-007_node_zonelist/drivers/char/sysrq.c
--- linux-2.6.23-rc4-mm1-005_freepages_zonelist/drivers/char/sysrq.c 2007-09-10 09:29:11.000000000 +0100
+++ linux-2.6.23-rc4-mm1-007_node_zonelist/drivers/char/sysrq.c 2007-09-10 16:06:13.000000000 +0100
@@ -270,8 +270,7 @@ static struct sysrq_key_op sysrq_term_op

static void moom_callback(struct work_struct *ignored)
{
- out_of_memory(&NODE_DATA(0)->node_zonelists[ZONE_NORMAL],
- GFP_KERNEL, 0);
+ out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
}

static DECLARE_WORK(moom_work, moom_callback);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-005_freepages_zonelist/fs/buffer.c linux-2.6.23-rc4-mm1-007_node_zonelist/fs/buffer.c
--- linux-2.6.23-rc4-mm1-005_freepages_zonelist/fs/buffer.c 2007-09-10 09:29:13.000000000 +0100
+++ linux-2.6.23-rc4-mm1-007_node_zonelist/fs/buffer.c 2007-09-10 16:06:13.000000000 +0100
@@ -369,13 +369,13 @@ void invalidate_bdev(struct block_device
static void free_more_memory(void)
{
struct zone **zones;
- pg_data_t *pgdat;
+ int nid;

wakeup_pdflush(1024);
yield();

- for_each_online_pgdat(pgdat) {
- zones = pgdat->node_zonelists[gfp_zone(GFP_NOFS)].zones;
+ for_each_online_node(nid) {
+ zones = node_zonelist(nid, GFP_NOFS);
if (*zones)
try_to_free_pages(zones, 0, GFP_NOFS);
}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-005_freepages_zonelist/include/linux/gfp.h linux-2.6.23-rc4-mm1-007_node_zonelist/include/linux/gfp.h
--- linux-2.6.23-rc4-mm1-005_freepages_zonelist/include/linux/gfp.h 2007-09-10 09:29:13.000000000 +0100
+++ linux-2.6.23-rc4-mm1-007_node_zonelist/include/linux/gfp.h 2007-09-10 16:06:13.000000000 +0100
@@ -159,11 +159,16 @@ static inline gfp_t set_migrateflags(gfp

/*
* We get the zone list from the current node and the gfp_mask.
- * This zone list contains a maximum of MAXNODES*MAX_NR_ZONES zones.
+ * This zonelist contains two zonelists, one for all zones with memory and
+ * one containing just zones from the node the zonelist belongs to
*
* For the normal case of non-DISCONTIGMEM systems the NODE_DATA() gets
* optimized to &contig_page_data at compile-time.
*/
+static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
+{
+ return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
+}

#ifndef HAVE_ARCH_FREE_PAGE
static inline void arch_free_page(struct page *page, int order) { }
@@ -185,8 +190,7 @@ static inline struct page *alloc_pages_n
if (nid < 0)
nid = numa_node_id();

- return __alloc_pages(gfp_mask, order,
- NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));
+ return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
}

#ifdef CONFIG_NUMA
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-005_freepages_zonelist/include/linux/mempolicy.h linux-2.6.23-rc4-mm1-007_node_zonelist/include/linux/mempolicy.h
--- linux-2.6.23-rc4-mm1-005_freepages_zonelist/include/linux/mempolicy.h 2007-09-10 09:29:13.000000000 +0100
+++ linux-2.6.23-rc4-mm1-007_node_zonelist/include/linux/mempolicy.h 2007-09-10 16:06:13.000000000 +0100
@@ -240,7 +240,7 @@ static inline void mpol_fix_fork_child_f
static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
unsigned long addr, gfp_t gfp_flags)
{
- return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags);
+ return node_zonelist(0, gfp_flags);
}

static inline int do_migrate_pages(struct mm_struct *mm,
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/mempolicy.c linux-2.6.23-rc4-mm1-007_node_zonelist/mm/mempolicy.c
--- linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/mempolicy.c 2007-09-10 09:29:14.000000000 +0100
+++ linux-2.6.23-rc4-mm1-007_node_zonelist/mm/mempolicy.c 2007-09-10 16:06:13.000000000 +0100
@@ -1130,7 +1130,7 @@ static struct zonelist *zonelist_policy(
nd = 0;
BUG();
}
- return NODE_DATA(nd)->node_zonelists + gfp_zone(gfp);
+ return node_zonelist(nd, gfp);
}

/* Do dynamic interleaving for a process */
@@ -1226,7 +1226,7 @@ struct zonelist *huge_zonelist(struct vm
unsigned nid;

nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
- return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
+ return node_zonelist(nid, gfp_flags);
}
return zonelist_policy(GFP_HIGHUSER, pol);
}
@@ -1240,7 +1240,7 @@ static struct page *alloc_page_interleav
struct zonelist *zl;
struct page *page;

- zl = NODE_DATA(nid)->node_zonelists + gfp_zone(gfp);
+ zl = node_zonelist(nid, gfp);
page = __alloc_pages(gfp, order, zl);
if (page && page_zone(page) == zl->zones[0])
inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/page_alloc.c linux-2.6.23-rc4-mm1-007_node_zonelist/mm/page_alloc.c
--- linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/page_alloc.c 2007-09-10 16:06:06.000000000 +0100
+++ linux-2.6.23-rc4-mm1-007_node_zonelist/mm/page_alloc.c 2007-09-10 16:06:13.000000000 +0100
@@ -1805,10 +1805,9 @@ EXPORT_SYMBOL(free_pages);
static unsigned int nr_free_zone_pages(int offset)
{
/* Just pick one node, since fallback list is circular */
- pg_data_t *pgdat = NODE_DATA(numa_node_id());
unsigned int sum = 0;

- struct zonelist *zonelist = pgdat->node_zonelists + offset;
+ struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
struct zone **zonep = zonelist->zones;
struct zone *zone;

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/slab.c linux-2.6.23-rc4-mm1-007_node_zonelist/mm/slab.c
--- linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/slab.c 2007-09-10 09:29:14.000000000 +0100
+++ linux-2.6.23-rc4-mm1-007_node_zonelist/mm/slab.c 2007-09-10 16:06:13.000000000 +0100
@@ -3248,8 +3248,7 @@ static void *fallback_alloc(struct kmem_
if (flags & __GFP_THISNODE)
return NULL;

- zonelist = &NODE_DATA(slab_node(current->mempolicy))
- ->node_zonelists[gfp_zone(flags)];
+ zonelist = node_zonelist(slab_node(current->mempolicy), flags);
local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);

retry:
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/slub.c linux-2.6.23-rc4-mm1-007_node_zonelist/mm/slub.c
--- linux-2.6.23-rc4-mm1-005_freepages_zonelist/mm/slub.c 2007-09-10 09:29:14.000000000 +0100
+++ linux-2.6.23-rc4-mm1-007_node_zonelist/mm/slub.c 2007-09-10 16:06:13.000000000 +0100
@@ -1281,8 +1281,7 @@ static struct page *get_any_partial(stru
if (!s->defrag_ratio || get_cycles() % 1024 > s->defrag_ratio)
return NULL;

- zonelist = &NODE_DATA(slab_node(current->mempolicy))
- ->node_zonelists[gfp_zone(flags)];
+ zonelist = node_zonelist(slab_node(current->mempolicy), flags);
for (z = zonelist->zones; *z; z++) {
struct kmem_cache_node *n;

2007-09-11 21:31:38

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/6] Use two zonelist that are filtered by GFP mask


Currently a node has a number of zonelists, one for each zone type in the
system and a second set for THISNODE allocations. Based on the zones allowed
by a gfp mask, one of these zonelists is selected. All of these zonelists
occupy memory and consume cache lines.

This patch replaces the multiple zonelists per-node with two zonelists. The
first contains all populated zones in the system and the second contains all
populated zones in node suitable for GFP_THISNODE allocations. An iterator
macro is introduced called for_each_zone_zonelist() interates through each
zone in the zonelist that is allowed by the GFP flags.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
---

arch/parisc/mm/init.c | 11 +-
fs/buffer.c | 6 +
include/linux/gfp.h | 29 ++++---
include/linux/mmzone.h | 65 +++++++++++-----
mm/hugetlb.c | 8 +-
mm/oom_kill.c | 8 +-
mm/page_alloc.c | 169 +++++++++++++++++++-------------------------
mm/slab.c | 8 +-
mm/slub.c | 8 +-
mm/vmscan.c | 20 ++---
10 files changed, 171 insertions(+), 161 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-007_node_zonelist/arch/parisc/mm/init.c linux-2.6.23-rc4-mm1-010_use_two_zonelists/arch/parisc/mm/init.c
--- linux-2.6.23-rc4-mm1-007_node_zonelist/arch/parisc/mm/init.c 2007-08-28 02:32:35.000000000 +0100
+++ linux-2.6.23-rc4-mm1-010_use_two_zonelists/arch/parisc/mm/init.c 2007-09-10 16:06:22.000000000 +0100
@@ -599,15 +599,18 @@ void show_mem(void)
#ifdef CONFIG_DISCONTIGMEM
{
struct zonelist *zl;
- int i, j, k;
+ int i, j;

for (i = 0; i < npmem_ranges; i++) {
+ zl = node_zonelist(i);
for (j = 0; j < MAX_NR_ZONES; j++) {
- zl = NODE_DATA(i)->node_zonelists + j;
+ struct zone **z;
+ struct zone *zone;

printk("Zone list for zone %d on node %d: ", j, i);
- for (k = 0; zl->zones[k] != NULL; k++)
- printk("[%ld/%s] ", zone_to_nid(zl->zones[k]), zl->zones[k]->name);
+ for_each_zone_zonelist(zone, z, zl, j)
+ printk("[%d/%s] ", zone_to_nid(zone),
+ zone->name);
printk("\n");
}
}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-007_node_zonelist/fs/buffer.c linux-2.6.23-rc4-mm1-010_use_two_zonelists/fs/buffer.c
--- linux-2.6.23-rc4-mm1-007_node_zonelist/fs/buffer.c 2007-09-10 16:06:13.000000000 +0100
+++ linux-2.6.23-rc4-mm1-010_use_two_zonelists/fs/buffer.c 2007-09-10 16:06:22.000000000 +0100
@@ -375,9 +375,11 @@ static void free_more_memory(void)
yield();

for_each_online_node(nid) {
- zones = node_zonelist(nid, GFP_NOFS);
+ zones = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
+ gfp_zone(GFP_NOFS));
if (*zones)
- try_to_free_pages(zones, 0, GFP_NOFS);
+ try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
+ GFP_NOFS);
}
}

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-007_node_zonelist/include/linux/gfp.h linux-2.6.23-rc4-mm1-010_use_two_zonelists/include/linux/gfp.h
--- linux-2.6.23-rc4-mm1-007_node_zonelist/include/linux/gfp.h 2007-09-10 16:06:13.000000000 +0100
+++ linux-2.6.23-rc4-mm1-010_use_two_zonelists/include/linux/gfp.h 2007-09-10 16:06:22.000000000 +0100
@@ -119,29 +119,22 @@ static inline int allocflags_to_migratet

static inline enum zone_type gfp_zone(gfp_t flags)
{
- int base = 0;
-
-#ifdef CONFIG_NUMA
- if (flags & __GFP_THISNODE)
- base = MAX_NR_ZONES;
-#endif
-
#ifdef CONFIG_ZONE_DMA
if (flags & __GFP_DMA)
- return base + ZONE_DMA;
+ return ZONE_DMA;
#endif
#ifdef CONFIG_ZONE_DMA32
if (flags & __GFP_DMA32)
- return base + ZONE_DMA32;
+ return ZONE_DMA32;
#endif
if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
(__GFP_HIGHMEM | __GFP_MOVABLE))
- return base + ZONE_MOVABLE;
+ return ZONE_MOVABLE;
#ifdef CONFIG_HIGHMEM
if (flags & __GFP_HIGHMEM)
- return base + ZONE_HIGHMEM;
+ return ZONE_HIGHMEM;
#endif
- return base + ZONE_NORMAL;
+ return ZONE_NORMAL;
}

static inline gfp_t set_migrateflags(gfp_t gfp, gfp_t migrate_flags)
@@ -157,6 +150,18 @@ static inline gfp_t set_migrateflags(gfp
* virtual kernel addresses to the allocated page(s).
*/

+static inline enum zone_type gfp_zonelist(gfp_t flags)
+{
+ int base = 0;
+
+#ifdef CONFIG_NUMA
+ if (flags & __GFP_THISNODE)
+ base = 1;
+#endif
+
+ return base;
+}
+
/*
* We get the zone list from the current node and the gfp_mask.
* This zonelist contains two zonelists, one for all zones with memory and
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-007_node_zonelist/include/linux/mmzone.h linux-2.6.23-rc4-mm1-010_use_two_zonelists/include/linux/mmzone.h
--- linux-2.6.23-rc4-mm1-007_node_zonelist/include/linux/mmzone.h 2007-09-10 09:29:13.000000000 +0100
+++ linux-2.6.23-rc4-mm1-010_use_two_zonelists/include/linux/mmzone.h 2007-09-10 16:06:22.000000000 +0100
@@ -361,10 +361,10 @@ struct zone {
* The NUMA zonelists are doubled becausse we need zonelists that restrict the
* allocations to a single node for GFP_THISNODE.
*
- * [0 .. MAX_NR_ZONES -1] : Zonelists with fallback
- * [MAZ_NR_ZONES ... MAZ_ZONELISTS -1] : No fallback (GFP_THISNODE)
+ * [0] : Zonelist with fallback
+ * [1] : No fallback (GFP_THISNODE)
*/
-#define MAX_ZONELISTS (2 * MAX_NR_ZONES)
+#define MAX_ZONELISTS 2


/*
@@ -432,7 +432,7 @@ struct zonelist_cache {
unsigned long last_full_zap; /* when last zap'd (jiffies) */
};
#else
-#define MAX_ZONELISTS MAX_NR_ZONES
+#define MAX_ZONELISTS 1
struct zonelist_cache;
#endif

@@ -454,24 +454,6 @@ struct zonelist {
#endif
};

-#ifdef CONFIG_NUMA
-/*
- * Only custom zonelists like MPOL_BIND need to be filtered as part of
- * policies. As described in the comment for struct zonelist_cache, these
- * zonelists will not have a zlcache so zlcache_ptr will not be set. Use
- * that to determine if the zonelists needs to be filtered or not.
- */
-static inline int alloc_should_filter_zonelist(struct zonelist *zonelist)
-{
- return !zonelist->zlcache_ptr;
-}
-#else
-static inline int alloc_should_filter_zonelist(struct zonelist *zonelist)
-{
- return 0;
-}
-#endif /* CONFIG_NUMA */
-
#ifdef CONFIG_ARCH_POPULATES_NODE_MAP
struct node_active_region {
unsigned long start_pfn;
@@ -700,6 +682,45 @@ extern struct zone *next_zone(struct zon
zone; \
zone = next_zone(zone))

+/* Returns the first zone at or below highest_zoneidx in a zonelist */
+static inline struct zone **first_zones_zonelist(struct zonelist *zonelist,
+ enum zone_type highest_zoneidx)
+{
+ struct zone **z;
+
+ for (z = zonelist->zones;
+ *z && zone_idx(*z) > highest_zoneidx;
+ z++)
+ ;
+
+ return z;
+}
+
+/* Returns the next zone at or below highest_zoneidx in a zonelist */
+static inline struct zone **next_zones_zonelist(struct zone **z,
+ enum zone_type highest_zoneidx)
+{
+ /* Find the next suitable zone to use for the allocation */
+ for (; *z && zone_idx(*z) > highest_zoneidx; z++)
+ ;
+
+ return z;
+}
+
+/**
+ * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
+ * @zone - The current zone in the iterator
+ * @z - The current pointer within zonelist->zones being iterated
+ * @zlist - The zonelist being iterated
+ * @highidx - The zone index of the highest zone to return
+ *
+ * This iterator iterates though all zones at or below a given zone index.
+ */
+#define for_each_zone_zonelist(zone, z, zlist, highidx) \
+ for (z = first_zones_zonelist(zlist, highidx), zone = *z++; \
+ zone; \
+ z = next_zones_zonelist(z, highidx), zone = *z++)
+
#ifdef CONFIG_SPARSEMEM
#include <asm/sparsemem.h>
#endif
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-007_node_zonelist/mm/hugetlb.c linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/hugetlb.c
--- linux-2.6.23-rc4-mm1-007_node_zonelist/mm/hugetlb.c 2007-09-10 09:29:14.000000000 +0100
+++ linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/hugetlb.c 2007-09-10 16:06:22.000000000 +0100
@@ -73,11 +73,11 @@ static struct page *dequeue_huge_page(st
struct page *page = NULL;
struct zonelist *zonelist = huge_zonelist(vma, address,
htlb_alloc_mask);
- struct zone **z;
+ struct zone *zone, **z;

- for (z = zonelist->zones; *z; z++) {
- nid = zone_to_nid(*z);
- if (cpuset_zone_allowed_softwall(*z, htlb_alloc_mask) &&
+ for_each_zone_zonelist(zone, z, zonelist, MAX_NR_ZONES - 1) {
+ nid = zone_to_nid(zone);
+ if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
!list_empty(&hugepage_freelists[nid])) {
page = list_entry(hugepage_freelists[nid].next,
struct page, lru);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-007_node_zonelist/mm/oom_kill.c linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/oom_kill.c
--- linux-2.6.23-rc4-mm1-007_node_zonelist/mm/oom_kill.c 2007-09-10 09:29:14.000000000 +0100
+++ linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/oom_kill.c 2007-09-10 16:06:22.000000000 +0100
@@ -185,12 +185,14 @@ unsigned long badness(struct task_struct
static inline int constrained_alloc(struct zonelist *zonelist, gfp_t gfp_mask)
{
#ifdef CONFIG_NUMA
+ struct zone *zone;
struct zone **z;
+ enum zone_type high_zoneidx = gfp_zone(gfp_mask);
nodemask_t nodes = node_states[N_HIGH_MEMORY];

- for (z = zonelist->zones; *z; z++)
- if (cpuset_zone_allowed_softwall(*z, gfp_mask))
- node_clear(zone_to_nid(*z), nodes);
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+ if (cpuset_zone_allowed_softwall(zone, gfp_mask))
+ node_clear(zone_to_nid(zone), nodes);
else
return CONSTRAINT_CPUSET;

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-007_node_zonelist/mm/page_alloc.c linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/page_alloc.c
--- linux-2.6.23-rc4-mm1-007_node_zonelist/mm/page_alloc.c 2007-09-10 16:06:13.000000000 +0100
+++ linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/page_alloc.c 2007-09-10 16:06:22.000000000 +0100
@@ -1420,41 +1420,28 @@ static void zlc_mark_zone_full(struct zo
*/
static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
- struct zonelist *zonelist, int alloc_flags)
+ struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
{
struct zone **z;
struct page *page = NULL;
- int classzone_idx = zone_idx(zonelist->zones[0]);
+ int classzone_idx;
struct zone *zone;
nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
int zlc_active = 0; /* set if using zonelist_cache */
int did_zlc_setup = 0; /* just call zlc_setup() one time */
- enum zone_type highest_zoneidx = -1; /* Gets set for policy zonelists */
+
+ z = first_zones_zonelist(zonelist, high_zoneidx);
+ classzone_idx = zone_idx(*z);

zonelist_scan:
/*
* Scan zonelist, looking for a zone with enough free.
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
- z = zonelist->zones;
-
- do {
- /*
- * In NUMA, this could be a policy zonelist which contains
- * zones that may not be allowed by the current gfp_mask.
- * Check the zone is allowed by the current flags
- */
- if (unlikely(alloc_should_filter_zonelist(zonelist))) {
- if (highest_zoneidx == -1)
- highest_zoneidx = gfp_zone(gfp_mask);
- if (zone_idx(*z) > highest_zoneidx)
- continue;
- }
-
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
if (NUMA_BUILD && zlc_active &&
!zlc_zone_worth_trying(zonelist, z, allowednodes))
continue;
- zone = *z;
if ((alloc_flags & ALLOC_CPUSET) &&
!cpuset_zone_allowed_softwall(zone, gfp_mask))
goto try_next_zone;
@@ -1488,7 +1475,7 @@ try_next_zone:
zlc_active = 1;
did_zlc_setup = 1;
}
- } while (*(++z) != NULL);
+ }

if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) {
/* Disable zlc cache for second zonelist scan */
@@ -1562,6 +1549,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
struct zonelist *zonelist)
{
const gfp_t wait = gfp_mask & __GFP_WAIT;
+ enum zone_type high_zoneidx = gfp_zone(gfp_mask);
struct zone **z;
struct page *page;
struct reclaim_state reclaim_state;
@@ -1587,7 +1575,7 @@ restart:
}

page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
- zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+ zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
if (page)
goto got_pg;

@@ -1631,7 +1619,8 @@ restart:
* Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
- page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+ page = get_page_from_freelist(gfp_mask, order, zonelist,
+ high_zoneidx, alloc_flags);
if (page)
goto got_pg;

@@ -1644,7 +1633,7 @@ rebalance:
nofail_alloc:
/* go through the zonelist yet again, ignoring mins */
page = get_page_from_freelist(gfp_mask, order,
- zonelist, ALLOC_NO_WATERMARKS);
+ zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
if (page)
goto got_pg;
if (gfp_mask & __GFP_NOFAIL) {
@@ -1679,7 +1668,7 @@ nofail_alloc:

if (likely(did_some_progress)) {
page = get_page_from_freelist(gfp_mask, order,
- zonelist, alloc_flags);
+ zonelist, high_zoneidx, alloc_flags);
if (page)
goto got_pg;
} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
@@ -1690,7 +1679,7 @@ nofail_alloc:
* under heavy pressure.
*/
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
- zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+ zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
if (page)
goto got_pg;

@@ -1804,14 +1793,16 @@ EXPORT_SYMBOL(free_pages);

static unsigned int nr_free_zone_pages(int offset)
{
+ enum zone_type high_zoneidx = MAX_NR_ZONES - 1;
+ struct zone **z;
+ struct zone *zone;
+
/* Just pick one node, since fallback list is circular */
unsigned int sum = 0;

struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
- struct zone **zonep = zonelist->zones;
- struct zone *zone;

- for (zone = *zonep++; zone; zone = *zonep++) {
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
unsigned long size = zone->present_pages;
unsigned long high = zone->pages_high;
if (size > high)
@@ -2171,17 +2162,15 @@ static int find_next_best_node(int node,
*/
static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
{
- enum zone_type i;
int j;
struct zonelist *zonelist;

- for (i = 0; i < MAX_NR_ZONES; i++) {
- zonelist = pgdat->node_zonelists + i;
- for (j = 0; zonelist->zones[j] != NULL; j++)
- ;
- j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
- zonelist->zones[j] = NULL;
- }
+ zonelist = &pgdat->node_zonelists[0];
+ for (j = 0; zonelist->zones[j] != NULL; j++)
+ ;
+ j = build_zonelists_node(NODE_DATA(node), zonelist, j,
+ MAX_NR_ZONES - 1);
+ zonelist->zones[j] = NULL;
}

/*
@@ -2189,15 +2178,12 @@ static void build_zonelists_in_node_orde
*/
static void build_thisnode_zonelists(pg_data_t *pgdat)
{
- enum zone_type i;
int j;
struct zonelist *zonelist;

- for (i = 0; i < MAX_NR_ZONES; i++) {
- zonelist = pgdat->node_zonelists + MAX_NR_ZONES + i;
- j = build_zonelists_node(pgdat, zonelist, 0, i);
- zonelist->zones[j] = NULL;
- }
+ zonelist = &pgdat->node_zonelists[1];
+ j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1);
+ zonelist->zones[j] = NULL;
}

/*
@@ -2210,27 +2196,24 @@ static int node_order[MAX_NUMNODES];

static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
{
- enum zone_type i;
int pos, j, node;
int zone_type; /* needs to be signed */
struct zone *z;
struct zonelist *zonelist;

- for (i = 0; i < MAX_NR_ZONES; i++) {
- zonelist = pgdat->node_zonelists + i;
- pos = 0;
- for (zone_type = i; zone_type >= 0; zone_type--) {
- for (j = 0; j < nr_nodes; j++) {
- node = node_order[j];
- z = &NODE_DATA(node)->node_zones[zone_type];
- if (populated_zone(z)) {
- zonelist->zones[pos++] = z;
- check_highest_zone(zone_type);
- }
+ zonelist = &pgdat->node_zonelists[0];
+ pos = 0;
+ for (zone_type = MAX_NR_ZONES - 1; zone_type >= 0; zone_type--) {
+ for (j = 0; j < nr_nodes; j++) {
+ node = node_order[j];
+ z = &NODE_DATA(node)->node_zones[zone_type];
+ if (populated_zone(z)) {
+ zonelist->zones[pos++] = z;
+ check_highest_zone(zone_type);
}
}
- zonelist->zones[pos] = NULL;
}
+ zonelist->zones[pos] = NULL;
}

static int default_zonelist_order(void)
@@ -2357,19 +2340,15 @@ static void build_zonelists(pg_data_t *p
/* Construct the zonelist performance cache - see further mmzone.h */
static void build_zonelist_cache(pg_data_t *pgdat)
{
- int i;
-
- for (i = 0; i < MAX_NR_ZONES; i++) {
- struct zonelist *zonelist;
- struct zonelist_cache *zlc;
- struct zone **z;
+ struct zonelist *zonelist;
+ struct zonelist_cache *zlc;
+ struct zone **z;

- zonelist = pgdat->node_zonelists + i;
- zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
- bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
- for (z = zonelist->zones; *z; z++)
- zlc->z_to_n[z - zonelist->zones] = zone_to_nid(*z);
- }
+ zonelist = &pgdat->node_zonelists[0];
+ zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
+ bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
+ for (z = zonelist->zones; *z; z++)
+ zlc->z_to_n[z - zonelist->zones] = zone_to_nid(*z);
}


@@ -2383,45 +2362,43 @@ static void set_zonelist_order(void)
static void build_zonelists(pg_data_t *pgdat)
{
int node, local_node;
- enum zone_type i,j;
+ enum zone_type j;
+ struct zonelist *zonelist;

local_node = pgdat->node_id;
- for (i = 0; i < MAX_NR_ZONES; i++) {
- struct zonelist *zonelist;

- zonelist = pgdat->node_zonelists + i;
+ zonelist = &pgdat->node_zonelists[0];
+ j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1);

- j = build_zonelists_node(pgdat, zonelist, 0, i);
- /*
- * Now we build the zonelist so that it contains the zones
- * of all the other nodes.
- * We don't want to pressure a particular node, so when
- * building the zones for node N, we make sure that the
- * zones coming right after the local ones are those from
- * node N+1 (modulo N)
- */
- for (node = local_node + 1; node < MAX_NUMNODES; node++) {
- if (!node_online(node))
- continue;
- j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
- }
- for (node = 0; node < local_node; node++) {
- if (!node_online(node))
- continue;
- j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
- }
-
- zonelist->zones[j] = NULL;
+ /*
+ * Now we build the zonelist so that it contains the zones
+ * of all the other nodes.
+ * We don't want to pressure a particular node, so when
+ * building the zones for node N, we make sure that the
+ * zones coming right after the local ones are those from
+ * node N+1 (modulo N)
+ */
+ for (node = local_node + 1; node < MAX_NUMNODES; node++) {
+ if (!node_online(node))
+ continue;
+ j = build_zonelists_node(NODE_DATA(node), zonelist, j,
+ MAX_NR_ZONES - 1);
}
+ for (node = 0; node < local_node; node++) {
+ if (!node_online(node))
+ continue;
+ j = build_zonelists_node(NODE_DATA(node), zonelist, j,
+ MAX_NR_ZONES - 1);
+ }
+
+ zonelist->zones[j] = NULL;
}

/* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
static void build_zonelist_cache(pg_data_t *pgdat)
{
- int i;
-
- for (i = 0; i < MAX_NR_ZONES; i++)
- pgdat->node_zonelists[i].zlcache_ptr = NULL;
+ pgdat->node_zonelists[0].zlcache_ptr = NULL;
+ pgdat->node_zonelists[1].zlcache_ptr = NULL;
}

#endif /* CONFIG_NUMA */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-007_node_zonelist/mm/slab.c linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/slab.c
--- linux-2.6.23-rc4-mm1-007_node_zonelist/mm/slab.c 2007-09-10 16:06:13.000000000 +0100
+++ linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/slab.c 2007-09-10 16:06:22.000000000 +0100
@@ -3242,6 +3242,8 @@ static void *fallback_alloc(struct kmem_
struct zonelist *zonelist;
gfp_t local_flags;
struct zone **z;
+ struct zone *zone;
+ enum zone_type high_zoneidx = gfp_zone(flags);
void *obj = NULL;
int nid;

@@ -3256,10 +3258,10 @@ retry:
* Look through allowed nodes for objects available
* from existing per node queues.
*/
- for (z = zonelist->zones; *z && !obj; z++) {
- nid = zone_to_nid(*z);
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+ nid = zone_to_nid(zone);

- if (cpuset_zone_allowed_hardwall(*z, flags) &&
+ if (cpuset_zone_allowed_hardwall(zone, flags) &&
cache->nodelists[nid] &&
cache->nodelists[nid]->free_objects)
obj = ____cache_alloc_node(cache,
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-007_node_zonelist/mm/slub.c linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/slub.c
--- linux-2.6.23-rc4-mm1-007_node_zonelist/mm/slub.c 2007-09-10 16:06:13.000000000 +0100
+++ linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/slub.c 2007-09-10 16:06:22.000000000 +0100
@@ -1258,6 +1258,8 @@ static struct page *get_any_partial(stru
#ifdef CONFIG_NUMA
struct zonelist *zonelist;
struct zone **z;
+ struct zone *zone;
+ enum zone_type high_zoneidx = gfp_zone(flags);
struct page *page;

/*
@@ -1282,12 +1284,12 @@ static struct page *get_any_partial(stru
return NULL;

zonelist = node_zonelist(slab_node(current->mempolicy), flags);
- for (z = zonelist->zones; *z; z++) {
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
struct kmem_cache_node *n;

- n = get_node(s, zone_to_nid(*z));
+ n = get_node(s, zone_to_nid(zone));

- if (n && cpuset_zone_allowed_hardwall(*z, flags) &&
+ if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
n->nr_partial > MIN_PARTIAL) {
page = get_partial_node(n);
if (page)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-007_node_zonelist/mm/vmscan.c linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/vmscan.c
--- linux-2.6.23-rc4-mm1-007_node_zonelist/mm/vmscan.c 2007-09-10 16:06:06.000000000 +0100
+++ linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/vmscan.c 2007-09-10 16:06:22.000000000 +0100
@@ -1211,13 +1211,11 @@ static unsigned long shrink_zones(int pr
struct scan_control *sc)
{
unsigned long nr_reclaimed = 0;
- struct zone **zones = zonelist->zones;
- int i;
+ struct zone **z;
+ struct zone *zone;

sc->all_unreclaimable = 1;
- for (i = 0; zones[i] != NULL; i++) {
- struct zone *zone = zones[i];
-
+ for_each_zone_zonelist(zone, z, zonelist, MAX_NR_ZONES - 1) {
if (!populated_zone(zone))
continue;

@@ -1258,14 +1256,14 @@ unsigned long do_try_to_free_pages(struc
unsigned long nr_reclaimed = 0;
struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long lru_pages = 0;
- struct zone **zones = zonelist->zones;
+ struct zone **z;
+ struct zone *zone;
+ enum zone_type high_zoneidx = gfp_zone(gfp_mask);
int i;

count_vm_event(ALLOCSTALL);

- for (i = 0; zones[i] != NULL; i++) {
- struct zone *zone = zones[i];
-
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;

@@ -1324,9 +1322,7 @@ out:
*/
if (priority < 0)
priority = 0;
- for (i = 0; zones[i] != 0; i++) {
- struct zone *zone = zones[i];
-
+ for_each_zone_zonelist(zone, z, zonelist, MAX_NR_ZONES - 1) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;

2007-09-11 21:32:01

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 4/6] Embed zone_id information within the zonelist->zones pointer


Using two zonelists per node requires very frequent use of zone_idx(). This
is costly as it involves a lookup of another structure and a substraction
operation. As struct zone is always word aligned and normally cache-line
aligned, the pointer values have a number of 0's at the least significant
bits of the address.

This patch embeds the zone_id of a zone in the zonelist->zones pointers.
The real zone pointer is retrieved using the zonelist_zone() helper function.
The ID of the zone is found using zonelist_zone_idx(). To avoid accidental
references, the zones field is renamed to _zones and the type changed to
unsigned long.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
---

arch/parisc/mm/init.c | 2 -
fs/buffer.c | 6 ++--
include/linux/mmzone.h | 58 ++++++++++++++++++++++++++++++++++++--------
kernel/cpuset.c | 4 +--
mm/hugetlb.c | 3 +-
mm/mempolicy.c | 37 +++++++++++++++++-----------
mm/oom_kill.c | 2 -
mm/page_alloc.c | 52 ++++++++++++++++++++-------------------
mm/slab.c | 2 -
mm/slub.c | 2 -
mm/vmscan.c | 5 +--
mm/vmstat.c | 5 ++-
12 files changed, 114 insertions(+), 64 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-010_use_two_zonelists/arch/parisc/mm/init.c linux-2.6.23-rc4-mm1-020_zoneid_zonelist/arch/parisc/mm/init.c
--- linux-2.6.23-rc4-mm1-010_use_two_zonelists/arch/parisc/mm/init.c 2007-09-10 16:06:22.000000000 +0100
+++ linux-2.6.23-rc4-mm1-020_zoneid_zonelist/arch/parisc/mm/init.c 2007-09-10 16:06:31.000000000 +0100
@@ -604,7 +604,7 @@ void show_mem(void)
for (i = 0; i < npmem_ranges; i++) {
zl = node_zonelist(i);
for (j = 0; j < MAX_NR_ZONES; j++) {
- struct zone **z;
+ unsigned long *z;
struct zone *zone;

printk("Zone list for zone %d on node %d: ", j, i);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-010_use_two_zonelists/fs/buffer.c linux-2.6.23-rc4-mm1-020_zoneid_zonelist/fs/buffer.c
--- linux-2.6.23-rc4-mm1-010_use_two_zonelists/fs/buffer.c 2007-09-10 16:06:22.000000000 +0100
+++ linux-2.6.23-rc4-mm1-020_zoneid_zonelist/fs/buffer.c 2007-09-10 16:06:31.000000000 +0100
@@ -368,7 +368,7 @@ void invalidate_bdev(struct block_device
*/
static void free_more_memory(void)
{
- struct zone **zones;
+ unsigned long *zones;
int nid;

wakeup_pdflush(1024);
@@ -376,10 +376,10 @@ static void free_more_memory(void)

for_each_online_node(nid) {
zones = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
- gfp_zone(GFP_NOFS));
+ gfp_zone(GFP_NOFS));
if (*zones)
try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
- GFP_NOFS);
+ GFP_NOFS);
}
}

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-010_use_two_zonelists/include/linux/mmzone.h linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mmzone.h
--- linux-2.6.23-rc4-mm1-010_use_two_zonelists/include/linux/mmzone.h 2007-09-10 16:06:22.000000000 +0100
+++ linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mmzone.h 2007-09-10 16:06:31.000000000 +0100
@@ -444,11 +444,18 @@ struct zonelist_cache;
*
* If zlcache_ptr is not NULL, then it is just the address of zlcache,
* as explained above. If zlcache_ptr is NULL, there is no zlcache.
+ * *
+ * To speed the reading of _zones, additional information is encoded in the
+ * least significant bits of zonelist->_zones. The following helpers are used
+ *
+ * zonelist_zone() - Return the struct zone * for an entry in _zones
+ * zonelist_zone_idx() - Return the index of the zone for an entry
*/
-
struct zonelist {
struct zonelist_cache *zlcache_ptr; // NULL or &zlcache
- struct zone *zones[MAX_ZONES_PER_ZONELIST + 1]; // NULL delimited
+ unsigned long _zones[MAX_ZONES_PER_ZONELIST + 1]; /* Encoded pointer,
+ * 0 delimited
+ */
#ifdef CONFIG_NUMA
struct zonelist_cache zlcache; // optional ...
#endif
@@ -682,14 +689,43 @@ extern struct zone *next_zone(struct zon
zone; \
zone = next_zone(zone))

+/*
+ * SMP will align zones to a large boundary so the zone ID will fit in the
+ * least significant bits. Otherwise, ZONES_SHIFT must be 2 or less
+ */
+#if (defined(CONFIG_SMP) && INTERNODE_CACHE_SHIFT < ZONES_SHIFT) || \
+ ZONES_SHIFT > 2
+#error There is not enough space to embed zone IDs in the zonelist
+#endif
+
+#define ZONELIST_ZONEIDX_MASK ((1UL << ZONES_SHIFT) - 1)
+static inline struct zone *zonelist_zone(unsigned long zone_addr)
+{
+ return (struct zone *)(zone_addr & ~ZONELIST_ZONEIDX_MASK);
+}
+
+static inline int zonelist_zone_idx(unsigned long zone_addr)
+{
+ return zone_addr & ZONELIST_ZONEIDX_MASK;
+}
+
+static inline unsigned long encode_zone_idx(struct zone *zone)
+{
+ unsigned long encoded;
+
+ encoded = (unsigned long)zone | zone_idx(zone);
+ BUG_ON(zonelist_zone(encoded) != zone);
+ return encoded;
+}
+
/* Returns the first zone at or below highest_zoneidx in a zonelist */
-static inline struct zone **first_zones_zonelist(struct zonelist *zonelist,
+static inline unsigned long *first_zones_zonelist(struct zonelist *zonelist,
enum zone_type highest_zoneidx)
{
- struct zone **z;
+ unsigned long *z;

- for (z = zonelist->zones;
- *z && zone_idx(*z) > highest_zoneidx;
+ for (z = zonelist->_zones;
+ zonelist_zone_idx(*z) > highest_zoneidx;
z++)
;

@@ -697,11 +733,11 @@ static inline struct zone **first_zones_
}

/* Returns the next zone at or below highest_zoneidx in a zonelist */
-static inline struct zone **next_zones_zonelist(struct zone **z,
+static inline unsigned long *next_zones_zonelist(unsigned long *z,
enum zone_type highest_zoneidx)
{
/* Find the next suitable zone to use for the allocation */
- for (; *z && zone_idx(*z) > highest_zoneidx; z++)
+ for (; zonelist_zone_idx(*z) > highest_zoneidx; z++)
;

return z;
@@ -717,9 +753,11 @@ static inline struct zone **next_zones_z
* This iterator iterates though all zones at or below a given zone index.
*/
#define for_each_zone_zonelist(zone, z, zlist, highidx) \
- for (z = first_zones_zonelist(zlist, highidx), zone = *z++; \
+ for (z = first_zones_zonelist(zlist, highidx), \
+ zone = zonelist_zone(*z++); \
zone; \
- z = next_zones_zonelist(z, highidx), zone = *z++)
+ z = next_zones_zonelist(z, highidx), \
+ zone = zonelist_zone(*z++))

#ifdef CONFIG_SPARSEMEM
#include <asm/sparsemem.h>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-010_use_two_zonelists/kernel/cpuset.c linux-2.6.23-rc4-mm1-020_zoneid_zonelist/kernel/cpuset.c
--- linux-2.6.23-rc4-mm1-010_use_two_zonelists/kernel/cpuset.c 2007-09-10 09:29:14.000000000 +0100
+++ linux-2.6.23-rc4-mm1-020_zoneid_zonelist/kernel/cpuset.c 2007-09-10 16:06:31.000000000 +0100
@@ -1525,8 +1525,8 @@ int cpuset_zonelist_valid_mems_allowed(s
{
int i;

- for (i = 0; zl->zones[i]; i++) {
- int nid = zone_to_nid(zl->zones[i]);
+ for (i = 0; zl->_zones[i]; i++) {
+ int nid = zone_to_nid(zonelist_zone(zl->_zones[i]));

if (node_isset(nid, current->mems_allowed))
return 1;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/hugetlb.c linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/hugetlb.c
--- linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/hugetlb.c 2007-09-10 16:06:22.000000000 +0100
+++ linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/hugetlb.c 2007-09-10 16:06:31.000000000 +0100
@@ -73,7 +73,8 @@ static struct page *dequeue_huge_page(st
struct page *page = NULL;
struct zonelist *zonelist = huge_zonelist(vma, address,
htlb_alloc_mask);
- struct zone *zone, **z;
+ struct zone *zone;
+ unsigned long *z;

for_each_zone_zonelist(zone, z, zonelist, MAX_NR_ZONES - 1) {
nid = zone_to_nid(zone);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/mempolicy.c linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/mempolicy.c
--- linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/mempolicy.c 2007-09-10 16:06:13.000000000 +0100
+++ linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/mempolicy.c 2007-09-10 16:06:31.000000000 +0100
@@ -157,7 +157,7 @@ static struct zonelist *bind_zonelist(no
for_each_node_mask(nd, *nodes) {
struct zone *z = &NODE_DATA(nd)->node_zones[k];
if (z->present_pages > 0)
- zl->zones[num++] = z;
+ zl->_zones[num++] = encode_zone_idx(z);
}
if (k == 0)
break;
@@ -167,7 +167,7 @@ static struct zonelist *bind_zonelist(no
kfree(zl);
return ERR_PTR(-EINVAL);
}
- zl->zones[num] = NULL;
+ zl->_zones[num] = 0;
return zl;
}

@@ -489,9 +489,11 @@ static void get_zonemask(struct mempolic
nodes_clear(*nodes);
switch (p->policy) {
case MPOL_BIND:
- for (i = 0; p->v.zonelist->zones[i]; i++)
- node_set(zone_to_nid(p->v.zonelist->zones[i]),
- *nodes);
+ for (i = 0; p->v.zonelist->_zones[i]; i++) {
+ struct zone *zone;
+ zone = zonelist_zone(p->v.zonelist->_zones[i]);
+ node_set(zone_to_nid(zone), *nodes);
+ }
break;
case MPOL_DEFAULT:
break;
@@ -1159,12 +1161,15 @@ unsigned slab_node(struct mempolicy *pol
case MPOL_INTERLEAVE:
return interleave_nodes(policy);

- case MPOL_BIND:
+ case MPOL_BIND: {
/*
* Follow bind policy behavior and start allocation at the
* first node.
*/
- return zone_to_nid(policy->v.zonelist->zones[0]);
+ struct zonelist *zonelist;
+ zonelist = policy->v.zonelist;
+ return zone_to_nid(zonelist_zone(zonelist->_zones[0]));
+ }

case MPOL_PREFERRED:
if (policy->v.preferred_node >= 0)
@@ -1242,7 +1247,7 @@ static struct page *alloc_page_interleav

zl = node_zonelist(nid, gfp);
page = __alloc_pages(gfp, order, zl);
- if (page && page_zone(page) == zl->zones[0])
+ if (page && page_zone(page) == zonelist_zone(zl->_zones[0]))
inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
return page;
}
@@ -1366,10 +1371,14 @@ int __mpol_equal(struct mempolicy *a, st
return a->v.preferred_node == b->v.preferred_node;
case MPOL_BIND: {
int i;
- for (i = 0; a->v.zonelist->zones[i]; i++)
- if (a->v.zonelist->zones[i] != b->v.zonelist->zones[i])
+ for (i = 0; a->v.zonelist->_zones[i]; i++) {
+ struct zone *za, *zb;
+ za = zonelist_zone(a->v.zonelist->_zones[i]);
+ zb = zonelist_zone(b->v.zonelist->_zones[i]);
+ if (za != zb)
return 0;
- return b->v.zonelist->zones[i] == NULL;
+ }
+ return b->v.zonelist->_zones[i] == 0;
}
default:
BUG();
@@ -1688,12 +1697,12 @@ static void mpol_rebind_policy(struct me
break;
case MPOL_BIND: {
nodemask_t nodes;
- struct zone **z;
+ unsigned long *z;
struct zonelist *zonelist;

nodes_clear(nodes);
- for (z = pol->v.zonelist->zones; *z; z++)
- node_set(zone_to_nid(*z), nodes);
+ for (z = pol->v.zonelist->_zones; *z; z++)
+ node_set(zone_to_nid(zonelist_zone(*z)), nodes);
nodes_remap(tmp, nodes, *mpolmask, *newmask);
nodes = tmp;

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/oom_kill.c linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/oom_kill.c
--- linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/oom_kill.c 2007-09-10 16:06:22.000000000 +0100
+++ linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/oom_kill.c 2007-09-10 16:06:31.000000000 +0100
@@ -186,7 +186,7 @@ static inline int constrained_alloc(stru
{
#ifdef CONFIG_NUMA
struct zone *zone;
- struct zone **z;
+ unsigned long *z;
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
nodemask_t nodes = node_states[N_HIGH_MEMORY];

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/page_alloc.c linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/page_alloc.c
--- linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/page_alloc.c 2007-09-10 16:06:22.000000000 +0100
+++ linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/page_alloc.c 2007-09-10 16:06:31.000000000 +0100
@@ -1359,7 +1359,7 @@ static nodemask_t *zlc_setup(struct zone
* We are low on memory in the second scan, and should leave no stone
* unturned looking for a free page.
*/
-static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zone **z,
+static int zlc_zone_worth_trying(struct zonelist *zonelist, unsigned long *z,
nodemask_t *allowednodes)
{
struct zonelist_cache *zlc; /* cached zonelist speedup info */
@@ -1370,7 +1370,7 @@ static int zlc_zone_worth_trying(struct
if (!zlc)
return 1;

- i = z - zonelist->zones;
+ i = z - zonelist->_zones;
n = zlc->z_to_n[i];

/* This zone is worth trying if it is allowed but not full */
@@ -1382,7 +1382,7 @@ static int zlc_zone_worth_trying(struct
* zlc->fullzones, so that subsequent attempts to allocate a page
* from that zone don't waste time re-examining it.
*/
-static void zlc_mark_zone_full(struct zonelist *zonelist, struct zone **z)
+static void zlc_mark_zone_full(struct zonelist *zonelist, unsigned long *z)
{
struct zonelist_cache *zlc; /* cached zonelist speedup info */
int i; /* index of *z in zonelist zones */
@@ -1391,7 +1391,7 @@ static void zlc_mark_zone_full(struct zo
if (!zlc)
return;

- i = z - zonelist->zones;
+ i = z - zonelist->_zones;

set_bit(i, zlc->fullzones);
}
@@ -1403,13 +1403,13 @@ static nodemask_t *zlc_setup(struct zone
return NULL;
}

-static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zone **z,
+static int zlc_zone_worth_trying(struct zonelist *zonelist, unsigned long *z,
nodemask_t *allowednodes)
{
return 1;
}

-static void zlc_mark_zone_full(struct zonelist *zonelist, struct zone **z)
+static void zlc_mark_zone_full(struct zonelist *zonelist, unsigned long *z)
{
}
#endif /* CONFIG_NUMA */
@@ -1422,7 +1422,7 @@ static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
{
- struct zone **z;
+ unsigned long *z;
struct page *page = NULL;
int classzone_idx;
struct zone *zone;
@@ -1431,7 +1431,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
int did_zlc_setup = 0; /* just call zlc_setup() one time */

z = first_zones_zonelist(zonelist, high_zoneidx);
- classzone_idx = zone_idx(*z);
+ classzone_idx = zonelist_zone_idx(*z);

zonelist_scan:
/*
@@ -1550,7 +1550,8 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
{
const gfp_t wait = gfp_mask & __GFP_WAIT;
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
- struct zone **z;
+ unsigned long *z;
+ struct zone *zone;
struct page *page;
struct reclaim_state reclaim_state;
struct task_struct *p = current;
@@ -1564,9 +1565,9 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
return NULL;

restart:
- z = zonelist->zones; /* the list of zones suitable for gfp_mask */
+ z = zonelist->_zones; /* the list of zones suitable for gfp_mask */

- if (unlikely(*z == NULL)) {
+ if (unlikely(*z == 0)) {
/*
* Happens if we have an empty zonelist as a result of
* GFP_THISNODE being used on a memoryless node
@@ -1590,8 +1591,8 @@ restart:
if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
goto nopage;

- for (z = zonelist->zones; *z; z++)
- wakeup_kswapd(*z, order);
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+ wakeup_kswapd(zone, order);

/*
* OK, we're below the kswapd watermark and have kicked background
@@ -1794,7 +1795,7 @@ EXPORT_SYMBOL(free_pages);
static unsigned int nr_free_zone_pages(int offset)
{
enum zone_type high_zoneidx = MAX_NR_ZONES - 1;
- struct zone **z;
+ unsigned long *z;
struct zone *zone;

/* Just pick one node, since fallback list is circular */
@@ -1989,7 +1990,7 @@ static int build_zonelists_node(pg_data_
zone_type--;
zone = pgdat->node_zones + zone_type;
if (populated_zone(zone)) {
- zonelist->zones[nr_zones++] = zone;
+ zonelist->_zones[nr_zones++] = encode_zone_idx(zone);
check_highest_zone(zone_type);
}

@@ -2166,11 +2167,11 @@ static void build_zonelists_in_node_orde
struct zonelist *zonelist;

zonelist = &pgdat->node_zonelists[0];
- for (j = 0; zonelist->zones[j] != NULL; j++)
+ for (j = 0; zonelist->_zones[j] != 0; j++)
;
j = build_zonelists_node(NODE_DATA(node), zonelist, j,
MAX_NR_ZONES - 1);
- zonelist->zones[j] = NULL;
+ zonelist->_zones[j] = 0;
}

/*
@@ -2183,7 +2184,7 @@ static void build_thisnode_zonelists(pg_

zonelist = &pgdat->node_zonelists[1];
j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1);
- zonelist->zones[j] = NULL;
+ zonelist->_zones[j] = 0;
}

/*
@@ -2208,12 +2209,12 @@ static void build_zonelists_in_zone_orde
node = node_order[j];
z = &NODE_DATA(node)->node_zones[zone_type];
if (populated_zone(z)) {
- zonelist->zones[pos++] = z;
+ zonelist->_zones[pos++] = encode_zone_idx(z);
check_highest_zone(zone_type);
}
}
}
- zonelist->zones[pos] = NULL;
+ zonelist->_zones[pos] = 0;
}

static int default_zonelist_order(void)
@@ -2290,7 +2291,7 @@ static void build_zonelists(pg_data_t *p
/* initialize zonelists */
for (i = 0; i < MAX_ZONELISTS; i++) {
zonelist = pgdat->node_zonelists + i;
- zonelist->zones[0] = NULL;
+ zonelist->_zones[0] = 0;
}

/* NUMA-aware ordering of nodes */
@@ -2342,13 +2343,14 @@ static void build_zonelist_cache(pg_data
{
struct zonelist *zonelist;
struct zonelist_cache *zlc;
- struct zone **z;
+ unsigned long *z;

zonelist = &pgdat->node_zonelists[0];
zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
- for (z = zonelist->zones; *z; z++)
- zlc->z_to_n[z - zonelist->zones] = zone_to_nid(*z);
+ for (z = zonelist->_zones; *z; z++)
+ zlc->z_to_n[z - zonelist->_zones] =
+ zone_to_nid(zonelist_zone(*z));
}


@@ -2391,7 +2393,7 @@ static void build_zonelists(pg_data_t *p
MAX_NR_ZONES - 1);
}

- zonelist->zones[j] = NULL;
+ zonelist->_zones[j] = 0;
}

/* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/slab.c linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/slab.c
--- linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/slab.c 2007-09-10 16:06:22.000000000 +0100
+++ linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/slab.c 2007-09-10 16:06:31.000000000 +0100
@@ -3241,7 +3241,7 @@ static void *fallback_alloc(struct kmem_
{
struct zonelist *zonelist;
gfp_t local_flags;
- struct zone **z;
+ unsigned long *z;
struct zone *zone;
enum zone_type high_zoneidx = gfp_zone(flags);
void *obj = NULL;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/slub.c linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/slub.c
--- linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/slub.c 2007-09-10 16:06:22.000000000 +0100
+++ linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/slub.c 2007-09-10 16:06:31.000000000 +0100
@@ -1257,7 +1257,7 @@ static struct page *get_any_partial(stru
{
#ifdef CONFIG_NUMA
struct zonelist *zonelist;
- struct zone **z;
+ unsigned long *z;
struct zone *zone;
enum zone_type high_zoneidx = gfp_zone(flags);
struct page *page;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/vmscan.c linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/vmscan.c
--- linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/vmscan.c 2007-09-10 16:06:22.000000000 +0100
+++ linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/vmscan.c 2007-09-10 16:06:31.000000000 +0100
@@ -1211,7 +1211,7 @@ static unsigned long shrink_zones(int pr
struct scan_control *sc)
{
unsigned long nr_reclaimed = 0;
- struct zone **z;
+ unsigned long *z;
struct zone *zone;

sc->all_unreclaimable = 1;
@@ -1256,10 +1256,9 @@ unsigned long do_try_to_free_pages(struc
unsigned long nr_reclaimed = 0;
struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long lru_pages = 0;
- struct zone **z;
+ unsigned long *z;
struct zone *zone;
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
- int i;

count_vm_event(ALLOCSTALL);

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/vmstat.c linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/vmstat.c
--- linux-2.6.23-rc4-mm1-010_use_two_zonelists/mm/vmstat.c 2007-09-10 09:29:14.000000000 +0100
+++ linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/vmstat.c 2007-09-10 16:06:31.000000000 +0100
@@ -365,11 +365,12 @@ void refresh_cpu_vm_stats(int cpu)
*/
void zone_statistics(struct zonelist *zonelist, struct zone *z)
{
- if (z->zone_pgdat == zonelist->zones[0]->zone_pgdat) {
+ if (z->zone_pgdat == zonelist_zone(zonelist->_zones[0])->zone_pgdat) {
__inc_zone_state(z, NUMA_HIT);
} else {
__inc_zone_state(z, NUMA_MISS);
- __inc_zone_state(zonelist->zones[0], NUMA_FOREIGN);
+ __inc_zone_state(zonelist_zone(zonelist->_zones[0]),
+ NUMA_FOREIGN);
}
if (z->node == numa_node_id())
__inc_zone_state(z, NUMA_LOCAL);

2007-09-11 21:32:31

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask


The MPOL_BIND policy creates a zonelist that is used for allocations belonging
to that thread that can use the policy_zone. As the per-node zonelist is
already being filtered based on a zone id, this patch adds a version of
__alloc_pages() that takes a nodemask for further filtering. This eliminates
the need for MPOL_BIND to create a custom zonelist. A positive benefit of
this is that allocations using MPOL_BIND now use the local-node-ordered
zonelist instead of a custom node-id-ordered zonelist.


Signed-off-by: Mel Gorman <[email protected]>
---

fs/buffer.c | 4 -
include/linux/cpuset.h | 4 -
include/linux/gfp.h | 4 +
include/linux/mempolicy.h | 3
include/linux/mmzone.h | 61 ++++++++++++++---
kernel/cpuset.c | 19 +----
mm/mempolicy.c | 144 +++++++++++------------------------------
mm/page_alloc.c | 40 +++++++----
8 files changed, 136 insertions(+), 143 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/fs/buffer.c linux-2.6.23-rc4-mm1-030_filter_nodemask/fs/buffer.c
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/fs/buffer.c 2007-09-10 16:06:31.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/fs/buffer.c 2007-09-10 16:06:39.000000000 +0100
@@ -376,10 +376,10 @@ static void free_more_memory(void)

for_each_online_node(nid) {
zones = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
- gfp_zone(GFP_NOFS));
+ NULL, gfp_zone(GFP_NOFS));
if (*zones)
try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
- GFP_NOFS);
+ GFP_NOFS);
}
}

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/cpuset.h linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/cpuset.h
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/cpuset.h 2007-09-10 09:29:13.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/cpuset.h 2007-09-10 16:06:39.000000000 +0100
@@ -28,7 +28,7 @@ void cpuset_init_current_mems_allowed(vo
void cpuset_update_task_memory_state(void);
#define cpuset_nodes_subset_current_mems_allowed(nodes) \
nodes_subset((nodes), current->mems_allowed)
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);

extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
extern int __cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask);
@@ -102,7 +102,7 @@ static inline void cpuset_init_current_m
static inline void cpuset_update_task_memory_state(void) {}
#define cpuset_nodes_subset_current_mems_allowed(nodes) (1)

-static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
{
return 1;
}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/gfp.h linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/gfp.h
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/gfp.h 2007-09-10 16:06:22.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/gfp.h 2007-09-10 16:06:39.000000000 +0100
@@ -185,6 +185,10 @@ static inline void arch_alloc_page(struc
extern struct page *
FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *));

+extern struct page *
+FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int,
+ struct zonelist *, nodemask_t *nodemask));
+
static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mempolicy.h linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mempolicy.h
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mempolicy.h 2007-09-10 16:06:13.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mempolicy.h 2007-09-10 16:06:39.000000000 +0100
@@ -63,9 +63,8 @@ struct mempolicy {
atomic_t refcnt;
short policy; /* See MPOL_* above */
union {
- struct zonelist *zonelist; /* bind */
short preferred_node; /* preferred */
- nodemask_t nodes; /* interleave */
+ nodemask_t nodes; /* interleave/bind */
/* undefined for default */
} v;
nodemask_t cpuset_mems_allowed; /* mempolicy relative to these nodes */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mmzone.h linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mmzone.h
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mmzone.h 2007-09-10 16:06:31.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mmzone.h 2007-09-11 13:43:04.000000000 +0100
@@ -718,14 +718,29 @@ static inline unsigned long encode_zone_
return encoded;
}

+static inline int zone_in_nodemask(unsigned long zone_addr, nodemask_t *nodes)
+{
+#ifdef CONFIG_NUMA
+ return node_isset(zonelist_zone(zone_addr)->node, *nodes);
+#else
+ return 1;
+#endif /* CONFIG_NUMA */
+}
+
/* Returns the first zone at or below highest_zoneidx in a zonelist */
static inline unsigned long *first_zones_zonelist(struct zonelist *zonelist,
+ nodemask_t *nodes,
enum zone_type highest_zoneidx)
{
- unsigned long *z;
+ unsigned long *z = zonelist->_zones;

- for (z = zonelist->_zones;
- zonelist_zone_idx(*z) > highest_zoneidx;
+ if (likely(nodes == NULL))
+ for (; zonelist_zone_idx(*z) > highest_zoneidx;
+ z++)
+ ;
+ else
+ for (; zonelist_zone_idx(*z) > highest_zoneidx ||
+ (*z && !zone_in_nodemask(*z, nodes));
z++)
;

@@ -734,31 +749,55 @@ static inline unsigned long *first_zones

/* Returns the next zone at or below highest_zoneidx in a zonelist */
static inline unsigned long *next_zones_zonelist(unsigned long *z,
+ nodemask_t *nodes,
enum zone_type highest_zoneidx)
{
- /* Find the next suitable zone to use for the allocation */
- for (; zonelist_zone_idx(*z) > highest_zoneidx; z++)
- ;
+ /*
+ * Find the next suitable zone to use for the allocation.
+ * Only filter based on nodemask if it's set
+ */
+ if (likely(nodes == NULL))
+ for (; zonelist_zone_idx(*z) > highest_zoneidx; z++)
+ ;
+ else
+ for (; zonelist_zone_idx(*z) > highest_zoneidx ||
+ (*z && !zone_in_nodemask(*z, nodes));
+ z++)
+ ;

return z;
}

/**
- * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
+ * for_each_zone_zonelist_nodemask - helper macro to iterate over valid zones in a zonelist at or below a given zone index and within a nodemask
* @zone - The current zone in the iterator
* @z - The current pointer within zonelist->zones being iterated
* @zlist - The zonelist being iterated
* @highidx - The zone index of the highest zone to return
+ * @nodemask - Nodemask allowed by the allocator
*
- * This iterator iterates though all zones at or below a given zone index.
+ * This iterator iterates though all zones at or below a given zone index and
+ * within a given nodemask
*/
-#define for_each_zone_zonelist(zone, z, zlist, highidx) \
- for (z = first_zones_zonelist(zlist, highidx), \
+#define for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
+ for (z = first_zones_zonelist(zlist, nodemask, highidx), \
zone = zonelist_zone(*z++); \
zone; \
- z = next_zones_zonelist(z, highidx), \
+ z = next_zones_zonelist(z, nodemask, highidx), \
zone = zonelist_zone(*z++))

+/**
+ * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
+ * @zone - The current zone in the iterator
+ * @z - The current pointer within zonelist->zones being iterated
+ * @zlist - The zonelist being iterated
+ * @highidx - The zone index of the highest zone to return
+ *
+ * This iterator iterates though all zones at or below a given zone index.
+ */
+#define for_each_zone_zonelist(zone, z, zlist, highidx) \
+ for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
+
#ifdef CONFIG_SPARSEMEM
#include <asm/sparsemem.h>
#endif
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/kernel/cpuset.c linux-2.6.23-rc4-mm1-030_filter_nodemask/kernel/cpuset.c
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/kernel/cpuset.c 2007-09-10 16:06:31.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/kernel/cpuset.c 2007-09-10 16:06:39.000000000 +0100
@@ -1516,22 +1516,17 @@ nodemask_t cpuset_mems_allowed(struct ta
}

/**
- * cpuset_zonelist_valid_mems_allowed - check zonelist vs. curremt mems_allowed
- * @zl: the zonelist to be checked
+ * cpuset_nodemask_valid_mems_allowed - check nodemask vs. curremt mems_allowed
+ * @nodemask: the nodemask to be checked
*
- * Are any of the nodes on zonelist zl allowed in current->mems_allowed?
+ * Are any of the nodes in the nodemask allowed in current->mems_allowed?
*/
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
{
- int i;
+ int nid;
+ nodemask_t tmp;

- for (i = 0; zl->_zones[i]; i++) {
- int nid = zone_to_nid(zonelist_zone(zl->_zones[i]));
-
- if (node_isset(nid, current->mems_allowed))
- return 1;
- }
- return 0;
+ return nodes_intersect(nodemask, current->mems_allowed);
}

/*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/mempolicy.c linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/mempolicy.c
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/mempolicy.c 2007-09-10 16:06:31.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/mempolicy.c 2007-09-10 16:06:39.000000000 +0100
@@ -134,41 +134,21 @@ static int mpol_check_policy(int mode, n
return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
}

-/* Generate a custom zonelist for the BIND policy. */
-static struct zonelist *bind_zonelist(nodemask_t *nodes)
+/* Check that the nodemask contains at least one populated zone */
+static int is_valid_nodemask(nodemask_t *nodemask)
{
- struct zonelist *zl;
- int num, max, nd;
- enum zone_type k;
+ int nd, k;

- max = 1 + MAX_NR_ZONES * nodes_weight(*nodes);
- max++; /* space for zlcache_ptr (see mmzone.h) */
- zl = kmalloc(sizeof(struct zone *) * max, GFP_KERNEL);
- if (!zl)
- return ERR_PTR(-ENOMEM);
- zl->zlcache_ptr = NULL;
- num = 0;
- /* First put in the highest zones from all nodes, then all the next
- lower zones etc. Avoid empty zones because the memory allocator
- doesn't like them. If you implement node hot removal you
- have to fix that. */
- k = MAX_NR_ZONES - 1;
- while (1) {
- for_each_node_mask(nd, *nodes) {
- struct zone *z = &NODE_DATA(nd)->node_zones[k];
- if (z->present_pages > 0)
- zl->_zones[num++] = encode_zone_idx(z);
- }
- if (k == 0)
- break;
- k--;
- }
- if (num == 0) {
- kfree(zl);
- return ERR_PTR(-EINVAL);
+ /* Check that there is something useful in this mask */
+ k = policy_zone;
+
+ for_each_node_mask(nd, *nodemask) {
+ struct zone *z = &NODE_DATA(nd)->node_zones[k];
+ if (z->present_pages > 0)
+ return 1;
}
- zl->_zones[num] = 0;
- return zl;
+
+ return 0;
}

/* Create a new policy */
@@ -201,12 +181,11 @@ static struct mempolicy *mpol_new(int mo
policy->v.preferred_node = -1;
break;
case MPOL_BIND:
- policy->v.zonelist = bind_zonelist(nodes);
- if (IS_ERR(policy->v.zonelist)) {
- void *error_code = policy->v.zonelist;
+ if (!is_valid_nodemask(nodes)) {
kmem_cache_free(policy_cache, policy);
- return error_code;
+ return ERR_PTR(-EINVAL);
}
+ policy->v.nodes = *nodes;
break;
}
policy->policy = mode;
@@ -484,19 +463,13 @@ static long do_set_mempolicy(int mode, n
/* Fill a zone bitmap for a policy */
static void get_zonemask(struct mempolicy *p, nodemask_t *nodes)
{
- int i;

nodes_clear(*nodes);
switch (p->policy) {
- case MPOL_BIND:
- for (i = 0; p->v.zonelist->_zones[i]; i++) {
- struct zone *zone;
- zone = zonelist_zone(p->v.zonelist->_zones[i]);
- node_set(zone_to_nid(zone), *nodes);
- }
- break;
case MPOL_DEFAULT:
break;
+ case MPOL_BIND:
+ /* Fall through */
case MPOL_INTERLEAVE:
*nodes = p->v.nodes;
break;
@@ -1106,6 +1079,18 @@ static struct mempolicy * get_vma_policy
return pol;
}

+/* Return a nodemask representing a mempolicy */
+static inline nodemask_t *nodemask_policy(gfp_t gfp, struct mempolicy *policy)
+{
+ /* Lower zones don't get a nodemask applied for MPOL_BIND */
+ if (unlikely(policy->policy == MPOL_BIND &&
+ gfp_zone(gfp) >= policy_zone &&
+ cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)))
+ return &policy->v.nodes;
+
+ return NULL;
+}
+
/* Return a zonelist representing a mempolicy */
static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
{
@@ -1118,11 +1103,6 @@ static struct zonelist *zonelist_policy(
nd = numa_node_id();
break;
case MPOL_BIND:
- /* Lower zones don't get a policy applied */
- /* Careful: current->mems_allowed might have moved */
- if (gfp_zone(gfp) >= policy_zone)
- if (cpuset_zonelist_valid_mems_allowed(policy->v.zonelist))
- return policy->v.zonelist;
/*FALL THROUGH*/
case MPOL_INTERLEAVE: /* should not happen */
case MPOL_DEFAULT:
@@ -1167,8 +1147,12 @@ unsigned slab_node(struct mempolicy *pol
* first node.
*/
struct zonelist *zonelist;
- zonelist = policy->v.zonelist;
- return zone_to_nid(zonelist_zone(zonelist->_zones[0]));
+ unsigned long *z;
+ enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
+ zonelist = &NODE_DATA(numa_node_id())->node_zonelists[0];
+ z = first_zones_zonelist(zonelist, &policy->v.nodes,
+ highest_zoneidx);
+ return zone_to_nid(zonelist_zone(*z));
}

case MPOL_PREFERRED:
@@ -1287,7 +1271,8 @@ alloc_page_vma(gfp_t gfp, struct vm_area
nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
return alloc_page_interleave(gfp, 0, nid);
}
- return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+ return __alloc_pages_nodemask(gfp, 0,
+ zonelist_policy(gfp, pol), nodemask_policy(gfp, pol));
}

/**
@@ -1344,14 +1329,6 @@ struct mempolicy *__mpol_copy(struct mem
}
*new = *old;
atomic_set(&new->refcnt, 1);
- if (new->policy == MPOL_BIND) {
- int sz = ksize(old->v.zonelist);
- new->v.zonelist = kmemdup(old->v.zonelist, sz, GFP_KERNEL);
- if (!new->v.zonelist) {
- kmem_cache_free(policy_cache, new);
- return ERR_PTR(-ENOMEM);
- }
- }
return new;
}

@@ -1365,21 +1342,12 @@ int __mpol_equal(struct mempolicy *a, st
switch (a->policy) {
case MPOL_DEFAULT:
return 1;
+ case MPOL_BIND:
+ /* Fall through */
case MPOL_INTERLEAVE:
return nodes_equal(a->v.nodes, b->v.nodes);
case MPOL_PREFERRED:
return a->v.preferred_node == b->v.preferred_node;
- case MPOL_BIND: {
- int i;
- for (i = 0; a->v.zonelist->_zones[i]; i++) {
- struct zone *za, *zb;
- za = zonelist_zone(a->v.zonelist->_zones[i]);
- zb = zonelist_zone(b->v.zonelist->_zones[i]);
- if (za != zb)
- return 0;
- }
- return b->v.zonelist->_zones[i] == 0;
- }
default:
BUG();
return 0;
@@ -1391,8 +1359,6 @@ void __mpol_free(struct mempolicy *p)
{
if (!atomic_dec_and_test(&p->refcnt))
return;
- if (p->policy == MPOL_BIND)
- kfree(p->v.zonelist);
p->policy = MPOL_DEFAULT;
kmem_cache_free(policy_cache, p);
}
@@ -1683,6 +1649,8 @@ static void mpol_rebind_policy(struct me
switch (pol->policy) {
case MPOL_DEFAULT:
break;
+ case MPOL_BIND:
+ /* Fall through */
case MPOL_INTERLEAVE:
nodes_remap(tmp, pol->v.nodes, *mpolmask, *newmask);
pol->v.nodes = tmp;
@@ -1695,32 +1663,6 @@ static void mpol_rebind_policy(struct me
*mpolmask, *newmask);
*mpolmask = *newmask;
break;
- case MPOL_BIND: {
- nodemask_t nodes;
- unsigned long *z;
- struct zonelist *zonelist;
-
- nodes_clear(nodes);
- for (z = pol->v.zonelist->_zones; *z; z++)
- node_set(zone_to_nid(zonelist_zone(*z)), nodes);
- nodes_remap(tmp, nodes, *mpolmask, *newmask);
- nodes = tmp;
-
- zonelist = bind_zonelist(&nodes);
-
- /* If no mem, then zonelist is NULL and we keep old zonelist.
- * If that old zonelist has no remaining mems_allowed nodes,
- * then zonelist_policy() will "FALL THROUGH" to MPOL_DEFAULT.
- */
-
- if (!IS_ERR(zonelist)) {
- /* Good - got mem - substitute new zonelist */
- kfree(pol->v.zonelist);
- pol->v.zonelist = zonelist;
- }
- *mpolmask = *newmask;
- break;
- }
default:
BUG();
break;
@@ -1783,9 +1725,7 @@ static inline int mpol_to_str(char *buff
break;

case MPOL_BIND:
- get_zonemask(pol, &nodes);
- break;
-
+ /* Fall through */
case MPOL_INTERLEAVE:
nodes = pol->v.nodes;
break;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/page_alloc.c linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/page_alloc.c
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/page_alloc.c 2007-09-10 16:06:31.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/page_alloc.c 2007-09-10 16:06:39.000000000 +0100
@@ -1419,7 +1419,7 @@ static void zlc_mark_zone_full(struct zo
* a page.
*/
static struct page *
-get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
+get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
{
unsigned long *z;
@@ -1430,7 +1430,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
int zlc_active = 0; /* set if using zonelist_cache */
int did_zlc_setup = 0; /* just call zlc_setup() one time */

- z = first_zones_zonelist(zonelist, high_zoneidx);
+ z = first_zones_zonelist(zonelist, nodemask, high_zoneidx);
classzone_idx = zonelist_zone_idx(*z);

zonelist_scan:
@@ -1438,7 +1438,8 @@ zonelist_scan:
* Scan zonelist, looking for a zone with enough free.
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
- for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+ for_each_zone_zonelist_nodemask(zone, z, zonelist,
+ high_zoneidx, nodemask) {
if (NUMA_BUILD && zlc_active &&
!zlc_zone_worth_trying(zonelist, z, allowednodes))
continue;
@@ -1544,9 +1545,9 @@ static void set_page_owner(struct page *
/*
* This is the 'heart' of the zoned buddy allocator.
*/
-struct page * fastcall
-__alloc_pages(gfp_t gfp_mask, unsigned int order,
- struct zonelist *zonelist)
+static struct page *
+__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist, nodemask_t *nodemask)
{
const gfp_t wait = gfp_mask & __GFP_WAIT;
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
@@ -1575,7 +1576,7 @@ restart:
return NULL;
}

- page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
+ page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
if (page)
goto got_pg;
@@ -1620,7 +1621,7 @@ restart:
* Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
- page = get_page_from_freelist(gfp_mask, order, zonelist,
+ page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
high_zoneidx, alloc_flags);
if (page)
goto got_pg;
@@ -1633,7 +1634,7 @@ rebalance:
if (!(gfp_mask & __GFP_NOMEMALLOC)) {
nofail_alloc:
/* go through the zonelist yet again, ignoring mins */
- page = get_page_from_freelist(gfp_mask, order,
+ page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
if (page)
goto got_pg;
@@ -1668,7 +1669,7 @@ nofail_alloc:
drain_all_local_pages();

if (likely(did_some_progress)) {
- page = get_page_from_freelist(gfp_mask, order,
+ page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx, alloc_flags);
if (page)
goto got_pg;
@@ -1679,8 +1680,9 @@ nofail_alloc:
* a parallel oom killing, we must fail if we're still
* under heavy pressure.
*/
- page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
- zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+ page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
+ order, zonelist, high_zoneidx,
+ ALLOC_WMARK_HIGH|ALLOC_CPUSET);
if (page)
goto got_pg;

@@ -1728,6 +1730,20 @@ got_pg:
return page;
}

+struct page * fastcall
+__alloc_pages(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist)
+{
+ return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+}
+
+struct page * fastcall
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist, nodemask_t *nodemask)
+{
+ return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
+}
+
EXPORT_SYMBOL(__alloc_pages);

/*

2007-09-11 21:32:49

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 6/6] Use one zonelist that is filtered by nodemask


Two zonelists exist so that GFP_THISNODE allocations will be guaranteed
to use memory only from a node local to the CPU. As we can now filter the
zonelist based on a nodemask, we can filter the node slightly different
when GFP_THISNODE is specified.

When GFP_THISNODE is used, a temporary nodemask is created with only the
node local to the CPU set. This allows us to eliminate the second zonelist.

Signed-off-by: Mel Gorman <[email protected]>
---

drivers/char/sysrq.c | 2 -
fs/buffer.c | 5 +--
include/linux/gfp.h | 23 +++------------
include/linux/mempolicy.h | 2 -
include/linux/mmzone.h | 14 ---------
mm/mempolicy.c | 8 ++---
mm/page_alloc.c | 61 ++++++++++++++++++++++-------------------
mm/slab.c | 2 -
mm/slub.c | 2 -
9 files changed, 50 insertions(+), 69 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-030_filter_nodemask/drivers/char/sysrq.c linux-2.6.23-rc4-mm1-040_use_one_zonelist/drivers/char/sysrq.c
--- linux-2.6.23-rc4-mm1-030_filter_nodemask/drivers/char/sysrq.c 2007-09-10 16:06:13.000000000 +0100
+++ linux-2.6.23-rc4-mm1-040_use_one_zonelist/drivers/char/sysrq.c 2007-09-11 13:43:28.000000000 +0100
@@ -270,7 +270,7 @@ static struct sysrq_key_op sysrq_term_op

static void moom_callback(struct work_struct *ignored)
{
- out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+ out_of_memory(node_zonelist(0), GFP_KERNEL, 0);
}

static DECLARE_WORK(moom_work, moom_callback);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-030_filter_nodemask/fs/buffer.c linux-2.6.23-rc4-mm1-040_use_one_zonelist/fs/buffer.c
--- linux-2.6.23-rc4-mm1-030_filter_nodemask/fs/buffer.c 2007-09-10 16:06:39.000000000 +0100
+++ linux-2.6.23-rc4-mm1-040_use_one_zonelist/fs/buffer.c 2007-09-11 13:43:28.000000000 +0100
@@ -375,11 +375,10 @@ static void free_more_memory(void)
yield();

for_each_online_node(nid) {
- zones = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
+ zones = first_zones_zonelist(node_zonelist(nid),
NULL, gfp_zone(GFP_NOFS));
if (*zones)
- try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
- GFP_NOFS);
+ try_to_free_pages(node_zonelist(nid), 0, GFP_NOFS);
}
}

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/gfp.h linux-2.6.23-rc4-mm1-040_use_one_zonelist/include/linux/gfp.h
--- linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/gfp.h 2007-09-10 16:06:39.000000000 +0100
+++ linux-2.6.23-rc4-mm1-040_use_one_zonelist/include/linux/gfp.h 2007-09-11 13:43:28.000000000 +0100
@@ -150,29 +150,16 @@ static inline gfp_t set_migrateflags(gfp
* virtual kernel addresses to the allocated page(s).
*/

-static inline enum zone_type gfp_zonelist(gfp_t flags)
-{
- int base = 0;
-
-#ifdef CONFIG_NUMA
- if (flags & __GFP_THISNODE)
- base = 1;
-#endif
-
- return base;
-}
-
/*
- * We get the zone list from the current node and the gfp_mask.
- * This zonelist contains two zonelists, one for all zones with memory and
- * one containing just zones from the node the zonelist belongs to
+ * We get the zone list from the current node and the list of zones
+ * is filtered based on the GFP flags
*
* For the normal case of non-DISCONTIGMEM systems the NODE_DATA() gets
* optimized to &contig_page_data at compile-time.
*/
-static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
+static inline struct zonelist *node_zonelist(int nid)
{
- return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
+ return &NODE_DATA(nid)->node_zonelist;
}

#ifndef HAVE_ARCH_FREE_PAGE
@@ -199,7 +186,7 @@ static inline struct page *alloc_pages_n
if (nid < 0)
nid = numa_node_id();

- return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+ return __alloc_pages(gfp_mask, order, node_zonelist(nid));
}

#ifdef CONFIG_NUMA
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mempolicy.h linux-2.6.23-rc4-mm1-040_use_one_zonelist/include/linux/mempolicy.h
--- linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mempolicy.h 2007-09-10 16:06:39.000000000 +0100
+++ linux-2.6.23-rc4-mm1-040_use_one_zonelist/include/linux/mempolicy.h 2007-09-11 13:43:28.000000000 +0100
@@ -239,7 +239,7 @@ static inline void mpol_fix_fork_child_f
static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
unsigned long addr, gfp_t gfp_flags)
{
- return node_zonelist(0, gfp_flags);
+ return node_zonelist(0);
}

static inline int do_migrate_pages(struct mm_struct *mm,
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mmzone.h linux-2.6.23-rc4-mm1-040_use_one_zonelist/include/linux/mmzone.h
--- linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mmzone.h 2007-09-11 13:43:04.000000000 +0100
+++ linux-2.6.23-rc4-mm1-040_use_one_zonelist/include/linux/mmzone.h 2007-09-11 13:43:28.000000000 +0100
@@ -356,17 +356,6 @@ struct zone {
#define MAX_ZONES_PER_ZONELIST (MAX_NUMNODES * MAX_NR_ZONES)

#ifdef CONFIG_NUMA
-
-/*
- * The NUMA zonelists are doubled becausse we need zonelists that restrict the
- * allocations to a single node for GFP_THISNODE.
- *
- * [0] : Zonelist with fallback
- * [1] : No fallback (GFP_THISNODE)
- */
-#define MAX_ZONELISTS 2
-
-
/*
* We cache key information from each zonelist for smaller cache
* footprint when scanning for free pages in get_page_from_freelist().
@@ -432,7 +421,6 @@ struct zonelist_cache {
unsigned long last_full_zap; /* when last zap'd (jiffies) */
};
#else
-#define MAX_ZONELISTS 1
struct zonelist_cache;
#endif

@@ -488,7 +476,7 @@ extern struct page *mem_map;
struct bootmem_data;
typedef struct pglist_data {
struct zone node_zones[MAX_NR_ZONES];
- struct zonelist node_zonelists[MAX_ZONELISTS];
+ struct zonelist node_zonelist;
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP
struct page *node_mem_map;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/mempolicy.c linux-2.6.23-rc4-mm1-040_use_one_zonelist/mm/mempolicy.c
--- linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/mempolicy.c 2007-09-10 16:06:39.000000000 +0100
+++ linux-2.6.23-rc4-mm1-040_use_one_zonelist/mm/mempolicy.c 2007-09-11 13:43:28.000000000 +0100
@@ -1112,7 +1112,7 @@ static struct zonelist *zonelist_policy(
nd = 0;
BUG();
}
- return node_zonelist(nd, gfp);
+ return node_zonelist(nd);
}

/* Do dynamic interleaving for a process */
@@ -1149,7 +1149,7 @@ unsigned slab_node(struct mempolicy *pol
struct zonelist *zonelist;
unsigned long *z;
enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
- zonelist = &NODE_DATA(numa_node_id())->node_zonelists[0];
+ zonelist = &NODE_DATA(numa_node_id())->node_zonelist;
z = first_zones_zonelist(zonelist, &policy->v.nodes,
highest_zoneidx);
return zone_to_nid(zonelist_zone(*z));
@@ -1215,7 +1215,7 @@ struct zonelist *huge_zonelist(struct vm
unsigned nid;

nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
- return node_zonelist(nid, gfp_flags);
+ return node_zonelist(nid);
}
return zonelist_policy(GFP_HIGHUSER, pol);
}
@@ -1229,7 +1229,7 @@ static struct page *alloc_page_interleav
struct zonelist *zl;
struct page *page;

- zl = node_zonelist(nid, gfp);
+ zl = node_zonelist(nid);
page = __alloc_pages(gfp, order, zl);
if (page && page_zone(page) == zonelist_zone(zl->_zones[0]))
inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/page_alloc.c linux-2.6.23-rc4-mm1-040_use_one_zonelist/mm/page_alloc.c
--- linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/page_alloc.c 2007-09-10 16:06:39.000000000 +0100
+++ linux-2.6.23-rc4-mm1-040_use_one_zonelist/mm/page_alloc.c 2007-09-11 13:43:28.000000000 +0100
@@ -1730,10 +1730,33 @@ got_pg:
return page;
}

+static nodemask_t *nodemask_thisnode(nodemask_t *nodemask)
+{
+ /* Build a nodemask for just this node */
+ int nid = numa_node_id();
+
+ nodes_clear(*nodemask);
+ node_set(nid, *nodemask);
+
+ return nodemask;
+}
+
struct page * fastcall
__alloc_pages(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist)
{
+ /*
+ * Use a temporary nodemask for __GFP_THISNODE allocations. If the
+ * cost of allocating on the stack or the stack usage becomes
+ * noticable, allocate the nodemasks per node at boot or compile time
+ */
+ if (unlikely(gfp_mask & __GFP_THISNODE)) {
+ nodemask_t nodemask;
+
+ return __alloc_pages_internal(gfp_mask, order,
+ zonelist, nodemask_thisnode(&nodemask));
+ }
+
return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
}

@@ -1741,6 +1764,9 @@ struct page * fastcall
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, nodemask_t *nodemask)
{
+ /* Specifying both __GFP_THISNODE and nodemask is stupid. Warn user */
+ WARN_ON(gfp_mask & __GFP_THISNODE);
+
return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
}

@@ -1817,7 +1843,7 @@ static unsigned int nr_free_zone_pages(i
/* Just pick one node, since fallback list is circular */
unsigned int sum = 0;

- struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
+ struct zonelist *zonelist = node_zonelist(numa_node_id());

for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
unsigned long size = zone->present_pages;
@@ -2182,7 +2208,7 @@ static void build_zonelists_in_node_orde
int j;
struct zonelist *zonelist;

- zonelist = &pgdat->node_zonelists[0];
+ zonelist = &pgdat->node_zonelist;
for (j = 0; zonelist->_zones[j] != 0; j++)
;
j = build_zonelists_node(NODE_DATA(node), zonelist, j,
@@ -2191,19 +2217,6 @@ static void build_zonelists_in_node_orde
}

/*
- * Build gfp_thisnode zonelists
- */
-static void build_thisnode_zonelists(pg_data_t *pgdat)
-{
- int j;
- struct zonelist *zonelist;
-
- zonelist = &pgdat->node_zonelists[1];
- j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1);
- zonelist->_zones[j] = 0;
-}
-
-/*
* Build zonelists ordered by zone and nodes within zones.
* This results in conserving DMA zone[s] until all Normal memory is
* exhausted, but results in overflowing to remote node while memory
@@ -2218,7 +2231,7 @@ static void build_zonelists_in_zone_orde
struct zone *z;
struct zonelist *zonelist;

- zonelist = &pgdat->node_zonelists[0];
+ zonelist = &pgdat->node_zonelist;
pos = 0;
for (zone_type = MAX_NR_ZONES - 1; zone_type >= 0; zone_type--) {
for (j = 0; j < nr_nodes; j++) {
@@ -2298,17 +2311,14 @@ static void set_zonelist_order(void)
static void build_zonelists(pg_data_t *pgdat)
{
int j, node, load;
- enum zone_type i;
nodemask_t used_mask;
int local_node, prev_node;
struct zonelist *zonelist;
int order = current_zonelist_order;

/* initialize zonelists */
- for (i = 0; i < MAX_ZONELISTS; i++) {
- zonelist = pgdat->node_zonelists + i;
- zonelist->_zones[0] = 0;
- }
+ zonelist = &pgdat->node_zonelist;
+ zonelist->_zones[0] = 0;

/* NUMA-aware ordering of nodes */
local_node = pgdat->node_id;
@@ -2350,8 +2360,6 @@ static void build_zonelists(pg_data_t *p
/* calculate node order -- i.e., DMA last! */
build_zonelists_in_zone_order(pgdat, j);
}
-
- build_thisnode_zonelists(pgdat);
}

/* Construct the zonelist performance cache - see further mmzone.h */
@@ -2361,7 +2369,7 @@ static void build_zonelist_cache(pg_data
struct zonelist_cache *zlc;
unsigned long *z;

- zonelist = &pgdat->node_zonelists[0];
+ zonelist = &pgdat->node_zonelist;
zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
for (z = zonelist->_zones; *z; z++)
@@ -2385,7 +2393,7 @@ static void build_zonelists(pg_data_t *p

local_node = pgdat->node_id;

- zonelist = &pgdat->node_zonelists[0];
+ zonelist = &pgdat->node_zonelist;
j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1);

/*
@@ -2415,8 +2423,7 @@ static void build_zonelists(pg_data_t *p
/* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
static void build_zonelist_cache(pg_data_t *pgdat)
{
- pgdat->node_zonelists[0].zlcache_ptr = NULL;
- pgdat->node_zonelists[1].zlcache_ptr = NULL;
+ pgdat->node_zonelist.zlcache_ptr = NULL;
}

#endif /* CONFIG_NUMA */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/slab.c linux-2.6.23-rc4-mm1-040_use_one_zonelist/mm/slab.c
--- linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/slab.c 2007-09-10 16:06:31.000000000 +0100
+++ linux-2.6.23-rc4-mm1-040_use_one_zonelist/mm/slab.c 2007-09-11 13:43:28.000000000 +0100
@@ -3250,7 +3250,7 @@ static void *fallback_alloc(struct kmem_
if (flags & __GFP_THISNODE)
return NULL;

- zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+ zonelist = node_zonelist(slab_node(current->mempolicy));
local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);

retry:
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/slub.c linux-2.6.23-rc4-mm1-040_use_one_zonelist/mm/slub.c
--- linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/slub.c 2007-09-10 16:06:31.000000000 +0100
+++ linux-2.6.23-rc4-mm1-040_use_one_zonelist/mm/slub.c 2007-09-11 13:43:28.000000000 +0100
@@ -1283,7 +1283,7 @@ static struct page *get_any_partial(stru
if (!s->defrag_ratio || get_cycles() % 1024 > s->defrag_ratio)
return NULL;

- zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+ zonelist = node_zonelist(slab_node(current->mempolicy));
for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
struct kmem_cache_node *n;

2007-09-12 07:50:58

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 4/6] Embed zone_id information within the zonelist->zones pointer

On Tue, 11 Sep 2007 22:31:27 +0100 (IST)
Mel Gorman <[email protected]> wrote:

> Using two zonelists per node requires very frequent use of zone_idx(). This
> is costly as it involves a lookup of another structure and a substraction
> operation. As struct zone is always word aligned and normally cache-line
> aligned, the pointer values have a number of 0's at the least significant
> bits of the address.
>
> This patch embeds the zone_id of a zone in the zonelist->zones pointers.
> The real zone pointer is retrieved using the zonelist_zone() helper function.
> The ID of the zone is found using zonelist_zone_idx(). To avoid accidental
> references, the zones field is renamed to _zones and the type changed to
> unsigned long.
>
At first, I welcome this patch. thanks!.


A comment after reading all is, how about defining zonelist as following
instead of encoding in pointer ?
==
struct zone_pointer {
struct zone *zone;
int node_id;
int zone_idx;
};

struct zonelist {
struct zone_pointer _zones[MAX_ZONES_PER_ZONELIST + 1];
};

#define zonelist_zone(zl) (zl)->zone
#define zonelist_zone_idx(zl) (zl)->zone_idx
#ifdef CONFIG_NUMA
#define zonelist_zone_nid(zl) (zl)->node_id
#else
#define zonelist_zone_nid(zl, i) (0)
==

If we really want to avoid unnecessary access to "zone" while walking zonelist,
above may do something good. Cons is this makes sizeof zonlist bigger.

Thanks,
-Kame

2007-09-12 08:32:18

by mel

[permalink] [raw]
Subject: Re: [PATCH 4/6] Embed zone_id information within the zonelist->zones pointer

On (12/09/07 16:51), KAMEZAWA Hiroyuki didst pronounce:
> On Tue, 11 Sep 2007 22:31:27 +0100 (IST)
> Mel Gorman <[email protected]> wrote:
>
> > Using two zonelists per node requires very frequent use of zone_idx(). This
> > is costly as it involves a lookup of another structure and a substraction
> > operation. As struct zone is always word aligned and normally cache-line
> > aligned, the pointer values have a number of 0's at the least significant
> > bits of the address.
> >
> > This patch embeds the zone_id of a zone in the zonelist->zones pointers.
> > The real zone pointer is retrieved using the zonelist_zone() helper function.
> > The ID of the zone is found using zonelist_zone_idx(). To avoid accidental
> > references, the zones field is renamed to _zones and the type changed to
> > unsigned long.
> >
> At first, I welcome this patch. thanks!.
>
>
> A comment after reading all is, how about defining zonelist as following
> instead of encoding in pointer ?
> ==
> struct zone_pointer {
> struct zone *zone;
> int node_id;
> int zone_idx;
> };
>
> struct zonelist {
> struct zone_pointer _zones[MAX_ZONES_PER_ZONELIST + 1];
> };
>
> #define zonelist_zone(zl) (zl)->zone
> #define zonelist_zone_idx(zl) (zl)->zone_idx
> #ifdef CONFIG_NUMA
> #define zonelist_zone_nid(zl) (zl)->node_id
> #else
> #define zonelist_zone_nid(zl, i) (0)
> ==
>

That is an interesting idea. I'll try it out and see what happens.
Thanks

> If we really want to avoid unnecessary access to "zone" while walking zonelist,
> above may do something good. Cons is this makes sizeof zonlist bigger.
>

While this is true, it might not affect the size of struct zone because
of padding. One way to find out

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2007-09-12 17:07:27

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 4/6] Embed zone_id information within the zonelist->zones pointer

On Wed, 12 Sep 2007, KAMEZAWA Hiroyuki wrote:

> If we really want to avoid unnecessary access to "zone" while walking zonelist,
> above may do something good. Cons is this makes sizeof zonlist bigger.

The trouble is that the size of the zonelist would double with this
approach. We have long zonelists and doubling the size could double
the cachelines needed to be touched in order to scan the zonelists.

2007-09-12 20:28:38

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v5 (resend)

On Tue, 2007-09-11 at 22:30 +0100, Mel Gorman wrote:
> (Sorry for the resend, I mucked up the TO: line in the earlier sending)
>
> This is the latest version of one-zonelist and it should be solid enough
> for wider testing. To briefly summarise, the patchset replaces multiple
> zonelists-per-node with one zonelist that is filtered based on nodemask and
> GFP flags. I've dropped the patch that replaces inline functions with macros
> from the end as it obscures the code for something that may or may not be a
> performance benefit on older compilers. If we see performance regressions that
> might have something to do with it, the patch is trivially to bring forward.
>
> Andrew, please merge to -mm for wider testing and consideration for merging
> to mainline. Minimally, it gets rid of the hack in relation to ZONE_MOVABLE
> and MPOL_BIND.


Mel:

I'm just getting to this after sorting out an issue with the memory
controller stuff in 23-rc4-mm1. I'm building all my kernels with the
memory controller enabled now, as it hits areas that I'm playing in. I
wanted to give you a heads up that vmscan.c doesn't build with
CONTAINER_MEM_CONT configured with your patches. I won't get to this
until tomorrow. Since you're a few hours ahead of me, you might want to
take a look. No worries, if you don't get a chance...

Later,
Lee

2007-09-13 10:11:54

by mel

[permalink] [raw]
Subject: Re: [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v5 (resend)

On (12/09/07 16:27), Lee Schermerhorn didst pronounce:
> On Tue, 2007-09-11 at 22:30 +0100, Mel Gorman wrote:
> > (Sorry for the resend, I mucked up the TO: line in the earlier sending)
> >
> > This is the latest version of one-zonelist and it should be solid enough
> > for wider testing. To briefly summarise, the patchset replaces multiple
> > zonelists-per-node with one zonelist that is filtered based on nodemask and
> > GFP flags. I've dropped the patch that replaces inline functions with macros
> > from the end as it obscures the code for something that may or may not be a
> > performance benefit on older compilers. If we see performance regressions that
> > might have something to do with it, the patch is trivially to bring forward.
> >
> > Andrew, please merge to -mm for wider testing and consideration for merging
> > to mainline. Minimally, it gets rid of the hack in relation to ZONE_MOVABLE
> > and MPOL_BIND.
>
>
> Mel:
>
> I'm just getting to this after sorting out an issue with the memory
> controller stuff in 23-rc4-mm1. I'm building all my kernels with the
> memory controller enabled now, as it hits areas that I'm playing in. I
> wanted to give you a heads up that vmscan.c doesn't build with
> CONTAINER_MEM_CONT configured with your patches. I won't get to this
> until tomorrow. Since you're a few hours ahead of me, you might want to
> take a look. No worries, if you don't get a chance...
>

Thanks a lot. I took a look and you're right. Does the following patch
fix it for you?

====

Fix a compile bug with one-zonelist and the memory controller.

Signed-off-by: Mel Gorman <[email protected]>
---
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-onezonelist_v5r21/mm/vmscan.c linux-2.6.23-rc4-mm1-onezonelist_v5r21-fix/mm/vmscan.c
--- linux-2.6.23-rc4-mm1-onezonelist_v5r21/mm/vmscan.c 2007-09-12 10:00:22.000000000 +0100
+++ linux-2.6.23-rc4-mm1-onezonelist_v5r21-fix/mm/vmscan.c 2007-09-13 11:09:49.000000000 +0100
@@ -1368,11 +1368,11 @@ unsigned long try_to_free_mem_container_
.isolate_pages = mem_container_isolate_pages,
};
int node;
- struct zone **zones;
+ struct zonelist *zonelist;

for_each_online_node(node) {
- zones = NODE_DATA(node)->node_zonelists[ZONE_USERPAGES].zones;
- if (do_try_to_free_pages(zones, sc.gfp_mask, &sc))
+ zonelist = &NODE_DATA(node)->node_zonelist;
+ if (do_try_to_free_pages(zonelist, sc.gfp_mask, &sc))
return 1;
}
return 0;