2009-03-16 09:45:16

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 00/35] Cleanup and optimise the page allocator V3

Here is V3 of an attempt to cleanup and optimise the page allocator and should
be ready for general testing. The page allocator is now faster (16%
reduced time overall for kernbench on one machine) and it has a smaller cache
footprint (16.5% less L1 cache misses and 19.5% less L2 cache misses for
kernbench on one machine). The text footprint has unfortunately increased,
largely due to the introduction of a form of lazy buddy merging mechanism
that avoids cache misses by postponing buddy merging until a high-order
allocation needs it.

I tested the patchset with kernbench, hackbench, sysbench-postgres and netperf
UDP and TCP with a variety of sizes. Many machines and loads showed improved
performance *however* it was not universal. On some machines, one load would
be faster and another slower (perversely, sometimes netperf-UDP would be
faster with netperf-TCP slower). On an different machines, the workloads
that gained or lost would differ. I haven't fully pinned down why this is
yet but I have observed on at least one machine lock contention is higher
and more time is spent in functions like rb_erase(), both which might imply
some sort of scheduling artifact. I've also noted that while the allocator
incurs fewer cache misses, sometimes cache misses overall are increased
for the workload but the increased lock contention might account for this.

In some cases, more time is spent in copy_user_generic_string()[1] which
might imply that strings are getting the same colour with the greater
effort spent giving back hot pages but theories as to why this is not a
universal effect are welcome. I've also noted that machines with many CPUs
with different caches suffer because struct page is not cache-aligned but
aligning it hurts other machines so I left it alone. Finally, the performance
characteristics are vary depending on if you use SLAB, SLUB or SLQB.

So, while the page allocator is faster in most cases, making all workloads
universally go faster needs to now look at other areas like the sl*b
allocator and the scheduler.

Here is the patchset as it stands and I think it's ready for wider testing
and to be considered for merging depending on the outcome of testing and
reviews.

[1] copy_user_generic_unrolled on one machine was slowed down by an extreme
amount. I did not check if there was a pattern of slowdowns versus which
version of copy_user_generic() was used

Changes since V2
o Remove brances by treating watermark flags as array indices
o Remove branch by assuming __GFP_HIGH == ALLOC_HIGH
o Do not check for compound on every page free
o Remove branch by always ensuring the migratetype is known on free
o Simplify buffered_rmqueue further
o Reintroduce improved version of batched bulk free of pcp pages
o Use allocation flags as an index to zone watermarks
o Work out __GFP_COLD only once
o Reduce the number of times zone stats are updated
o Do not dump reserve pages back into the allocator. Instead treat them
as MOVABLE so that MIGRATE_RESERVE gets used on the max-order-overlapped
boundaries without causing trouble
o Allow pages up to PAGE_ALLOC_COSTLY_ORDER to use the per-cpu allocator.
order-1 allocations are frequently enough in particular to justify this
o Rearrange inlining such that the hot-path is inlined but not in a way
that increases the text size of the page allocator
o Make the check for needing additional zonelist filtering due to NUMA
or cpusets as light as possible
o Do not destroy compound pages going to the PCP lists
o Delay the merging of buddies until a high-order allocation needs them
or anti-fragmentation is being forced to fallback
o Count high-order pages as 1

Changes since V1
o Remove the ifdef CONFIG_CPUSETS from inside get_page_from_freelist()
o Use non-lock bit operations for clearing the mlock flag
o Factor out alloc_flags calculation so it is only done once (Peter)
o Make gfp.h a bit prettier and clear-cut (Peter)
o Instead of deleting a debugging check, replace page_count() in the
free path with a version that does not check for compound pages (Nick)
o Drop the alteration for hot/cold page freeing until we know if it
helps or not


2009-03-16 09:44:35

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 02/35] Do not sanity check order in the fast path

No user of the allocator API should be passing in an order >= MAX_ORDER
but we check for it on each and every allocation. Delete this check and
make it a VM_BUG_ON check further down the call path.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/gfp.h | 6 ------
mm/page_alloc.c | 2 ++
2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index dcf0ab8..8736047 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -181,9 +181,6 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
- if (unlikely(order >= MAX_ORDER))
- return NULL;
-
/* Unknown node is current node */
if (nid < 0)
nid = numa_node_id();
@@ -197,9 +194,6 @@ extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
static inline struct page *
alloc_pages(gfp_t gfp_mask, unsigned int order)
{
- if (unlikely(order >= MAX_ORDER))
- return NULL;
-
return alloc_pages_current(gfp_mask, order);
}
extern struct page *alloc_page_vma(gfp_t gfp_mask,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0671b3f..dd87dad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1407,6 +1407,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,

classzone_idx = zone_idx(preferred_zone);

+ VM_BUG_ON(order >= MAX_ORDER);
+
zonelist_scan:
/*
* Scan zonelist, looking for a zone with enough free.
--
1.5.6.5

2009-03-16 09:44:54

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 01/35] Replace __alloc_pages_internal() with __alloc_pages_nodemask()

__alloc_pages_internal is the core page allocator function but
essentially it is an alias of __alloc_pages_nodemask. Naming a publicly
available and exported function "internal" is also a big ugly. This
patch renames __alloc_pages_internal() to __alloc_pages_nodemask() and
deletes the old nodemask function.

Warning - This patch renames an exported symbol. No kernel driver is
affected by external drivers calling __alloc_pages_internal() should
change the call to __alloc_pages_nodemask() without any alteration of
parameters.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/gfp.h | 12 ++----------
mm/page_alloc.c | 4 ++--
2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index dd20cd7..dcf0ab8 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -168,24 +168,16 @@ static inline void arch_alloc_page(struct page *page, int order) { }
#endif

struct page *
-__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, nodemask_t *nodemask);

static inline struct page *
__alloc_pages(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist)
{
- return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+ return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
}

-static inline struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
- struct zonelist *zonelist, nodemask_t *nodemask)
-{
- return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
-}
-
-
static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5c44ed4..0671b3f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1464,7 +1464,7 @@ try_next_zone:
* This is the 'heart' of the zoned buddy allocator.
*/
struct page *
-__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, nodemask_t *nodemask)
{
const gfp_t wait = gfp_mask & __GFP_WAIT;
@@ -1670,7 +1670,7 @@ nopage:
got_pg:
return page;
}
-EXPORT_SYMBOL(__alloc_pages_internal);
+EXPORT_SYMBOL(__alloc_pages_nodemask);

/*
* Common helper functions.
--
1.5.6.5

2009-03-16 09:45:41

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 03/35] Do not check NUMA node ID when the caller knows the node is valid

Callers of alloc_pages_node() can optionally specify -1 as a node to mean
"allocate from the current node". However, a number of the callers in fast
paths know for a fact their node is valid. To avoid a comparison and branch,
this patch adds alloc_pages_exact_node() that only checks the nid with
VM_BUG_ON(). Callers that know their node is valid are then converted.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
---
arch/ia64/hp/common/sba_iommu.c | 2 +-
arch/ia64/kernel/mca.c | 3 +--
arch/ia64/kernel/uncached.c | 3 ++-
arch/ia64/sn/pci/pci_dma.c | 3 ++-
arch/powerpc/platforms/cell/ras.c | 2 +-
arch/x86/kvm/vmx.c | 2 +-
drivers/misc/sgi-gru/grufile.c | 2 +-
drivers/misc/sgi-xp/xpc_uv.c | 2 +-
include/linux/gfp.h | 9 +++++++++
include/linux/mm.h | 1 -
kernel/profile.c | 8 ++++----
mm/filemap.c | 2 +-
mm/hugetlb.c | 4 ++--
mm/mempolicy.c | 2 +-
mm/migrate.c | 2 +-
mm/slab.c | 4 ++--
mm/slob.c | 4 ++--
17 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index 6d5e6c5..66a3257 100644
--- a/arch/ia64/hp/common/sba_iommu.c
+++ b/arch/ia64/hp/common/sba_iommu.c
@@ -1116,7 +1116,7 @@ sba_alloc_coherent (struct device *dev, size_t size, dma_addr_t *dma_handle, gfp
#ifdef CONFIG_NUMA
{
struct page *page;
- page = alloc_pages_node(ioc->node == MAX_NUMNODES ?
+ page = alloc_pages_exact_node(ioc->node == MAX_NUMNODES ?
numa_node_id() : ioc->node, flags,
get_order(size));

diff --git a/arch/ia64/kernel/mca.c b/arch/ia64/kernel/mca.c
index bab1de2..2e614bd 100644
--- a/arch/ia64/kernel/mca.c
+++ b/arch/ia64/kernel/mca.c
@@ -1829,8 +1829,7 @@ ia64_mca_cpu_init(void *cpu_data)
data = mca_bootmem();
first_time = 0;
} else
- data = page_address(alloc_pages_node(numa_node_id(),
- GFP_KERNEL, get_order(sz)));
+ data = __get_free_pages(GFP_KERNEL, get_order(sz));
if (!data)
panic("Could not allocate MCA memory for cpu %d\n",
cpu);
diff --git a/arch/ia64/kernel/uncached.c b/arch/ia64/kernel/uncached.c
index 8eff8c1..6ba72ab 100644
--- a/arch/ia64/kernel/uncached.c
+++ b/arch/ia64/kernel/uncached.c
@@ -98,7 +98,8 @@ static int uncached_add_chunk(struct uncached_pool *uc_pool, int nid)

/* attempt to allocate a granule's worth of cached memory pages */

- page = alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+ page = alloc_pages_exact_node(nid,
+ GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
IA64_GRANULE_SHIFT-PAGE_SHIFT);
if (!page) {
mutex_unlock(&uc_pool->add_chunk_mutex);
diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c
index 863f501..2aa52de 100644
--- a/arch/ia64/sn/pci/pci_dma.c
+++ b/arch/ia64/sn/pci/pci_dma.c
@@ -91,7 +91,8 @@ void *sn_dma_alloc_coherent(struct device *dev, size_t size,
*/
node = pcibus_to_node(pdev->bus);
if (likely(node >=0)) {
- struct page *p = alloc_pages_node(node, flags, get_order(size));
+ struct page *p = alloc_pages_exact_node(node,
+ flags, get_order(size));

if (likely(p))
cpuaddr = page_address(p);
diff --git a/arch/powerpc/platforms/cell/ras.c b/arch/powerpc/platforms/cell/ras.c
index 5f961c4..16ba671 100644
--- a/arch/powerpc/platforms/cell/ras.c
+++ b/arch/powerpc/platforms/cell/ras.c
@@ -122,7 +122,7 @@ static int __init cbe_ptcal_enable_on_node(int nid, int order)

area->nid = nid;
area->order = order;
- area->pages = alloc_pages_node(area->nid, GFP_KERNEL, area->order);
+ area->pages = alloc_pages_exact_node(area->nid, GFP_KERNEL, area->order);

if (!area->pages)
goto out_free_area;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 7611af5..cca119a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1244,7 +1244,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
struct page *pages;
struct vmcs *vmcs;

- pages = alloc_pages_node(node, GFP_KERNEL, vmcs_config.order);
+ pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order);
if (!pages)
return NULL;
vmcs = page_address(pages);
diff --git a/drivers/misc/sgi-gru/grufile.c b/drivers/misc/sgi-gru/grufile.c
index 6509838..52d4160 100644
--- a/drivers/misc/sgi-gru/grufile.c
+++ b/drivers/misc/sgi-gru/grufile.c
@@ -309,7 +309,7 @@ static int gru_init_tables(unsigned long gru_base_paddr, void *gru_base_vaddr)
pnode = uv_node_to_pnode(nid);
if (gru_base[bid])
continue;
- page = alloc_pages_node(nid, GFP_KERNEL, order);
+ page = alloc_pages_exact_node(nid, GFP_KERNEL, order);
if (!page)
goto fail;
gru_base[bid] = page_address(page);
diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c
index 29c0502..0563350 100644
--- a/drivers/misc/sgi-xp/xpc_uv.c
+++ b/drivers/misc/sgi-xp/xpc_uv.c
@@ -184,7 +184,7 @@ xpc_create_gru_mq_uv(unsigned int mq_size, int cpu, char *irq_name,
mq->mmr_blade = uv_cpu_to_blade_id(cpu);

nid = cpu_to_node(cpu);
- page = alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+ page = alloc_pages_exact_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
pg_order);
if (page == NULL) {
dev_err(xpc_part, "xpc_create_gru_mq_uv() failed to alloc %d "
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 8736047..59eb093 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -4,6 +4,7 @@
#include <linux/mmzone.h>
#include <linux/stddef.h>
#include <linux/linkage.h>
+#include <linux/mmdebug.h>

struct vm_area_struct;

@@ -188,6 +189,14 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
}

+static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
+ unsigned int order)
+{
+ VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
+
+ return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+}
+
#ifdef CONFIG_NUMA
extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065cdf8..565e7b2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -7,7 +7,6 @@

#include <linux/gfp.h>
#include <linux/list.h>
-#include <linux/mmdebug.h>
#include <linux/mmzone.h>
#include <linux/rbtree.h>
#include <linux/prio_tree.h>
diff --git a/kernel/profile.c b/kernel/profile.c
index 7724e04..62e08db 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -371,7 +371,7 @@ static int __cpuinit profile_cpu_callback(struct notifier_block *info,
node = cpu_to_node(cpu);
per_cpu(cpu_profile_flip, cpu) = 0;
if (!per_cpu(cpu_profile_hits, cpu)[1]) {
- page = alloc_pages_node(node,
+ page = alloc_pages_exact_node(node,
GFP_KERNEL | __GFP_ZERO,
0);
if (!page)
@@ -379,7 +379,7 @@ static int __cpuinit profile_cpu_callback(struct notifier_block *info,
per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
}
if (!per_cpu(cpu_profile_hits, cpu)[0]) {
- page = alloc_pages_node(node,
+ page = alloc_pages_exact_node(node,
GFP_KERNEL | __GFP_ZERO,
0);
if (!page)
@@ -570,14 +570,14 @@ static int create_hash_tables(void)
int node = cpu_to_node(cpu);
struct page *page;

- page = alloc_pages_node(node,
+ page = alloc_pages_exact_node(node,
GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
0);
if (!page)
goto out_cleanup;
per_cpu(cpu_profile_hits, cpu)[1]
= (struct profile_hit *)page_address(page);
- page = alloc_pages_node(node,
+ page = alloc_pages_exact_node(node,
GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
0);
if (!page)
diff --git a/mm/filemap.c b/mm/filemap.c
index 23acefe..2523d95 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -519,7 +519,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
{
if (cpuset_do_page_mem_spread()) {
int n = cpuset_mem_spread_node();
- return alloc_pages_node(n, gfp, 0);
+ return alloc_pages_exact_node(n, gfp, 0);
}
return alloc_pages(gfp, 0);
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 107da3d..1e99997 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -630,7 +630,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
if (h->order >= MAX_ORDER)
return NULL;

- page = alloc_pages_node(nid,
+ page = alloc_pages_exact_node(nid,
htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
__GFP_REPEAT|__GFP_NOWARN,
huge_page_order(h));
@@ -649,7 +649,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
* Use a helper variable to find the next node and then
* copy it back to hugetlb_next_nid afterwards:
* otherwise there's a window in which a racer might
- * pass invalid nid MAX_NUMNODES to alloc_pages_node.
+ * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
* But we don't need to use a spin_lock here: it really
* doesn't matter if occasionally a racer chooses the
* same nid as we do. Move nid forward in the mask even
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3eb4a6f..341fbca 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -767,7 +767,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,

static struct page *new_node_page(struct page *page, unsigned long node, int **x)
{
- return alloc_pages_node(node, GFP_HIGHUSER_MOVABLE, 0);
+ return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE, 0);
}

/*
diff --git a/mm/migrate.c b/mm/migrate.c
index a9eff3f..6bda9c2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -802,7 +802,7 @@ static struct page *new_page_node(struct page *p, unsigned long private,

*result = &pm->status;

- return alloc_pages_node(pm->node,
+ return alloc_pages_exact_node(pm->node,
GFP_HIGHUSER_MOVABLE | GFP_THISNODE, 0);
}

diff --git a/mm/slab.c b/mm/slab.c
index 4d00855..e7f1ded 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1680,7 +1680,7 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
flags |= __GFP_RECLAIMABLE;

- page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+ page = alloc_pages_exact_node(nodeid, flags, cachep->gfporder);
if (!page)
return NULL;

@@ -3210,7 +3210,7 @@ retry:
if (local_flags & __GFP_WAIT)
local_irq_enable();
kmem_flagcheck(cache, flags);
- obj = kmem_getpages(cache, local_flags, -1);
+ obj = kmem_getpages(cache, local_flags, numa_node_id());
if (local_flags & __GFP_WAIT)
local_irq_disable();
if (obj) {
diff --git a/mm/slob.c b/mm/slob.c
index 52bc8a2..d646a4c 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -46,7 +46,7 @@
* NUMA support in SLOB is fairly simplistic, pushing most of the real
* logic down to the page allocator, and simply doing the node accounting
* on the upper levels. In the event that a node id is explicitly
- * provided, alloc_pages_node() with the specified node id is used
+ * provided, alloc_pages_exact_node() with the specified node id is used
* instead. The common case (or when the node id isn't explicitly provided)
* will default to the current node, as per numa_node_id().
*
@@ -236,7 +236,7 @@ static void *slob_new_page(gfp_t gfp, int order, int node)

#ifdef CONFIG_NUMA
if (node != -1)
- page = alloc_pages_node(node, gfp, order);
+ page = alloc_pages_exact_node(node, gfp, order);
else
#endif
page = alloc_pages(gfp, order);
--
1.5.6.5

2009-03-16 09:46:36

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 06/35] Move check for disabled anti-fragmentation out of fastpath

On low-memory systems, anti-fragmentation gets disabled as there is nothing
it can do and it would just incur overhead shuffling pages between lists
constantly. Currently the check is made in the free page fast path for every
page. This patch moves it to a slow path. On machines with low memory,
there will be small amount of additional overhead as pages get shuffled
between lists but it should quickly settle.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 3 ---
mm/page_alloc.c | 4 ++++
2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1aca6ce..ca000b8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -50,9 +50,6 @@ extern int page_group_by_mobility_disabled;

static inline int get_pageblock_migratetype(struct page *page)
{
- if (unlikely(page_group_by_mobility_disabled))
- return MIGRATE_UNMOVABLE;
-
return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
}

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7ba7705..d815c8f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -171,6 +171,10 @@ int page_group_by_mobility_disabled __read_mostly;

static void set_pageblock_migratetype(struct page *page, int migratetype)
{
+
+ if (unlikely(page_group_by_mobility_disabled))
+ migratetype = MIGRATE_UNMOVABLE;
+
set_pageblock_flags_group(page, (unsigned long)migratetype,
PB_migrate, PB_migrate_end);
}
--
1.5.6.5

2009-03-16 09:45:56

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 04/35] Check only once if the zonelist is suitable for the allocation

It is possible with __GFP_THISNODE that no zones are suitable. This
patch makes sure the check is only made once.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
---
mm/page_alloc.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd87dad..8024abc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1486,9 +1486,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
if (should_fail_alloc_page(gfp_mask, order))
return NULL;

-restart:
- z = zonelist->_zonerefs; /* the list of zones suitable for gfp_mask */
-
+ /* the list of zones suitable for gfp_mask */
+ z = zonelist->_zonerefs;
if (unlikely(!z->zone)) {
/*
* Happens if we have an empty zonelist as a result of
@@ -1497,6 +1496,7 @@ restart:
return NULL;
}

+restart:
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
if (page)
--
1.5.6.5

2009-03-16 09:46:20

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 05/35] Break up the allocator entry point into fast and slow paths

The core of the page allocator is one giant function which allocates memory
on the stack and makes calculations that may not be needed for every
allocation. This patch breaks up the allocator path into fast and slow
paths for clarity. Note the slow paths are still inlined but the entry is
marked unlikely. If they were not inlined, it actally increases text size
to generate the as there is only one call site.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 348 ++++++++++++++++++++++++++++++++++---------------------
1 files changed, 218 insertions(+), 130 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8024abc..7ba7705 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,45 +1462,171 @@ try_next_zone:
return page;
}

-/*
- * This is the 'heart' of the zoned buddy allocator.
- */
-struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
- struct zonelist *zonelist, nodemask_t *nodemask)
+static inline int
+should_alloc_retry(gfp_t gfp_mask, unsigned int order,
+ unsigned long pages_reclaimed)
{
- const gfp_t wait = gfp_mask & __GFP_WAIT;
- enum zone_type high_zoneidx = gfp_zone(gfp_mask);
- struct zoneref *z;
- struct zone *zone;
- struct page *page;
- struct reclaim_state reclaim_state;
- struct task_struct *p = current;
- int do_retry;
- int alloc_flags;
- unsigned long did_some_progress;
- unsigned long pages_reclaimed = 0;
+ /* Do not loop if specifically requested */
+ if (gfp_mask & __GFP_NORETRY)
+ return 0;

- might_sleep_if(wait);
+ /*
+ * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
+ * means __GFP_NOFAIL, but that may not be true in other
+ * implementations.
+ */
+ if (order <= PAGE_ALLOC_COSTLY_ORDER)
+ return 1;

- if (should_fail_alloc_page(gfp_mask, order))
- return NULL;
+ /*
+ * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
+ * specified, then we retry until we no longer reclaim any pages
+ * (above), or we've reclaimed an order of pages at least as
+ * large as the allocation's order. In both cases, if the
+ * allocation still fails, we stop retrying.
+ */
+ if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
+ return 1;

- /* the list of zones suitable for gfp_mask */
- z = zonelist->_zonerefs;
- if (unlikely(!z->zone)) {
- /*
- * Happens if we have an empty zonelist as a result of
- * GFP_THISNODE being used on a memoryless node
- */
+ /*
+ * Don't let big-order allocations loop unless the caller
+ * explicitly requests that.
+ */
+ if (gfp_mask & __GFP_NOFAIL)
+ return 1;
+
+ return 0;
+}
+
+static inline struct page *
+__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist, enum zone_type high_zoneidx,
+ nodemask_t *nodemask)
+{
+ struct page *page;
+
+ /* Acquire the OOM killer lock for the zones in zonelist */
+ if (!try_set_zone_oom(zonelist, gfp_mask)) {
+ schedule_timeout_uninterruptible(1);
return NULL;
}

-restart:
- page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
- zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+ /*
+ * Go through the zonelist yet one more time, keep very high watermark
+ * here, this is only to catch a parallel oom killing, we must fail if
+ * we're still under heavy pressure.
+ */
+ page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
+ order, zonelist, high_zoneidx,
+ ALLOC_WMARK_HIGH|ALLOC_CPUSET);
if (page)
- goto got_pg;
+ goto out;
+
+ /* The OOM killer will not help higher order allocs */
+ if (order > PAGE_ALLOC_COSTLY_ORDER)
+ goto out;
+
+ /* Exhausted what can be done so it's blamo time */
+ out_of_memory(zonelist, gfp_mask, order);
+
+out:
+ clear_zonelist_oom(zonelist, gfp_mask);
+ return page;
+}
+
+/* The really slow allocator path where we enter direct reclaim */
+static inline struct page *
+__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist, enum zone_type high_zoneidx,
+ nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress)
+{
+ struct page *page = NULL;
+ struct reclaim_state reclaim_state;
+ struct task_struct *p = current;
+
+ cond_resched();
+
+ /* We now go into synchronous reclaim */
+ cpuset_memory_pressure_bump();
+
+ /*
+ * The task's cpuset might have expanded its set of allowable nodes
+ */
+ cpuset_update_task_memory_state();
+ p->flags |= PF_MEMALLOC;
+ reclaim_state.reclaimed_slab = 0;
+ p->reclaim_state = &reclaim_state;
+
+ *did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
+
+ p->reclaim_state = NULL;
+ p->flags &= ~PF_MEMALLOC;
+
+ cond_resched();
+
+ if (order != 0)
+ drain_all_pages();
+
+ if (likely(*did_some_progress))
+ page = get_page_from_freelist(gfp_mask, nodemask, order,
+ zonelist, high_zoneidx, alloc_flags);
+ return page;
+}
+
+static inline int
+is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
+{
+ if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+ && !in_interrupt())
+ if (!(gfp_mask & __GFP_NOMEMALLOC))
+ return 1;
+ return 0;
+}
+
+/*
+ * This is called in the allocator slow-path if the allocation request is of
+ * sufficient urgency to ignore watermarks and take other desperate measures
+ */
+static inline struct page *
+__alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist, enum zone_type high_zoneidx,
+ nodemask_t *nodemask)
+{
+ struct page *page;
+
+ do {
+ page = get_page_from_freelist(gfp_mask, nodemask, order,
+ zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+
+ if (!page && gfp_mask & __GFP_NOFAIL)
+ congestion_wait(WRITE, HZ/50);
+ } while (!page && (gfp_mask & __GFP_NOFAIL));
+
+ return page;
+}
+
+static inline
+void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
+ enum zone_type high_zoneidx)
+{
+ struct zoneref *z;
+ struct zone *zone;
+
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+ wakeup_kswapd(zone, order);
+}
+
+static inline struct page *
+__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist, enum zone_type high_zoneidx,
+ nodemask_t *nodemask)
+{
+ const gfp_t wait = gfp_mask & __GFP_WAIT;
+ struct page *page = NULL;
+ int alloc_flags;
+ unsigned long pages_reclaimed = 0;
+ unsigned long did_some_progress;
+ struct task_struct *p = current;

/*
* GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
@@ -1513,8 +1639,7 @@ restart:
if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
goto nopage;

- for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
- wakeup_kswapd(zone, order);
+ wake_all_kswapd(order, zonelist, high_zoneidx);

/*
* OK, we're below the kswapd watermark and have kicked background
@@ -1534,6 +1659,7 @@ restart:
if (wait)
alloc_flags |= ALLOC_CPUSET;

+restart:
/*
* Go through the zonelist again. Let __GFP_HIGH and allocations
* coming from realtime tasks go deeper into reserves.
@@ -1547,118 +1673,47 @@ restart:
if (page)
goto got_pg;

- /* This allocation should allow future memory freeing. */
-
-rebalance:
- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
- if (!(gfp_mask & __GFP_NOMEMALLOC)) {
-nofail_alloc:
- /* go through the zonelist yet again, ignoring mins */
- page = get_page_from_freelist(gfp_mask, nodemask, order,
- zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
- if (page)
- goto got_pg;
- if (gfp_mask & __GFP_NOFAIL) {
- congestion_wait(WRITE, HZ/50);
- goto nofail_alloc;
- }
- }
- goto nopage;
- }
+ /* Allocate without watermarks if the context allows */
+ if (is_allocation_high_priority(p, gfp_mask))
+ page = __alloc_pages_high_priority(gfp_mask, order,
+ zonelist, high_zoneidx, nodemask);
+ if (page)
+ goto got_pg;

/* Atomic allocations - we can't balance anything */
if (!wait)
goto nopage;

- cond_resched();
+ /* Try direct reclaim and then allocating */
+ page = __alloc_pages_direct_reclaim(gfp_mask, order,
+ zonelist, high_zoneidx,
+ nodemask,
+ alloc_flags, &did_some_progress);
+ if (page)
+ goto got_pg;

- /* We now go into synchronous reclaim */
- cpuset_memory_pressure_bump();
/*
- * The task's cpuset might have expanded its set of allowable nodes
+ * If we failed to make any progress reclaiming, then we are
+ * running out of options and have to consider going OOM
*/
- cpuset_update_task_memory_state();
- p->flags |= PF_MEMALLOC;
- reclaim_state.reclaimed_slab = 0;
- p->reclaim_state = &reclaim_state;
-
- did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
-
- p->reclaim_state = NULL;
- p->flags &= ~PF_MEMALLOC;
-
- cond_resched();
-
- if (order != 0)
- drain_all_pages();
+ if (!did_some_progress) {
+ if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
+ page = __alloc_pages_may_oom(gfp_mask, order,
+ zonelist, high_zoneidx,
+ nodemask);
+ if (page)
+ goto got_pg;

- if (likely(did_some_progress)) {
- page = get_page_from_freelist(gfp_mask, nodemask, order,
- zonelist, high_zoneidx, alloc_flags);
- if (page)
- goto got_pg;
- } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
- if (!try_set_zone_oom(zonelist, gfp_mask)) {
- schedule_timeout_uninterruptible(1);
goto restart;
}
-
- /*
- * Go through the zonelist yet one more time, keep
- * very high watermark here, this is only to catch
- * a parallel oom killing, we must fail if we're still
- * under heavy pressure.
- */
- page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
- order, zonelist, high_zoneidx,
- ALLOC_WMARK_HIGH|ALLOC_CPUSET);
- if (page) {
- clear_zonelist_oom(zonelist, gfp_mask);
- goto got_pg;
- }
-
- /* The OOM killer will not help higher order allocs so fail */
- if (order > PAGE_ALLOC_COSTLY_ORDER) {
- clear_zonelist_oom(zonelist, gfp_mask);
- goto nopage;
- }
-
- out_of_memory(zonelist, gfp_mask, order);
- clear_zonelist_oom(zonelist, gfp_mask);
- goto restart;
}

- /*
- * Don't let big-order allocations loop unless the caller explicitly
- * requests that. Wait for some write requests to complete then retry.
- *
- * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
- * means __GFP_NOFAIL, but that may not be true in other
- * implementations.
- *
- * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
- * specified, then we retry until we no longer reclaim any pages
- * (above), or we've reclaimed an order of pages at least as
- * large as the allocation's order. In both cases, if the
- * allocation still fails, we stop retrying.
- */
+ /* Check if we should retry the allocation */
pages_reclaimed += did_some_progress;
- do_retry = 0;
- if (!(gfp_mask & __GFP_NORETRY)) {
- if (order <= PAGE_ALLOC_COSTLY_ORDER) {
- do_retry = 1;
- } else {
- if (gfp_mask & __GFP_REPEAT &&
- pages_reclaimed < (1 << order))
- do_retry = 1;
- }
- if (gfp_mask & __GFP_NOFAIL)
- do_retry = 1;
- }
- if (do_retry) {
+ if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
+ /* Wait for some write requests to complete then retry */
congestion_wait(WRITE, HZ/50);
- goto rebalance;
+ goto restart;
}

nopage:
@@ -1671,6 +1726,39 @@ nopage:
}
got_pg:
return page;
+
+}
+
+/*
+ * This is the 'heart' of the zoned buddy allocator.
+ */
+struct page *
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist, nodemask_t *nodemask)
+{
+ enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+ struct page *page;
+
+ might_sleep_if(gfp_mask & __GFP_WAIT);
+
+ if (should_fail_alloc_page(gfp_mask, order))
+ return NULL;
+
+ /*
+ * Check the zones suitable for the gfp_mask contain at least one
+ * valid zone. It's possible to have an empty zonelist as a result
+ * of GFP_THISNODE and a memoryless node
+ */
+ if (unlikely(!zonelist->_zonerefs->zone))
+ return NULL;
+
+ page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
+ zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+ if (unlikely(!page))
+ page = __alloc_pages_slowpath(gfp_mask, order,
+ zonelist, high_zoneidx, nodemask);
+
+ return page;
}
EXPORT_SYMBOL(__alloc_pages_nodemask);

--
1.5.6.5

2009-03-16 09:46:52

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 07/35] Check in advance if the zonelist needs additional filtering

Zonelist are filtered based on nodemasks for memory policies normally.
It can be additionally filters on cpusets if they exist as well as
noting when zones are full. These simple checks are expensive enough to
be noticed in profiles. This patch checks in advance if zonelist
filtering will ever be needed. If not, then the bulk of the checks are
skipped.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/cpuset.h | 2 ++
mm/page_alloc.c | 37 ++++++++++++++++++++++++++-----------
2 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 90c6074..6051082 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -83,6 +83,8 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);

#else /* !CONFIG_CPUSETS */

+#define number_of_cpusets (0)
+
static inline int cpuset_init_early(void) { return 0; }
static inline int cpuset_init(void) { return 0; }
static inline void cpuset_init_smp(void) {}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d815c8f..fe71147 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1139,7 +1139,11 @@ failed:
#define ALLOC_WMARK_HIGH 0x08 /* use pages_high watermark */
#define ALLOC_HARDER 0x10 /* try to alloc harder */
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
+#ifdef CONFIG_CPUSETS
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
+#else
+#define ALLOC_CPUSET 0x00
+#endif /* CONFIG_CPUSETS */

#ifdef CONFIG_FAIL_PAGE_ALLOC

@@ -1403,6 +1407,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
int zlc_active = 0; /* set if using zonelist_cache */
int did_zlc_setup = 0; /* just call zlc_setup() one time */
+ int zonelist_filter = 0;

(void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
&preferred_zone);
@@ -1413,6 +1418,10 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,

VM_BUG_ON(order >= MAX_ORDER);

+ /* Determine in advance if the zonelist needs filtering */
+ if ((alloc_flags & ALLOC_CPUSET) && unlikely(number_of_cpusets > 1))
+ zonelist_filter = 1;
+
zonelist_scan:
/*
* Scan zonelist, looking for a zone with enough free.
@@ -1420,12 +1429,16 @@ zonelist_scan:
*/
for_each_zone_zonelist_nodemask(zone, z, zonelist,
high_zoneidx, nodemask) {
- if (NUMA_BUILD && zlc_active &&
- !zlc_zone_worth_trying(zonelist, z, allowednodes))
- continue;
- if ((alloc_flags & ALLOC_CPUSET) &&
- !cpuset_zone_allowed_softwall(zone, gfp_mask))
- goto try_next_zone;
+
+ /* Ignore the additional zonelist filter checks if possible */
+ if (zonelist_filter) {
+ if (NUMA_BUILD && zlc_active &&
+ !zlc_zone_worth_trying(zonelist, z, allowednodes))
+ continue;
+ if ((alloc_flags & ALLOC_CPUSET) &&
+ !cpuset_zone_allowed_softwall(zone, gfp_mask))
+ goto try_next_zone;
+ }

if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
unsigned long mark;
@@ -1447,13 +1460,15 @@ zonelist_scan:
if (page)
break;
this_zone_full:
- if (NUMA_BUILD)
+ if (NUMA_BUILD && zonelist_filter)
zlc_mark_zone_full(zonelist, z);
try_next_zone:
- if (NUMA_BUILD && !did_zlc_setup) {
- /* we do zlc_setup after the first zone is tried */
- allowednodes = zlc_setup(zonelist, alloc_flags);
- zlc_active = 1;
+ if (NUMA_BUILD && zonelist_filter) {
+ if (!did_zlc_setup) {
+ /* do zlc_setup after the first zone is tried */
+ allowednodes = zlc_setup(zonelist, alloc_flags);
+ zlc_active = 1;
+ }
did_zlc_setup = 1;
}
}
--
1.5.6.5

2009-03-16 09:47:53

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 10/35] Calculate the alloc_flags for allocation only once

Factor out the mapping between GFP and alloc_flags only once. Once factored
out, it only needs to be calculated once but some care must be taken.

[[email protected] says]
As the test:

- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
- if (!(gfp_mask & __GFP_NOMEMALLOC)) {

has been replaced with a slightly weaker one:

+ if (alloc_flags & ALLOC_NO_WATERMARKS) {

we need to ensure we don't recurse when PF_MEMALLOC is set.

From: Peter Zijlstra <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Acked-by: Pekka Enberg <[email protected]>
---
mm/page_alloc.c | 88 +++++++++++++++++++++++++++++++-----------------------
1 files changed, 50 insertions(+), 38 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8771de3..0558eb4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1593,16 +1593,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
return page;
}

-static inline int
-is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
-{
- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt())
- if (!(gfp_mask & __GFP_NOMEMALLOC))
- return 1;
- return 0;
-}
-
/*
* This is called in the allocator slow-path if the allocation request is of
* sufficient urgency to ignore watermarks and take other desperate measures
@@ -1638,6 +1628,42 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
wakeup_kswapd(zone, order);
}

+static inline int
+gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+ struct task_struct *p = current;
+ int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+ const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+ /*
+ * The caller may dip into page reserves a bit more if the caller
+ * cannot run direct reclaim, or if the caller has realtime scheduling
+ * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
+ * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+ */
+ if (gfp_mask & __GFP_HIGH)
+ alloc_flags |= ALLOC_HIGH;
+
+ if (!wait) {
+ alloc_flags |= ALLOC_HARDER;
+ /*
+ * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+ * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+ */
+ alloc_flags &= ~ALLOC_CPUSET;
+ } else if (unlikely(rt_task(p)) && !in_interrupt())
+ alloc_flags |= ALLOC_HARDER;
+
+ if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+ if (!in_interrupt() &&
+ ((p->flags & PF_MEMALLOC) ||
+ unlikely(test_thread_flag(TIF_MEMDIE))))
+ alloc_flags |= ALLOC_NO_WATERMARKS;
+ }
+
+ return alloc_flags;
+}
+
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -1668,48 +1694,34 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
* OK, we're below the kswapd watermark and have kicked background
* reclaim. Now things get more complex, so set up alloc_flags according
* to how we want to proceed.
- *
- * The caller may dip into page reserves a bit more if the caller
- * cannot run direct reclaim, or if the caller has realtime scheduling
- * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
- * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
*/
- alloc_flags = ALLOC_WMARK_MIN;
- if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
- alloc_flags |= ALLOC_HARDER;
- if (gfp_mask & __GFP_HIGH)
- alloc_flags |= ALLOC_HIGH;
- if (wait)
- alloc_flags |= ALLOC_CPUSET;
+ alloc_flags = gfp_to_alloc_flags(gfp_mask);

restart:
- /*
- * Go through the zonelist again. Let __GFP_HIGH and allocations
- * coming from realtime tasks go deeper into reserves.
- *
- * This is the last chance, in general, before the goto nopage.
- * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
- * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
- */
+ /* This is the last chance, in general, before the goto nopage. */
page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
- high_zoneidx, alloc_flags,
- preferred_zone,
- migratetype);
+ high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
+ preferred_zone, migratetype);
if (page)
goto got_pg;

/* Allocate without watermarks if the context allows */
- if (is_allocation_high_priority(p, gfp_mask))
+ if (alloc_flags & ALLOC_NO_WATERMARKS) {
page = __alloc_pages_high_priority(gfp_mask, order,
- zonelist, high_zoneidx, nodemask, preferred_zone,
- migratetype);
- if (page)
- goto got_pg;
+ zonelist, high_zoneidx, nodemask,
+ preferred_zone, migratetype);
+ if (page)
+ goto got_pg;
+ }

/* Atomic allocations - we can't balance anything */
if (!wait)
goto nopage;

+ /* Avoid recursion of direct reclaim */
+ if (p->flags & PF_MEMALLOC)
+ goto nopage;
+
/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order,
zonelist, high_zoneidx,
--
1.5.6.5

2009-03-16 09:47:36

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 09/35] Calculate the migratetype for allocation only once

GFP mask is converted into a migratetype when deciding which pagelist to
take a page from. However, it is happening multiple times per
allocation, at least once per zone traversed. Calculate it once.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 43 ++++++++++++++++++++++++++-----------------
1 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 78e1d8e..8771de3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1067,13 +1067,13 @@ void split_page(struct page *page, unsigned int order)
* or two.
*/
static struct page *buffered_rmqueue(struct zone *preferred_zone,
- struct zone *zone, int order, gfp_t gfp_flags)
+ struct zone *zone, int order, gfp_t gfp_flags,
+ int migratetype)
{
unsigned long flags;
struct page *page;
int cold = !!(gfp_flags & __GFP_COLD);
int cpu;
- int migratetype = allocflags_to_migratetype(gfp_flags);

again:
cpu = get_cpu();
@@ -1399,7 +1399,7 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
static struct page *
get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
- struct zone *preferred_zone)
+ struct zone *preferred_zone, int migratetype)
{
struct zoneref *z;
struct page *page = NULL;
@@ -1451,7 +1451,8 @@ zonelist_scan:
}
}

- page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
+ page = buffered_rmqueue(preferred_zone, zone, order,
+ gfp_mask, migratetype);
if (page)
break;
this_zone_full:
@@ -1515,7 +1516,8 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
static inline struct page *
__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
- nodemask_t *nodemask, struct zone *preferred_zone)
+ nodemask_t *nodemask, struct zone *preferred_zone,
+ int migratetype)
{
struct page *page;

@@ -1533,7 +1535,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
order, zonelist, high_zoneidx,
ALLOC_WMARK_HIGH|ALLOC_CPUSET,
- preferred_zone);
+ preferred_zone, migratetype);
if (page)
goto out;

@@ -1554,7 +1556,7 @@ static inline struct page *
__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
- unsigned long *did_some_progress)
+ int migratetype, unsigned long *did_some_progress)
{
struct page *page = NULL;
struct reclaim_state reclaim_state;
@@ -1586,7 +1588,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
if (likely(*did_some_progress))
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx,
- alloc_flags, preferred_zone);
+ alloc_flags, preferred_zone,
+ migratetype);
return page;
}

@@ -1607,14 +1610,15 @@ is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
static inline struct page *
__alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
- nodemask_t *nodemask, struct zone *preferred_zone)
+ nodemask_t *nodemask, struct zone *preferred_zone,
+ int migratetype)
{
struct page *page;

do {
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
- preferred_zone);
+ preferred_zone, migratetype);

if (!page && gfp_mask & __GFP_NOFAIL)
congestion_wait(WRITE, HZ/50);
@@ -1637,7 +1641,8 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
- nodemask_t *nodemask, struct zone *preferred_zone)
+ nodemask_t *nodemask, struct zone *preferred_zone,
+ int migratetype)
{
const gfp_t wait = gfp_mask & __GFP_WAIT;
struct page *page = NULL;
@@ -1688,14 +1693,16 @@ restart:
*/
page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
high_zoneidx, alloc_flags,
- preferred_zone);
+ preferred_zone,
+ migratetype);
if (page)
goto got_pg;

/* Allocate without watermarks if the context allows */
if (is_allocation_high_priority(p, gfp_mask))
page = __alloc_pages_high_priority(gfp_mask, order,
- zonelist, high_zoneidx, nodemask, preferred_zone);
+ zonelist, high_zoneidx, nodemask, preferred_zone,
+ migratetype);
if (page)
goto got_pg;

@@ -1708,7 +1715,7 @@ restart:
zonelist, high_zoneidx,
nodemask,
alloc_flags, preferred_zone,
- &did_some_progress);
+ migratetype, &did_some_progress);
if (page)
goto got_pg;

@@ -1720,7 +1727,8 @@ restart:
if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
page = __alloc_pages_may_oom(gfp_mask, order,
zonelist, high_zoneidx,
- nodemask, preferred_zone);
+ nodemask, preferred_zone,
+ migratetype);
if (page)
goto got_pg;

@@ -1759,6 +1767,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
struct zone *preferred_zone;
struct page *page;
+ int migratetype = allocflags_to_migratetype(gfp_mask);

might_sleep_if(gfp_mask & __GFP_WAIT);

@@ -1782,11 +1791,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
/* First allocation attempt */
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
- preferred_zone);
+ preferred_zone, migratetype);
if (unlikely(!page))
page = __alloc_pages_slowpath(gfp_mask, order,
zonelist, high_zoneidx, nodemask,
- preferred_zone);
+ preferred_zone, migratetype);

return page;
}
--
1.5.6.5

2009-03-16 09:47:20

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 08/35] Calculate the preferred zone for allocation only once

get_page_from_freelist() can be called multiple times for an allocation.
Part of this calculates the preferred_zone which is the first usable
zone in the zonelist. This patch calculates preferred_zone once.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 53 ++++++++++++++++++++++++++++++++---------------------
1 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fe71147..78e1d8e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1398,24 +1398,19 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
*/
static struct page *
get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
- struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
+ struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
+ struct zone *preferred_zone)
{
struct zoneref *z;
struct page *page = NULL;
int classzone_idx;
- struct zone *zone, *preferred_zone;
+ struct zone *zone;
nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
int zlc_active = 0; /* set if using zonelist_cache */
int did_zlc_setup = 0; /* just call zlc_setup() one time */
int zonelist_filter = 0;

- (void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
- &preferred_zone);
- if (!preferred_zone)
- return NULL;
-
classzone_idx = zone_idx(preferred_zone);
-
VM_BUG_ON(order >= MAX_ORDER);

/* Determine in advance if the zonelist needs filtering */
@@ -1520,7 +1515,7 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
static inline struct page *
__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
- nodemask_t *nodemask)
+ nodemask_t *nodemask, struct zone *preferred_zone)
{
struct page *page;

@@ -1537,7 +1532,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
*/
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
order, zonelist, high_zoneidx,
- ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+ ALLOC_WMARK_HIGH|ALLOC_CPUSET,
+ preferred_zone);
if (page)
goto out;

@@ -1557,7 +1553,8 @@ out:
static inline struct page *
__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
- nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress)
+ nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
+ unsigned long *did_some_progress)
{
struct page *page = NULL;
struct reclaim_state reclaim_state;
@@ -1588,7 +1585,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,

if (likely(*did_some_progress))
page = get_page_from_freelist(gfp_mask, nodemask, order,
- zonelist, high_zoneidx, alloc_flags);
+ zonelist, high_zoneidx,
+ alloc_flags, preferred_zone);
return page;
}

@@ -1609,13 +1607,14 @@ is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
static inline struct page *
__alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
- nodemask_t *nodemask)
+ nodemask_t *nodemask, struct zone *preferred_zone)
{
struct page *page;

do {
page = get_page_from_freelist(gfp_mask, nodemask, order,
- zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+ zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
+ preferred_zone);

if (!page && gfp_mask & __GFP_NOFAIL)
congestion_wait(WRITE, HZ/50);
@@ -1638,7 +1637,7 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
- nodemask_t *nodemask)
+ nodemask_t *nodemask, struct zone *preferred_zone)
{
const gfp_t wait = gfp_mask & __GFP_WAIT;
struct page *page = NULL;
@@ -1688,14 +1687,15 @@ restart:
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
- high_zoneidx, alloc_flags);
+ high_zoneidx, alloc_flags,
+ preferred_zone);
if (page)
goto got_pg;

/* Allocate without watermarks if the context allows */
if (is_allocation_high_priority(p, gfp_mask))
page = __alloc_pages_high_priority(gfp_mask, order,
- zonelist, high_zoneidx, nodemask);
+ zonelist, high_zoneidx, nodemask, preferred_zone);
if (page)
goto got_pg;

@@ -1707,7 +1707,8 @@ restart:
page = __alloc_pages_direct_reclaim(gfp_mask, order,
zonelist, high_zoneidx,
nodemask,
- alloc_flags, &did_some_progress);
+ alloc_flags, preferred_zone,
+ &did_some_progress);
if (page)
goto got_pg;

@@ -1719,7 +1720,7 @@ restart:
if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
page = __alloc_pages_may_oom(gfp_mask, order,
zonelist, high_zoneidx,
- nodemask);
+ nodemask, preferred_zone);
if (page)
goto got_pg;

@@ -1756,6 +1757,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, nodemask_t *nodemask)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+ struct zone *preferred_zone;
struct page *page;

might_sleep_if(gfp_mask & __GFP_WAIT);
@@ -1771,11 +1773,20 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
if (unlikely(!zonelist->_zonerefs->zone))
return NULL;

+ /* The preferred zone is used for statistics later */
+ (void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
+ &preferred_zone);
+ if (!preferred_zone)
+ return NULL;
+
+ /* First allocation attempt */
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
- zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+ zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
+ preferred_zone);
if (unlikely(!page))
page = __alloc_pages_slowpath(gfp_mask, order,
- zonelist, high_zoneidx, nodemask);
+ zonelist, high_zoneidx, nodemask,
+ preferred_zone);

return page;
}
--
1.5.6.5

2009-03-16 09:48:19

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 11/35] Calculate the cold parameter for allocation only once

GFP mask is checked for __GFP_COLD has been specified when deciding which
end of the PCP lists to use. However, it is happening multiple times per
allocation, at least once per zone traversed. Calculate it once.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 35 ++++++++++++++++++-----------------
1 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0558eb4..ad26052 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1068,11 +1068,10 @@ void split_page(struct page *page, unsigned int order)
*/
static struct page *buffered_rmqueue(struct zone *preferred_zone,
struct zone *zone, int order, gfp_t gfp_flags,
- int migratetype)
+ int migratetype, int cold)
{
unsigned long flags;
struct page *page;
- int cold = !!(gfp_flags & __GFP_COLD);
int cpu;

again:
@@ -1399,7 +1398,7 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
static struct page *
get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
- struct zone *preferred_zone, int migratetype)
+ struct zone *preferred_zone, int migratetype, int cold)
{
struct zoneref *z;
struct page *page = NULL;
@@ -1452,7 +1451,7 @@ zonelist_scan:
}

page = buffered_rmqueue(preferred_zone, zone, order,
- gfp_mask, migratetype);
+ gfp_mask, migratetype, cold);
if (page)
break;
this_zone_full:
@@ -1517,7 +1516,7 @@ static inline struct page *
__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, struct zone *preferred_zone,
- int migratetype)
+ int migratetype, int cold)
{
struct page *page;

@@ -1535,7 +1534,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
order, zonelist, high_zoneidx,
ALLOC_WMARK_HIGH|ALLOC_CPUSET,
- preferred_zone, migratetype);
+ preferred_zone, migratetype, cold);
if (page)
goto out;

@@ -1556,7 +1555,7 @@ static inline struct page *
__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
- int migratetype, unsigned long *did_some_progress)
+ int migratetype, int cold, unsigned long *did_some_progress)
{
struct page *page = NULL;
struct reclaim_state reclaim_state;
@@ -1589,7 +1588,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx,
alloc_flags, preferred_zone,
- migratetype);
+ migratetype, cold);
return page;
}

@@ -1601,14 +1600,14 @@ static inline struct page *
__alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, struct zone *preferred_zone,
- int migratetype)
+ int migratetype, int cold)
{
struct page *page;

do {
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
- preferred_zone, migratetype);
+ preferred_zone, migratetype, cold);

if (!page && gfp_mask & __GFP_NOFAIL)
congestion_wait(WRITE, HZ/50);
@@ -1668,7 +1667,7 @@ static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, struct zone *preferred_zone,
- int migratetype)
+ int migratetype, int cold)
{
const gfp_t wait = gfp_mask & __GFP_WAIT;
struct page *page = NULL;
@@ -1701,7 +1700,7 @@ restart:
/* This is the last chance, in general, before the goto nopage. */
page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
- preferred_zone, migratetype);
+ preferred_zone, migratetype, cold);
if (page)
goto got_pg;

@@ -1709,7 +1708,7 @@ restart:
if (alloc_flags & ALLOC_NO_WATERMARKS) {
page = __alloc_pages_high_priority(gfp_mask, order,
zonelist, high_zoneidx, nodemask,
- preferred_zone, migratetype);
+ preferred_zone, migratetype, cold);
if (page)
goto got_pg;
}
@@ -1727,7 +1726,8 @@ restart:
zonelist, high_zoneidx,
nodemask,
alloc_flags, preferred_zone,
- migratetype, &did_some_progress);
+ migratetype, cold,
+ &did_some_progress);
if (page)
goto got_pg;

@@ -1740,7 +1740,7 @@ restart:
page = __alloc_pages_may_oom(gfp_mask, order,
zonelist, high_zoneidx,
nodemask, preferred_zone,
- migratetype);
+ migratetype, cold);
if (page)
goto got_pg;

@@ -1780,6 +1780,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zone *preferred_zone;
struct page *page;
int migratetype = allocflags_to_migratetype(gfp_mask);
+ int cold = gfp_mask & __GFP_COLD;

might_sleep_if(gfp_mask & __GFP_WAIT);

@@ -1803,11 +1804,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
/* First allocation attempt */
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
- preferred_zone, migratetype);
+ preferred_zone, migratetype, cold);
if (unlikely(!page))
page = __alloc_pages_slowpath(gfp_mask, order,
zonelist, high_zoneidx, nodemask,
- preferred_zone, migratetype);
+ preferred_zone, migratetype, cold);

return page;
}
--
1.5.6.5

2009-03-16 09:49:23

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 14/35] Inline buffered_rmqueue()

buffered_rmqueue() is in the fast path so inline it. Because it only has
one call site, this actually should reduce text bloat instead of
increase it.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a3ca80d..9f7631e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1079,7 +1079,8 @@ void split_page(struct page *page, unsigned int order)
* we cheat by calling it from here, in the order > 0 path. Saves a branch
* or two.
*/
-static struct page *buffered_rmqueue(struct zone *preferred_zone,
+static inline
+struct page *buffered_rmqueue(struct zone *preferred_zone,
struct zone *zone, int order, gfp_t gfp_flags,
int migratetype, int cold)
{
--
1.5.6.5

2009-03-16 09:48:53

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 13/35] Inline __rmqueue_smallest()

Inline __rmqueue_smallest by altering flow very slightly so that there
is only one call site. This allows the function to be inlined without
additional text bloat.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 23 ++++++++++++++++++-----
1 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1e8b4b6..a3ca80d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -664,7 +664,8 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
* Go through the free lists for the given migratetype and remove
* the smallest available page from the freelists
*/
-static struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
+static inline
+struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
int migratetype)
{
unsigned int current_order;
@@ -834,24 +835,36 @@ static struct page *__rmqueue_fallback(struct zone *zone, int order,
}
}

- /* Use MIGRATE_RESERVE rather than fail an allocation */
- return __rmqueue_smallest(zone, order, MIGRATE_RESERVE);
+ return NULL;
}

/*
* Do the hard work of removing an element from the buddy allocator.
* Call me with the zone->lock already held.
*/
-static struct page *__rmqueue(struct zone *zone, unsigned int order,
+static inline
+struct page *__rmqueue(struct zone *zone, unsigned int order,
int migratetype)
{
struct page *page;

+retry_reserve:
page = __rmqueue_smallest(zone, order, migratetype);

- if (unlikely(!page))
+ if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
page = __rmqueue_fallback(zone, order, migratetype);

+ /*
+ * Use MIGRATE_RESERVE rather than fail an allocation. goto
+ * is used because __rmqueue_smallest is an inline function
+ * and we want just one call site
+ */
+ if (!page) {
+ migratetype = MIGRATE_RESERVE;
+ goto retry_reserve;
+ }
+ }
+
return page;
}

--
1.5.6.5

2009-03-16 09:48:39

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 12/35] Remove a branch by assuming __GFP_HIGH == ALLOC_HIGH

Allocations that specify __GFP_HIGH get the ALLOC_HIGH flag. If these
flags are equal to each other, we can eliminate a branch.

[[email protected]: Suggested the hack]
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ad26052..1e8b4b6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1640,8 +1640,8 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
* policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
* set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
*/
- if (gfp_mask & __GFP_HIGH)
- alloc_flags |= ALLOC_HIGH;
+ VM_BUG_ON(__GFP_HIGH != ALLOC_HIGH);
+ alloc_flags |= (gfp_mask & __GFP_HIGH);

if (!wait) {
alloc_flags |= ALLOC_HARDER;
--
1.5.6.5

2009-03-16 09:49:40

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 15/35] Inline __rmqueue_fallback()

__rmqueue() is in the slow path but has only one call site. It actually
reduces text if it's inlined.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9f7631e..0ba9e4f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -774,8 +774,8 @@ static int move_freepages_block(struct zone *zone, struct page *page,
}

/* Remove an element from the buddy allocator from the fallback list */
-static struct page *__rmqueue_fallback(struct zone *zone, int order,
- int start_migratetype)
+static inline struct page *
+__rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
{
struct free_area * area;
int current_order;
--
1.5.6.5

2009-03-16 09:50:47

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 18/35] Do not disable interrupts in free_page_mlock()

free_page_mlock() tests and clears PG_mlocked using locked versions of the
bit operations. If set, it disables interrupts to update counters and this
happens on every page free even though interrupts are disabled very shortly
afterwards a second time. This is wasteful.

This patch splits what free_page_mlock() does. The bit check is still
made. However, the update of counters is delayed until the interrupts are
disabled and the non-lock version for clearing the bit is used. One potential
weirdness with this split is that the counters do not get updated if the
bad_page() check is triggered but a system showing bad pages is getting
screwed already.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/internal.h | 11 +++--------
mm/page_alloc.c | 8 +++++++-
2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 478223b..7f775a1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -155,14 +155,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
*/
static inline void free_page_mlock(struct page *page)
{
- if (unlikely(TestClearPageMlocked(page))) {
- unsigned long flags;
-
- local_irq_save(flags);
- __dec_zone_page_state(page, NR_MLOCK);
- __count_vm_event(UNEVICTABLE_MLOCKFREED);
- local_irq_restore(flags);
- }
+ __ClearPageMlocked(page);
+ __dec_zone_page_state(page, NR_MLOCK);
+ __count_vm_event(UNEVICTABLE_MLOCKFREED);
}

#else /* CONFIG_UNEVICTABLE_LRU */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 349c64d..a068589 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -498,7 +498,6 @@ static inline void __free_one_page(struct page *page,

static inline int free_pages_check(struct page *page)
{
- free_page_mlock(page);
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
(page_count(page) != 0) |
@@ -555,6 +554,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
unsigned long flags;
int i;
int bad = 0;
+ int clearMlocked = PageMlocked(page);

for (i = 0 ; i < (1 << order) ; ++i)
bad += free_pages_check(page + i);
@@ -570,6 +570,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
kernel_map_pages(page, 1 << order, 0);

local_irq_save(flags);
+ if (clearMlocked)
+ free_page_mlock(page);
__count_vm_events(PGFREE, 1 << order);
free_one_page(page_zone(page), page, order,
get_pageblock_migratetype(page));
@@ -1020,6 +1022,7 @@ static void free_hot_cold_page(struct page *page, int cold)
struct zone *zone = page_zone(page);
struct per_cpu_pages *pcp;
unsigned long flags;
+ int clearMlocked = PageMlocked(page);

if (PageAnon(page))
page->mapping = NULL;
@@ -1036,6 +1039,9 @@ static void free_hot_cold_page(struct page *page, int cold)
pcp = &zone_pcp(zone, get_cpu())->pcp;
local_irq_save(flags);
__count_vm_event(PGFREE);
+ if (clearMlocked)
+ free_page_mlock(page);
+
if (cold)
list_add_tail(&page->lru, &pcp->list);
else
--
1.5.6.5

2009-03-16 09:50:29

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 17/35] Do not call get_pageblock_migratetype() more than necessary

get_pageblock_migratetype() is potentially called twice for every page
free. Once, when being freed to the pcp lists and once when being freed
back to buddy. When freeing from the pcp lists, it is known what the
pageblock type was at the time of free so use it rather than rechecking.
In low memory situations under memory pressure, this might skew
anti-fragmentation slightly but the interference is minimal and
decisions that are fragmenting memory are being made anyway.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 16 ++++++++++------
1 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 795cfc5..349c64d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -455,16 +455,18 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
*/

static inline void __free_one_page(struct page *page,
- struct zone *zone, unsigned int order)
+ struct zone *zone, unsigned int order,
+ int migratetype)
{
unsigned long page_idx;
int order_size = 1 << order;
- int migratetype = get_pageblock_migratetype(page);

if (unlikely(PageCompound(page)))
if (unlikely(destroy_compound_page(page, order)))
return;

+ VM_BUG_ON(migratetype == -1);
+
page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);

VM_BUG_ON(page_idx & (order_size - 1));
@@ -533,17 +535,18 @@ static void free_pages_bulk(struct zone *zone, int count,
page = list_entry(list->prev, struct page, lru);
/* have to delete it as __free_one_page list manipulates */
list_del(&page->lru);
- __free_one_page(page, zone, order);
+ __free_one_page(page, zone, order, page_private(page));
}
spin_unlock(&zone->lock);
}

-static void free_one_page(struct zone *zone, struct page *page, int order)
+static void free_one_page(struct zone *zone, struct page *page, int order,
+ int migratetype)
{
spin_lock(&zone->lock);
zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
zone->pages_scanned = 0;
- __free_one_page(page, zone, order);
+ __free_one_page(page, zone, order, migratetype);
spin_unlock(&zone->lock);
}

@@ -568,7 +571,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)

local_irq_save(flags);
__count_vm_events(PGFREE, 1 << order);
- free_one_page(page_zone(page), page, order);
+ free_one_page(page_zone(page), page, order,
+ get_pageblock_migratetype(page));
local_irq_restore(flags);
}

--
1.5.6.5

2009-03-16 09:49:56

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 16/35] Save text by reducing call sites of __rmqueue()

__rmqueue is inlined in the fast path but it has two call sites, the low
order and high order paths. However, a slight modification to the
high-order path reduces the call sites of __rmqueue. This reduces text
at the slight increase of complexity of the high-order allocation path.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 11 +++++++----
1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0ba9e4f..795cfc5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1123,11 +1123,14 @@ again:
list_del(&page->lru);
pcp->count--;
} else {
- spin_lock_irqsave(&zone->lock, flags);
- page = __rmqueue(zone, order, migratetype);
- spin_unlock(&zone->lock);
- if (!page)
+ LIST_HEAD(list);
+ local_irq_save(flags);
+
+ /* Calling __rmqueue would bloat text, hence this */
+ if (!rmqueue_bulk(zone, order, 1, &list, migratetype))
goto failed;
+ page = list_entry(list.next, struct page, lru);
+ list_del(&page->lru);
}

__count_zone_vm_events(PGALLOC, zone, 1 << order);
--
1.5.6.5

2009-03-16 09:51:07

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 19/35] Do not setup zonelist cache when there is only one node

There is a zonelist cache which is used to track zones that are not in
the allowed cpuset or found to be recently full. This is to reduce cache
footprint on large machines. On smaller machines, it just incurs cost
for no gain. This patch only uses the zonelist cache when there are NUMA
nodes.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 10 ++++++++--
1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a068589..e620c91 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1442,6 +1442,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
/* Determine in advance if the zonelist needs filtering */
if ((alloc_flags & ALLOC_CPUSET) && unlikely(number_of_cpusets > 1))
zonelist_filter = 1;
+ if (num_online_nodes() > 1)
+ zonelist_filter = 1;

zonelist_scan:
/*
@@ -1486,8 +1488,12 @@ this_zone_full:
zlc_mark_zone_full(zonelist, z);
try_next_zone:
if (NUMA_BUILD && zonelist_filter) {
- if (!did_zlc_setup) {
- /* do zlc_setup after the first zone is tried */
+ if (!did_zlc_setup && num_online_nodes() > 1) {
+ /*
+ * do zlc_setup after the first zone is tried
+ * but only if there are multiple nodes to make
+ * it worthwhile
+ */
allowednodes = zlc_setup(zonelist, alloc_flags);
zlc_active = 1;
}
--
1.5.6.5

2009-03-16 09:51:27

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

num_online_nodes() is called by the page allocator to decide whether the
zonelist needs to be filtered based on cpusets or the zonelist cache.
This is actually a heavy function and touches a number of cache lines.
This patch stores the number of online nodes at boot time and when
nodes get onlined and offlined.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/nodemask.h | 16 ++++++++++++++--
mm/page_alloc.c | 6 ++++--
2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 848025c..4749e30 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -449,13 +449,25 @@ static inline int num_node_state(enum node_states state)
node; \
})

+/* Recorded value for num_online_nodes() */
+extern int static_num_online_nodes;
+
#define num_online_nodes() num_node_state(N_ONLINE)
#define num_possible_nodes() num_node_state(N_POSSIBLE)
#define node_online(node) node_state((node), N_ONLINE)
#define node_possible(node) node_state((node), N_POSSIBLE)

-#define node_set_online(node) node_set_state((node), N_ONLINE)
-#define node_set_offline(node) node_clear_state((node), N_ONLINE)
+static inline void node_set_online(int nid)
+{
+ node_set_state(nid, N_ONLINE);
+ static_num_online_nodes = num_node_state(N_ONLINE);
+}
+
+static inline void node_set_offline(int nid)
+{
+ node_clear_state(nid, N_ONLINE);
+ static_num_online_nodes = num_node_state(N_ONLINE);
+}

#define for_each_node(node) for_each_node_state(node, N_POSSIBLE)
#define for_each_online_node(node) for_each_node_state(node, N_ONLINE)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e620c91..d297780 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -70,6 +70,7 @@ EXPORT_SYMBOL(node_states);
unsigned long totalram_pages __read_mostly;
unsigned long totalreserve_pages __read_mostly;
unsigned long highest_memmap_pfn __read_mostly;
+int static_num_online_nodes __read_mostly;
int percpu_pagelist_fraction;

#ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
@@ -1442,7 +1443,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
/* Determine in advance if the zonelist needs filtering */
if ((alloc_flags & ALLOC_CPUSET) && unlikely(number_of_cpusets > 1))
zonelist_filter = 1;
- if (num_online_nodes() > 1)
+ if (static_num_online_nodes > 1)
zonelist_filter = 1;

zonelist_scan:
@@ -1488,7 +1489,7 @@ this_zone_full:
zlc_mark_zone_full(zonelist, z);
try_next_zone:
if (NUMA_BUILD && zonelist_filter) {
- if (!did_zlc_setup && num_online_nodes() > 1) {
+ if (!did_zlc_setup && static_num_online_nodes > 1) {
/*
* do zlc_setup after the first zone is tried
* but only if there are multiple nodes to make
@@ -2645,6 +2646,7 @@ void build_all_zonelists(void)
else
page_group_by_mobility_disabled = 0;

+ static_num_online_nodes = num_node_state(N_ONLINE);
printk("Built %i zonelists in %s order, mobility grouping %s. "
"Total pages: %ld\n",
num_online_nodes(),
--
1.5.6.5

2009-03-16 09:52:26

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 23/35] Update NR_FREE_PAGES only as necessary

When pages are being freed to the buddy allocator, the zone
NR_FREE_PAGES counter must be updated. In the case of bulk per-cpu page
freeing, it's updated once per page. This retouches cache lines more
than necessary. Update the counters one per per-cpu bulk free.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 446cefa..bc491fa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -460,7 +460,6 @@ static inline void __free_one_page(struct page *page,
int migratetype)
{
unsigned long page_idx;
- int order_size = 1 << order;

if (unlikely(PageCompound(page)))
if (unlikely(destroy_compound_page(page, order)))
@@ -470,10 +469,9 @@ static inline void __free_one_page(struct page *page,

page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);

- VM_BUG_ON(page_idx & (order_size - 1));
+ VM_BUG_ON(page_idx & ((1 << order) - 1));
VM_BUG_ON(bad_range(zone, page));

- __mod_zone_page_state(zone, NR_FREE_PAGES, order_size);
while (order < MAX_ORDER-1) {
unsigned long combined_idx;
struct page *buddy;
@@ -528,6 +526,8 @@ static void free_pages_bulk(struct zone *zone, int count,
spin_lock(&zone->lock);
zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
zone->pages_scanned = 0;
+
+ __mod_zone_page_state(zone, NR_FREE_PAGES, count);
while (count--) {
struct page *page;

@@ -546,6 +546,8 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
spin_lock(&zone->lock);
zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
zone->pages_scanned = 0;
+
+ __mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
__free_one_page(page, zone, order, migratetype);
spin_unlock(&zone->lock);
}
@@ -690,7 +692,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
list_del(&page->lru);
rmv_page_order(page);
area->nr_free--;
- __mod_zone_page_state(zone, NR_FREE_PAGES, - (1UL << order));
expand(zone, page, order, current_order, area, migratetype);
return page;
}
@@ -830,8 +831,6 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
/* Remove the page from the freelists */
list_del(&page->lru);
rmv_page_order(page);
- __mod_zone_page_state(zone, NR_FREE_PAGES,
- -(1UL << order));

if (current_order == pageblock_order)
set_pageblock_migratetype(page,
@@ -905,6 +904,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
set_page_private(page, migratetype);
list = &page->lru;
}
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order) * i);
spin_unlock(&zone->lock);
return i;
}
--
1.5.6.5

2009-03-16 09:52:03

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 22/35] Use allocation flags as an index to the zone watermark

ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
pages_min, pages_low or pages_high is used as the zone watermark when
allocating the pages. Two branches in the allocator hotpath determine which
watermark to use. This patch uses the flags as an array index and places
the three watermarks in a union with an array so it can be offset. This
means the flags can be used as an array index and reduces the branches
taken.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 8 +++++++-
mm/page_alloc.c | 18 ++++++++----------
2 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ca000b8..c20c662 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -275,7 +275,13 @@ struct zone_reclaim_stat {

struct zone {
/* Fields commonly accessed by the page allocator */
- unsigned long pages_min, pages_low, pages_high;
+ union {
+ struct {
+ unsigned long pages_min, pages_low, pages_high;
+ };
+ unsigned long pages_mark[3];
+ };
+
/*
* We don't know if the memory that we're going to allocate will be freeable
* or/and it will be released eventually, so to avoid totally wasting several
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2704092..446cefa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1160,10 +1160,13 @@ failed:
return NULL;
}

-#define ALLOC_NO_WATERMARKS 0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN 0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW 0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH 0x08 /* use pages_high watermark */
+/* The WMARK bits are used as an index zone->pages_mark */
+#define ALLOC_WMARK_MIN 0x00 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW 0x01 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH 0x02 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS 0x08 /* don't check watermarks at all */
+#define ALLOC_WMARK_MASK 0x07 /* Mask to get the watermark bits */
+
#define ALLOC_HARDER 0x10 /* try to alloc harder */
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
#ifdef CONFIG_CPUSETS
@@ -1466,12 +1469,7 @@ zonelist_scan:

if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
unsigned long mark;
- if (alloc_flags & ALLOC_WMARK_MIN)
- mark = zone->pages_min;
- else if (alloc_flags & ALLOC_WMARK_LOW)
- mark = zone->pages_low;
- else
- mark = zone->pages_high;
+ mark = zone->pages_mark[alloc_flags & ALLOC_WMARK_MASK];
if (!zone_watermark_ok(zone, order, mark,
classzone_idx, alloc_flags)) {
if (!zone_reclaim_mode ||
--
1.5.6.5

2009-03-16 09:51:46

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 21/35] Do not check for compound pages during the page allocator sanity checks

A number of sanity checks are made on each page allocation and free
including that the page count is zero. page_count() checks for
compound pages and checks the count of the head page if true. However,
in these paths, we do not care if the page is compound or not as the
count of each tail page should also be zero.

This patch makes two changes to the use of page_count() in the free path. It
converts one check of page_count() to a VM_BUG_ON() as the count should
have been unconditionally checked earlier in the free path. It also avoids
checking for compound pages.

[[email protected]: Wrote changelog]
Signed-off-by: Nick Piggin <[email protected]>
---
mm/page_alloc.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d297780..2704092 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -425,7 +425,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
return 0;

if (PageBuddy(buddy) && page_order(buddy) == order) {
- BUG_ON(page_count(buddy) != 0);
+ VM_BUG_ON(page_count(buddy) != 0);
return 1;
}
return 0;
@@ -501,7 +501,7 @@ static inline int free_pages_check(struct page *page)
{
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
- (page_count(page) != 0) |
+ (atomic_read(&page->_count) != 0) |
(page->flags & PAGE_FLAGS_CHECK_AT_FREE))) {
bad_page(page);
return 1;
@@ -646,7 +646,7 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
{
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
- (page_count(page) != 0) |
+ (atomic_read(&page->_count) != 0) |
(page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
bad_page(page);
return 1;
--
1.5.6.5

2009-03-16 09:52:43

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 24/35] Convert gfp_zone() to use a table of precalculated values

Every page allocation uses gfp_zone() to calcuate what the highest zone
allowed by a combination of GFP flags is. This is a large number of branches
to have in a fast path. This patch replaces the branches with a lookup
table that is calculated at boot-time and stored in the read-mostly section
so it can be shared. This requires __GFP_MOVABLE to be redefined but it's
debatable as to whether it should be considered a zone modifier or not.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/gfp.h | 28 +++++++++++-----------------
init/main.c | 1 +
mm/page_alloc.c | 36 +++++++++++++++++++++++++++++++++++-
3 files changed, 47 insertions(+), 18 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 59eb093..581f8a9 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -16,6 +16,10 @@ struct vm_area_struct;
* Do not put any conditional on these. If necessary modify the definitions
* without the underscores and use the consistently. The definitions here may
* be used in bit comparisons.
+ *
+ * Note that __GFP_MOVABLE uses the next available bit but it is not
+ * a zone modifier. It uses the fourth bit so that the calculation of
+ * gfp_zone() can use a table rather than a series of comparisons
*/
#define __GFP_DMA ((__force gfp_t)0x01u)
#define __GFP_HIGHMEM ((__force gfp_t)0x02u)
@@ -50,7 +54,7 @@ struct vm_area_struct;
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
#define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
-#define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
+#define __GFP_MOVABLE ((__force gfp_t)0x08u) /* Page is movable */

#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -77,6 +81,9 @@ struct vm_area_struct;
#define GFP_THISNODE ((__force gfp_t)0)
#endif

+/* This is a mask of all modifiers affecting gfp_zonemask() */
+#define GFP_ZONEMASK (__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32 | __GFP_MOVABLE)
+
/* This mask makes up all the page movable related flags */
#define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)

@@ -112,24 +119,11 @@ static inline int allocflags_to_migratetype(gfp_t gfp_flags)
((gfp_flags & __GFP_RECLAIMABLE) != 0);
}

+extern int gfp_zone_table[GFP_ZONEMASK];
+void init_gfp_zone_table(void);
static inline enum zone_type gfp_zone(gfp_t flags)
{
-#ifdef CONFIG_ZONE_DMA
- if (flags & __GFP_DMA)
- return ZONE_DMA;
-#endif
-#ifdef CONFIG_ZONE_DMA32
- if (flags & __GFP_DMA32)
- return ZONE_DMA32;
-#endif
- if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
- (__GFP_HIGHMEM | __GFP_MOVABLE))
- return ZONE_MOVABLE;
-#ifdef CONFIG_HIGHMEM
- if (flags & __GFP_HIGHMEM)
- return ZONE_HIGHMEM;
-#endif
- return ZONE_NORMAL;
+ return gfp_zone_table[flags & GFP_ZONEMASK];
}

/*
diff --git a/init/main.c b/init/main.c
index 8442094..08a5663 100644
--- a/init/main.c
+++ b/init/main.c
@@ -573,6 +573,7 @@ asmlinkage void __init start_kernel(void)
* fragile until we cpu_idle() for the first time.
*/
preempt_disable();
+ init_gfp_zone_table();
build_all_zonelists();
page_alloc_init();
printk(KERN_NOTICE "Kernel command line: %s\n", boot_command_line);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bc491fa..d76f57d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -70,6 +70,7 @@ EXPORT_SYMBOL(node_states);
unsigned long totalram_pages __read_mostly;
unsigned long totalreserve_pages __read_mostly;
unsigned long highest_memmap_pfn __read_mostly;
+int gfp_zone_table[GFP_ZONEMASK] __read_mostly;
int static_num_online_nodes __read_mostly;
int percpu_pagelist_fraction;

@@ -4569,7 +4570,7 @@ static void setup_per_zone_inactive_ratio(void)
* 8192MB: 11584k
* 16384MB: 16384k
*/
-static int __init init_per_zone_pages_min(void)
+static int init_per_zone_pages_min(void)
{
unsigned long lowmem_kbytes;

@@ -4587,6 +4588,39 @@ static int __init init_per_zone_pages_min(void)
}
module_init(init_per_zone_pages_min)

+static inline int __init gfp_flags_to_zone(gfp_t flags)
+{
+#ifdef CONFIG_ZONE_DMA
+ if (flags & __GFP_DMA)
+ return ZONE_DMA;
+#endif
+#ifdef CONFIG_ZONE_DMA32
+ if (flags & __GFP_DMA32)
+ return ZONE_DMA32;
+#endif
+ if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
+ (__GFP_HIGHMEM | __GFP_MOVABLE))
+ return ZONE_MOVABLE;
+#ifdef CONFIG_HIGHMEM
+ if (flags & __GFP_HIGHMEM)
+ return ZONE_HIGHMEM;
+#endif
+ return ZONE_NORMAL;
+}
+
+/*
+ * For each possible combination of zone modifier flags, we calculate
+ * what zone it should be using. This consumes a cache line in most
+ * cases but avoids a number of branches in the allocator fast path
+ */
+void __init init_gfp_zone_table(void)
+{
+ gfp_t gfp_flags;
+
+ for (gfp_flags = 0; gfp_flags < GFP_ZONEMASK; gfp_flags++)
+ gfp_zone_table[gfp_flags] = gfp_flags_to_zone(gfp_flags);
+}
+
/*
* min_free_kbytes_sysctl_handler - just a wrapper around proc_dointvec() so
* that we can call two helper functions whenever min_free_kbytes
--
1.5.6.5

2009-03-16 09:53:01

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 25/35] Re-sort GFP flags and fix whitespace alignment for easier reading.

Resort the GFP flags after __GFP_MOVABLE got redefined so how the bits
are used are a bit cleared.

From: Peter Zijlstra <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/gfp.h | 9 +++++----
1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 581f8a9..8f7d176 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -25,6 +25,8 @@ struct vm_area_struct;
#define __GFP_HIGHMEM ((__force gfp_t)0x02u)
#define __GFP_DMA32 ((__force gfp_t)0x04u)

+#define __GFP_MOVABLE ((__force gfp_t)0x08u) /* Page is movable */
+
/*
* Action modifiers - doesn't change the zoning
*
@@ -50,11 +52,10 @@ struct vm_area_struct;
#define __GFP_NORETRY ((__force gfp_t)0x1000u)/* See above */
#define __GFP_COMP ((__force gfp_t)0x4000u)/* Add compound page metadata */
#define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */
-#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
-#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
-#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
+#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_THISNODE ((__force gfp_t)0x40000u) /* No fallback, no policies */
#define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
-#define __GFP_MOVABLE ((__force gfp_t)0x08u) /* Page is movable */

#define __GFP_BITS_SHIFT 21 /* Room for 21 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
--
1.5.6.5

2009-03-16 09:53:28

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 26/35] Use the per-cpu allocator for orders up to PAGE_ALLOC_COSTLY_ORDER

The per-cpu allocator is used to store order-0 pages for fast allocation
without using the zone locks. There are a number of cases where order-1
allocations are frequent. Obviously there are 8K stacks, but also the
signal handlers can be sufficiently large as well as some networking-related
structures.

This patch allows orders up to PAGE_ALLOC_COSTLY_ORDER to be stored on the
per-cpu lists which should help workloads making frequent order-1 allocations
such as fork-heavy workloads on x86-64. It is somewhat simplified in that
no splitting occurs of high-order pages on the lists but care is taken to
account for the larger size of those pages so that the same amount of memory
can end up on the per-cpu lists. If a high-order allocation would fail, then
the lists get drained so it is not expected to cause any unusual OOM problems.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 118 +++++++++++++++++++++++++++++++++++++------------------
1 files changed, 80 insertions(+), 38 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d76f57d..42280c1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -469,6 +469,7 @@ static inline void __free_one_page(struct page *page,
VM_BUG_ON(migratetype == -1);

page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
+ page->index = 0;

VM_BUG_ON(page_idx & ((1 << order) - 1));
VM_BUG_ON(bad_range(zone, page));
@@ -510,8 +511,31 @@ static inline int free_pages_check(struct page *page)
return 0;
}

+static inline void rmv_pcp_page(struct per_cpu_pages *pcp, struct page *page)
+{
+ list_del(&page->lru);
+ pcp->count -= 1 << page->index;
+}
+
+static inline void add_pcp_page(struct per_cpu_pages *pcp,
+ struct page *page,
+ int cold)
+{
+ if (cold)
+ list_add_tail(&page->lru, &pcp->list);
+ else
+ list_add(&page->lru, &pcp->list);
+ pcp->count += 1 << page->index;
+}
+
+static inline void bulk_add_pcp_page(struct per_cpu_pages *pcp,
+ int order, int count)
+{
+ pcp->count += count << order;
+}
+
/*
- * Frees a list of pages.
+ * Frees a number of pages from the PCP lists
* Assumes all pages on list are in same zone, and of same order.
* count is the number of pages to free.
*
@@ -522,23 +546,28 @@ static inline int free_pages_check(struct page *page)
* pinned" detection logic.
*/
static void free_pages_bulk(struct zone *zone, int count,
- struct list_head *list, int order)
+ struct per_cpu_pages *pcp)
{
+ unsigned int freed = 0;
+ struct list_head *list = &pcp->list;
+
spin_lock(&zone->lock);
zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
zone->pages_scanned = 0;

- __mod_zone_page_state(zone, NR_FREE_PAGES, count);
- while (count--) {
+ while (freed < count) {
struct page *page;

VM_BUG_ON(list_empty(list));
page = list_entry(list->prev, struct page, lru);
- /* have to delete it as __free_one_page list manipulates */
- list_del(&page->lru);
- __free_one_page(page, zone, order, page_private(page));
+ rmv_pcp_page(pcp, page);
+
+ freed += 1 << page->index;
+ __free_one_page(page, zone, page->index, page_private(page));
}
spin_unlock(&zone->lock);
+
+ __mod_zone_page_state(zone, NR_FREE_PAGES, freed);
}

static void free_one_page(struct zone *zone, struct page *page, int order,
@@ -655,6 +684,7 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
return 1;
}

+ page->index = 0;
set_page_private(page, 0);
set_page_refcounted(page);

@@ -903,9 +933,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
*/
list_add(&page->lru, list);
set_page_private(page, migratetype);
+ page->index = order;
list = &page->lru;
}
- __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order) * i);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
spin_unlock(&zone->lock);
return i;
}
@@ -929,8 +960,7 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
to_drain = pcp->batch;
else
to_drain = pcp->count;
- free_pages_bulk(zone, to_drain, &pcp->list, 0);
- pcp->count -= to_drain;
+ free_pages_bulk(zone, to_drain, &pcp->list);
local_irq_restore(flags);
}
#endif
@@ -958,8 +988,8 @@ static void drain_pages(unsigned int cpu)

pcp = &pset->pcp;
local_irq_save(flags);
- free_pages_bulk(zone, pcp->count, &pcp->list, 0);
- pcp->count = 0;
+ free_pages_bulk(zone, pcp->count, pcp);
+ BUG_ON(pcp->count);
local_irq_restore(flags);
}
}
@@ -1019,13 +1049,18 @@ void mark_free_pages(struct zone *zone)
/*
* Free a 0-order page
*/
-static void free_hot_cold_page(struct page *page, int cold)
+static void free_hot_cold_page(struct page *page, int order, int cold)
{
struct zone *zone = page_zone(page);
struct per_cpu_pages *pcp;
unsigned long flags;
int clearMlocked = PageMlocked(page);

+ /* SLUB can return lowish-order compound pages that need handling */
+ if (order > 0 && unlikely(PageCompound(page)))
+ if (unlikely(destroy_compound_page(page, order)))
+ return;
+
if (PageAnon(page))
page->mapping = NULL;
if (free_pages_check(page))
@@ -1035,8 +1070,8 @@ static void free_hot_cold_page(struct page *page, int cold)
debug_check_no_locks_freed(page_address(page), PAGE_SIZE);
debug_check_no_obj_freed(page_address(page), PAGE_SIZE);
}
- arch_free_page(page, 0);
- kernel_map_pages(page, 1, 0);
+ arch_free_page(page, order);
+ kernel_map_pages(page, 1 << order, 0);

pcp = &zone_pcp(zone, get_cpu())->pcp;
local_irq_save(flags);
@@ -1044,28 +1079,24 @@ static void free_hot_cold_page(struct page *page, int cold)
if (clearMlocked)
free_page_mlock(page);

- if (cold)
- list_add_tail(&page->lru, &pcp->list);
- else
- list_add(&page->lru, &pcp->list);
set_page_private(page, get_pageblock_migratetype(page));
- pcp->count++;
- if (pcp->count >= pcp->high) {
- free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
- pcp->count -= pcp->batch;
- }
+ page->index = order;
+ add_pcp_page(pcp, page, cold);
+
+ if (pcp->count >= pcp->high)
+ free_pages_bulk(zone, pcp->batch, pcp);
local_irq_restore(flags);
put_cpu();
}

void free_hot_page(struct page *page)
{
- free_hot_cold_page(page, 0);
+ free_hot_cold_page(page, 0, 0);
}

void free_cold_page(struct page *page)
{
- free_hot_cold_page(page, 1);
+ free_hot_cold_page(page, 0, 1);
}

/*
@@ -1086,6 +1117,11 @@ void split_page(struct page *page, unsigned int order)
set_page_refcounted(page + i);
}

+static inline int pcp_page_suit(struct page *page, int migratetype, int order)
+{
+ return page_private(page) == migratetype && page->index == order;
+}
+
/*
* Really, prep_compound_page() should be called from __rmqueue_bulk(). But
* we cheat by calling it from here, in the order > 0 path. Saves a branch
@@ -1102,14 +1138,18 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,

again:
cpu = get_cpu();
- if (likely(order == 0)) {
+ if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
struct per_cpu_pages *pcp;
+ int batch;
+ int delta;

pcp = &zone_pcp(zone, cpu)->pcp;
+ batch = max(1, pcp->batch >> order);
local_irq_save(flags);
if (!pcp->count) {
- pcp->count = rmqueue_bulk(zone, 0,
- pcp->batch, &pcp->list, migratetype);
+ delta = rmqueue_bulk(zone, order, batch,
+ &pcp->list, migratetype);
+ bulk_add_pcp_page(pcp, order, delta);
if (unlikely(!pcp->count))
goto failed;
}
@@ -1117,23 +1157,25 @@ again:
/* Find a page of the appropriate migrate type */
if (cold) {
list_for_each_entry_reverse(page, &pcp->list, lru)
- if (page_private(page) == migratetype)
+ if (pcp_page_suit(page, migratetype, order))
break;
} else {
list_for_each_entry(page, &pcp->list, lru)
- if (page_private(page) == migratetype)
+ if (pcp_page_suit(page, migratetype, order))
break;
}

/* Allocate more to the pcp list if necessary */
if (unlikely(&page->lru == &pcp->list)) {
- pcp->count += rmqueue_bulk(zone, 0,
- pcp->batch, &pcp->list, migratetype);
+ delta = rmqueue_bulk(zone, order, batch,
+ &pcp->list, migratetype);
+ bulk_add_pcp_page(pcp, order, delta);
page = list_entry(pcp->list.next, struct page, lru);
+ if (!pcp_page_suit(page, migratetype, order))
+ goto failed;
}

- list_del(&page->lru);
- pcp->count--;
+ rmv_pcp_page(pcp, page);
} else {
LIST_HEAD(list);
local_irq_save(flags);
@@ -1884,14 +1926,14 @@ void __pagevec_free(struct pagevec *pvec)
int i = pagevec_count(pvec);

while (--i >= 0)
- free_hot_cold_page(pvec->pages[i], pvec->cold);
+ free_hot_cold_page(pvec->pages[i], 0, pvec->cold);
}

void __free_pages(struct page *page, unsigned int order)
{
if (put_page_testzero(page)) {
- if (order == 0)
- free_hot_page(page);
+ if (order <= PAGE_ALLOC_COSTLY_ORDER)
+ free_hot_cold_page(page, order, 0);
else
__free_pages_ok(page, order);
}
--
1.5.6.5

2009-03-16 09:53:45

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 28/35] Batch free pages from migratetype per-cpu lists

When the PCP lists are too large, a number of pages are freed in bulk.
Currently the free lists are examined in a round-robin fashion but this
touches more cache lines than necessary. This patch frees pages from one
list at a time and uses the migratetype most recently used as the
starting point.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 36 +++++++++++++++++++++++-------------
1 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3516b87..edadab1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -547,32 +547,42 @@ static inline void bulk_add_pcp_page(struct per_cpu_pages *pcp,
* pinned" detection logic.
*/
static void free_pcppages_bulk(struct zone *zone, int count,
- struct per_cpu_pages *pcp)
+ struct per_cpu_pages *pcp,
+ int migratetype)
{
- int migratetype = 0;
unsigned int freed = 0;
+ unsigned int bulkcount;
+ struct list_head *list;

spin_lock(&zone->lock);
zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
zone->pages_scanned = 0;

+ list = &pcp->lists[migratetype];
+ bulkcount = 1 + (count / (MIGRATE_PCPTYPES * 2));
while (freed < count) {
struct page *page;
- struct list_head *list;
+ int thisfreed;

- /* Remove pages from lists in a round-robin fashion */
- do {
- if (migratetype == MIGRATE_PCPTYPES)
+ /*
+ * Move to another migratetype if this list is depleted or
+ * we've freed enough in this batch
+ */
+ while (list_empty(list) || bulkcount < 0) {
+ bulkcount = 1 + (count / (MIGRATE_PCPTYPES * 2));
+ if (++migratetype == MIGRATE_PCPTYPES)
migratetype = 0;
list = &pcp->lists[migratetype];
- migratetype++;
- } while (list_empty(list));
+ }

+ /* Remove from list and update counters */
page = list_entry(list->prev, struct page, lru);
rmv_pcp_page(pcp, page);
+ thisfreed = 1 << page->index;
+ freed += thisfreed;
+ bulkcount -= thisfreed;

- freed += 1 << page->index;
- __free_one_page(page, zone, page->index, page_private(page));
+ __free_one_page(page, zone, page->index, migratetype);
}
spin_unlock(&zone->lock);

@@ -969,7 +979,7 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
to_drain = pcp->batch;
else
to_drain = pcp->count;
- free_pcppages_bulk(zone, to_drain, pcp);
+ free_pcppages_bulk(zone, to_drain, pcp, 0);
local_irq_restore(flags);
}
#endif
@@ -997,7 +1007,7 @@ static void drain_pages(unsigned int cpu)

pcp = &pset->pcp;
local_irq_save(flags);
- free_pcppages_bulk(zone, pcp->count, pcp);
+ free_pcppages_bulk(zone, pcp->count, pcp, 0);
BUG_ON(pcp->count);
local_irq_restore(flags);
}
@@ -1110,7 +1120,7 @@ static void free_hot_cold_page(struct page *page, int order, int cold)
page->index = order;
add_pcp_page(pcp, page, cold);
if (pcp->count >= pcp->high)
- free_pcppages_bulk(zone, pcp->batch, pcp);
+ free_pcppages_bulk(zone, pcp->batch, pcp, migratetype);

out:
local_irq_restore(flags);
--
1.5.6.5

2009-03-16 09:54:03

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 27/35] Split per-cpu list into one-list-per-migrate-type

Currently the per-cpu page allocator searches the PCP list for pages of the
correct migrate-type to reduce the possibility of pages being inappropriate
placed from a fragmentation perspective. This search is potentially expensive
in a fast-path and undesirable. Splitting the per-cpu list into multiple
lists increases the size of a per-cpu structure and this was potentially
a major problem at the time the search was introduced. These problem has
been mitigated as now only the necessary number of structures is allocated
for the running system.

This patch replaces a list search in the per-cpu allocator with one list
per migrate type. It is still searched but for a page of the correct
order which is expected to be a lot less costly.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 5 ++-
mm/page_alloc.c | 82 +++++++++++++++++++++++++++++++++--------------
2 files changed, 61 insertions(+), 26 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c20c662..eed6867 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -38,6 +38,7 @@
#define MIGRATE_UNMOVABLE 0
#define MIGRATE_RECLAIMABLE 1
#define MIGRATE_MOVABLE 2
+#define MIGRATE_PCPTYPES 3 /* the number of types on the pcp lists */
#define MIGRATE_RESERVE 3
#define MIGRATE_ISOLATE 4 /* can't allocate from here */
#define MIGRATE_TYPES 5
@@ -167,7 +168,9 @@ struct per_cpu_pages {
int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
int batch; /* chunk size for buddy add/remove */
- struct list_head list; /* the list of pages */
+
+ /* Lists of pages, one per migrate type stored on the pcp-lists */
+ struct list_head lists[MIGRATE_PCPTYPES];
};

struct per_cpu_pageset {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 42280c1..3516b87 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -521,10 +521,11 @@ static inline void add_pcp_page(struct per_cpu_pages *pcp,
struct page *page,
int cold)
{
+ int migratetype = page_private(page);
if (cold)
- list_add_tail(&page->lru, &pcp->list);
+ list_add_tail(&page->lru, &pcp->lists[migratetype]);
else
- list_add(&page->lru, &pcp->list);
+ list_add(&page->lru, &pcp->lists[migratetype]);
pcp->count += 1 << page->index;
}

@@ -545,11 +546,11 @@ static inline void bulk_add_pcp_page(struct per_cpu_pages *pcp,
* And clear the zone's pages_scanned counter, to hold off the "all pages are
* pinned" detection logic.
*/
-static void free_pages_bulk(struct zone *zone, int count,
+static void free_pcppages_bulk(struct zone *zone, int count,
struct per_cpu_pages *pcp)
{
+ int migratetype = 0;
unsigned int freed = 0;
- struct list_head *list = &pcp->list;

spin_lock(&zone->lock);
zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
@@ -557,8 +558,16 @@ static void free_pages_bulk(struct zone *zone, int count,

while (freed < count) {
struct page *page;
+ struct list_head *list;
+
+ /* Remove pages from lists in a round-robin fashion */
+ do {
+ if (migratetype == MIGRATE_PCPTYPES)
+ migratetype = 0;
+ list = &pcp->lists[migratetype];
+ migratetype++;
+ } while (list_empty(list));

- VM_BUG_ON(list_empty(list));
page = list_entry(list->prev, struct page, lru);
rmv_pcp_page(pcp, page);

@@ -960,7 +969,7 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
to_drain = pcp->batch;
else
to_drain = pcp->count;
- free_pages_bulk(zone, to_drain, &pcp->list);
+ free_pcppages_bulk(zone, to_drain, pcp);
local_irq_restore(flags);
}
#endif
@@ -988,7 +997,7 @@ static void drain_pages(unsigned int cpu)

pcp = &pset->pcp;
local_irq_save(flags);
- free_pages_bulk(zone, pcp->count, pcp);
+ free_pcppages_bulk(zone, pcp->count, pcp);
BUG_ON(pcp->count);
local_irq_restore(flags);
}
@@ -1054,6 +1063,7 @@ static void free_hot_cold_page(struct page *page, int order, int cold)
struct zone *zone = page_zone(page);
struct per_cpu_pages *pcp;
unsigned long flags;
+ int migratetype;
int clearMlocked = PageMlocked(page);

/* SLUB can return lowish-order compound pages that need handling */
@@ -1073,18 +1083,36 @@ static void free_hot_cold_page(struct page *page, int order, int cold)
arch_free_page(page, order);
kernel_map_pages(page, 1 << order, 0);

+ migratetype = get_pageblock_migratetype(page);
+
pcp = &zone_pcp(zone, get_cpu())->pcp;
local_irq_save(flags);
__count_vm_event(PGFREE);
if (clearMlocked)
free_page_mlock(page);

- set_page_private(page, get_pageblock_migratetype(page));
+ /*
+ * We only track unreclaimable, reclaimable and movable on pcp lists.
+ * Free ISOLATE pages back to the allocator because they are being
+ * offlined but treat RESERVE as movable pages so we can get those
+ * areas back if necessary. Otherwise, we may have to free
+ * excessively into the page allocator
+ */
+ if (migratetype >= MIGRATE_PCPTYPES) {
+ if (unlikely(migratetype == MIGRATE_ISOLATE)) {
+ free_one_page(zone, page, order, migratetype);
+ goto out;
+ }
+ migratetype = MIGRATE_MOVABLE;
+ }
+
+ set_page_private(page, migratetype);
page->index = order;
add_pcp_page(pcp, page, cold);
-
if (pcp->count >= pcp->high)
- free_pages_bulk(zone, pcp->batch, pcp);
+ free_pcppages_bulk(zone, pcp->batch, pcp);
+
+out:
local_irq_restore(flags);
put_cpu();
}
@@ -1117,9 +1145,9 @@ void split_page(struct page *page, unsigned int order)
set_page_refcounted(page + i);
}

-static inline int pcp_page_suit(struct page *page, int migratetype, int order)
+static inline int pcp_page_suit(struct page *page, int order)
{
- return page_private(page) == migratetype && page->index == order;
+ return page->index == order;
}

/*
@@ -1142,36 +1170,38 @@ again:
struct per_cpu_pages *pcp;
int batch;
int delta;
+ struct list_head *list;

pcp = &zone_pcp(zone, cpu)->pcp;
+ list = &pcp->lists[migratetype];
batch = max(1, pcp->batch >> order);
local_irq_save(flags);
- if (!pcp->count) {
+ if (list_empty(list)) {
delta = rmqueue_bulk(zone, order, batch,
- &pcp->list, migratetype);
+ list, migratetype);
bulk_add_pcp_page(pcp, order, delta);
- if (unlikely(!pcp->count))
+ if (unlikely(list_empty(list)))
goto failed;
}

- /* Find a page of the appropriate migrate type */
+ /* Find a page of the appropriate order */
if (cold) {
- list_for_each_entry_reverse(page, &pcp->list, lru)
- if (pcp_page_suit(page, migratetype, order))
+ list_for_each_entry_reverse(page, list, lru)
+ if (pcp_page_suit(page, order))
break;
} else {
- list_for_each_entry(page, &pcp->list, lru)
- if (pcp_page_suit(page, migratetype, order))
+ list_for_each_entry(page, list, lru)
+ if (pcp_page_suit(page, order))
break;
}

/* Allocate more to the pcp list if necessary */
- if (unlikely(&page->lru == &pcp->list)) {
+ if (unlikely(&page->lru == list)) {
delta = rmqueue_bulk(zone, order, batch,
- &pcp->list, migratetype);
+ list, migratetype);
bulk_add_pcp_page(pcp, order, delta);
- page = list_entry(pcp->list.next, struct page, lru);
- if (!pcp_page_suit(page, migratetype, order))
+ page = list_entry(list->next, struct page, lru);
+ if (!pcp_page_suit(page, order))
goto failed;
}

@@ -2938,6 +2968,7 @@ static int zone_batchsize(struct zone *zone)
static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
{
struct per_cpu_pages *pcp;
+ int migratetype;

memset(p, 0, sizeof(*p));

@@ -2945,7 +2976,8 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
pcp->count = 0;
pcp->high = 6 * batch;
pcp->batch = max(1UL, 1 * batch);
- INIT_LIST_HEAD(&pcp->list);
+ for (migratetype = 0; migratetype < MIGRATE_TYPES; migratetype++)
+ INIT_LIST_HEAD(&pcp->lists[migratetype]);
}

/*
--
1.5.6.5

2009-03-16 09:54:27

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 30/35] Skip the PCP list search by counting the order and type of pages on list

The PCP lists are searched for free pages of the right size but due to
multiple orders, the list may be searched uselessly. This patch records how
many pages there are of each order and migratetype on the list to determine
if a list search will succeed.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 6 +++++-
mm/page_alloc.c | 46 +++++++++++++++++++++++++---------------------
2 files changed, 30 insertions(+), 22 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4fba09..5be2386 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -165,7 +165,11 @@ static inline int is_unevictable_lru(enum lru_list l)
}

struct per_cpu_pages {
- int count; /* number of pages in the list */
+ /* The total number of pages on the PCP lists */
+ int count;
+
+ /* Count of each migratetype and order */
+ u8 mocount[MIGRATE_PCPTYPES][PAGE_ALLOC_COSTLY_ORDER+1];

/* Lists of pages, one per migrate type stored on the pcp-lists */
struct list_head lists[MIGRATE_PCPTYPES];
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 77e9970..bb5bd5e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -513,8 +513,12 @@ static inline int free_pages_check(struct page *page)

static inline void rmv_pcp_page(struct per_cpu_pages *pcp, struct page *page)
{
+ int migratetype = page_private(page);
+ int basepage_count = 1 << page->index;
+
list_del(&page->lru);
- pcp->count -= 1 << page->index;
+ pcp->count -= basepage_count;
+ pcp->mocount[migratetype][page->index] -= basepage_count;
}

static inline void add_pcp_page(struct per_cpu_pages *pcp,
@@ -522,17 +526,22 @@ static inline void add_pcp_page(struct per_cpu_pages *pcp,
int cold)
{
int migratetype = page_private(page);
+ int basepage_count = 1 << page->index;
+
if (cold)
list_add_tail(&page->lru, &pcp->lists[migratetype]);
else
list_add(&page->lru, &pcp->lists[migratetype]);
- pcp->count += 1 << page->index;
+ pcp->count += basepage_count;
+ pcp->mocount[migratetype][page->index] += basepage_count;
}

static inline void bulk_add_pcp_page(struct per_cpu_pages *pcp,
- int order, int count)
+ int migratetype, int order, int count)
{
- pcp->count += count << order;
+ int basepage_count = count << order;
+ pcp->count += basepage_count;
+ pcp->mocount[migratetype][order] += basepage_count;
}

/*
@@ -1178,20 +1187,23 @@ again:
cpu = get_cpu();
if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
struct per_cpu_pages *pcp;
- int batch;
- int delta;
struct list_head *list;

pcp = &zone_pcp(zone, cpu)->pcp;
list = &pcp->lists[migratetype];
- batch = max(1, zone->pcp_batch >> order);
local_irq_save(flags);
- if (list_empty(list)) {
- delta = rmqueue_bulk(zone, order, batch,
+
+ /* Allocate more if no suitable page is in list */
+ if (!pcp->mocount[migratetype][order]) {
+ int batch = max(1, zone->pcp_batch >> order);
+ int delta = rmqueue_bulk(zone, order, batch,
list, migratetype);
- bulk_add_pcp_page(pcp, order, delta);
- if (unlikely(list_empty(list)))
+ bulk_add_pcp_page(pcp, migratetype, order, delta);
+ if (unlikely(!pcp->mocount[migratetype][order]))
goto failed;
+
+ page = list_entry(list->next, struct page, lru);
+ goto found;
}

/* Find a page of the appropriate order */
@@ -1205,16 +1217,7 @@ again:
break;
}

- /* Allocate more to the pcp list if necessary */
- if (unlikely(&page->lru == list)) {
- delta = rmqueue_bulk(zone, order, batch,
- list, migratetype);
- bulk_add_pcp_page(pcp, order, delta);
- page = list_entry(list->next, struct page, lru);
- if (!pcp_page_suit(page, order))
- goto failed;
- }
-
+found:
rmv_pcp_page(pcp, page);
} else {
LIST_HEAD(list);
@@ -2985,6 +2988,7 @@ static void setup_pageset(struct zone *zone,

pcp = &p->pcp;
pcp->count = 0;
+ memset(pcp->mocount, 0, sizeof(pcp->mocount));
zone->pcp_high = 6 * batch;
zone->pcp_batch = max(1UL, 1 * batch);
for (migratetype = 0; migratetype < MIGRATE_TYPES; migratetype++)
--
1.5.6.5

2009-03-16 09:55:35

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 32/35] Inline next_zones_zonelist() of the zonelist scan in the fastpath

The zonelist walkers call next_zones_zonelist() to find the next zone
that is allowed by the nodemask. It's not inlined because the number of
call-sites bloats text but it is not free to call a function either.
This patch inlines next_zones_zonelist() only for the page allocator
fastpath. All other zonelist walkers use an uninlined version.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 6 ++++++
mm/mmzone.c | 31 -------------------------------
mm/page_alloc.c | 40 +++++++++++++++++++++++++++++++++++++++-
3 files changed, 45 insertions(+), 32 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5be2386..9057bc1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -895,6 +895,12 @@ static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
zone; \
z = next_zones_zonelist(++z, highidx, nodemask, &zone)) \

+/* Only available to the page allocator fast-path */
+#define fast_foreach_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
+ for (z = first_zones_zonelist(zlist, highidx, nodemask, &zone); \
+ zone; \
+ z = __next_zones_zonelist(++z, highidx, nodemask, &zone)) \
+
/**
* for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
* @zone - The current zone in the iterator
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 16ce8b9..347951c 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -41,34 +41,3 @@ struct zone *next_zone(struct zone *zone)
}
return zone;
}
-
-static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
-{
-#ifdef CONFIG_NUMA
- return node_isset(zonelist_node_idx(zref), *nodes);
-#else
- return 1;
-#endif /* CONFIG_NUMA */
-}
-
-/* Returns the next zone at or below highest_zoneidx in a zonelist */
-struct zoneref *next_zones_zonelist(struct zoneref *z,
- enum zone_type highest_zoneidx,
- nodemask_t *nodes,
- struct zone **zone)
-{
- /*
- * Find the next suitable zone to use for the allocation.
- * Only filter based on nodemask if it's set
- */
- if (likely(nodes == NULL))
- while (zonelist_zone_idx(z) > highest_zoneidx)
- z++;
- else
- while (zonelist_zone_idx(z) > highest_zoneidx ||
- (z->zone && !zref_in_nodemask(z, nodes)))
- z++;
-
- *zone = zonelist_zone(z);
- return z;
-}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8568284..33f39cf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1514,6 +1514,44 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
}
#endif /* CONFIG_NUMA */

+static inline int
+zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
+{
+#ifdef CONFIG_NUMA
+ return node_isset(zonelist_node_idx(zref), *nodes);
+#else
+ return 1;
+#endif /* CONFIG_NUMA */
+}
+
+/* Returns the next zone at or below highest_zoneidx in a zonelist */
+static inline struct zoneref *
+__next_zones_zonelist(struct zoneref *z, enum zone_type highest_zoneidx,
+ nodemask_t *nodes, struct zone **zone)
+{
+ /*
+ * Find the next suitable zone to use for the allocation.
+ * Only filter based on nodemask if it's set
+ */
+ if (likely(nodes == NULL))
+ while (zonelist_zone_idx(z) > highest_zoneidx)
+ z++;
+ else
+ while (zonelist_zone_idx(z) > highest_zoneidx ||
+ (z->zone && !zref_in_nodemask(z, nodes)))
+ z++;
+
+ *zone = zonelist_zone(z);
+ return z;
+}
+
+struct zoneref *
+next_zones_zonelist(struct zoneref *z, enum zone_type highest_zoneidx,
+ nodemask_t *nodes, struct zone **zone)
+{
+ return __next_zones_zonelist(z, highest_zoneidx, nodes, zone);
+}
+
/*
* get_page_from_freelist goes through the zonelist trying to allocate
* a page.
@@ -1546,7 +1584,7 @@ zonelist_scan:
* Scan zonelist, looking for a zone with enough free.
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
- for_each_zone_zonelist_nodemask(zone, z, zonelist,
+ fast_foreach_zone_zonelist_nodemask(zone, z, zonelist,
high_zoneidx, nodemask) {

/* Ignore the additional zonelist filter checks if possible */
--
1.5.6.5

2009-03-16 09:55:12

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 31/35] Optimistically check the first page on the PCP free list is suitable

The PCP lists are searched for a page of the suitable order. However,
the majority of pages are still expected to be order-0 pages and the
setup for the search is a bit expensive. This patch optimistically
checks if the first page is suitable for use in the hot-page allocation
path.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb5bd5e..8568284 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1212,6 +1212,12 @@ again:
if (pcp_page_suit(page, order))
break;
} else {
+ /* Optimistic before we start a list search */
+ page = list_entry(list->next, struct page, lru);
+ if (pcp_page_suit(page, order))
+ goto found;
+
+ /* Do the search */
list_for_each_entry(page, list, lru)
if (pcp_page_suit(page, order))
break;
--
1.5.6.5

2009-03-16 09:54:47

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 29/35] Do not store the PCP high and batch watermarks in the per-cpu structure

Currently, there are high and batch counters in the per-cpu structure.
This might have made sense when there was hot and cold per-cpu
structures but that is no longer the case. In practice, all the per-cpu
structures for a zone contain the same values and they are read-mostly.
This patch stores them in the zone with the watermarks which are also
read-mostly.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 8 ++++++--
mm/page_alloc.c | 43 +++++++++++++++++++++++--------------------
mm/vmstat.c | 4 ++--
3 files changed, 31 insertions(+), 24 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index eed6867..b4fba09 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -166,8 +166,6 @@ static inline int is_unevictable_lru(enum lru_list l)

struct per_cpu_pages {
int count; /* number of pages in the list */
- int high; /* high watermark, emptying needed */
- int batch; /* chunk size for buddy add/remove */

/* Lists of pages, one per migrate type stored on the pcp-lists */
struct list_head lists[MIGRATE_PCPTYPES];
@@ -285,6 +283,12 @@ struct zone {
unsigned long pages_mark[3];
};

+ /* high watermark for per-cpu lists, emptying needed */
+ u16 pcp_high;
+
+ /* chunk size for buddy add/remove to per-cpu lists*/
+ u16 pcp_batch;
+
/*
* We don't know if the memory that we're going to allocate will be freeable
* or/and it will be released eventually, so to avoid totally wasting several
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index edadab1..77e9970 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -975,8 +975,8 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
int to_drain;

local_irq_save(flags);
- if (pcp->count >= pcp->batch)
- to_drain = pcp->batch;
+ if (pcp->count >= zone->pcp_batch)
+ to_drain = zone->pcp_batch;
else
to_drain = pcp->count;
free_pcppages_bulk(zone, to_drain, pcp, 0);
@@ -1119,8 +1119,8 @@ static void free_hot_cold_page(struct page *page, int order, int cold)
set_page_private(page, migratetype);
page->index = order;
add_pcp_page(pcp, page, cold);
- if (pcp->count >= pcp->high)
- free_pcppages_bulk(zone, pcp->batch, pcp, migratetype);
+ if (pcp->count >= zone->pcp_high)
+ free_pcppages_bulk(zone, zone->pcp_batch, pcp, migratetype);

out:
local_irq_restore(flags);
@@ -1184,7 +1184,7 @@ again:

pcp = &zone_pcp(zone, cpu)->pcp;
list = &pcp->lists[migratetype];
- batch = max(1, pcp->batch >> order);
+ batch = max(1, zone->pcp_batch >> order);
local_irq_save(flags);
if (list_empty(list)) {
delta = rmqueue_bulk(zone, order, batch,
@@ -2144,8 +2144,8 @@ void show_free_areas(void)
pageset = zone_pcp(zone, cpu);

printk("CPU %4d: hi:%5d, btch:%4d usd:%4d\n",
- cpu, pageset->pcp.high,
- pageset->pcp.batch, pageset->pcp.count);
+ cpu, zone->pcp_high,
+ zone->pcp_batch, pageset->pcp.count);
}
}

@@ -2975,7 +2975,8 @@ static int zone_batchsize(struct zone *zone)
return batch;
}

-static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
+static void setup_pageset(struct zone *zone,
+ struct per_cpu_pageset *p, unsigned long batch)
{
struct per_cpu_pages *pcp;
int migratetype;
@@ -2984,8 +2985,8 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)

pcp = &p->pcp;
pcp->count = 0;
- pcp->high = 6 * batch;
- pcp->batch = max(1UL, 1 * batch);
+ zone->pcp_high = 6 * batch;
+ zone->pcp_batch = max(1UL, 1 * batch);
for (migratetype = 0; migratetype < MIGRATE_TYPES; migratetype++)
INIT_LIST_HEAD(&pcp->lists[migratetype]);
}
@@ -2995,16 +2996,17 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
* to the value high for the pageset p.
*/

-static void setup_pagelist_highmark(struct per_cpu_pageset *p,
+static void setup_pagelist_highmark(struct zone *zone,
+ struct per_cpu_pageset *p,
unsigned long high)
{
struct per_cpu_pages *pcp;

pcp = &p->pcp;
- pcp->high = high;
- pcp->batch = max(1UL, high/4);
+ zone->pcp_high = high;
+ zone->pcp_batch = max(1UL, high/4);
if ((high/4) > (PAGE_SHIFT * 8))
- pcp->batch = PAGE_SHIFT * 8;
+ zone->pcp_batch = PAGE_SHIFT * 8;
}


@@ -3049,10 +3051,10 @@ static int __cpuinit process_zones(int cpu)
if (!zone_pcp(zone, cpu))
goto bad;

- setup_pageset(zone_pcp(zone, cpu), zone_batchsize(zone));
+ setup_pageset(zone, zone_pcp(zone, cpu), zone_batchsize(zone));

if (percpu_pagelist_fraction)
- setup_pagelist_highmark(zone_pcp(zone, cpu),
+ setup_pagelist_highmark(zone, zone_pcp(zone, cpu),
(zone->present_pages / percpu_pagelist_fraction));
}

@@ -3178,9 +3180,9 @@ static __meminit void zone_pcp_init(struct zone *zone)
#ifdef CONFIG_NUMA
/* Early boot. Slab allocator not functional yet */
zone_pcp(zone, cpu) = &boot_pageset[cpu];
- setup_pageset(&boot_pageset[cpu],0);
+ setup_pageset(zone, &boot_pageset[cpu], 0);
#else
- setup_pageset(zone_pcp(zone,cpu), batch);
+ setup_pageset(zone, zone_pcp(zone, cpu), batch);
#endif
}
if (zone->present_pages)
@@ -4771,7 +4773,7 @@ int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, int write,
}

/*
- * percpu_pagelist_fraction - changes the pcp->high for each zone on each
+ * percpu_pagelist_fraction - changes the zone->pcp_high for each zone on each
* cpu. It is the fraction of total pages in each zone that a hot per cpu pagelist
* can have before it gets flushed back to buddy allocator.
*/
@@ -4790,7 +4792,8 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
for_each_online_cpu(cpu) {
unsigned long high;
high = zone->present_pages / percpu_pagelist_fraction;
- setup_pagelist_highmark(zone_pcp(zone, cpu), high);
+ setup_pagelist_highmark(zone, zone_pcp(zone, cpu),
+ high);
}
}
return 0;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9114974..3be59b1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -766,8 +766,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
"\n batch: %i",
i,
pageset->pcp.count,
- pageset->pcp.high,
- pageset->pcp.batch);
+ zone->pcp_high,
+ zone->pcp_batch);
#ifdef CONFIG_SMP
seq_printf(m, "\n vm stats threshold: %d",
pageset->stat_threshold);
--
1.5.6.5

2009-03-16 09:55:54

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 33/35] Do not merge buddies until they are needed by a high-order allocation or anti-fragmentation

Freeing and allocating pages from the buddy lists can incur a number of
cache misses as the struct pages are written to. This patch only merges
buddies up to PAGE_ALLOC_COSTLY_ORDER. High-order allocations are then
required to do the actual merging. This punishes high-order allocations
somewhat but they are expected to be relatively rare and should be
avoided in general.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 7 ++++
mm/page_alloc.c | 91 +++++++++++++++++++++++++++++++++++++++++------
2 files changed, 86 insertions(+), 12 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9057bc1..8027163 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -35,6 +35,13 @@
*/
#define PAGE_ALLOC_COSTLY_ORDER 3

+/*
+ * PAGE_ALLOC_MERGE_ORDER is the order at which pages get merged together
+ * but not merged further unless explicitly needed by a high-order allocation.
+ * The value is to merge to larger than the PCP batch refill size
+ */
+#define PAGE_ALLOC_MERGE_ORDER 5
+
#define MIGRATE_UNMOVABLE 0
#define MIGRATE_RECLAIMABLE 1
#define MIGRATE_MOVABLE 2
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 33f39cf..f1741a3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -456,25 +456,18 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
* -- wli
*/

-static inline void __free_one_page(struct page *page,
- struct zone *zone, unsigned int order,
- int migratetype)
+static inline struct page *__merge_one_page(struct page *page,
+ struct zone *zone, unsigned int order, unsigned int maxorder)
{
unsigned long page_idx;

- if (unlikely(PageCompound(page)))
- if (unlikely(destroy_compound_page(page, order)))
- return;
-
- VM_BUG_ON(migratetype == -1);
-
page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
page->index = 0;

VM_BUG_ON(page_idx & ((1 << order) - 1));
VM_BUG_ON(bad_range(zone, page));

- while (order < MAX_ORDER-1) {
+ while (order < maxorder) {
unsigned long combined_idx;
struct page *buddy;

@@ -491,10 +484,77 @@ static inline void __free_one_page(struct page *page,
page_idx = combined_idx;
order++;
}
+
set_page_order(page, order);
+ return page;
+}
+
+/* Merge free pages up to MAX_ORDER-1 */
+static noinline void __merge_highorder_pages(struct zone *zone)
+{
+ struct page *page, *buddy;
+ struct free_area *area;
+ int migratetype;
+ unsigned int order;
+
+ for_each_migratetype_order(order, migratetype) {
+ struct list_head *list;
+ unsigned long page_idx;
+
+ if (order == MAX_ORDER-1)
+ break;
+
+ area = &(zone->free_area[order]);
+ list = &area->free_list[migratetype];
+
+pagemerged:
+ if (list_empty(list))
+ continue;
+ /*
+ * Each time we merge, we jump back here as even the _safe
+ * variants of list_for_each() cannot cope with the cursor
+ * page disappearing
+ */
+ list_for_each_entry(page, list, lru) {
+
+ page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
+ buddy = __page_find_buddy(page, page_idx, order);
+ if (!page_is_buddy(page, buddy, order))
+ continue;
+
+ /* Ok, remove the page, merge and re-add */
+ list_del(&page->lru);
+ rmv_page_order(page);
+ area->nr_free--;
+ page = __merge_one_page(page, zone,
+ order, MAX_ORDER-1);
+ list_add(&page->lru,
+ &zone->free_area[page_order(page)].free_list[migratetype]);
+ zone->free_area[page_order(page)].nr_free++;
+ goto pagemerged;
+ }
+ }
+}
+
+static inline void __free_one_page(struct page *page,
+ struct zone *zone, unsigned int order,
+ int migratetype)
+{
+ if (unlikely(PageCompound(page)))
+ if (unlikely(destroy_compound_page(page, order)))
+ return;
+
+ VM_BUG_ON(migratetype == -1);
+
+ /*
+ * We only lazily merge up to PAGE_ALLOC_MERGE_ORDER to avoid
+ * cache line bounces merging buddies. High order allocations
+ * take the hit of merging the buddies further
+ */
+ page = __merge_one_page(page, zone, order, PAGE_ALLOC_MERGE_ORDER);
list_add(&page->lru,
- &zone->free_area[order].free_list[migratetype]);
- zone->free_area[order].nr_free++;
+ &zone->free_area[page_order(page)].free_list[migratetype]);
+ zone->free_area[page_order(page)].nr_free++;
}

static inline int free_pages_check(struct page *page)
@@ -849,6 +909,9 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
struct page *page;
int migratetype, i;

+ /* Merge the buddies before stealing */
+ __merge_highorder_pages(zone);
+
/* Find the largest possible block of pages in the other list */
for (current_order = MAX_ORDER-1; current_order >= order;
--current_order) {
@@ -1608,6 +1671,10 @@ zonelist_scan:
}
}

+ /* Lazy merge buddies for high orders */
+ if (order > PAGE_ALLOC_MERGE_ORDER)
+ __merge_highorder_pages(zone);
+
page = buffered_rmqueue(preferred_zone, zone, order,
gfp_mask, migratetype, cold);
if (page)
--
1.5.6.5

2009-03-16 09:56:18

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 34/35] Allow compound pages to be stored on the PCP lists

The SLUB allocator frees and allocates compound pages. The setup costs
for compound pages are noticeable in profiles and incur cache misses as
every struct page has to be checked and written. This patch allows
compound pages to be stored on the PCP list to save on teardown and
setup time.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/page-flags.h | 4 ++-
mm/page_alloc.c | 50 +++++++++++++++++++++++++++++---------------
2 files changed, 36 insertions(+), 18 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 219a523..4177ec1 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -388,7 +388,9 @@ static inline void __ClearPageTail(struct page *page)
* Pages being prepped should not have any flags set. It they are set,
* there has been a kernel bug or struct page corruption.
*/
-#define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1)
+#define PAGE_FLAGS_CHECK_AT_PREP_BUDDY ((1 << NR_PAGEFLAGS) - 1)
+#define PAGE_FLAGS_CHECK_AT_PREP (((1 << NR_PAGEFLAGS) - 1) & \
+ ~(1 << PG_head | 1 << PG_tail))

#endif /* !__GENERATING_BOUNDS_H */
#endif /* PAGE_FLAGS_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f1741a3..1ac4c3d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -281,11 +281,7 @@ out:
* put_page() function. Its ->lru.prev holds the order of allocation.
* This usage means that zero-order pages may not be compound.
*/
-
-static void free_compound_page(struct page *page)
-{
- __free_pages_ok(page, compound_order(page));
-}
+static void free_compound_page(struct page *page);

void prep_compound_page(struct page *page, unsigned long order)
{
@@ -557,7 +553,9 @@ static inline void __free_one_page(struct page *page,
zone->free_area[page_order(page)].nr_free++;
}

-static inline int free_pages_check(struct page *page)
+/* Sanity check a free pages flags */
+static inline int check_freepage_flags(struct page *page,
+ unsigned long prepflags)
{
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
@@ -566,8 +564,8 @@ static inline int free_pages_check(struct page *page)
bad_page(page);
return 1;
}
- if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
- page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+ if (page->flags & prepflags)
+ page->flags &= ~prepflags;
return 0;
}

@@ -678,7 +676,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
int clearMlocked = PageMlocked(page);

for (i = 0 ; i < (1 << order) ; ++i)
- bad += free_pages_check(page + i);
+ bad += check_freepage_flags(page + i,
+ PAGE_FLAGS_CHECK_AT_PREP_BUDDY);
if (bad)
return;

@@ -782,8 +781,20 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
if (gfp_flags & __GFP_ZERO)
prep_zero_page(page, order, gfp_flags);

- if (order && (gfp_flags & __GFP_COMP))
- prep_compound_page(page, order);
+ /*
+ * If a compound page is requested, we have to check the page being
+ * prepped. If it's already compound, we leave it alone. If a
+ * compound page is not requested but the page being prepped is
+ * compound, then it must be destroyed
+ */
+ if (order) {
+ if ((gfp_flags & __GFP_COMP) && !PageCompound(page))
+ prep_compound_page(page, order);
+
+ if (!(gfp_flags & __GFP_COMP) && PageCompound(page))
+ if (unlikely(destroy_compound_page(page, order)))
+ return 1;
+ }

return 0;
}
@@ -1148,14 +1159,9 @@ static void free_hot_cold_page(struct page *page, int order, int cold)
int migratetype;
int clearMlocked = PageMlocked(page);

- /* SLUB can return lowish-order compound pages that need handling */
- if (order > 0 && unlikely(PageCompound(page)))
- if (unlikely(destroy_compound_page(page, order)))
- return;
-
if (PageAnon(page))
page->mapping = NULL;
- if (free_pages_check(page))
+ if (check_freepage_flags(page, PAGE_FLAGS_CHECK_AT_PREP))
return;

if (!PageHighMem(page)) {
@@ -1199,6 +1205,16 @@ out:
put_cpu();
}

+static void free_compound_page(struct page *page)
+{
+ unsigned int order = compound_order(page);
+
+ if (order <= PAGE_ALLOC_COSTLY_ORDER)
+ free_hot_cold_page(page, order, 0);
+ else
+ __free_pages_ok(page, order);
+}
+
void free_hot_page(struct page *page)
{
free_hot_cold_page(page, 0, 0);
--
1.5.6.5

2009-03-16 09:56:37

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 35/35] Allow up to 4MB PCP lists due to compound pages

Compound pages from SLUB on the free lists can occupy a fair percentage of
the 512K that is currently allowed on the PCP lists. This can push out cache
hot order-0 pages even though the compound page may be relatively sparsely
used in the short term. This patch changes pcp->count to count pages (1
per page regardless of order) instead of accounting for the number of base
pages on the list. This keeps cache hot pages on the list at the cost of
the PCP lists being up to 4MB in size instead of 512K.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 23 +++++++++--------------
1 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1ac4c3d..d5161cf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -572,11 +572,10 @@ static inline int check_freepage_flags(struct page *page,
static inline void rmv_pcp_page(struct per_cpu_pages *pcp, struct page *page)
{
int migratetype = page_private(page);
- int basepage_count = 1 << page->index;

list_del(&page->lru);
- pcp->count -= basepage_count;
- pcp->mocount[migratetype][page->index] -= basepage_count;
+ pcp->count--;
+ pcp->mocount[migratetype][page->index]--;
}

static inline void add_pcp_page(struct per_cpu_pages *pcp,
@@ -584,22 +583,20 @@ static inline void add_pcp_page(struct per_cpu_pages *pcp,
int cold)
{
int migratetype = page_private(page);
- int basepage_count = 1 << page->index;

if (cold)
list_add_tail(&page->lru, &pcp->lists[migratetype]);
else
list_add(&page->lru, &pcp->lists[migratetype]);
- pcp->count += basepage_count;
- pcp->mocount[migratetype][page->index] += basepage_count;
+ pcp->count++;
+ pcp->mocount[migratetype][page->index]++;
}

static inline void bulk_add_pcp_page(struct per_cpu_pages *pcp,
int migratetype, int order, int count)
{
- int basepage_count = count << order;
- pcp->count += basepage_count;
- pcp->mocount[migratetype][order] += basepage_count;
+ pcp->count += count;
+ pcp->mocount[migratetype][order] += count;
}

/*
@@ -627,9 +624,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,

list = &pcp->lists[migratetype];
bulkcount = 1 + (count / (MIGRATE_PCPTYPES * 2));
- while (freed < count) {
+ while (count--) {
struct page *page;
- int thisfreed;

/*
* Move to another migratetype if this list is depleted or
@@ -645,9 +641,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
/* Remove from list and update counters */
page = list_entry(list->prev, struct page, lru);
rmv_pcp_page(pcp, page);
- thisfreed = 1 << page->index;
- freed += thisfreed;
- bulkcount -= thisfreed;
+ freed += 1 << page->index;
+ bulkcount--;

__free_one_page(page, zone, page->index, migratetype);
}
--
1.5.6.5

2009-03-16 10:42:41

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 00/35] Cleanup and optimise the page allocator V3

On Mon, Mar 16, 2009 at 09:45:55AM +0000, Mel Gorman wrote:
> Here is V3 of an attempt to cleanup and optimise the page allocator and should
> be ready for general testing. The page allocator is now faster (16%
> reduced time overall for kernbench on one machine) and it has a smaller cache
> footprint (16.5% less L1 cache misses and 19.5% less L2 cache misses for
> kernbench on one machine). The text footprint has unfortunately increased,
> largely due to the introduction of a form of lazy buddy merging mechanism
> that avoids cache misses by postponing buddy merging until a high-order
> allocation needs it.

You!? You want to do lazy buddy? ;) That's wonderful, but it would
significantly increase the fragmentation problem, wouldn't it?
(although pcp lists are conceptually a form of lazy buddy already)

No objections from me of course, if it is making significant
speedups. I assume you mean overall time on kernbench is overall sys
time?

2009-03-16 11:19:26

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/35] Cleanup and optimise the page allocator V3

On Mon, Mar 16, 2009 at 11:40:54AM +0100, Nick Piggin wrote:
> On Mon, Mar 16, 2009 at 09:45:55AM +0000, Mel Gorman wrote:
> > Here is V3 of an attempt to cleanup and optimise the page allocator and should
> > be ready for general testing. The page allocator is now faster (16%
> > reduced time overall for kernbench on one machine) and it has a smaller cache
> > footprint (16.5% less L1 cache misses and 19.5% less L2 cache misses for
> > kernbench on one machine). The text footprint has unfortunately increased,
> > largely due to the introduction of a form of lazy buddy merging mechanism
> > that avoids cache misses by postponing buddy merging until a high-order
> > allocation needs it.
>
> You!? You want to do lazy buddy? ;)

I'm either a man of surprises, an idiot or just plain inconsistent :)

> That's wonderful, but it would
> significantly increase the fragmentation problem, wouldn't it?

Not necessarily, anti-fragmentation groups movable pages within a
hugepage-aligned block and high-order allocations will trigger a merge of
buddies from PAGE_ALLOC_MERGE_ORDER (defined in the relevant patch) up to
MAX_ORDER-1. Critically, a merge is also triggered when anti-fragmentation
wants to fallback to another migratetype to satisfy an allocation. As
long as the grouping works, it doesn't matter if they were only merged up
to PAGE_ALLOC_MERGE_ORDER as a full merge will still free up hugepages.
So two slow paths are made slower but the fast path should be faster and it
should be causing fewer cache line bounces due to writes to struct page.

The success rate of high-order allocations should be roughly the same but
they will be slower, particularly as I remerge more often than required. This
slowdown is undesirable but the assumptions are that either a) it's the
static hugepage pool being resized in which case the delay is irrelevant or
b) the pool is being dynamically resized but the expected lifetime of the
page far exceeds the allocation cost of merging.

I fully agree with you that it's more important that order-0 allocations
are faster than order-9 allocations but I'm not totally off high-order
allocs either. You'll see the patchset allows higher-order pages (up to
PAGE_ALLOC_COSTLY_ORDER) onto the PCP lists for order-1 allocations used
by sig handlers, stacks and the like (important for fork-heavy loads I am
guessing) and because SLUB uses high-order allocations that are currently
bypassing the PCP lists altogether. I haven't measured it but SLUB-heavy
workloads must be contending on the zone->lock to some extent.

When I last checked (about 10 days) ago, I hadn't damaged anti-fragmentation
but that was a lot of revisions ago. I'm redoing the tests to make sure
anti-fragmentation is still ok.

> (although pcp lists are conceptually a form of lazy buddy already)
>

Indeed.

> No objections from me of course, if it is making significant
> speedups. I assume you mean overall time on kernbench is overall sys
> time?

Both, I should be clearer. The amount of oprofile samples measured in the
page allocator is reduced by a large amount but it does not always translate
into overall speedups although it did for 8 out of 9 machines I tested. On
most machines the overall "System Time" for kernbench is reduced but in 2
out of 9 test machines, the elapsed time increases due to some other caching
weirdness or due to a change in timing with respect to locking.

Pinning down the exact problem is tricky as profile sometimes reverses the
performance effects. i.e. without profiling I'll see a slowdown and
with profiling I see significant speedups so I can't measure what is going
on.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-16 11:34:19

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 00/35] Cleanup and optimise the page allocator V3

On Mon, Mar 16, 2009 at 11:19:06AM +0000, Mel Gorman wrote:
> On Mon, Mar 16, 2009 at 11:40:54AM +0100, Nick Piggin wrote:
> > That's wonderful, but it would
> > significantly increase the fragmentation problem, wouldn't it?
>
> Not necessarily, anti-fragmentation groups movable pages within a
> hugepage-aligned block and high-order allocations will trigger a merge of
> buddies from PAGE_ALLOC_MERGE_ORDER (defined in the relevant patch) up to
> MAX_ORDER-1. Critically, a merge is also triggered when anti-fragmentation
> wants to fallback to another migratetype to satisfy an allocation. As
> long as the grouping works, it doesn't matter if they were only merged up
> to PAGE_ALLOC_MERGE_ORDER as a full merge will still free up hugepages.
> So two slow paths are made slower but the fast path should be faster and it
> should be causing fewer cache line bounces due to writes to struct page.

Oh, but the anti-fragmentation stuff is orthogonal to this. Movable
groups should always be defragmentable (at some cost)... the bane of
anti-frag is fragmentation of the non-movable groups.

And one reason why buddy is so good at avoiding fragmentation is
because it will pick up _any_ pages that go past the allocator if
they have any free buddies. And it hands out ones that don't have
free buddies. So in that way it is naturally continually filtering
out pages which can be merged.

Wheras if you defer this until the point you need a higher order
page, the only thing you have to work with are the pages that are
free *right now*.

It will almost definitely increase fragmentation of non movable zones,
and if you have a workload doing non-trivial, non movable higher order
allocations that are likely to cause fragmentation, it will result
in these allocations eating movable groups sooner I think.


> When I last checked (about 10 days) ago, I hadn't damaged anti-fragmentation
> but that was a lot of revisions ago. I'm redoing the tests to make sure
> anti-fragmentation is still ok.

Your anti-frag tests probably don't stress this long term fragmentation
problem.

Still, it's significant enough that I think it should be made
optional (and arguably default to on) even if it does harm higher
order allocations a bit.

2009-03-16 11:42:54

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

On Monday 16 March 2009 20:46:15 Mel Gorman wrote:
> num_online_nodes() is called by the page allocator to decide whether the
> zonelist needs to be filtered based on cpusets or the zonelist cache.
> This is actually a heavy function and touches a number of cache lines.
> This patch stores the number of online nodes at boot time and when
> nodes get onlined and offlined.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> include/linux/nodemask.h | 16 ++++++++++++++--
> mm/page_alloc.c | 6 ++++--
> 2 files changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index 848025c..4749e30 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -449,13 +449,25 @@ static inline int num_node_state(enum node_states
> state) node; \
> })
>
> +/* Recorded value for num_online_nodes() */
> +extern int static_num_online_nodes;

__read_mostly, please. Check this for any other place you've added
global cachelines that are referenced by the allocator.

2009-03-16 11:46:18

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 00/35] Cleanup and optimise the page allocator V3

On Mon, Mar 16, 2009 at 09:45:55AM +0000, Mel Gorman wrote:
> Here is V3 of an attempt to cleanup and optimise the page allocator and should
> be ready for general testing. The page allocator is now faster (16%
> reduced time overall for kernbench on one machine) and it has a smaller cache
> footprint (16.5% less L1 cache misses and 19.5% less L2 cache misses for
> kernbench on one machine). The text footprint has unfortunately increased,
> largely due to the introduction of a form of lazy buddy merging mechanism
> that avoids cache misses by postponing buddy merging until a high-order
> allocation needs it.

BTW. I would feel better about this if it gets merged in stages, with
functional changes split out, and also code optimisations and omore
obvious performace improvements split out and preferably merged first.

At a very quick glance, the first 25 or so patches should go in first,
and that gives a much better base to compare subsequent functional
changes with. Patch 18 for example is really significant, and should
almost be 2.6.29/-stable material IMO.

2009-03-16 11:46:57

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

On Mon, Mar 16, 2009 at 10:42:34PM +1100, Nick Piggin wrote:
> On Monday 16 March 2009 20:46:15 Mel Gorman wrote:
> > num_online_nodes() is called by the page allocator to decide whether the
> > zonelist needs to be filtered based on cpusets or the zonelist cache.
> > This is actually a heavy function and touches a number of cache lines.
> > This patch stores the number of online nodes at boot time and when
> > nodes get onlined and offlined.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > include/linux/nodemask.h | 16 ++++++++++++++--
> > mm/page_alloc.c | 6 ++++--
> > 2 files changed, 18 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> > index 848025c..4749e30 100644
> > --- a/include/linux/nodemask.h
> > +++ b/include/linux/nodemask.h
> > @@ -449,13 +449,25 @@ static inline int num_node_state(enum node_states
> > state) node; \
> > })
> >
> > +/* Recorded value for num_online_nodes() */
> > +extern int static_num_online_nodes;
>
> __read_mostly, please. Check this for any other place you've added
> global cachelines that are referenced by the allocator.

OK I'm blind, sorry ignore that :P

2009-03-16 12:02:30

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/35] Cleanup and optimise the page allocator V3

On Mon, Mar 16, 2009 at 12:33:58PM +0100, Nick Piggin wrote:
> On Mon, Mar 16, 2009 at 11:19:06AM +0000, Mel Gorman wrote:
> > On Mon, Mar 16, 2009 at 11:40:54AM +0100, Nick Piggin wrote:
> > > That's wonderful, but it would
> > > significantly increase the fragmentation problem, wouldn't it?
> >
> > Not necessarily, anti-fragmentation groups movable pages within a
> > hugepage-aligned block and high-order allocations will trigger a merge of
> > buddies from PAGE_ALLOC_MERGE_ORDER (defined in the relevant patch) up to
> > MAX_ORDER-1. Critically, a merge is also triggered when anti-fragmentation
> > wants to fallback to another migratetype to satisfy an allocation. As
> > long as the grouping works, it doesn't matter if they were only merged up
> > to PAGE_ALLOC_MERGE_ORDER as a full merge will still free up hugepages.
> > So two slow paths are made slower but the fast path should be faster and it
> > should be causing fewer cache line bounces due to writes to struct page.
>
> Oh, but the anti-fragmentation stuff is orthogonal to this. Movable
> groups should always be defragmentable (at some cost)... the bane of
> anti-frag is fragmentation of the non-movable groups.
>

True, the reclaimable area has varying degrees of success and the
non-movable groups are almost unworkable and depend on how much of them
depend on pagetables.

> And one reason why buddy is so good at avoiding fragmentation is
> because it will pick up _any_ pages that go past the allocator if
> they have any free buddies. And it hands out ones that don't have
> free buddies. So in that way it is naturally continually filtering
> out pages which can be merged.
>
> Wheras if you defer this until the point you need a higher order
> page, the only thing you have to work with are the pages that are
> free *right now*.
>

Well, buddy always uses the smallest available page first. Even with
deferred coalescing, it will merge up to order-5 at least. Lets say they
could have merged up to order-10 in ordinary circumstances, they are
still avoided for as long as possible. Granted, it might mean that an
order-5 is split that could have been merged but it's hard to tell how
much of a difference that makes.

> It will almost definitely increase fragmentation of non movable zones,
> and if you have a workload doing non-trivial, non movable higher order
> allocations that are likely to cause fragmentation, it will result
> in these allocations eating movable groups sooner I think.
>

I think the effect will be same unless the non-movable high-order
allocations are order-5 or higher in which case we are likely going to
hit trouble anyway.

>
> > When I last checked (about 10 days) ago, I hadn't damaged anti-fragmentation
> > but that was a lot of revisions ago. I'm redoing the tests to make sure
> > anti-fragmentation is still ok.
>
> Your anti-frag tests probably don't stress this long term fragmentation
> problem.
>

Probably not, but we have little data on long-term fragmentation other than
anecdotal evidence that it's ok these days.

> Still, it's significant enough that I think it should be made
> optional (and arguably default to on) even if it does harm higher
> order allocations a bit.
>

I could make PAGE_ORDER_MERGE_ORDER a proc tunable? If it's placed as a
read-mostly variable beside the gfp_zone table, it might even fit in the
same cache line.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-16 12:11:34

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/35] Cleanup and optimise the page allocator V3

On Mon, Mar 16, 2009 at 12:45:55PM +0100, Nick Piggin wrote:
> On Mon, Mar 16, 2009 at 09:45:55AM +0000, Mel Gorman wrote:
> > Here is V3 of an attempt to cleanup and optimise the page allocator and should
> > be ready for general testing. The page allocator is now faster (16%
> > reduced time overall for kernbench on one machine) and it has a smaller cache
> > footprint (16.5% less L1 cache misses and 19.5% less L2 cache misses for
> > kernbench on one machine). The text footprint has unfortunately increased,
> > largely due to the introduction of a form of lazy buddy merging mechanism
> > that avoids cache misses by postponing buddy merging until a high-order
> > allocation needs it.
>
> BTW. I would feel better about this if it gets merged in stages, with
> functional changes split out, and also code optimisations and omore
> obvious performace improvements split out and preferably merged first.
>

The ordering of the patches was such that least-controversial stuff is
at the start of the patchset. The intention was to be able to select a
cut-off point and say "that's enough for now"

> At a very quick glance, the first 25 or so patches should go in first,
> and that gives a much better base to compare subsequent functional
> changes with.

That's reasonable. I've requeued tests for the patchset up to 25 to see what
that looks like. There is also a part of a later patch that reduces how much
time is spent with interrupts disabled. I should split that out and move it
back to within the cut-off point as something that is "obviously good".

> Patch 18 for example is really significant, and should
> almost be 2.6.29/-stable material IMO.
>

My impression was that -stable was only for functional regressions where
as this is really a performance thing.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-16 12:25:29

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 00/35] Cleanup and optimise the page allocator V3

On Mon, Mar 16, 2009 at 12:02:17PM +0000, Mel Gorman wrote:
> On Mon, Mar 16, 2009 at 12:33:58PM +0100, Nick Piggin wrote:
> > Wheras if you defer this until the point you need a higher order
> > page, the only thing you have to work with are the pages that are
> > free *right now*.
> >
>
> Well, buddy always uses the smallest available page first. Even with
> deferred coalescing, it will merge up to order-5 at least. Lets say they
> could have merged up to order-10 in ordinary circumstances, they are
> still avoided for as long as possible. Granted, it might mean that an
> order-5 is split that could have been merged but it's hard to tell how
> much of a difference that makes.

But the kinds of pages *you* are interested in are order-10, right?


> > Your anti-frag tests probably don't stress this long term fragmentation
> > problem.
> >
>
> Probably not, but we have little data on long-term fragmentation other than
> anecdotal evidence that it's ok these days.

Well, I think before anti-frag there was lots of anecdotal evidence
that it's "ok", except for loads heavily using large higher order
allocations. I don't know if we'd have many systems running with
hundreds of days of uptime on such workloads post-anti-frag?

Google might? But I don't know how long their uptimes are. I expect
we'd have a better idea in a couple more years after the next
enterprise distro release cycles with anti-frag.


> > Still, it's significant enough that I think it should be made
> > optional (and arguably default to on) even if it does harm higher
> > order allocations a bit.
> >
>
> I could make PAGE_ORDER_MERGE_ORDER a proc tunable? If it's placed as a
> read-mostly variable beside the gfp_zone table, it might even fit in the
> same cache line.

Hmm, possibly. OTOH I don't like tunables. If you don't think it will
be a problem for hugepage allocations, then I would prefer just to
leave it on and 5 by default (or even less? COSTLY_ORDER?)

2009-03-16 12:28:35

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 00/35] Cleanup and optimise the page allocator V3

On Mon, Mar 16, 2009 at 12:11:22PM +0000, Mel Gorman wrote:
> On Mon, Mar 16, 2009 at 12:45:55PM +0100, Nick Piggin wrote:
> > On Mon, Mar 16, 2009 at 09:45:55AM +0000, Mel Gorman wrote:
> > > Here is V3 of an attempt to cleanup and optimise the page allocator and should
> > > be ready for general testing. The page allocator is now faster (16%
> > > reduced time overall for kernbench on one machine) and it has a smaller cache
> > > footprint (16.5% less L1 cache misses and 19.5% less L2 cache misses for
> > > kernbench on one machine). The text footprint has unfortunately increased,
> > > largely due to the introduction of a form of lazy buddy merging mechanism
> > > that avoids cache misses by postponing buddy merging until a high-order
> > > allocation needs it.
> >
> > BTW. I would feel better about this if it gets merged in stages, with
> > functional changes split out, and also code optimisations and omore
> > obvious performace improvements split out and preferably merged first.
> >
>
> The ordering of the patches was such that least-controversial stuff is
> at the start of the patchset. The intention was to be able to select a
> cut-off point and say "that's enough for now"
>
> > At a very quick glance, the first 25 or so patches should go in first,
> > and that gives a much better base to compare subsequent functional
> > changes with.
>
> That's reasonable. I've requeued tests for the patchset up to 25 to see what
> that looks like. There is also a part of a later patch that reduces how much
> time is spent with interrupts disabled. I should split that out and move it
> back to within the cut-off point as something that is "obviously good".

OK cool. It also means we can start getting benefit of some of them
sooner. I hope most of the obvious ones can be merged in 2.6.30.


> > Patch 18 for example is really significant, and should
> > almost be 2.6.29/-stable material IMO.
> >
>
> My impression was that -stable was only for functional regressions where
> as this is really a performance thing.

A performance regression like this in the core page allocator is a
pretty important problem. The fix is obvious. But maybe you're right.

2009-03-16 13:32:45

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/35] Cleanup and optimise the page allocator V3

On Mon, Mar 16, 2009 at 01:25:05PM +0100, Nick Piggin wrote:
> On Mon, Mar 16, 2009 at 12:02:17PM +0000, Mel Gorman wrote:
> > On Mon, Mar 16, 2009 at 12:33:58PM +0100, Nick Piggin wrote:
> > > Wheras if you defer this until the point you need a higher order
> > > page, the only thing you have to work with are the pages that are
> > > free *right now*.
> > >
> >
> > Well, buddy always uses the smallest available page first. Even with
> > deferred coalescing, it will merge up to order-5 at least. Lets say they
> > could have merged up to order-10 in ordinary circumstances, they are
> > still avoided for as long as possible. Granted, it might mean that an
> > order-5 is split that could have been merged but it's hard to tell how
> > much of a difference that makes.
>
> But the kinds of pages *you* are interested in are order-10, right?
>

Yes, but my expectation is that multiple free order-5 pages can be
merged to make up an order-10. If they can't, then lumpy reclaim kicks
in as normal. My expectation actually is that order-10 allocations often
end up using lumpy reclaim and the pages are not automatically
available.

As it is though, I have done something wrong and success rates have dropped
where they were ok 10 days ago. I need to investigate further but as the
first cut-off point at 25 patches is before the lazy buddy patch, it's not
an immediate problem.

>
> > > Your anti-frag tests probably don't stress this long term fragmentation
> > > problem.
> > >
> >
> > Probably not, but we have little data on long-term fragmentation other than
> > anecdotal evidence that it's ok these days.
>
> Well, I think before anti-frag there was lots of anecdotal evidence
> that it's "ok", except for loads heavily using large higher order
> allocations. I don't know if we'd have many systems running with
> hundreds of days of uptime on such workloads post-anti-frag?
>

I doubt it. I probably won't see proper reports on how it behaves until
it's part of a major distro release.

> Google might? But I don't know how long their uptimes are. I expect
> we'd have a better idea in a couple more years after the next
> enterprise distro release cycles with anti-frag.
>

Exactly.

>
> > > Still, it's significant enough that I think it should be made
> > > optional (and arguably default to on) even if it does harm higher
> > > order allocations a bit.
> > >
> >
> > I could make PAGE_ORDER_MERGE_ORDER a proc tunable? If it's placed as a
> > read-mostly variable beside the gfp_zone table, it might even fit in the
> > same cache line.
>
> Hmm, possibly. OTOH I don't like tunables.

Neither do I, but in this case it would make it easier to test where the
proper cut-off point is without requiring kernel recompiles and make a
final static decision later.

> If you don't think it will
> be a problem for hugepage allocations, then I would prefer just to
> leave it on and 5 by default (or even less? COSTLY_ORDER?)
>

I went with 5 because it means we merge up to at least the size the pcp->batch
size. As the page allocator gives back pages in contiguous order if a buddy
split occured, it made sense that pcp batch refills are contiguous where
possible.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-16 15:54:00

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 00/35] Cleanup and optimise the page allocator V3

On Mon, Mar 16, 2009 at 01:32:32PM +0000, Mel Gorman wrote:
> On Mon, Mar 16, 2009 at 01:25:05PM +0100, Nick Piggin wrote:
> > > Well, buddy always uses the smallest available page first. Even with
> > > deferred coalescing, it will merge up to order-5 at least. Lets say they
> > > could have merged up to order-10 in ordinary circumstances, they are
> > > still avoided for as long as possible. Granted, it might mean that an
> > > order-5 is split that could have been merged but it's hard to tell how
> > > much of a difference that makes.
> >
> > But the kinds of pages *you* are interested in are order-10, right?
> >
>
> Yes, but my expectation is that multiple free order-5 pages can be
> merged to make up an order-10.

Yes, but lazy buddy will give out part of an order-10 free area
to an order-5 request even when there are genuine order-5,6,7,8,9
free areas available.

Now it could be assumed that not too much else in the kernel
asks for anything over order-3, so you are unlikely to get these
kinds of requests. But it's worse than that actually, because
lazy buddy will also split half of an order-10 free area in order
to satisfy an order-0 allocation in cases that there are no smaller
orders than 5 available.

So yes definitely I think there should be a very real impact on
higher order coalescing no matter what you do.


> If they can't, then lumpy reclaim kicks
> in as normal. My expectation actually is that order-10 allocations often
> end up using lumpy reclaim and the pages are not automatically
> available.

movable zone is less interesting, although it will make it harder
to allocate these guys from movable zone. But the pages are
movable so eventually they should be able to be reclaimed.

unmovable zone fragmentation is more important point because it
eventually can destroy the movable zone.

2009-03-16 16:25:43

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 15/35] Inline __rmqueue_fallback()

On Mon, Mar 16, 2009 at 11:57:10AM -0400, Christoph Lameter wrote:
> On Mon, 16 Mar 2009, Mel Gorman wrote:
>
> > __rmqueue() is in the slow path but has only one call site. It actually
> > reduces text if it's inlined.
>
> This is modifying __rmqueue_fallback() not __rmqueue().
>

Fixed, thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-16 16:29:48

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 18/35] Do not disable interrupts in free_page_mlock()

On Mon, Mar 16, 2009 at 12:05:53PM -0400, Christoph Lameter wrote:
> On Mon, 16 Mar 2009, Mel Gorman wrote:
>
> > @@ -570,6 +570,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> > kernel_map_pages(page, 1 << order, 0);
> >
> > local_irq_save(flags);
> > + if (clearMlocked)
> > + free_page_mlock(page);
> > __count_vm_events(PGFREE, 1 << order);
> > free_one_page(page_zone(page), page, order,
> > get_pageblock_migratetype(page));
>
> Add an unlikely(clearMblocked) here?
>

I wasn't sure at the time of writing how likely the case really is but it
makes sense that mlocked() pages are rarely freed. On reflection though,
it makes sense to mark this unlikely().

> > @@ -1036,6 +1039,9 @@ static void free_hot_cold_page(struct page *page, int cold)
> > pcp = &zone_pcp(zone, get_cpu())->pcp;
> > local_irq_save(flags);
> > __count_vm_event(PGFREE);
> > + if (clearMlocked)
> > + free_page_mlock(page);
> > +
> > if (cold)
> > list_add_tail(&page->lru, &pcp->list);
> > else
> >
>
> Same here also make sure tha the __count_vm_events(PGFREE) comes after the
> free_pages_mlock() to preserve symmetry with __free_pages_ok() and maybe
> allow the compiler to do CSE between two invocations of
> __count_vm_events().
>

Whatever about the latter reasoning about CSE, the symmetry makes sense.
I've made the change. Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-16 16:36:43

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

On Mon, Mar 16, 2009 at 12:08:25PM -0400, Christoph Lameter wrote:
> On Mon, 16 Mar 2009, Mel Gorman wrote:
>
> > +extern int static_num_online_nodes;
>
> Strange name. Could we name this nr_online_nodes or so?
>

It's to match the function name. Arguably I could also have replaced the
implementation of num_online_nodes() with a version that uses the static
variable.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-16 16:42:51

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 23/35] Update NR_FREE_PAGES only as necessary

On Mon, Mar 16, 2009 at 12:17:15PM -0400, Christoph Lameter wrote:
> On Mon, 16 Mar 2009, Mel Gorman wrote:
>
> > When pages are being freed to the buddy allocator, the zone
> > NR_FREE_PAGES counter must be updated. In the case of bulk per-cpu page
> > freeing, it's updated once per page. This retouches cache lines more
> > than necessary. Update the counters one per per-cpu bulk free.
>
> Not sure about the reasoning here since the individual updates are batched

Each update take places between lots of other work with different cache
lines. With enough buddy merging, I believed the line holding the counters
could at least get pushed out of L1 although probably not L2 cache.

> and you are touching the same cacheline as the pcp you are operating on
> and have to touch anyways.
>
> But if its frequent that __rmqueue_smallest() and free_pages_bulk() are
> called with multiple pages then its always a win.
>

It's frequent enough that it showed up in profiles

> Reviewed-by: Christoph Lameter <[email protected]>
>
> > + __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order) * i);
>
> A multiplication? Okay with contemporary cpus I guess.
>

Replaced with

__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-16 16:46:19

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 24/35] Convert gfp_zone() to use a table of precalculated values

On Mon, Mar 16, 2009 at 12:19:06PM -0400, Christoph Lameter wrote:
> On Mon, 16 Mar 2009, Mel Gorman wrote:
>
> > diff --git a/init/main.c b/init/main.c
> > index 8442094..08a5663 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -573,6 +573,7 @@ asmlinkage void __init start_kernel(void)
> > * fragile until we cpu_idle() for the first time.
> > */
> > preempt_disable();
> > + init_gfp_zone_table();
> > build_all_zonelists();
> > page_alloc_init();
> > printk(KERN_NOTICE "Kernel command line: %s\n", boot_command_line);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index bc491fa..d76f57d 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -70,6 +70,7 @@ EXPORT_SYMBOL(node_states);
> > unsigned long totalram_pages __read_mostly;
> > unsigned long totalreserve_pages __read_mostly;
> > unsigned long highest_memmap_pfn __read_mostly;
> > +int gfp_zone_table[GFP_ZONEMASK] __read_mostly;
> > int static_num_online_nodes __read_mostly;
> > int percpu_pagelist_fraction;
> >
> > @@ -4569,7 +4570,7 @@ static void setup_per_zone_inactive_ratio(void)
> > * 8192MB: 11584k
> > * 16384MB: 16384k
> > */
> > -static int __init init_per_zone_pages_min(void)
> > +static int init_per_zone_pages_min(void)
> > {
> > unsigned long lowmem_kbytes;
> >
> > @@ -4587,6 +4588,39 @@ static int __init init_per_zone_pages_min(void)
> > }
> > module_init(init_per_zone_pages_min)
> >
> > +static inline int __init gfp_flags_to_zone(gfp_t flags)
> > +{
> > +#ifdef CONFIG_ZONE_DMA
> > + if (flags & __GFP_DMA)
> > + return ZONE_DMA;
> > +#endif
> > +#ifdef CONFIG_ZONE_DMA32
> > + if (flags & __GFP_DMA32)
> > + return ZONE_DMA32;
> > +#endif
> > + if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
> > + (__GFP_HIGHMEM | __GFP_MOVABLE))
> > + return ZONE_MOVABLE;
> > +#ifdef CONFIG_HIGHMEM
> > + if (flags & __GFP_HIGHMEM)
> > + return ZONE_HIGHMEM;
> > +#endif
> > + return ZONE_NORMAL;
> > +}
> > +
> > +/*
> > + * For each possible combination of zone modifier flags, we calculate
> > + * what zone it should be using. This consumes a cache line in most
> > + * cases but avoids a number of branches in the allocator fast path
> > + */
> > +void __init init_gfp_zone_table(void)
> > +{
> > + gfp_t gfp_flags;
> > +
> > + for (gfp_flags = 0; gfp_flags < GFP_ZONEMASK; gfp_flags++)
> > + gfp_zone_table[gfp_flags] = gfp_flags_to_zone(gfp_flags);
> > +}
> > +
>
> This is all known at compile time. The table can be calculated at compile
> time with some ifdefs and then we do not need the init_gfp_zone_table().
>

The last discussion didn't come up with something that could generate such
a table and still be correct in all cases. I guess someone could do it all
by hand but it was going to be an error prone exercise. I left the patch
as-is as a result.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-16 16:47:48

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 26/35] Use the per-cpu allocator for orders up to PAGE_ALLOC_COSTLY_ORDER

On Mon, Mar 16, 2009 at 12:26:07PM -0400, Christoph Lameter wrote:
> On Mon, 16 Mar 2009, Mel Gorman wrote:
>
> > -static void free_hot_cold_page(struct page *page, int cold)
> > +static void free_hot_cold_page(struct page *page, int order, int cold)
> > {
> > struct zone *zone = page_zone(page);
> > struct per_cpu_pages *pcp;
> > unsigned long flags;
> > int clearMlocked = PageMlocked(page);
> >
> > + /* SLUB can return lowish-order compound pages that need handling */
> > + if (order > 0 && unlikely(PageCompound(page)))
> > + if (unlikely(destroy_compound_page(page, order)))
> > + return;
> > +
>
> Isnt that also true for stacks and generic network objects ==- 8k?
>

I think they are vanilla high-order pages, not compound pages.

> > again:
> > cpu = get_cpu();
> > - if (likely(order == 0)) {
> > + if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> > struct per_cpu_pages *pcp;
> > + int batch;
> > + int delta;
> >
> > pcp = &zone_pcp(zone, cpu)->pcp;
> > + batch = max(1, pcp->batch >> order);
> > local_irq_save(flags);
> > if (!pcp->count) {
> > - pcp->count = rmqueue_bulk(zone, 0,
> > - pcp->batch, &pcp->list, migratetype);
> > + delta = rmqueue_bulk(zone, order, batch,
> > + &pcp->list, migratetype);
> > + bulk_add_pcp_page(pcp, order, delta);
> > if (unlikely(!pcp->count))
> > goto failed;
>
> The pcp adds a series of order N pages if an order N alloc occurs and the
> queue is empty?
>

Nope, that would be bad. The calculation of batch is made above and is

batch = max(1, pcp->batch >> order);

so order is taken into account when deciding how many pages to allocate.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-16 16:51:20

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 30/35] Skip the PCP list search by counting the order and type of pages on list

On Mon, Mar 16, 2009 at 12:31:44PM -0400, Christoph Lameter wrote:
> On Mon, 16 Mar 2009, Mel Gorman wrote:
>
> >
> > struct per_cpu_pages {
> > - int count; /* number of pages in the list */
> > + /* The total number of pages on the PCP lists */
> > + int count;
> > +
> > + /* Count of each migratetype and order */
> > + u8 mocount[MIGRATE_PCPTYPES][PAGE_ALLOC_COSTLY_ORDER+1];
>
> What about overflow? You could have more than 255 pages of a given type in
> a pcp.
>

Are you sure? The maximum size I would expect is pcp->high which is
6 * pcp->batch which is limited in size. I might have missed something
though.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-16 16:52:32

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 31/35] Optimistically check the first page on the PCP free list is suitable

On Mon, Mar 16, 2009 at 12:33:44PM -0400, Christoph Lameter wrote:
> On Mon, 16 Mar 2009, Mel Gorman wrote:
>
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index bb5bd5e..8568284 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1212,6 +1212,12 @@ again:
> > if (pcp_page_suit(page, order))
> > break;
> > } else {
> > + /* Optimistic before we start a list search */
> > + page = list_entry(list->next, struct page, lru);
> > + if (pcp_page_suit(page, order))
> > + goto found;
> > +
> > + /* Do the search */
> > list_for_each_entry(page, list, lru)
> > if (pcp_page_suit(page, order))
> > break;
>
> I am not convinced that this is beneficial. If it would then the compiler
> would unroll the loop.
>

It hit a large number of times when I checked (although I don't have the
figures any more) and the list_for_each_entry() does a prefetch
possibly fetching in a line we don't need.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-16 16:56:41

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/35] Cleanup and optimise the page allocator V3

On Mon, Mar 16, 2009 at 04:53:42PM +0100, Nick Piggin wrote:
> On Mon, Mar 16, 2009 at 01:32:32PM +0000, Mel Gorman wrote:
> > On Mon, Mar 16, 2009 at 01:25:05PM +0100, Nick Piggin wrote:
> > > > Well, buddy always uses the smallest available page first. Even with
> > > > deferred coalescing, it will merge up to order-5 at least. Lets say they
> > > > could have merged up to order-10 in ordinary circumstances, they are
> > > > still avoided for as long as possible. Granted, it might mean that an
> > > > order-5 is split that could have been merged but it's hard to tell how
> > > > much of a difference that makes.
> > >
> > > But the kinds of pages *you* are interested in are order-10, right?
> > >
> >
> > Yes, but my expectation is that multiple free order-5 pages can be
> > merged to make up an order-10.
>
> Yes, but lazy buddy will give out part of an order-10 free area
> to an order-5 request even when there are genuine order-5,6,7,8,9
> free areas available.
>

True.

> Now it could be assumed that not too much else in the kernel
> asks for anything over order-3, so you are unlikely to get these
> kinds of requests.

Which is an assumption I was working with.

> But it's worse than that actually, because
> lazy buddy will also split half of an order-10 free area in order
> to satisfy an order-0 allocation in cases that there are no smaller
> orders than 5 available.
>

Also true. In movable areas it probably makes no difference but it might
if large high-order unmovable allocations were common.

> So yes definitely I think there should be a very real impact on
> higher order coalescing no matter what you do.
>

Because this is not straight-forward at all, I'll put lazy buddy onto
the back-burner and exhaust all other possibilities before revisiting it
again.

>
> > If they can't, then lumpy reclaim kicks
> > in as normal. My expectation actually is that order-10 allocations often
> > end up using lumpy reclaim and the pages are not automatically
> > available.
>
> movable zone is less interesting, although it will make it harder
> to allocate these guys from movable zone. But the pages are
> movable so eventually they should be able to be reclaimed.
>

Exactly.

> unmovable zone fragmentation is more important point because it
> eventually can destroy the movable zone.
>

Which is why rmqueue_fallback() also merges up all buddies before making
any decisions but I accept your points. This is hard enough to
mind-experiment with that it should be considered last.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-16 16:59:00

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 23/35] Update NR_FREE_PAGES only as necessary

On Mon, Mar 16, 2009 at 12:48:25PM -0400, Christoph Lameter wrote:
> On Mon, 16 Mar 2009, Mel Gorman wrote:
>
> > Replaced with
> >
> > __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>
> A later patch does that also.
>

Silly of me. That later patch is in the "controversial" pile but the
change is now moved back here where it belongs.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-16 17:06:41

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 00/35] Cleanup and optimise the page allocator V3

On Mon, Mar 16, 2009 at 04:56:28PM +0000, Mel Gorman wrote:
> On Mon, Mar 16, 2009 at 04:53:42PM +0100, Nick Piggin wrote:
> > So yes definitely I think there should be a very real impact on
> > higher order coalescing no matter what you do.
> >
>
> Because this is not straight-forward at all, I'll put lazy buddy onto
> the back-burner and exhaust all other possibilities before revisiting it
> again.

If it is such a big improvement, I expect *most* people will want
it and we probably should do it. But just that it will not play
nicely with fragmentation and so you'd need to look into it and
devise some way those users can tune it to be nicer.

> > unmovable zone fragmentation is more important point because it
> > eventually can destroy the movable zone.
> >
>
> Which is why rmqueue_fallback() also merges up all buddies before making
> any decisions but I accept your points.

Right, that merge of buddies will only be able to look at what is
currently free. Wheras non-lazy buddy can pull out higher orders
before reallocating them.

2009-03-16 17:25:10

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 23/35] Update NR_FREE_PAGES only as necessary

On Mon, 16 Mar 2009, Mel Gorman wrote:

> Replaced with
>
> __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));

A later patch does that also.

2009-03-16 18:01:36

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 29/35] Do not store the PCP high and batch watermarks in the per-cpu structure

On Mon, 16 Mar 2009, Mel Gorman wrote:

> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -166,8 +166,6 @@ static inline int is_unevictable_lru(enum lru_list l)
>
> struct per_cpu_pages {
> int count; /* number of pages in the list */
> - int high; /* high watermark, emptying needed */
> - int batch; /* chunk size for buddy add/remove */


There is a hole here on 64 bit systems.

>
> /* Lists of pages, one per migrate type stored on the pcp-lists */
> struct list_head lists[MIGRATE_PCPTYPES];
> @@ -285,6 +283,12 @@ struct zone {
> unsigned long pages_mark[3];
> };
>
> + /* high watermark for per-cpu lists, emptying needed */
> + u16 pcp_high;
> +
> + /* chunk size for buddy add/remove to per-cpu lists*/
> + u16 pcp_batch;
> +

Move this up to fill the hole?

2009-03-16 18:01:52

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

On Mon, 16 Mar 2009, Mel Gorman wrote:

> On Mon, Mar 16, 2009 at 12:08:25PM -0400, Christoph Lameter wrote:
> > On Mon, 16 Mar 2009, Mel Gorman wrote:
> >
> > > +extern int static_num_online_nodes;
> >
> > Strange name. Could we name this nr_online_nodes or so?
> >
>
> It's to match the function name. Arguably I could also have replaced the
> implementation of num_online_nodes() with a version that uses the static
> variable.

We have nr_node_ids etc. It would be consistant with that naming.

2009-03-16 18:03:34

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 30/35] Skip the PCP list search by counting the order and type of pages on list

On Mon, 16 Mar 2009, Mel Gorman wrote:

>
> struct per_cpu_pages {
> - int count; /* number of pages in the list */
> + /* The total number of pages on the PCP lists */
> + int count;
> +
> + /* Count of each migratetype and order */
> + u8 mocount[MIGRATE_PCPTYPES][PAGE_ALLOC_COSTLY_ORDER+1];

What about overflow? You could have more than 255 pages of a given type in
a pcp.

2009-03-16 18:02:49

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 31/35] Optimistically check the first page on the PCP free list is suitable

On Mon, 16 Mar 2009, Mel Gorman wrote:

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bb5bd5e..8568284 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1212,6 +1212,12 @@ again:
> if (pcp_page_suit(page, order))
> break;
> } else {
> + /* Optimistic before we start a list search */
> + page = list_entry(list->next, struct page, lru);
> + if (pcp_page_suit(page, order))
> + goto found;
> +
> + /* Do the search */
> list_for_each_entry(page, list, lru)
> if (pcp_page_suit(page, order))
> break;

I am not convinced that this is beneficial. If it would then the compiler
would unroll the loop.

2009-03-16 18:55:01

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 26/35] Use the per-cpu allocator for orders up to PAGE_ALLOC_COSTLY_ORDER

On Mon, 16 Mar 2009, Mel Gorman wrote:

> -static void free_hot_cold_page(struct page *page, int cold)
> +static void free_hot_cold_page(struct page *page, int order, int cold)
> {
> struct zone *zone = page_zone(page);
> struct per_cpu_pages *pcp;
> unsigned long flags;
> int clearMlocked = PageMlocked(page);
>
> + /* SLUB can return lowish-order compound pages that need handling */
> + if (order > 0 && unlikely(PageCompound(page)))
> + if (unlikely(destroy_compound_page(page, order)))
> + return;
> +

Isnt that also true for stacks and generic network objects ==- 8k?

> again:
> cpu = get_cpu();
> - if (likely(order == 0)) {
> + if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> struct per_cpu_pages *pcp;
> + int batch;
> + int delta;
>
> pcp = &zone_pcp(zone, cpu)->pcp;
> + batch = max(1, pcp->batch >> order);
> local_irq_save(flags);
> if (!pcp->count) {
> - pcp->count = rmqueue_bulk(zone, 0,
> - pcp->batch, &pcp->list, migratetype);
> + delta = rmqueue_bulk(zone, order, batch,
> + &pcp->list, migratetype);
> + bulk_add_pcp_page(pcp, order, delta);
> if (unlikely(!pcp->count))
> goto failed;

The pcp adds a series of order N pages if an order N alloc occurs and the
queue is empty?

2009-03-16 18:55:49

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 24/35] Convert gfp_zone() to use a table of precalculated values

On Mon, 16 Mar 2009, Mel Gorman wrote:

> diff --git a/init/main.c b/init/main.c
> index 8442094..08a5663 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -573,6 +573,7 @@ asmlinkage void __init start_kernel(void)
> * fragile until we cpu_idle() for the first time.
> */
> preempt_disable();
> + init_gfp_zone_table();
> build_all_zonelists();
> page_alloc_init();
> printk(KERN_NOTICE "Kernel command line: %s\n", boot_command_line);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bc491fa..d76f57d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -70,6 +70,7 @@ EXPORT_SYMBOL(node_states);
> unsigned long totalram_pages __read_mostly;
> unsigned long totalreserve_pages __read_mostly;
> unsigned long highest_memmap_pfn __read_mostly;
> +int gfp_zone_table[GFP_ZONEMASK] __read_mostly;
> int static_num_online_nodes __read_mostly;
> int percpu_pagelist_fraction;
>
> @@ -4569,7 +4570,7 @@ static void setup_per_zone_inactive_ratio(void)
> * 8192MB: 11584k
> * 16384MB: 16384k
> */
> -static int __init init_per_zone_pages_min(void)
> +static int init_per_zone_pages_min(void)
> {
> unsigned long lowmem_kbytes;
>
> @@ -4587,6 +4588,39 @@ static int __init init_per_zone_pages_min(void)
> }
> module_init(init_per_zone_pages_min)
>
> +static inline int __init gfp_flags_to_zone(gfp_t flags)
> +{
> +#ifdef CONFIG_ZONE_DMA
> + if (flags & __GFP_DMA)
> + return ZONE_DMA;
> +#endif
> +#ifdef CONFIG_ZONE_DMA32
> + if (flags & __GFP_DMA32)
> + return ZONE_DMA32;
> +#endif
> + if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
> + (__GFP_HIGHMEM | __GFP_MOVABLE))
> + return ZONE_MOVABLE;
> +#ifdef CONFIG_HIGHMEM
> + if (flags & __GFP_HIGHMEM)
> + return ZONE_HIGHMEM;
> +#endif
> + return ZONE_NORMAL;
> +}
> +
> +/*
> + * For each possible combination of zone modifier flags, we calculate
> + * what zone it should be using. This consumes a cache line in most
> + * cases but avoids a number of branches in the allocator fast path
> + */
> +void __init init_gfp_zone_table(void)
> +{
> + gfp_t gfp_flags;
> +
> + for (gfp_flags = 0; gfp_flags < GFP_ZONEMASK; gfp_flags++)
> + gfp_zone_table[gfp_flags] = gfp_flags_to_zone(gfp_flags);
> +}
> +

This is all known at compile time. The table can be calculated at compile
time with some ifdefs and then we do not need the init_gfp_zone_table().

2009-03-16 18:56:08

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 23/35] Update NR_FREE_PAGES only as necessary

On Mon, 16 Mar 2009, Mel Gorman wrote:

> When pages are being freed to the buddy allocator, the zone
> NR_FREE_PAGES counter must be updated. In the case of bulk per-cpu page
> freeing, it's updated once per page. This retouches cache lines more
> than necessary. Update the counters one per per-cpu bulk free.

Not sure about the reasoning here since the individual updates are batched
and you are touching the same cacheline as the pcp you are operating on
and have to touch anyways.

But if its frequent that __rmqueue_smallest() and free_pages_bulk() are
called with multiple pages then its always a win.

Reviewed-by: Christoph Lameter <[email protected]>

> + __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order) * i);

A multiplication? Okay with contemporary cpus I guess.

2009-03-16 18:56:45

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

On Mon, 16 Mar 2009, Mel Gorman wrote:

> +extern int static_num_online_nodes;

Strange name. Could we name this nr_online_nodes or so?

2009-03-16 18:57:01

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 15/35] Inline __rmqueue_fallback()

On Mon, 16 Mar 2009, Mel Gorman wrote:

> __rmqueue() is in the slow path but has only one call site. It actually
> reduces text if it's inlined.

This is modifying __rmqueue_fallback() not __rmqueue().

2009-03-16 18:57:24

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 18/35] Do not disable interrupts in free_page_mlock()

On Mon, 16 Mar 2009, Mel Gorman wrote:

> @@ -570,6 +570,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> kernel_map_pages(page, 1 << order, 0);
>
> local_irq_save(flags);
> + if (clearMlocked)
> + free_page_mlock(page);
> __count_vm_events(PGFREE, 1 << order);
> free_one_page(page_zone(page), page, order,
> get_pageblock_migratetype(page));

Add an unlikely(clearMblocked) here?

> @@ -1036,6 +1039,9 @@ static void free_hot_cold_page(struct page *page, int cold)
> pcp = &zone_pcp(zone, get_cpu())->pcp;
> local_irq_save(flags);
> __count_vm_event(PGFREE);
> + if (clearMlocked)
> + free_page_mlock(page);
> +
> if (cold)
> list_add_tail(&page->lru, &pcp->list);
> else
>

Same here also make sure tha the __count_vm_events(PGFREE) comes after the
free_pages_mlock() to preserve symmetry with __free_pages_ok() and maybe
allow the compiler to do CSE between two invocations of
__count_vm_events().

2009-03-18 15:07:37

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/35] Cleanup and optimise the page allocator V3

On Mon, Mar 16, 2009 at 06:05:51PM +0100, Nick Piggin wrote:
> On Mon, Mar 16, 2009 at 04:56:28PM +0000, Mel Gorman wrote:
> > On Mon, Mar 16, 2009 at 04:53:42PM +0100, Nick Piggin wrote:
> > > So yes definitely I think there should be a very real impact on
> > > higher order coalescing no matter what you do.
> > >
> >
> > Because this is not straight-forward at all, I'll put lazy buddy onto
> > the back-burner and exhaust all other possibilities before revisiting it
> > again.
>
> If it is such a big improvement, I expect *most* people will want
> it and we probably should do it.

I'll be reinvestigating it in isolation. It's possible that high-order and
compound pages on the PCP lists is enough of a delayed buddy merging that
the benefit from lazy buddy is marginal.

> But just that it will not play
> nicely with fragmentation and so you'd need to look into it and
> devise some way those users can tune it to be nicer.
>

Which is why I'm going to postpone it for now.

> > > unmovable zone fragmentation is more important point because it
> > > eventually can destroy the movable zone.
> > >
> >
> > Which is why rmqueue_fallback() also merges up all buddies before making
> > any decisions but I accept your points.
>
> Right, that merge of buddies will only be able to look at what is
> currently free. Wheras non-lazy buddy can pull out higher orders
> before reallocating them.
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-18 15:08:50

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

On Mon, Mar 16, 2009 at 12:47:35PM -0400, Christoph Lameter wrote:
> On Mon, 16 Mar 2009, Mel Gorman wrote:
>
> > On Mon, Mar 16, 2009 at 12:08:25PM -0400, Christoph Lameter wrote:
> > > On Mon, 16 Mar 2009, Mel Gorman wrote:
> > >
> > > > +extern int static_num_online_nodes;
> > >
> > > Strange name. Could we name this nr_online_nodes or so?
> > >
> >
> > It's to match the function name. Arguably I could also have replaced the
> > implementation of num_online_nodes() with a version that uses the static
> > variable.
>
> We have nr_node_ids etc. It would be consistant with that naming.
>

Naming has never been great, but in this case the static value is a
direct replacement of num_online_nodes(). I think having a
similarly-named-but-still-different name obscures more than it helps.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-18 17:04:27

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

On Wed, 18 Mar 2009, Mel Gorman wrote:

> Naming has never been great, but in this case the static value is a
> direct replacement of num_online_nodes(). I think having a
> similarly-named-but-still-different name obscures more than it helps.

Creates a weird new name. Please use nr_online_nodes. Its useful elsewhere
too.

2009-03-18 18:02:14

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

On Wed, Mar 18, 2009 at 12:58:02PM -0400, Christoph Lameter wrote:
> On Wed, 18 Mar 2009, Mel Gorman wrote:
>
> > Naming has never been great, but in this case the static value is a
> > direct replacement of num_online_nodes(). I think having a
> > similarly-named-but-still-different name obscures more than it helps.
>
> Creates a weird new name. Please use nr_online_nodes. Its useful elsewhere
> too.
>

Ok, I'm dropping this patch for the first pass altogether and will deal
with it later.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-18 19:12:43

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

On Wed, 18 Mar 2009, Mel Gorman wrote:

> On Wed, Mar 18, 2009 at 12:58:02PM -0400, Christoph Lameter wrote:
> > On Wed, 18 Mar 2009, Mel Gorman wrote:
> >
> > > Naming has never been great, but in this case the static value is a
> > > direct replacement of num_online_nodes(). I think having a
> > > similarly-named-but-still-different name obscures more than it helps.
> >
> > Creates a weird new name. Please use nr_online_nodes. Its useful elsewhere
> > too.
> >
>
> Ok, I'm dropping this patch for the first pass altogether and will deal
> with it later.

This is important infrastructure stuff. num_online_nodes is used in a
couple of other places where it could be replaced by nr_online_nodes.

7 grufile.c gru_get_config_info 181 if (num_online_nodes() > 1 &&
8 grufile.c gru_get_config_info 187 info.nodes = num_online_nodes();
9 hugetlb.c return_unused_surplus_pages 878 unsigned long remaining_iterations = num_online_nodes();
a hugetlb.c return_unused_surplus_pages 907 remaining_iterations = num_online_nodes();
b page_alloc.c MAX_NODE_LOAD 2115 #define MAX_NODE_LOAD (num_online_nodes())
c page_alloc.c build_zonelists 2324 load = num_online_nodes();
d page_alloc.c build_all_zonelists 2475 num_online_nodes(),
e slub.c list_locations 3651 if (num_online_nodes() > 1 && !nodes_empty(l->nodes) &&
f svc.c svc_pool_map_choose_mode 127 if (num_online_nodes() > 1) {

In other places its avoided because deemed to be too expensive.

2009-03-19 20:47:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

Trying to the same in the style of nr_node_ids etc.


Subject: Provide nr_online_nodes and nr_possible_nodes

It seems that its beneficial to have a less expensive way to check for the
number of currently active and possible nodes in a NUMA system. This will
simplify further kernel optimizations for the cases in which only a single
node is online or possible.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6/include/linux/nodemask.h
===================================================================
--- linux-2.6.orig/include/linux/nodemask.h 2009-03-19 15:04:28.000000000 -0500
+++ linux-2.6/include/linux/nodemask.h 2009-03-19 15:33:18.000000000 -0500
@@ -408,6 +408,20 @@
#define next_online_node(nid) next_node((nid), node_states[N_ONLINE])

extern int nr_node_ids;
+extern int nr_online_nodes;
+extern int nr_possible_nodes;
+
+static inline void node_set_online(int nid)
+{
+ node_set_state(nid, N_ONLINE);
+ nr_online_nodes = num_node_state(N_ONLINE);
+}
+
+static inline void node_set_offline(int nid)
+{
+ node_clear_state(nid, N_ONLINE);
+ nr_online_nodes = num_node_state(N_ONLINE);
+}
#else

static inline int node_state(int node, enum node_states state)
@@ -434,7 +448,8 @@
#define first_online_node 0
#define next_online_node(nid) (MAX_NUMNODES)
#define nr_node_ids 1
-
+#define nr_online_nodes 1
+#define nr_possible_nodes 1
#endif

#define node_online_map node_states[N_ONLINE]
@@ -454,8 +469,7 @@
#define node_online(node) node_state((node), N_ONLINE)
#define node_possible(node) node_state((node), N_POSSIBLE)

-#define node_set_online(node) node_set_state((node), N_ONLINE)
-#define node_set_offline(node) node_clear_state((node), N_ONLINE)
+

#define for_each_node(node) for_each_node_state(node, N_POSSIBLE)
#define for_each_online_node(node) for_each_node_state(node, N_ONLINE)
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c 2009-03-19 15:11:49.000000000 -0500
+++ linux-2.6/mm/hugetlb.c 2009-03-19 15:12:13.000000000 -0500
@@ -875,7 +875,7 @@
* can no longer free unreserved surplus pages. This occurs when
* the nodes with surplus pages have no free pages.
*/
- unsigned long remaining_iterations = num_online_nodes();
+ unsigned long remaining_iterations = nr_online_nodes;

/* Uncommit the reservation */
h->resv_huge_pages -= unused_resv_pages;
@@ -904,7 +904,7 @@
h->surplus_huge_pages--;
h->surplus_huge_pages_node[nid]--;
nr_pages--;
- remaining_iterations = num_online_nodes();
+ remaining_iterations = nr_online_nodes;
}
}
}
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2009-03-19 15:12:19.000000000 -0500
+++ linux-2.6/mm/page_alloc.c 2009-03-19 15:30:21.000000000 -0500
@@ -168,6 +168,10 @@
#if MAX_NUMNODES > 1
int nr_node_ids __read_mostly = MAX_NUMNODES;
EXPORT_SYMBOL(nr_node_ids);
+int nr_online_nodes __read_mostly = 1;
+EXPORT_SYMBOL(nr_online_nodes);
+int nr_possible_nodes __read_mostly = MAX_NUMNODES;
+EXPORT_SYMBOL(nr_possible_nodes);
#endif

int page_group_by_mobility_disabled __read_mostly;
@@ -2115,7 +2119,7 @@
}


-#define MAX_NODE_LOAD (num_online_nodes())
+#define MAX_NODE_LOAD nr_online_nodes
static int node_load[MAX_NUMNODES];

/**
Index: linux-2.6/mm/slab.c
===================================================================
--- linux-2.6.orig/mm/slab.c 2009-03-19 15:13:45.000000000 -0500
+++ linux-2.6/mm/slab.c 2009-03-19 15:15:28.000000000 -0500
@@ -881,7 +881,6 @@
*/

static int use_alien_caches __read_mostly = 1;
-static int numa_platform __read_mostly = 1;
static int __init noaliencache_setup(char *s)
{
use_alien_caches = 0;
@@ -1434,9 +1433,8 @@
int order;
int node;

- if (num_possible_nodes() == 1) {
+ if (nr_possible_nodes == 1) {
use_alien_caches = 0;
- numa_platform = 0;
}

for (i = 0; i < NUM_INIT_LISTS; i++) {
@@ -3526,7 +3524,7 @@
* variable to skip the call, which is mostly likely to be present in
* the cache.
*/
- if (numa_platform && cache_free_alien(cachep, objp))
+ if (nr_possible_nodes > 1 && cache_free_alien(cachep, objp))
return;

if (likely(ac->avail < ac->limit)) {
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2009-03-19 15:13:15.000000000 -0500
+++ linux-2.6/mm/slub.c 2009-03-19 15:13:38.000000000 -0500
@@ -3648,7 +3648,7 @@
to_cpumask(l->cpus));
}

- if (num_online_nodes() > 1 && !nodes_empty(l->nodes) &&
+ if (nr_online_nodes > 1 && !nodes_empty(l->nodes) &&
len < PAGE_SIZE - 60) {
len += sprintf(buf + len, " nodes=");
len += nodelist_scnprintf(buf + len, PAGE_SIZE - len - 50,
Index: linux-2.6/net/sunrpc/svc.c
===================================================================
--- linux-2.6.orig/net/sunrpc/svc.c 2009-03-19 15:16:21.000000000 -0500
+++ linux-2.6/net/sunrpc/svc.c 2009-03-19 15:16:51.000000000 -0500
@@ -124,7 +124,7 @@
{
unsigned int node;

- if (num_online_nodes() > 1) {
+ if (nr_online_nodes > 1) {
/*
* Actually have multiple NUMA nodes,
* so split pools on NUMA node boundaries

2009-03-19 21:29:25

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

On Thu, Mar 19, 2009 at 04:43:55PM -0400, Christoph Lameter wrote:
> Trying to the same in the style of nr_node_ids etc.
>
>
> Subject: Provide nr_online_nodes and nr_possible_nodes
>
> It seems that its beneficial to have a less expensive way to check for the
> number of currently active and possible nodes in a NUMA system. This will
> simplify further kernel optimizations for the cases in which only a single
> node is online or possible.
>
> Signed-off-by: Christoph Lameter <[email protected]>
>

heh, funningily enough we ended up doing something similar because I
couldn't leave the cost of num_online_nodes() so high.

This patch actually alters the API. node_set_online() called when
MAX_NUMNODES == 1 will now fail to compile. That situation wouldn't make
any sense anyway but is it intentional?

For reference here is the patch I had for a similar goal which kept the
API as it was. I'll drop it if you prefer your own version.

================

commit d767ff28677178659e3260b04b6535e608af68b7
Author: Mel Gorman <[email protected]>
Date: Wed Mar 18 21:12:31 2009 +0000

Use a pre-calculated value for num_online_nodes()

num_online_nodes() is called in a number of places but most often by the
page allocator when deciding whether the zonelist needs to be filtered based
on cpusets or the zonelist cache. This is actually a heavy function and
touches a number of cache lines. This patch stores the number of online
nodes at boot time and updates the value when nodes get onlined and offlined.

Signed-off-by: Mel Gorman <[email protected]>

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 848025c..48e8d4a 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -381,6 +381,10 @@ enum node_states {
extern nodemask_t node_states[NR_NODE_STATES];

#if MAX_NUMNODES > 1
+
+extern int nr_node_ids;
+extern int nr_online_nodes;
+
static inline int node_state(int node, enum node_states state)
{
return node_isset(node, node_states[state]);
@@ -389,11 +393,15 @@ static inline int node_state(int node, enum node_states state)
static inline void node_set_state(int node, enum node_states state)
{
__node_set(node, &node_states[state]);
+ if (state == N_ONLINE)
+ nr_online_nodes = num_node_state(N_ONLINE);
}

static inline void node_clear_state(int node, enum node_states state)
{
__node_clear(node, &node_states[state]);
+ if (state == N_ONLINE)
+ nr_online_nodes = num_node_state(N_ONLINE);
}

static inline int num_node_state(enum node_states state)
@@ -407,7 +415,6 @@ static inline int num_node_state(enum node_states state)
#define first_online_node first_node(node_states[N_ONLINE])
#define next_online_node(nid) next_node((nid), node_states[N_ONLINE])

-extern int nr_node_ids;
#else

static inline int node_state(int node, enum node_states state)
@@ -434,6 +441,7 @@ static inline int num_node_state(enum node_states state)
#define first_online_node 0
#define next_online_node(nid) (MAX_NUMNODES)
#define nr_node_ids 1
+#define nr_online_nodes 1

#endif

@@ -449,7 +457,8 @@ static inline int num_node_state(enum node_states state)
node; \
})

-#define num_online_nodes() num_node_state(N_ONLINE)
+
+#define num_online_nodes() (nr_online_nodes)
#define num_possible_nodes() num_node_state(N_POSSIBLE)
#define node_online(node) node_state((node), N_ONLINE)
#define node_possible(node) node_state((node), N_POSSIBLE)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 18f0b56..67ac93a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -164,7 +164,9 @@ static unsigned long __meminitdata dma_reserve;

#if MAX_NUMNODES > 1
int nr_node_ids __read_mostly = MAX_NUMNODES;
+int nr_online_nodes __read_mostly;
EXPORT_SYMBOL(nr_node_ids);
+EXPORT_SYMBOL(nr_online_nodes);
#endif

int page_group_by_mobility_disabled __read_mostly;

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-19 22:07:42

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

On Thu, Mar 19, 2009 at 04:43:55PM -0400, Christoph Lameter wrote:
>
> It seems that its beneficial to have a less expensive way to check for the
> number of currently active and possible nodes in a NUMA system. This will
> simplify further kernel optimizations for the cases in which only a single
> node is online or possible.
>
> Signed-off-by: Christoph Lameter <[email protected]>
>
> Index: linux-2.6/include/linux/nodemask.h
> ===================================================================
> --- linux-2.6.orig/include/linux/nodemask.h 2009-03-19 15:04:28.000000000 -0500
> +++ linux-2.6/include/linux/nodemask.h 2009-03-19 15:33:18.000000000 -0500
> @@ -408,6 +408,20 @@
> #define next_online_node(nid) next_node((nid), node_states[N_ONLINE])
>
> extern int nr_node_ids;
> +extern int nr_online_nodes;
> +extern int nr_possible_nodes;

Have you tested the nr_possible_nodes aspects of this patch? I can see
where it gets initialised but nothing that updates it. It would appear that
nr_possible_nodes() and num_possible_nodes() can return different values.

> +
> +static inline void node_set_online(int nid)
> +{
> + node_set_state(nid, N_ONLINE);
> + nr_online_nodes = num_node_state(N_ONLINE);
> +}
> +
> +static inline void node_set_offline(int nid)
> +{
> + node_clear_state(nid, N_ONLINE);
> + nr_online_nodes = num_node_state(N_ONLINE);
> +}
> #else
>
> static inline int node_state(int node, enum node_states state)
> @@ -434,7 +448,8 @@
> #define first_online_node 0
> #define next_online_node(nid) (MAX_NUMNODES)
> #define nr_node_ids 1
> -
> +#define nr_online_nodes 1
> +#define nr_possible_nodes 1
> #endif
>
> #define node_online_map node_states[N_ONLINE]
> @@ -454,8 +469,7 @@
> #define node_online(node) node_state((node), N_ONLINE)
> #define node_possible(node) node_state((node), N_POSSIBLE)
>
> -#define node_set_online(node) node_set_state((node), N_ONLINE)
> -#define node_set_offline(node) node_clear_state((node), N_ONLINE)
> +
>
> #define for_each_node(node) for_each_node_state(node, N_POSSIBLE)
> #define for_each_online_node(node) for_each_node_state(node, N_ONLINE)
> Index: linux-2.6/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.orig/mm/hugetlb.c 2009-03-19 15:11:49.000000000 -0500
> +++ linux-2.6/mm/hugetlb.c 2009-03-19 15:12:13.000000000 -0500
> @@ -875,7 +875,7 @@
> * can no longer free unreserved surplus pages. This occurs when
> * the nodes with surplus pages have no free pages.
> */
> - unsigned long remaining_iterations = num_online_nodes();
> + unsigned long remaining_iterations = nr_online_nodes;
>
> /* Uncommit the reservation */
> h->resv_huge_pages -= unused_resv_pages;
> @@ -904,7 +904,7 @@
> h->surplus_huge_pages--;
> h->surplus_huge_pages_node[nid]--;
> nr_pages--;
> - remaining_iterations = num_online_nodes();
> + remaining_iterations = nr_online_nodes;
> }
> }
> }
> Index: linux-2.6/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.orig/mm/page_alloc.c 2009-03-19 15:12:19.000000000 -0500
> +++ linux-2.6/mm/page_alloc.c 2009-03-19 15:30:21.000000000 -0500
> @@ -168,6 +168,10 @@
> #if MAX_NUMNODES > 1
> int nr_node_ids __read_mostly = MAX_NUMNODES;
> EXPORT_SYMBOL(nr_node_ids);
> +int nr_online_nodes __read_mostly = 1;
> +EXPORT_SYMBOL(nr_online_nodes);
> +int nr_possible_nodes __read_mostly = MAX_NUMNODES;
> +EXPORT_SYMBOL(nr_possible_nodes);
> #endif
>
> int page_group_by_mobility_disabled __read_mostly;
> @@ -2115,7 +2119,7 @@
> }
>
>
> -#define MAX_NODE_LOAD (num_online_nodes())
> +#define MAX_NODE_LOAD nr_online_nodes
> static int node_load[MAX_NUMNODES];
>
> /**
> Index: linux-2.6/mm/slab.c
> ===================================================================
> --- linux-2.6.orig/mm/slab.c 2009-03-19 15:13:45.000000000 -0500
> +++ linux-2.6/mm/slab.c 2009-03-19 15:15:28.000000000 -0500
> @@ -881,7 +881,6 @@
> */
>
> static int use_alien_caches __read_mostly = 1;
> -static int numa_platform __read_mostly = 1;
> static int __init noaliencache_setup(char *s)
> {
> use_alien_caches = 0;
> @@ -1434,9 +1433,8 @@
> int order;
> int node;
>
> - if (num_possible_nodes() == 1) {
> + if (nr_possible_nodes == 1) {
> use_alien_caches = 0;
> - numa_platform = 0;
> }
>
> for (i = 0; i < NUM_INIT_LISTS; i++) {
> @@ -3526,7 +3524,7 @@
> * variable to skip the call, which is mostly likely to be present in
> * the cache.
> */
> - if (numa_platform && cache_free_alien(cachep, objp))
> + if (nr_possible_nodes > 1 && cache_free_alien(cachep, objp))
> return;
>
> if (likely(ac->avail < ac->limit)) {
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c 2009-03-19 15:13:15.000000000 -0500
> +++ linux-2.6/mm/slub.c 2009-03-19 15:13:38.000000000 -0500
> @@ -3648,7 +3648,7 @@
> to_cpumask(l->cpus));
> }
>
> - if (num_online_nodes() > 1 && !nodes_empty(l->nodes) &&
> + if (nr_online_nodes > 1 && !nodes_empty(l->nodes) &&
> len < PAGE_SIZE - 60) {
> len += sprintf(buf + len, " nodes=");
> len += nodelist_scnprintf(buf + len, PAGE_SIZE - len - 50,
> Index: linux-2.6/net/sunrpc/svc.c
> ===================================================================
> --- linux-2.6.orig/net/sunrpc/svc.c 2009-03-19 15:16:21.000000000 -0500
> +++ linux-2.6/net/sunrpc/svc.c 2009-03-19 15:16:51.000000000 -0500
> @@ -124,7 +124,7 @@
> {
> unsigned int node;
>
> - if (num_online_nodes() > 1) {
> + if (nr_online_nodes > 1) {
> /*
> * Actually have multiple NUMA nodes,
> * so split pools on NUMA node boundaries
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-19 22:21:20

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

On Thu, Mar 19, 2009 at 04:43:55PM -0400, Christoph Lameter wrote:
> Trying to the same in the style of nr_node_ids etc.
>

Because of some issues with the patch and what it does for possible
nodes, I reworked the patch slightly into the following and is what I'm
actually testing.

commit 58652d22798591d4df5421a90d65cbd7e60b9048
Author: Christoph Lameter <[email protected]>
Date: Thu Mar 19 21:34:17 2009 +0000

Use a pre-calculated value for num_online_nodes()

num_online_nodes() is called in a number of places but most often by the
page allocator when deciding whether the zonelist needs to be filtered based
on cpusets or the zonelist cache. This is actually a heavy function and
touches a number of cache lines. This patch stores the number of online
nodes at boot time and updates the value when nodes get onlined and offlined.

Signed-off-by: Christoph Lameter <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 848025c..474e73e 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -408,6 +408,19 @@ static inline int num_node_state(enum node_states state)
#define next_online_node(nid) next_node((nid), node_states[N_ONLINE])

extern int nr_node_ids;
+extern int nr_online_nodes;
+
+static inline void node_set_online(int nid)
+{
+ node_set_state(nid, N_ONLINE);
+ nr_online_nodes = num_node_state(N_ONLINE);
+}
+
+static inline void node_set_offline(int nid)
+{
+ node_clear_state(nid, N_ONLINE);
+ nr_online_nodes = num_node_state(N_ONLINE);
+}
#else

static inline int node_state(int node, enum node_states state)
@@ -434,7 +447,7 @@ static inline int num_node_state(enum node_states state)
#define first_online_node 0
#define next_online_node(nid) (MAX_NUMNODES)
#define nr_node_ids 1
-
+#define nr_online_nodes 1
#endif

#define node_online_map node_states[N_ONLINE]
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1e99997..210e28c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -875,7 +875,7 @@ static void return_unused_surplus_pages(struct hstate *h,
* can no longer free unreserved surplus pages. This occurs when
* the nodes with surplus pages have no free pages.
*/
- unsigned long remaining_iterations = num_online_nodes();
+ unsigned long remaining_iterations = nr_online_nodes;

/* Uncommit the reservation */
h->resv_huge_pages -= unused_resv_pages;
@@ -904,7 +904,7 @@ static void return_unused_surplus_pages(struct hstate *h,
h->surplus_huge_pages--;
h->surplus_huge_pages_node[nid]--;
nr_pages--;
- remaining_iterations = num_online_nodes();
+ remaining_iterations = nr_online_nodes;
}
}
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 18f0b56..4131ff9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -164,7 +164,9 @@ static unsigned long __meminitdata dma_reserve;

#if MAX_NUMNODES > 1
int nr_node_ids __read_mostly = MAX_NUMNODES;
+int nr_online_nodes __read_mostly = 1;
EXPORT_SYMBOL(nr_node_ids);
+EXPORT_SYMBOL(nr_online_nodes);
#endif

int page_group_by_mobility_disabled __read_mostly;
@@ -1445,7 +1447,7 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
/* Determine in advance if the zonelist needs filtering */
if ((alloc_flags & ALLOC_CPUSET) && unlikely(number_of_cpusets > 1))
zonelist_filter = 1;
- if (num_online_nodes() > 1)
+ if (nr_online_nodes > 1)
zonelist_filter = 1;

zonelist_scan:
@@ -1486,7 +1488,7 @@ this_zone_full:
zlc_mark_zone_full(zonelist, z);
try_next_zone:
if (NUMA_BUILD && zonelist_filter) {
- if (!did_zlc_setup && num_online_nodes() > 1) {
+ if (!did_zlc_setup && nr_online_nodes > 1) {
/*
* do zlc_setup after the first zone is tried
* but only if there are multiple nodes to make
@@ -2285,7 +2287,7 @@ int numa_zonelist_order_handler(ctl_table *table, int write,
}


-#define MAX_NODE_LOAD (num_online_nodes())
+#define MAX_NODE_LOAD (nr_online_nodes)
static int node_load[MAX_NUMNODES];

/**
@@ -2494,7 +2496,7 @@ static void build_zonelists(pg_data_t *pgdat)

/* NUMA-aware ordering of nodes */
local_node = pgdat->node_id;
- load = num_online_nodes();
+ load = nr_online_nodes;
prev_node = local_node;
nodes_clear(used_mask);

@@ -2645,7 +2647,7 @@ void build_all_zonelists(void)

printk("Built %i zonelists in %s order, mobility grouping %s. "
"Total pages: %ld\n",
- num_online_nodes(),
+ nr_online_nodes,
zonelist_order_name[current_zonelist_order],
page_group_by_mobility_disabled ? "off" : "on",
vm_total_pages);
diff --git a/mm/slab.c b/mm/slab.c
index e7f1ded..e6157a0 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3526,7 +3526,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp)
* variable to skip the call, which is mostly likely to be present in
* the cache.
*/
- if (numa_platform && cache_free_alien(cachep, objp))
+ if (numa_platform > 1 && cache_free_alien(cachep, objp))
return;

if (likely(ac->avail < ac->limit)) {
diff --git a/mm/slub.c b/mm/slub.c
index 0280eee..8ce6be8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3648,7 +3648,7 @@ static int list_locations(struct kmem_cache *s, char *buf,
to_cpumask(l->cpus));
}

- if (num_online_nodes() > 1 && !nodes_empty(l->nodes) &&
+ if (nr_online_nodes > 1 && !nodes_empty(l->nodes) &&
len < PAGE_SIZE - 60) {
len += sprintf(buf + len, " nodes=");
len += nodelist_scnprintf(buf + len, PAGE_SIZE - len - 50,
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index c51fed4..ba9eb8f 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -124,7 +124,7 @@ svc_pool_map_choose_mode(void)
{
unsigned int node;

- if (num_online_nodes() > 1) {
+ if (nr_online_nodes > 1) {
/*
* Actually have multiple NUMA nodes,
* so split pools on NUMA node boundaries

2009-03-19 22:26:16

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

On Thu, 19 Mar 2009, Mel Gorman wrote:

> This patch actually alters the API. node_set_online() called when
> MAX_NUMNODES == 1 will now fail to compile. That situation wouldn't make
> any sense anyway but is it intentional?

Yes MAX_NUMNODES means that this is not a NUMA configuration. Setting an
ode online would make no sense. Node 0 is always online.

> For reference here is the patch I had for a similar goal which kept the
> API as it was. I'll drop it if you prefer your own version.

Lets look through it and get the best pieces from both.

> static inline void node_set_state(int node, enum node_states state)
> {
> __node_set(node, &node_states[state]);
> + if (state == N_ONLINE)
> + nr_online_nodes = num_node_state(N_ONLINE);
> }

That assumes uses of node_set_state N_ONLINE. Are there such users or are
all using node_set_online()?

> @@ -449,7 +457,8 @@ static inline int num_node_state(enum node_states state)
> node; \
> })
>
> -#define num_online_nodes() num_node_state(N_ONLINE)
> +
> +#define num_online_nodes() (nr_online_nodes)
> #define num_possible_nodes() num_node_state(N_POSSIBLE)
> #define node_online(node) node_state((node), N_ONLINE)
> #define node_possible(node) node_state((node), N_POSSIBLE)

Hmmmm... Yes we could get rid of those.

I'd also like to see nr_possible_nodes(). nr_possible_nodes is important
if you want to check if the system could ever bring up a second node
(which would make the current optimization not viable) whereas
nr_online_nodes is the check for how many nodes are currently online.

2009-03-19 22:34:13

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

On Thu, Mar 19, 2009 at 06:22:38PM -0400, Christoph Lameter wrote:
> On Thu, 19 Mar 2009, Mel Gorman wrote:
>
> > This patch actually alters the API. node_set_online() called when
> > MAX_NUMNODES == 1 will now fail to compile. That situation wouldn't make
> > any sense anyway but is it intentional?
>
> Yes MAX_NUMNODES means that this is not a NUMA configuration. Setting an
> ode online would make no sense. Node 0 is always online.
>

Right.

> > For reference here is the patch I had for a similar goal which kept the
> > API as it was. I'll drop it if you prefer your own version.
>
> Lets look through it and get the best pieces from both.
>

I posted an amalgamation. Sorry for the cross-over mails but I wanted to
get tests going before I fell asleep. They take a few hours to complete.

> > static inline void node_set_state(int node, enum node_states state)
> > {
> > __node_set(node, &node_states[state]);
> > + if (state == N_ONLINE)
> > + nr_online_nodes = num_node_state(N_ONLINE);
> > }
>
> That assumes uses of node_set_state N_ONLINE. Are there such users or are
> all using node_set_online()?
>

node_set_online() calls node_set_state(node, N_ONLINE) so it should have
worked out.

> > @@ -449,7 +457,8 @@ static inline int num_node_state(enum node_states state)
> > node; \
> > })
> >
> > -#define num_online_nodes() num_node_state(N_ONLINE)
> > +
> > +#define num_online_nodes() (nr_online_nodes)
> > #define num_possible_nodes() num_node_state(N_POSSIBLE)
> > #define node_online(node) node_state((node), N_ONLINE)
> > #define node_possible(node) node_state((node), N_POSSIBLE)
>
> Hmmmm... Yes we could get rid of those.
>
> I'd also like to see nr_possible_nodes(). nr_possible_nodes is important
> if you want to check if the system could ever bring up a second node
> (which would make the current optimization not viable) whereas
> nr_online_nodes is the check for how many nodes are currently online.
>

I redid your patch to drop the nr_possible_nodes() because I couldn't convince
myself it was correct in all cases and it isn't as important as avoiding
num_online_nodes() in fast paths.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-19 22:42:17

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

On Thu, 19 Mar 2009, Mel Gorman wrote:

> >
> > extern int nr_node_ids;
> > +extern int nr_online_nodes;
> > +extern int nr_possible_nodes;
>
> Have you tested the nr_possible_nodes aspects of this patch? I can see
> where it gets initialised but nothing that updates it. It would appear that
> nr_possible_nodes() and num_possible_nodes() can return different values.

Right now we bypass the helper functions.... The only places where the
possible map is modified are:

./arch/x86/mm/numa_64.c: node_set(0, node_possible_map);
./arch/x86/mm/k8topology_64.c: node_set(nodeid, node_possible_map);
./arch/x86/mm/srat_64.c: node_possible_map = nodes_parsed;

2009-03-19 22:53:13

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

On Thu, Mar 19, 2009 at 06:42:39PM -0400, Christoph Lameter wrote:
> On Thu, 19 Mar 2009, Mel Gorman wrote:
>
> > I posted an amalgamation. Sorry for the cross-over mails but I wanted to
> > get tests going before I fell asleep. They take a few hours to complete.
> >
> > > > static inline void node_set_state(int node, enum node_states state)
> > > > {
> > > > __node_set(node, &node_states[state]);
> > > > + if (state == N_ONLINE)
> > > > + nr_online_nodes = num_node_state(N_ONLINE);
> > > > }
> > >
> > > That assumes uses of node_set_state N_ONLINE. Are there such users or are
> > > all using node_set_online()?
> > >
> >
> > node_set_online() calls node_set_state(node, N_ONLINE) so it should have
> > worked out.
>
> But this adds a surprising side effect to all uses of node_set_state.
> Node_set_state is generating more code now.
>

Fair point.

> > > if you want to check if the system could ever bring up a second node
> > > (which would make the current optimization not viable) whereas
> > > nr_online_nodes is the check for how many nodes are currently online.
> > >
> >
> > I redid your patch to drop the nr_possible_nodes() because I couldn't convince
> > myself it was correct in all cases and it isn't as important as avoiding
> > num_online_nodes() in fast paths.
>
> I was more thinking about getting the infrastructure right so that we can
> avoid future hacks like the one in slab.
>

Which is fair enough and you're right in that it's worth fixing. One horribly
large patchset and associcate thread at a time though so I'll be putting
it on the wrong finger rather than adding this to the pile right now :).

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-19 23:05:00

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

On Thu, Mar 19, 2009 at 06:24:02PM -0400, Christoph Lameter wrote:
> On Thu, 19 Mar 2009, Mel Gorman wrote:
>
> > On Thu, Mar 19, 2009 at 04:43:55PM -0400, Christoph Lameter wrote:
> > > Trying to the same in the style of nr_node_ids etc.
> > >
> >
> > Because of some issues with the patch and what it does for possible
> > nodes, I reworked the patch slightly into the following and is what I'm
> > actually testing.
>
> Ok. It also removes the slab bits etc.
>

Well ... yes.

One of the slab changes removed a variable called numa_platform. From your
patch, this appears to have some relation to nr_possible_nodes but it's not
obvious at all if that is true or not.

The second change replaced num_possible_nodes() with nr_possible_nodes()
but it wasn't clear this was equivalent because nr_possible_nodes()
doesn't get updated from the call sites affecting the "possible" bitmap.

Both of those changes belong in a different patch and need explaination.
The bits left alter just nr_online_nodes and use it where it's
important.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-03-19 23:25:33

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

On Thu, 19 Mar 2009, Mel Gorman wrote:

> On Thu, Mar 19, 2009 at 04:43:55PM -0400, Christoph Lameter wrote:
> > Trying to the same in the style of nr_node_ids etc.
> >
>
> Because of some issues with the patch and what it does for possible
> nodes, I reworked the patch slightly into the following and is what I'm
> actually testing.

Ok. It also removes the slab bits etc.

2009-03-19 23:55:09

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 20/35] Use a pre-calculated value for num_online_nodes()

On Thu, 19 Mar 2009, Mel Gorman wrote:

> I posted an amalgamation. Sorry for the cross-over mails but I wanted to
> get tests going before I fell asleep. They take a few hours to complete.
>
> > > static inline void node_set_state(int node, enum node_states state)
> > > {
> > > __node_set(node, &node_states[state]);
> > > + if (state == N_ONLINE)
> > > + nr_online_nodes = num_node_state(N_ONLINE);
> > > }
> >
> > That assumes uses of node_set_state N_ONLINE. Are there such users or are
> > all using node_set_online()?
> >
>
> node_set_online() calls node_set_state(node, N_ONLINE) so it should have
> worked out.

But this adds a surprising side effect to all uses of node_set_state.
Node_set_state is generating more code now.

> > if you want to check if the system could ever bring up a second node
> > (which would make the current optimization not viable) whereas
> > nr_online_nodes is the check for how many nodes are currently online.
> >
>
> I redid your patch to drop the nr_possible_nodes() because I couldn't convince
> myself it was correct in all cases and it isn't as important as avoiding
> num_online_nodes() in fast paths.

I was more thinking about getting the infrastructure right so that we can
avoid future hacks like the one in slab.