2009-04-22 13:54:32

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 00/22] Cleanup and optimise the page allocator V7

Here is V7 of the cleanup and optimisation of the page allocator and
it should be ready for wider testing. Please consider a possibility for
merging as a Pass 1 at making the page allocator faster. Other passes will
occur later when this one has had a bit of exercise. This patchset is based
on mmotm-2009-04-17 and I've tested it successfully on a small number of
machines.

The performance improvements are in a wide range depending on the exact
machine but the results I've seen so fair are approximately;

kernbench: 0 to 0.12% (elapsed time)
0.49% to 3.20% (sys time)
aim9: -4% to 30% (for page_test and brk_test)
tbench: -1% to 4%
hackbench: -2.5% to 3.45% (mostly within the noise though)
netperf-udp -1.34% to 4.06% (varies between machines a bit)
netperf-tcp -0.44% to 5.22% (varies between machines a bit)

I haven't sysbench figures at hand, but previously they were within the -0.5%
to 2% range.

On netperf, the client and server were bound to opposite number CPUs to
maximise the problems with cache line bouncing of the struct pages so I
expect different people to report different results for netperf depending
on their exact machine and how they ran the test (different machines, same
cpus client/server, shared cache but two threads client/server, different
socket client/server etc).

I also measured the vmlinux sizes for a single x86-based config with
CONFIG_DEBUG_INFO enabled but not CONFIG_DEBUG_VM. The core of the .config
is based on the Debian Lenny kernel config so I expect it to be reasonably
typical.

text kernel
3355726 mmotm-20090417
3355718 0001-Replace-__alloc_pages_internal-with-__alloc_pages_.patch
3355622 0002-Do-not-sanity-check-order-in-the-fast-path.patch
3355574 0003-Do-not-check-NUMA-node-ID-when-the-caller-knows-the.patch
3355574 0004-Check-only-once-if-the-zonelist-is-suitable-for-the.patch
3355526 0005-Break-up-the-allocator-entry-point-into-fast-and-slo.patch
3355420 0006-Move-check-for-disabled-anti-fragmentation-out-of-fa.patch
3355452 0007-Calculate-the-preferred-zone-for-allocation-only-onc.patch
3355452 0008-Calculate-the-migratetype-for-allocation-only-once.patch
3355436 0009-Calculate-the-alloc_flags-for-allocation-only-once.patch
3355436 0010-Remove-a-branch-by-assuming-__GFP_HIGH-ALLOC_HIGH.patch
3355420 0011-Inline-__rmqueue_smallest.patch
3355420 0012-Inline-buffered_rmqueue.patch
3355420 0013-Inline-__rmqueue_fallback.patch
3355404 0014-Do-not-call-get_pageblock_migratetype-more-than-ne.patch
3355300 0015-Do-not-disable-interrupts-in-free_page_mlock.patch
3355300 0016-Do-not-setup-zonelist-cache-when-there-is-only-one-n.patch
3355188 0017-Do-not-check-for-compound-pages-during-the-page-allo.patch
3355161 0018-Use-allocation-flags-as-an-index-to-the-zone-waterma.patch
3355129 0019-Update-NR_FREE_PAGES-only-as-necessary.patch
3355129 0020-Get-the-pageblock-migratetype-without-disabling-inte.patch
3355129 0021-Use-a-pre-calculated-value-instead-of-num_online_nod.patch

Some patches were dropped in this revision because while I believe they
improved performance, they also increase the text size so they need to
be revisited in isolation to show they actually help things and by how
much. Other than that, the biggest changes were cleaning up accidental
functional changes identified by Kosaki Motohiro. Massive credit to him for a
very defailed review of V6, to Christoph Lameter who reviewed earlier versions
quite heavily and Pekka who kicked through V6 in quite a lot of detail.

arch/ia64/hp/common/sba_iommu.c | 2
arch/ia64/kernel/mca.c | 3
arch/ia64/kernel/uncached.c | 3
arch/ia64/sn/pci/pci_dma.c | 3
arch/powerpc/platforms/cell/ras.c | 2
arch/x86/kvm/vmx.c | 2
drivers/misc/sgi-gru/grufile.c | 2
drivers/misc/sgi-xp/xpc_uv.c | 2
include/linux/gfp.h | 27 -
include/linux/mm.h | 1
include/linux/mmzone.h | 11
include/linux/nodemask.h | 15 -
kernel/profile.c | 8
mm/filemap.c | 2
mm/hugetlb.c | 8
mm/internal.h | 11
mm/mempolicy.c | 2
mm/migrate.c | 2
mm/page_alloc.c | 555 ++++++++++++++++++++++++--------------
mm/slab.c | 11
mm/slob.c | 4
mm/slub.c | 2
net/sunrpc/svc.c | 2
23 files changed, 424 insertions(+), 256 deletions(-)

Changes since V6
o Remove unintentional functional changes when splitting into fast and slow paths
o Drop patch 7 for zonelist filtering as it modified when zlc_setup() is called
for the wrong reasons. The patch that avoids calling it for non-NUMA machines is
still there which has the bulk of the saving. cpusets is relatively small
o Drop an unnecessary check for in_interrupt() in gfp_to_alloc_flags()
o Clarify comment on __GFP_HIGH == ALLOC_HIGH
o Redefine the watermark mask to be expessed in terms of ALLOC_MARK_NOWATERMARK
o Use BUILD_BUG_ON for checking __GFP_HIGH == ALLOC_HIGH
o Drop some patches that were not reducing text sizes as expected
o Remove numa_platform from slab

Change since V5
o Rebase to mmotm-2009-04-17

Changes since V4
o Drop the more controversial patches for now and focus on the "obvious win"
material.
o Add reviewed-by notes
o Fix changelog entry to say __rmqueue_fallback instead __rmqueue
o Add unlikely() for the clearMlocked check
o Change where PGFREE is accounted in free_hot_cold_page() to have symmetry
with __free_pages_ok()
o Convert num_online_nodes() to use a static value so that callers do
not have to be individually updated
o Rebase to mmotm-2003-03-13

Changes since V3
o Drop the more controversial patches for now and focus on the "obvious win"
material
o Add reviewed-by notes
o Fix changelog entry to say __rmqueue_fallback instead __rmqueue
o Add unlikely() for the clearMlocked check
o Change where PGFREE is accounted in free_hot_cold_page() to have symmetry
with __free_pages_ok()

Changes since V2
o Remove brances by treating watermark flags as array indices
o Remove branch by assuming __GFP_HIGH == ALLOC_HIGH
o Do not check for compound on every page free
o Remove branch by always ensuring the migratetype is known on free
o Simplify buffered_rmqueue further
o Reintroduce improved version of batched bulk free of pcp pages
o Use allocation flags as an index to zone watermarks
o Work out __GFP_COLD only once
o Reduce the number of times zone stats are updated
o Do not dump reserve pages back into the allocator. Instead treat them
as MOVABLE so that MIGRATE_RESERVE gets used on the max-order-overlapped
boundaries without causing trouble
o Allow pages up to PAGE_ALLOC_COSTLY_ORDER to use the per-cpu allocator.
order-1 allocations are frequently enough in particular to justify this
o Rearrange inlining such that the hot-path is inlined but not in a way
that increases the text size of the page allocator
o Make the check for needing additional zonelist filtering due to NUMA
or cpusets as light as possible
o Do not destroy compound pages going to the PCP lists
o Delay the merging of buddies until a high-order allocation needs them
or anti-fragmentation is being forced to fallback

Changes since V1
o Remove the ifdef CONFIG_CPUSETS from inside get_page_from_freelist()
o Use non-lock bit operations for clearing the mlock flag
o Factor out alloc_flags calculation so it is only done once (Peter)
o Make gfp.h a bit prettier and clear-cut (Peter)
o Instead of deleting a debugging check, replace page_count() in the
free path with a version that does not check for compound pages (Nick)
o Drop the alteration for hot/cold page freeing until we know if it
helps or not


2009-04-22 13:53:49

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 02/22] Do not sanity check order in the fast path

No user of the allocator API should be passing in an order >= MAX_ORDER
but we check for it on each and every allocation. Delete this check and
make it a VM_BUG_ON check further down the call path.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
---
include/linux/gfp.h | 6 ------
mm/page_alloc.c | 2 ++
2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 556c840..760f6c0 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -182,9 +182,6 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
- if (unlikely(order >= MAX_ORDER))
- return NULL;
-
/* Unknown node is current node */
if (nid < 0)
nid = numa_node_id();
@@ -198,9 +195,6 @@ extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);
static inline struct page *
alloc_pages(gfp_t gfp_mask, unsigned int order)
{
- if (unlikely(order >= MAX_ORDER))
- return NULL;
-
return alloc_pages_current(gfp_mask, order);
}
extern struct page *alloc_page_vma(gfp_t gfp_mask,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dcc4f05..5028f40 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1405,6 +1405,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,

classzone_idx = zone_idx(preferred_zone);

+ VM_BUG_ON(order >= MAX_ORDER);
+
zonelist_scan:
/*
* Scan zonelist, looking for a zone with enough free.
--
1.5.6.5

2009-04-22 13:54:10

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 01/22] Replace __alloc_pages_internal() with __alloc_pages_nodemask()

__alloc_pages_internal is the core page allocator function but
essentially it is an alias of __alloc_pages_nodemask. Naming a publicly
available and exported function "internal" is also a big ugly. This
patch renames __alloc_pages_internal() to __alloc_pages_nodemask() and
deletes the old nodemask function.

Warning - This patch renames an exported symbol. No kernel driver is
affected by external drivers calling __alloc_pages_internal() should
change the call to __alloc_pages_nodemask() without any alteration of
parameters.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
---
include/linux/gfp.h | 12 ++----------
mm/page_alloc.c | 4 ++--
2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0bbc15f..556c840 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -169,24 +169,16 @@ static inline void arch_alloc_page(struct page *page, int order) { }
#endif

struct page *
-__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, nodemask_t *nodemask);

static inline struct page *
__alloc_pages(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist)
{
- return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+ return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
}

-static inline struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
- struct zonelist *zonelist, nodemask_t *nodemask)
-{
- return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
-}
-
-
static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
unsigned int order)
{
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e4ea469..dcc4f05 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1462,7 +1462,7 @@ try_next_zone:
* This is the 'heart' of the zoned buddy allocator.
*/
struct page *
-__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, nodemask_t *nodemask)
{
const gfp_t wait = gfp_mask & __GFP_WAIT;
@@ -1671,7 +1671,7 @@ nopage:
got_pg:
return page;
}
-EXPORT_SYMBOL(__alloc_pages_internal);
+EXPORT_SYMBOL(__alloc_pages_nodemask);

/*
* Common helper functions.
--
1.5.6.5

2009-04-22 13:54:50

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 04/22] Check only once if the zonelist is suitable for the allocation

It is possible with __GFP_THISNODE that no zones are suitable. This
patch makes sure the check is only made once.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
---
mm/page_alloc.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5028f40..3bed856 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1486,9 +1486,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
if (should_fail_alloc_page(gfp_mask, order))
return NULL;

-restart:
- z = zonelist->_zonerefs; /* the list of zones suitable for gfp_mask */
-
+ /* the list of zones suitable for gfp_mask */
+ z = zonelist->_zonerefs;
if (unlikely(!z->zone)) {
/*
* Happens if we have an empty zonelist as a result of
@@ -1497,6 +1496,7 @@ restart:
return NULL;
}

+restart:
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
if (page)
--
1.5.6.5

2009-04-22 13:55:11

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 03/22] Do not check NUMA node ID when the caller knows the node is valid

Callers of alloc_pages_node() can optionally specify -1 as a node to mean
"allocate from the current node". However, a number of the callers in fast
paths know for a fact their node is valid. To avoid a comparison and branch,
this patch adds alloc_pages_exact_node() that only checks the nid with
VM_BUG_ON(). Callers that know their node is valid are then converted.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>

(for the SLOB NUMA bits)
Acked-by: Paul Mundt <[email protected]>
---
arch/ia64/hp/common/sba_iommu.c | 2 +-
arch/ia64/kernel/mca.c | 3 +--
arch/ia64/kernel/uncached.c | 3 ++-
arch/ia64/sn/pci/pci_dma.c | 3 ++-
arch/powerpc/platforms/cell/ras.c | 2 +-
arch/x86/kvm/vmx.c | 2 +-
drivers/misc/sgi-gru/grufile.c | 2 +-
drivers/misc/sgi-xp/xpc_uv.c | 2 +-
include/linux/gfp.h | 9 +++++++++
include/linux/mm.h | 1 -
kernel/profile.c | 8 ++++----
mm/filemap.c | 2 +-
mm/hugetlb.c | 4 ++--
mm/mempolicy.c | 2 +-
mm/migrate.c | 2 +-
mm/slab.c | 4 ++--
mm/slob.c | 4 ++--
17 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index 56ceb68..fe63b2d 100644
--- a/arch/ia64/hp/common/sba_iommu.c
+++ b/arch/ia64/hp/common/sba_iommu.c
@@ -1131,7 +1131,7 @@ sba_alloc_coherent (struct device *dev, size_t size, dma_addr_t *dma_handle, gfp
#ifdef CONFIG_NUMA
{
struct page *page;
- page = alloc_pages_node(ioc->node == MAX_NUMNODES ?
+ page = alloc_pages_exact_node(ioc->node == MAX_NUMNODES ?
numa_node_id() : ioc->node, flags,
get_order(size));

diff --git a/arch/ia64/kernel/mca.c b/arch/ia64/kernel/mca.c
index 8f33a88..5b17bd4 100644
--- a/arch/ia64/kernel/mca.c
+++ b/arch/ia64/kernel/mca.c
@@ -1829,8 +1829,7 @@ ia64_mca_cpu_init(void *cpu_data)
data = mca_bootmem();
first_time = 0;
} else
- data = page_address(alloc_pages_node(numa_node_id(),
- GFP_KERNEL, get_order(sz)));
+ data = __get_free_pages(GFP_KERNEL, get_order(sz));
if (!data)
panic("Could not allocate MCA memory for cpu %d\n",
cpu);
diff --git a/arch/ia64/kernel/uncached.c b/arch/ia64/kernel/uncached.c
index 8eff8c1..6ba72ab 100644
--- a/arch/ia64/kernel/uncached.c
+++ b/arch/ia64/kernel/uncached.c
@@ -98,7 +98,8 @@ static int uncached_add_chunk(struct uncached_pool *uc_pool, int nid)

/* attempt to allocate a granule's worth of cached memory pages */

- page = alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+ page = alloc_pages_exact_node(nid,
+ GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
IA64_GRANULE_SHIFT-PAGE_SHIFT);
if (!page) {
mutex_unlock(&uc_pool->add_chunk_mutex);
diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c
index d876423..98b6849 100644
--- a/arch/ia64/sn/pci/pci_dma.c
+++ b/arch/ia64/sn/pci/pci_dma.c
@@ -90,7 +90,8 @@ static void *sn_dma_alloc_coherent(struct device *dev, size_t size,
*/
node = pcibus_to_node(pdev->bus);
if (likely(node >=0)) {
- struct page *p = alloc_pages_node(node, flags, get_order(size));
+ struct page *p = alloc_pages_exact_node(node,
+ flags, get_order(size));

if (likely(p))
cpuaddr = page_address(p);
diff --git a/arch/powerpc/platforms/cell/ras.c b/arch/powerpc/platforms/cell/ras.c
index 5f961c4..16ba671 100644
--- a/arch/powerpc/platforms/cell/ras.c
+++ b/arch/powerpc/platforms/cell/ras.c
@@ -122,7 +122,7 @@ static int __init cbe_ptcal_enable_on_node(int nid, int order)

area->nid = nid;
area->order = order;
- area->pages = alloc_pages_node(area->nid, GFP_KERNEL, area->order);
+ area->pages = alloc_pages_exact_node(area->nid, GFP_KERNEL, area->order);

if (!area->pages)
goto out_free_area;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c6997c0..eaa149f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1258,7 +1258,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu)
struct page *pages;
struct vmcs *vmcs;

- pages = alloc_pages_node(node, GFP_KERNEL, vmcs_config.order);
+ pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order);
if (!pages)
return NULL;
vmcs = page_address(pages);
diff --git a/drivers/misc/sgi-gru/grufile.c b/drivers/misc/sgi-gru/grufile.c
index 9ba90f3..796ac70 100644
--- a/drivers/misc/sgi-gru/grufile.c
+++ b/drivers/misc/sgi-gru/grufile.c
@@ -306,7 +306,7 @@ static int gru_init_tables(unsigned long gru_base_paddr, void *gru_base_vaddr)
pnode = uv_node_to_pnode(nid);
if (bid < 0 || gru_base[bid])
continue;
- page = alloc_pages_node(nid, GFP_KERNEL, order);
+ page = alloc_pages_exact_node(nid, GFP_KERNEL, order);
if (!page)
goto fail;
gru_base[bid] = page_address(page);
diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c
index 9172fcd..c76677a 100644
--- a/drivers/misc/sgi-xp/xpc_uv.c
+++ b/drivers/misc/sgi-xp/xpc_uv.c
@@ -232,7 +232,7 @@ xpc_create_gru_mq_uv(unsigned int mq_size, int cpu, char *irq_name,
mq->mmr_blade = uv_cpu_to_blade_id(cpu);

nid = cpu_to_node(cpu);
- page = alloc_pages_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+ page = alloc_pages_exact_node(nid, GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
pg_order);
if (page == NULL) {
dev_err(xpc_part, "xpc_create_gru_mq_uv() failed to alloc %d "
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 760f6c0..c7429b8 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -5,6 +5,7 @@
#include <linux/stddef.h>
#include <linux/linkage.h>
#include <linux/topology.h>
+#include <linux/mmdebug.h>

struct vm_area_struct;

@@ -189,6 +190,14 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
}

+static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
+ unsigned int order)
+{
+ VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
+
+ return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+}
+
#ifdef CONFIG_NUMA
extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9c916e4..2d86bc8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -7,7 +7,6 @@

#include <linux/gfp.h>
#include <linux/list.h>
-#include <linux/mmdebug.h>
#include <linux/mmzone.h>
#include <linux/rbtree.h>
#include <linux/prio_tree.h>
diff --git a/kernel/profile.c b/kernel/profile.c
index 7724e04..62e08db 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -371,7 +371,7 @@ static int __cpuinit profile_cpu_callback(struct notifier_block *info,
node = cpu_to_node(cpu);
per_cpu(cpu_profile_flip, cpu) = 0;
if (!per_cpu(cpu_profile_hits, cpu)[1]) {
- page = alloc_pages_node(node,
+ page = alloc_pages_exact_node(node,
GFP_KERNEL | __GFP_ZERO,
0);
if (!page)
@@ -379,7 +379,7 @@ static int __cpuinit profile_cpu_callback(struct notifier_block *info,
per_cpu(cpu_profile_hits, cpu)[1] = page_address(page);
}
if (!per_cpu(cpu_profile_hits, cpu)[0]) {
- page = alloc_pages_node(node,
+ page = alloc_pages_exact_node(node,
GFP_KERNEL | __GFP_ZERO,
0);
if (!page)
@@ -570,14 +570,14 @@ static int create_hash_tables(void)
int node = cpu_to_node(cpu);
struct page *page;

- page = alloc_pages_node(node,
+ page = alloc_pages_exact_node(node,
GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
0);
if (!page)
goto out_cleanup;
per_cpu(cpu_profile_hits, cpu)[1]
= (struct profile_hit *)page_address(page);
- page = alloc_pages_node(node,
+ page = alloc_pages_exact_node(node,
GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
0);
if (!page)
diff --git a/mm/filemap.c b/mm/filemap.c
index 5256582..0ad1eb1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -521,7 +521,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
{
if (cpuset_do_page_mem_spread()) {
int n = cpuset_mem_spread_node();
- return alloc_pages_node(n, gfp, 0);
+ return alloc_pages_exact_node(n, gfp, 0);
}
return alloc_pages(gfp, 0);
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 28c655b..1234486 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -630,7 +630,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
if (h->order >= MAX_ORDER)
return NULL;

- page = alloc_pages_node(nid,
+ page = alloc_pages_exact_node(nid,
htlb_alloc_mask|__GFP_COMP|__GFP_THISNODE|
__GFP_REPEAT|__GFP_NOWARN,
huge_page_order(h));
@@ -649,7 +649,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
* Use a helper variable to find the next node and then
* copy it back to hugetlb_next_nid afterwards:
* otherwise there's a window in which a racer might
- * pass invalid nid MAX_NUMNODES to alloc_pages_node.
+ * pass invalid nid MAX_NUMNODES to alloc_pages_exact_node.
* But we don't need to use a spin_lock here: it really
* doesn't matter if occasionally a racer chooses the
* same nid as we do. Move nid forward in the mask even
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8a5d2b8..c32bc15 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -796,7 +796,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,

static struct page *new_node_page(struct page *page, unsigned long node, int **x)
{
- return alloc_pages_node(node, GFP_HIGHUSER_MOVABLE, 0);
+ return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE, 0);
}

/*
diff --git a/mm/migrate.c b/mm/migrate.c
index 068655d..5a24923 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -802,7 +802,7 @@ static struct page *new_page_node(struct page *p, unsigned long private,

*result = &pm->status;

- return alloc_pages_node(pm->node,
+ return alloc_pages_exact_node(pm->node,
GFP_HIGHUSER_MOVABLE | GFP_THISNODE, 0);
}

diff --git a/mm/slab.c b/mm/slab.c
index 3da2640..1c680e8 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1699,7 +1699,7 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
flags |= __GFP_RECLAIMABLE;

- page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+ page = alloc_pages_exact_node(nodeid, flags, cachep->gfporder);
if (!page)
return NULL;

@@ -3254,7 +3254,7 @@ retry:
if (local_flags & __GFP_WAIT)
local_irq_enable();
kmem_flagcheck(cache, flags);
- obj = kmem_getpages(cache, local_flags, -1);
+ obj = kmem_getpages(cache, local_flags, numa_node_id());
if (local_flags & __GFP_WAIT)
local_irq_disable();
if (obj) {
diff --git a/mm/slob.c b/mm/slob.c
index 3e7acbc..ffeb218 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -46,7 +46,7 @@
* NUMA support in SLOB is fairly simplistic, pushing most of the real
* logic down to the page allocator, and simply doing the node accounting
* on the upper levels. In the event that a node id is explicitly
- * provided, alloc_pages_node() with the specified node id is used
+ * provided, alloc_pages_exact_node() with the specified node id is used
* instead. The common case (or when the node id isn't explicitly provided)
* will default to the current node, as per numa_node_id().
*
@@ -243,7 +243,7 @@ static void *slob_new_pages(gfp_t gfp, int order, int node)

#ifdef CONFIG_NUMA
if (node != -1)
- page = alloc_pages_node(node, gfp, order);
+ page = alloc_pages_exact_node(node, gfp, order);
else
#endif
page = alloc_pages(gfp, order);
--
1.5.6.5

2009-04-22 13:55:56

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 05/22] Break up the allocator entry point into fast and slow paths

The core of the page allocator is one giant function which allocates memory
on the stack and makes calculations that may not be needed for every
allocation. This patch breaks up the allocator path into fast and slow
paths for clarity. Note the slow paths are still inlined but the entry is
marked unlikely. If they were not inlined, it actally increases text size
to generate the as there is only one call site.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
---
mm/page_alloc.c | 353 +++++++++++++++++++++++++++++++++++--------------------
1 files changed, 228 insertions(+), 125 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3bed856..1e09f26 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1460,47 +1460,171 @@ try_next_zone:
return page;
}

-/*
- * This is the 'heart' of the zoned buddy allocator.
- */
-struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
- struct zonelist *zonelist, nodemask_t *nodemask)
+static inline int
+should_alloc_retry(gfp_t gfp_mask, unsigned int order,
+ unsigned long pages_reclaimed)
{
- const gfp_t wait = gfp_mask & __GFP_WAIT;
- enum zone_type high_zoneidx = gfp_zone(gfp_mask);
- struct zoneref *z;
- struct zone *zone;
- struct page *page;
- struct reclaim_state reclaim_state;
- struct task_struct *p = current;
- int do_retry;
- int alloc_flags;
- unsigned long did_some_progress;
- unsigned long pages_reclaimed = 0;
+ /* Do not loop if specifically requested */
+ if (gfp_mask & __GFP_NORETRY)
+ return 0;

- lockdep_trace_alloc(gfp_mask);
+ /*
+ * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
+ * means __GFP_NOFAIL, but that may not be true in other
+ * implementations.
+ */
+ if (order <= PAGE_ALLOC_COSTLY_ORDER)
+ return 1;

- might_sleep_if(wait);
+ /*
+ * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
+ * specified, then we retry until we no longer reclaim any pages
+ * (above), or we've reclaimed an order of pages at least as
+ * large as the allocation's order. In both cases, if the
+ * allocation still fails, we stop retrying.
+ */
+ if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
+ return 1;

- if (should_fail_alloc_page(gfp_mask, order))
- return NULL;
+ /*
+ * Don't let big-order allocations loop unless the caller
+ * explicitly requests that.
+ */
+ if (gfp_mask & __GFP_NOFAIL)
+ return 1;

- /* the list of zones suitable for gfp_mask */
- z = zonelist->_zonerefs;
- if (unlikely(!z->zone)) {
- /*
- * Happens if we have an empty zonelist as a result of
- * GFP_THISNODE being used on a memoryless node
- */
+ return 0;
+}
+
+static inline struct page *
+__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist, enum zone_type high_zoneidx,
+ nodemask_t *nodemask)
+{
+ struct page *page;
+
+ /* Acquire the OOM killer lock for the zones in zonelist */
+ if (!try_set_zone_oom(zonelist, gfp_mask)) {
+ schedule_timeout_uninterruptible(1);
return NULL;
}

-restart:
- page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
- zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+ /*
+ * Go through the zonelist yet one more time, keep very high watermark
+ * here, this is only to catch a parallel oom killing, we must fail if
+ * we're still under heavy pressure.
+ */
+ page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
+ order, zonelist, high_zoneidx,
+ ALLOC_WMARK_HIGH|ALLOC_CPUSET);
if (page)
- goto got_pg;
+ goto out;
+
+ /* The OOM killer will not help higher order allocs */
+ if (order > PAGE_ALLOC_COSTLY_ORDER)
+ goto out;
+
+ /* Exhausted what can be done so it's blamo time */
+ out_of_memory(zonelist, gfp_mask, order);
+
+out:
+ clear_zonelist_oom(zonelist, gfp_mask);
+ return page;
+}
+
+/* The really slow allocator path where we enter direct reclaim */
+static inline struct page *
+__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist, enum zone_type high_zoneidx,
+ nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress)
+{
+ struct page *page = NULL;
+ struct reclaim_state reclaim_state;
+ struct task_struct *p = current;
+
+ cond_resched();
+
+ /* We now go into synchronous reclaim */
+ cpuset_memory_pressure_bump();
+
+ /*
+ * The task's cpuset might have expanded its set of allowable nodes
+ */
+ p->flags |= PF_MEMALLOC;
+ lockdep_set_current_reclaim_state(gfp_mask);
+ reclaim_state.reclaimed_slab = 0;
+ p->reclaim_state = &reclaim_state;
+
+ *did_some_progress = try_to_free_pages(zonelist, order, gfp_mask, nodemask);
+
+ p->reclaim_state = NULL;
+ lockdep_clear_current_reclaim_state();
+ p->flags &= ~PF_MEMALLOC;
+
+ cond_resched();
+
+ if (order != 0)
+ drain_all_pages();
+
+ if (likely(*did_some_progress))
+ page = get_page_from_freelist(gfp_mask, nodemask, order,
+ zonelist, high_zoneidx, alloc_flags);
+ return page;
+}
+
+static inline int
+is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
+{
+ if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
+ && !in_interrupt())
+ return 1;
+ return 0;
+}
+
+/*
+ * This is called in the allocator slow-path if the allocation request is of
+ * sufficient urgency to ignore watermarks and take other desperate measures
+ */
+static inline struct page *
+__alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist, enum zone_type high_zoneidx,
+ nodemask_t *nodemask)
+{
+ struct page *page;
+
+ do {
+ page = get_page_from_freelist(gfp_mask, nodemask, order,
+ zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+
+ if (!page && gfp_mask & __GFP_NOFAIL)
+ congestion_wait(WRITE, HZ/50);
+ } while (!page && (gfp_mask & __GFP_NOFAIL));
+
+ return page;
+}
+
+static inline
+void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
+ enum zone_type high_zoneidx)
+{
+ struct zoneref *z;
+ struct zone *zone;
+
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+ wakeup_kswapd(zone, order);
+}
+
+static inline struct page *
+__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist, enum zone_type high_zoneidx,
+ nodemask_t *nodemask)
+{
+ const gfp_t wait = gfp_mask & __GFP_WAIT;
+ struct page *page = NULL;
+ int alloc_flags;
+ unsigned long pages_reclaimed = 0;
+ unsigned long did_some_progress;
+ struct task_struct *p = current;

/*
* GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
@@ -1513,8 +1637,7 @@ restart:
if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
goto nopage;

- for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
- wakeup_kswapd(zone, order);
+ wake_all_kswapd(order, zonelist, high_zoneidx);

/*
* OK, we're below the kswapd watermark and have kicked background
@@ -1534,6 +1657,7 @@ restart:
if (wait)
alloc_flags |= ALLOC_CPUSET;

+restart:
/*
* Go through the zonelist again. Let __GFP_HIGH and allocations
* coming from realtime tasks go deeper into reserves.
@@ -1547,23 +1671,18 @@ restart:
if (page)
goto got_pg;

- /* This allocation should allow future memory freeing. */
-
rebalance:
- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
+ /* Allocate without watermarks if the context allows */
+ if (is_allocation_high_priority(p, gfp_mask)) {
+ /* Do not dip into emergency reserves if specified */
if (!(gfp_mask & __GFP_NOMEMALLOC)) {
-nofail_alloc:
- /* go through the zonelist yet again, ignoring mins */
- page = get_page_from_freelist(gfp_mask, nodemask, order,
- zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+ page = __alloc_pages_high_priority(gfp_mask, order,
+ zonelist, high_zoneidx, nodemask);
if (page)
goto got_pg;
- if (gfp_mask & __GFP_NOFAIL) {
- congestion_wait(WRITE, HZ/50);
- goto nofail_alloc;
- }
}
+
+ /* Ensure no recursion into the allocator */
goto nopage;
}

@@ -1571,93 +1690,42 @@ nofail_alloc:
if (!wait)
goto nopage;

- cond_resched();
-
- /* We now go into synchronous reclaim */
- cpuset_memory_pressure_bump();
-
- p->flags |= PF_MEMALLOC;
-
- lockdep_set_current_reclaim_state(gfp_mask);
- reclaim_state.reclaimed_slab = 0;
- p->reclaim_state = &reclaim_state;
-
- did_some_progress = try_to_free_pages(zonelist, order,
- gfp_mask, nodemask);
-
- p->reclaim_state = NULL;
- lockdep_clear_current_reclaim_state();
- p->flags &= ~PF_MEMALLOC;
+ /* Try direct reclaim and then allocating */
+ page = __alloc_pages_direct_reclaim(gfp_mask, order,
+ zonelist, high_zoneidx,
+ nodemask,
+ alloc_flags, &did_some_progress);
+ if (page)
+ goto got_pg;

- cond_resched();
+ /*
+ * If we failed to make any progress reclaiming, then we are
+ * running out of options and have to consider going OOM
+ */
+ if (!did_some_progress) {
+ if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
+ page = __alloc_pages_may_oom(gfp_mask, order,
+ zonelist, high_zoneidx,
+ nodemask);
+ if (page)
+ goto got_pg;

- if (order != 0)
- drain_all_pages();
+ /*
+ * The OOM killer does not trigger for high-order allocations
+ * but if no progress is being made, there are no other
+ * options and retrying is unlikely to help
+ */
+ if (order > PAGE_ALLOC_COSTLY_ORDER)
+ goto nopage;

- if (likely(did_some_progress)) {
- page = get_page_from_freelist(gfp_mask, nodemask, order,
- zonelist, high_zoneidx, alloc_flags);
- if (page)
- goto got_pg;
- } else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
- if (!try_set_zone_oom(zonelist, gfp_mask)) {
- schedule_timeout_uninterruptible(1);
goto restart;
}
-
- /*
- * Go through the zonelist yet one more time, keep
- * very high watermark here, this is only to catch
- * a parallel oom killing, we must fail if we're still
- * under heavy pressure.
- */
- page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
- order, zonelist, high_zoneidx,
- ALLOC_WMARK_HIGH|ALLOC_CPUSET);
- if (page) {
- clear_zonelist_oom(zonelist, gfp_mask);
- goto got_pg;
- }
-
- /* The OOM killer will not help higher order allocs so fail */
- if (order > PAGE_ALLOC_COSTLY_ORDER) {
- clear_zonelist_oom(zonelist, gfp_mask);
- goto nopage;
- }
-
- out_of_memory(zonelist, gfp_mask, order);
- clear_zonelist_oom(zonelist, gfp_mask);
- goto restart;
}

- /*
- * Don't let big-order allocations loop unless the caller explicitly
- * requests that. Wait for some write requests to complete then retry.
- *
- * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
- * means __GFP_NOFAIL, but that may not be true in other
- * implementations.
- *
- * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
- * specified, then we retry until we no longer reclaim any pages
- * (above), or we've reclaimed an order of pages at least as
- * large as the allocation's order. In both cases, if the
- * allocation still fails, we stop retrying.
- */
+ /* Check if we should retry the allocation */
pages_reclaimed += did_some_progress;
- do_retry = 0;
- if (!(gfp_mask & __GFP_NORETRY)) {
- if (order <= PAGE_ALLOC_COSTLY_ORDER) {
- do_retry = 1;
- } else {
- if (gfp_mask & __GFP_REPEAT &&
- pages_reclaimed < (1 << order))
- do_retry = 1;
- }
- if (gfp_mask & __GFP_NOFAIL)
- do_retry = 1;
- }
- if (do_retry) {
+ if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
+ /* Wait for some write requests to complete then retry */
congestion_wait(WRITE, HZ/50);
goto rebalance;
}
@@ -1672,6 +1740,41 @@ nopage:
}
got_pg:
return page;
+
+}
+
+/*
+ * This is the 'heart' of the zoned buddy allocator.
+ */
+struct page *
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist, nodemask_t *nodemask)
+{
+ enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+ struct page *page;
+
+ lockdep_trace_alloc(gfp_mask);
+
+ might_sleep_if(gfp_mask & __GFP_WAIT);
+
+ if (should_fail_alloc_page(gfp_mask, order))
+ return NULL;
+
+ /*
+ * Check the zones suitable for the gfp_mask contain at least one
+ * valid zone. It's possible to have an empty zonelist as a result
+ * of GFP_THISNODE and a memoryless node
+ */
+ if (unlikely(!zonelist->_zonerefs->zone))
+ return NULL;
+
+ page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
+ zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+ if (unlikely(!page))
+ page = __alloc_pages_slowpath(gfp_mask, order,
+ zonelist, high_zoneidx, nodemask);
+
+ return page;
}
EXPORT_SYMBOL(__alloc_pages_nodemask);

--
1.5.6.5

2009-04-22 13:55:39

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 06/22] Move check for disabled anti-fragmentation out of fastpath

On low-memory systems, anti-fragmentation gets disabled as there is nothing
it can do and it would just incur overhead shuffling pages between lists
constantly. Currently the check is made in the free page fast path for every
page. This patch moves it to a slow path. On machines with low memory,
there will be small amount of additional overhead as pages get shuffled
between lists but it should quickly settle.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
---
include/linux/mmzone.h | 3 ---
mm/page_alloc.c | 4 ++++
2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 186ec6a..f82bdba 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -50,9 +50,6 @@ extern int page_group_by_mobility_disabled;

static inline int get_pageblock_migratetype(struct page *page)
{
- if (unlikely(page_group_by_mobility_disabled))
- return MIGRATE_UNMOVABLE;
-
return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
}

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1e09f26..b1ae435 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -172,6 +172,10 @@ int page_group_by_mobility_disabled __read_mostly;

static void set_pageblock_migratetype(struct page *page, int migratetype)
{
+
+ if (unlikely(page_group_by_mobility_disabled))
+ migratetype = MIGRATE_UNMOVABLE;
+
set_pageblock_flags_group(page, (unsigned long)migratetype,
PB_migrate, PB_migrate_end);
}
--
1.5.6.5

2009-04-22 13:56:21

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 07/22] Calculate the preferred zone for allocation only once

get_page_from_freelist() can be called multiple times for an allocation.
Part of this calculates the preferred_zone which is the first usable zone
in the zonelist but the zone depends on the GFP flags specified at the
beginning of the allocation call. This patch calculates preferred_zone
once. It's safe to do this because if preferred_zone is NULL at the start
of the call, no amount of direct reclaim or other actions will change the
fact the allocation will fail.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
---
mm/page_alloc.c | 53 ++++++++++++++++++++++++++++++++---------------------
1 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b1ae435..e073fa3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1392,23 +1392,18 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
*/
static struct page *
get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
- struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
+ struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
+ struct zone *preferred_zone)
{
struct zoneref *z;
struct page *page = NULL;
int classzone_idx;
- struct zone *zone, *preferred_zone;
+ struct zone *zone;
nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
int zlc_active = 0; /* set if using zonelist_cache */
int did_zlc_setup = 0; /* just call zlc_setup() one time */

- (void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
- &preferred_zone);
- if (!preferred_zone)
- return NULL;
-
classzone_idx = zone_idx(preferred_zone);
-
VM_BUG_ON(order >= MAX_ORDER);

zonelist_scan:
@@ -1503,7 +1498,7 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
static inline struct page *
__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
- nodemask_t *nodemask)
+ nodemask_t *nodemask, struct zone *preferred_zone)
{
struct page *page;

@@ -1520,7 +1515,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
*/
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
order, zonelist, high_zoneidx,
- ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+ ALLOC_WMARK_HIGH|ALLOC_CPUSET,
+ preferred_zone);
if (page)
goto out;

@@ -1540,7 +1536,8 @@ out:
static inline struct page *
__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
- nodemask_t *nodemask, int alloc_flags, unsigned long *did_some_progress)
+ nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
+ unsigned long *did_some_progress)
{
struct page *page = NULL;
struct reclaim_state reclaim_state;
@@ -1572,7 +1569,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,

if (likely(*did_some_progress))
page = get_page_from_freelist(gfp_mask, nodemask, order,
- zonelist, high_zoneidx, alloc_flags);
+ zonelist, high_zoneidx,
+ alloc_flags, preferred_zone);
return page;
}

@@ -1592,13 +1590,14 @@ is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
static inline struct page *
__alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
- nodemask_t *nodemask)
+ nodemask_t *nodemask, struct zone *preferred_zone)
{
struct page *page;

do {
page = get_page_from_freelist(gfp_mask, nodemask, order,
- zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
+ zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
+ preferred_zone);

if (!page && gfp_mask & __GFP_NOFAIL)
congestion_wait(WRITE, HZ/50);
@@ -1621,7 +1620,7 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
- nodemask_t *nodemask)
+ nodemask_t *nodemask, struct zone *preferred_zone)
{
const gfp_t wait = gfp_mask & __GFP_WAIT;
struct page *page = NULL;
@@ -1671,7 +1670,8 @@ restart:
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
- high_zoneidx, alloc_flags);
+ high_zoneidx, alloc_flags,
+ preferred_zone);
if (page)
goto got_pg;

@@ -1681,7 +1681,7 @@ rebalance:
/* Do not dip into emergency reserves if specified */
if (!(gfp_mask & __GFP_NOMEMALLOC)) {
page = __alloc_pages_high_priority(gfp_mask, order,
- zonelist, high_zoneidx, nodemask);
+ zonelist, high_zoneidx, nodemask, preferred_zone);
if (page)
goto got_pg;
}
@@ -1698,7 +1698,8 @@ rebalance:
page = __alloc_pages_direct_reclaim(gfp_mask, order,
zonelist, high_zoneidx,
nodemask,
- alloc_flags, &did_some_progress);
+ alloc_flags, preferred_zone,
+ &did_some_progress);
if (page)
goto got_pg;

@@ -1710,7 +1711,7 @@ rebalance:
if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
page = __alloc_pages_may_oom(gfp_mask, order,
zonelist, high_zoneidx,
- nodemask);
+ nodemask, preferred_zone);
if (page)
goto got_pg;

@@ -1755,6 +1756,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, nodemask_t *nodemask)
{
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+ struct zone *preferred_zone;
struct page *page;

lockdep_trace_alloc(gfp_mask);
@@ -1772,11 +1774,20 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
if (unlikely(!zonelist->_zonerefs->zone))
return NULL;

+ /* The preferred zone is used for statistics later */
+ (void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
+ &preferred_zone);
+ if (!preferred_zone)
+ return NULL;
+
+ /* First allocation attempt */
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
- zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+ zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
+ preferred_zone);
if (unlikely(!page))
page = __alloc_pages_slowpath(gfp_mask, order,
- zonelist, high_zoneidx, nodemask);
+ zonelist, high_zoneidx, nodemask,
+ preferred_zone);

return page;
}
--
1.5.6.5

2009-04-22 13:57:22

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 09/22] Calculate the alloc_flags for allocation only once

Factor out the mapping between GFP and alloc_flags only once. Once factored
out, it only needs to be calculated once but some care must be taken.

[[email protected] says]
As the test:

- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
- if (!(gfp_mask & __GFP_NOMEMALLOC)) {

has been replaced with a slightly weaker one:

+ if (alloc_flags & ALLOC_NO_WATERMARKS) {

Without care, this would allow recursion into the allocator via direct
reclaim. This patch ensures we do not recurse when PF_MEMALLOC is set
but TF_MEMDIE callers are now allowed to directly reclaim where they
would have been prevented in the past.

From: Peter Zijlstra <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Acked-by: Pekka Enberg <[email protected]>
---
mm/page_alloc.c | 94 +++++++++++++++++++++++++++++-------------------------
1 files changed, 50 insertions(+), 44 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eb1548c..0d23795 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1577,15 +1577,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
return page;
}

-static inline int
-is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
-{
- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt())
- return 1;
- return 0;
-}
-
/*
* This is called in the allocator slow-path if the allocation request is of
* sufficient urgency to ignore watermarks and take other desperate measures
@@ -1621,6 +1612,42 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
wakeup_kswapd(zone, order);
}

+static inline int
+gfp_to_alloc_flags(gfp_t gfp_mask)
+{
+ struct task_struct *p = current;
+ int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
+ const gfp_t wait = gfp_mask & __GFP_WAIT;
+
+ /*
+ * The caller may dip into page reserves a bit more if the caller
+ * cannot run direct reclaim, or if the caller has realtime scheduling
+ * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
+ * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
+ */
+ if (gfp_mask & __GFP_HIGH)
+ alloc_flags |= ALLOC_HIGH;
+
+ if (!wait) {
+ alloc_flags |= ALLOC_HARDER;
+ /*
+ * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+ * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+ */
+ alloc_flags &= ~ALLOC_CPUSET;
+ } else if (unlikely(rt_task(p)))
+ alloc_flags |= ALLOC_HARDER;
+
+ if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+ if (!in_interrupt() &&
+ ((p->flags & PF_MEMALLOC) ||
+ unlikely(test_thread_flag(TIF_MEMDIE))))
+ alloc_flags |= ALLOC_NO_WATERMARKS;
+ }
+
+ return alloc_flags;
+}
+
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -1651,56 +1678,35 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
* OK, we're below the kswapd watermark and have kicked background
* reclaim. Now things get more complex, so set up alloc_flags according
* to how we want to proceed.
- *
- * The caller may dip into page reserves a bit more if the caller
- * cannot run direct reclaim, or if the caller has realtime scheduling
- * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
- * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
*/
- alloc_flags = ALLOC_WMARK_MIN;
- if ((unlikely(rt_task(p)) && !in_interrupt()) || !wait)
- alloc_flags |= ALLOC_HARDER;
- if (gfp_mask & __GFP_HIGH)
- alloc_flags |= ALLOC_HIGH;
- if (wait)
- alloc_flags |= ALLOC_CPUSET;
+ alloc_flags = gfp_to_alloc_flags(gfp_mask);

restart:
- /*
- * Go through the zonelist again. Let __GFP_HIGH and allocations
- * coming from realtime tasks go deeper into reserves.
- *
- * This is the last chance, in general, before the goto nopage.
- * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
- * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
- */
+ /* This is the last chance, in general, before the goto nopage. */
page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
- high_zoneidx, alloc_flags,
- preferred_zone,
- migratetype);
+ high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
+ preferred_zone, migratetype);
if (page)
goto got_pg;

rebalance:
/* Allocate without watermarks if the context allows */
- if (is_allocation_high_priority(p, gfp_mask)) {
- /* Do not dip into emergency reserves if specified */
- if (!(gfp_mask & __GFP_NOMEMALLOC)) {
- page = __alloc_pages_high_priority(gfp_mask, order,
- zonelist, high_zoneidx, nodemask, preferred_zone,
- migratetype);
- if (page)
- goto got_pg;
- }
-
- /* Ensure no recursion into the allocator */
- goto nopage;
+ if (alloc_flags & ALLOC_NO_WATERMARKS) {
+ page = __alloc_pages_high_priority(gfp_mask, order,
+ zonelist, high_zoneidx, nodemask,
+ preferred_zone, migratetype);
+ if (page)
+ goto got_pg;
}

/* Atomic allocations - we can't balance anything */
if (!wait)
goto nopage;

+ /* Avoid recursion of direct reclaim */
+ if (p->flags & PF_MEMALLOC)
+ goto nopage;
+
/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order,
zonelist, high_zoneidx,
--
1.5.6.5

2009-04-22 13:56:55

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 10/22] Remove a branch by assuming __GFP_HIGH == ALLOC_HIGH

Allocations that specify __GFP_HIGH get the ALLOC_HIGH flag. If these
flags are equal to each other, we can eliminate a branch.

[[email protected]: Suggested the hack]
Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
---
mm/page_alloc.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d23795..4f9cdaa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1619,14 +1619,16 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
const gfp_t wait = gfp_mask & __GFP_WAIT;

+ /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
+ BUILD_BUG_ON(__GFP_HIGH != ALLOC_HIGH);
+
/*
* The caller may dip into page reserves a bit more if the caller
* cannot run direct reclaim, or if the caller has realtime scheduling
* policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
* set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
*/
- if (gfp_mask & __GFP_HIGH)
- alloc_flags |= ALLOC_HIGH;
+ alloc_flags |= (gfp_mask & __GFP_HIGH);

if (!wait) {
alloc_flags |= ALLOC_HARDER;
--
1.5.6.5

2009-04-22 13:56:38

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 08/22] Calculate the migratetype for allocation only once

GFP mask is converted into a migratetype when deciding which pagelist to
take a page from. However, it is happening multiple times per
allocation, at least once per zone traversed. Calculate it once.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 43 ++++++++++++++++++++++++++-----------------
1 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e073fa3..eb1548c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1065,13 +1065,13 @@ void split_page(struct page *page, unsigned int order)
* or two.
*/
static struct page *buffered_rmqueue(struct zone *preferred_zone,
- struct zone *zone, int order, gfp_t gfp_flags)
+ struct zone *zone, int order, gfp_t gfp_flags,
+ int migratetype)
{
unsigned long flags;
struct page *page;
int cold = !!(gfp_flags & __GFP_COLD);
int cpu;
- int migratetype = allocflags_to_migratetype(gfp_flags);

again:
cpu = get_cpu();
@@ -1393,7 +1393,7 @@ static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
static struct page *
get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
- struct zone *preferred_zone)
+ struct zone *preferred_zone, int migratetype)
{
struct zoneref *z;
struct page *page = NULL;
@@ -1436,7 +1436,8 @@ zonelist_scan:
}
}

- page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
+ page = buffered_rmqueue(preferred_zone, zone, order,
+ gfp_mask, migratetype);
if (page)
break;
this_zone_full:
@@ -1498,7 +1499,8 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
static inline struct page *
__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
- nodemask_t *nodemask, struct zone *preferred_zone)
+ nodemask_t *nodemask, struct zone *preferred_zone,
+ int migratetype)
{
struct page *page;

@@ -1516,7 +1518,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
order, zonelist, high_zoneidx,
ALLOC_WMARK_HIGH|ALLOC_CPUSET,
- preferred_zone);
+ preferred_zone, migratetype);
if (page)
goto out;

@@ -1537,7 +1539,7 @@ static inline struct page *
__alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
- unsigned long *did_some_progress)
+ int migratetype, unsigned long *did_some_progress)
{
struct page *page = NULL;
struct reclaim_state reclaim_state;
@@ -1570,7 +1572,8 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
if (likely(*did_some_progress))
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx,
- alloc_flags, preferred_zone);
+ alloc_flags, preferred_zone,
+ migratetype);
return page;
}

@@ -1590,14 +1593,15 @@ is_allocation_high_priority(struct task_struct *p, gfp_t gfp_mask)
static inline struct page *
__alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
- nodemask_t *nodemask, struct zone *preferred_zone)
+ nodemask_t *nodemask, struct zone *preferred_zone,
+ int migratetype)
{
struct page *page;

do {
page = get_page_from_freelist(gfp_mask, nodemask, order,
zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
- preferred_zone);
+ preferred_zone, migratetype);

if (!page && gfp_mask & __GFP_NOFAIL)
congestion_wait(WRITE, HZ/50);
@@ -1620,7 +1624,8 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
- nodemask_t *nodemask, struct zone *preferred_zone)
+ nodemask_t *nodemask, struct zone *preferred_zone,
+ int migratetype)
{
const gfp_t wait = gfp_mask & __GFP_WAIT;
struct page *page = NULL;
@@ -1671,7 +1676,8 @@ restart:
*/
page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
high_zoneidx, alloc_flags,
- preferred_zone);
+ preferred_zone,
+ migratetype);
if (page)
goto got_pg;

@@ -1681,7 +1687,8 @@ rebalance:
/* Do not dip into emergency reserves if specified */
if (!(gfp_mask & __GFP_NOMEMALLOC)) {
page = __alloc_pages_high_priority(gfp_mask, order,
- zonelist, high_zoneidx, nodemask, preferred_zone);
+ zonelist, high_zoneidx, nodemask, preferred_zone,
+ migratetype);
if (page)
goto got_pg;
}
@@ -1699,7 +1706,7 @@ rebalance:
zonelist, high_zoneidx,
nodemask,
alloc_flags, preferred_zone,
- &did_some_progress);
+ migratetype, &did_some_progress);
if (page)
goto got_pg;

@@ -1711,7 +1718,8 @@ rebalance:
if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
page = __alloc_pages_may_oom(gfp_mask, order,
zonelist, high_zoneidx,
- nodemask, preferred_zone);
+ nodemask, preferred_zone,
+ migratetype);
if (page)
goto got_pg;

@@ -1758,6 +1766,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
struct zone *preferred_zone;
struct page *page;
+ int migratetype = allocflags_to_migratetype(gfp_mask);

lockdep_trace_alloc(gfp_mask);

@@ -1783,11 +1792,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
/* First allocation attempt */
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
- preferred_zone);
+ preferred_zone, migratetype);
if (unlikely(!page))
page = __alloc_pages_slowpath(gfp_mask, order,
zonelist, high_zoneidx, nodemask,
- preferred_zone);
+ preferred_zone, migratetype);

return page;
}
--
1.5.6.5

2009-04-22 13:57:40

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 11/22] Inline __rmqueue_smallest()

Inline __rmqueue_smallest by altering flow very slightly so that there is
only one call site. Because there is only one call-site, this function
can then be inlined without causing text bloat. On an x86-based config,
this patch reduces text by 16 bytes.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
---
mm/page_alloc.c | 20 ++++++++++++++++----
1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4f9cdaa..8bfced9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -665,7 +665,8 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
* Go through the free lists for the given migratetype and remove
* the smallest available page from the freelists
*/
-static struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
+static inline
+struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
int migratetype)
{
unsigned int current_order;
@@ -835,8 +836,7 @@ static struct page *__rmqueue_fallback(struct zone *zone, int order,
}
}

- /* Use MIGRATE_RESERVE rather than fail an allocation */
- return __rmqueue_smallest(zone, order, MIGRATE_RESERVE);
+ return NULL;
}

/*
@@ -848,11 +848,23 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
{
struct page *page;

+retry_reserve:
page = __rmqueue_smallest(zone, order, migratetype);

- if (unlikely(!page))
+ if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
page = __rmqueue_fallback(zone, order, migratetype);

+ /*
+ * Use MIGRATE_RESERVE rather than fail an allocation. goto
+ * is used because __rmqueue_smallest is an inline function
+ * and we want just one call site
+ */
+ if (!page) {
+ migratetype = MIGRATE_RESERVE;
+ goto retry_reserve;
+ }
+ }
+
return page;
}

--
1.5.6.5

2009-04-22 13:57:57

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 12/22] Inline buffered_rmqueue()

buffered_rmqueue() is in the fast path so inline it. Because it only has one
call site, this function can then be inlined without causing text bloat. On
an x86-based config, it made no difference as the savings were padded out
by NOP instructions. Milage varies but text will either decrease in size
or remain static.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
---
mm/page_alloc.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8bfced9..cb57382 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1076,7 +1076,8 @@ void split_page(struct page *page, unsigned int order)
* we cheat by calling it from here, in the order > 0 path. Saves a branch
* or two.
*/
-static struct page *buffered_rmqueue(struct zone *preferred_zone,
+static inline
+struct page *buffered_rmqueue(struct zone *preferred_zone,
struct zone *zone, int order, gfp_t gfp_flags,
int migratetype)
{
--
1.5.6.5

2009-04-22 13:58:20

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 13/22] Inline __rmqueue_fallback()

__rmqueue_fallback() is in the slow path but has only one call site. Because
there is only one call-site, this function can then be inlined without
causing text bloat. On an x86-based config, it made no difference as the
savings were padded out by NOP instructions. Milage varies but text will
either decrease in size or remain static.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb57382..e0e9e67 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -775,8 +775,8 @@ static int move_freepages_block(struct zone *zone, struct page *page,
}

/* Remove an element from the buddy allocator from the fallback list */
-static struct page *__rmqueue_fallback(struct zone *zone, int order,
- int start_migratetype)
+static inline struct page *
+__rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
{
struct free_area * area;
int current_order;
--
1.5.6.5

2009-04-22 13:58:41

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 14/22] Do not call get_pageblock_migratetype() more than necessary

get_pageblock_migratetype() is potentially called twice for every page
free. Once, when being freed to the pcp lists and once when being freed
back to buddy. When freeing from the pcp lists, it is known what the
pageblock type was at the time of free so use it rather than rechecking.
In low memory situations under memory pressure, this might skew
anti-fragmentation slightly but the interference is minimal and
decisions that are fragmenting memory are being made anyway.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
---
mm/page_alloc.c | 16 ++++++++++------
1 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e0e9e67..67cafd0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -456,16 +456,18 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
*/

static inline void __free_one_page(struct page *page,
- struct zone *zone, unsigned int order)
+ struct zone *zone, unsigned int order,
+ int migratetype)
{
unsigned long page_idx;
int order_size = 1 << order;
- int migratetype = get_pageblock_migratetype(page);

if (unlikely(PageCompound(page)))
if (unlikely(destroy_compound_page(page, order)))
return;

+ VM_BUG_ON(migratetype == -1);
+
page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);

VM_BUG_ON(page_idx & (order_size - 1));
@@ -534,17 +536,18 @@ static void free_pages_bulk(struct zone *zone, int count,
page = list_entry(list->prev, struct page, lru);
/* have to delete it as __free_one_page list manipulates */
list_del(&page->lru);
- __free_one_page(page, zone, order);
+ __free_one_page(page, zone, order, page_private(page));
}
spin_unlock(&zone->lock);
}

-static void free_one_page(struct zone *zone, struct page *page, int order)
+static void free_one_page(struct zone *zone, struct page *page, int order,
+ int migratetype)
{
spin_lock(&zone->lock);
zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
zone->pages_scanned = 0;
- __free_one_page(page, zone, order);
+ __free_one_page(page, zone, order, migratetype);
spin_unlock(&zone->lock);
}

@@ -569,7 +572,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)

local_irq_save(flags);
__count_vm_events(PGFREE, 1 << order);
- free_one_page(page_zone(page), page, order);
+ free_one_page(page_zone(page), page, order,
+ get_pageblock_migratetype(page));
local_irq_restore(flags);
}

--
1.5.6.5

2009-04-22 13:58:58

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 15/22] Do not disable interrupts in free_page_mlock()

free_page_mlock() tests and clears PG_mlocked using locked versions of the
bit operations. If set, it disables interrupts to update counters and this
happens on every page free even though interrupts are disabled very shortly
afterwards a second time. This is wasteful.

This patch splits what free_page_mlock() does. The bit check is still
made. However, the update of counters is delayed until the interrupts are
disabled and the non-lock version for clearing the bit is used. One potential
weirdness with this split is that the counters do not get updated if the
bad_page() check is triggered but a system showing bad pages is getting
screwed already.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
---
mm/internal.h | 11 +++--------
mm/page_alloc.c | 8 +++++++-
2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 987bb03..58ec1bc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
*/
static inline void free_page_mlock(struct page *page)
{
- if (unlikely(TestClearPageMlocked(page))) {
- unsigned long flags;
-
- local_irq_save(flags);
- __dec_zone_page_state(page, NR_MLOCK);
- __count_vm_event(UNEVICTABLE_MLOCKFREED);
- local_irq_restore(flags);
- }
+ __ClearPageMlocked(page);
+ __dec_zone_page_state(page, NR_MLOCK);
+ __count_vm_event(UNEVICTABLE_MLOCKFREED);
}

#else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 67cafd0..7f45de1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -499,7 +499,6 @@ static inline void __free_one_page(struct page *page,

static inline int free_pages_check(struct page *page)
{
- free_page_mlock(page);
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
(page_count(page) != 0) |
@@ -556,6 +555,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
unsigned long flags;
int i;
int bad = 0;
+ int clearMlocked = PageMlocked(page);

for (i = 0 ; i < (1 << order) ; ++i)
bad += free_pages_check(page + i);
@@ -571,6 +571,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
kernel_map_pages(page, 1 << order, 0);

local_irq_save(flags);
+ if (unlikely(clearMlocked))
+ free_page_mlock(page);
__count_vm_events(PGFREE, 1 << order);
free_one_page(page_zone(page), page, order,
get_pageblock_migratetype(page));
@@ -1017,6 +1019,7 @@ static void free_hot_cold_page(struct page *page, int cold)
struct zone *zone = page_zone(page);
struct per_cpu_pages *pcp;
unsigned long flags;
+ int clearMlocked = PageMlocked(page);

if (PageAnon(page))
page->mapping = NULL;
@@ -1032,7 +1035,10 @@ static void free_hot_cold_page(struct page *page, int cold)

pcp = &zone_pcp(zone, get_cpu())->pcp;
local_irq_save(flags);
+ if (unlikely(clearMlocked))
+ free_page_mlock(page);
__count_vm_event(PGFREE);
+
if (cold)
list_add_tail(&page->lru, &pcp->list);
else
--
1.5.6.5

2009-04-22 13:59:28

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 16/22] Do not setup zonelist cache when there is only one node

There is a zonelist cache which is used to track zones that are not in
the allowed cpuset or found to be recently full. This is to reduce cache
footprint on large machines. On smaller machines, it just incurs cost
for no gain. This patch only uses the zonelist cache when there are NUMA
nodes.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
---
mm/page_alloc.c | 7 +++++--
1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7f45de1..e59bb80 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1467,8 +1467,11 @@ this_zone_full:
if (NUMA_BUILD)
zlc_mark_zone_full(zonelist, z);
try_next_zone:
- if (NUMA_BUILD && !did_zlc_setup) {
- /* we do zlc_setup after the first zone is tried */
+ if (NUMA_BUILD && !did_zlc_setup && num_online_nodes() > 1) {
+ /*
+ * we do zlc_setup after the first zone is tried but only
+ * if there are multiple nodes make it worthwhile
+ */
allowednodes = zlc_setup(zonelist, alloc_flags);
zlc_active = 1;
did_zlc_setup = 1;
--
1.5.6.5

2009-04-22 13:59:46

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 17/22] Do not check for compound pages during the page allocator sanity checks

A number of sanity checks are made on each page allocation and free
including that the page count is zero. page_count() checks for
compound pages and checks the count of the head page if true. However,
in these paths, we do not care if the page is compound or not as the
count of each tail page should also be zero.

This patch makes two changes to the use of page_count() in the free path. It
converts one check of page_count() to a VM_BUG_ON() as the count should
have been unconditionally checked earlier in the free path. It also avoids
checking for compound pages.

[[email protected]: Wrote changelog]
Signed-off-by: Nick Piggin <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
---
mm/page_alloc.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e59bb80..b174f2c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -425,7 +425,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
return 0;

if (PageBuddy(buddy) && page_order(buddy) == order) {
- BUG_ON(page_count(buddy) != 0);
+ VM_BUG_ON(page_count(buddy) != 0);
return 1;
}
return 0;
@@ -501,7 +501,7 @@ static inline int free_pages_check(struct page *page)
{
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
- (page_count(page) != 0) |
+ (atomic_read(&page->_count) != 0) |
(page->flags & PAGE_FLAGS_CHECK_AT_FREE))) {
bad_page(page);
return 1;
@@ -646,7 +646,7 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
{
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
- (page_count(page) != 0) |
+ (atomic_read(&page->_count) != 0) |
(page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
bad_page(page);
return 1;
--
1.5.6.5

2009-04-22 14:00:10

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 18/22] Use allocation flags as an index to the zone watermark

ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether
pages_min, pages_low or pages_high is used as the zone watermark when
allocating the pages. Two branches in the allocator hotpath determine which
watermark to use. This patch uses the flags as an array index and places
the three watermarks in a union with an array so it can be offset. This
means the flags can be used as an array index and reduces the branches
taken.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
---
include/linux/mmzone.h | 8 +++++++-
mm/page_alloc.c | 20 ++++++++++----------
2 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f82bdba..c1fa208 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -275,7 +275,13 @@ struct zone_reclaim_stat {

struct zone {
/* Fields commonly accessed by the page allocator */
- unsigned long pages_min, pages_low, pages_high;
+ union {
+ struct {
+ unsigned long pages_min, pages_low, pages_high;
+ };
+ unsigned long pages_mark[3];
+ };
+
/*
* We don't know if the memory that we're going to allocate will be freeable
* or/and it will be released eventually, so to avoid totally wasting several
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b174f2c..6030f49 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1154,10 +1154,15 @@ failed:
return NULL;
}

-#define ALLOC_NO_WATERMARKS 0x01 /* don't check watermarks at all */
-#define ALLOC_WMARK_MIN 0x02 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW 0x04 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH 0x08 /* use pages_high watermark */
+/* The WMARK bits are used as an index zone->pages_mark */
+#define ALLOC_WMARK_MIN 0x00 /* use pages_min watermark */
+#define ALLOC_WMARK_LOW 0x01 /* use pages_low watermark */
+#define ALLOC_WMARK_HIGH 0x02 /* use pages_high watermark */
+#define ALLOC_NO_WATERMARKS 0x04 /* don't check watermarks at all */
+
+/* Mask to get the watermark bits */
+#define ALLOC_WMARK_MASK (ALLOC_NO_WATERMARKS-1)
+
#define ALLOC_HARDER 0x10 /* try to alloc harder */
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */
@@ -1445,12 +1450,7 @@ zonelist_scan:

if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
unsigned long mark;
- if (alloc_flags & ALLOC_WMARK_MIN)
- mark = zone->pages_min;
- else if (alloc_flags & ALLOC_WMARK_LOW)
- mark = zone->pages_low;
- else
- mark = zone->pages_high;
+ mark = zone->pages_mark[alloc_flags & ALLOC_WMARK_MASK];
if (!zone_watermark_ok(zone, order, mark,
classzone_idx, alloc_flags)) {
if (!zone_reclaim_mode ||
--
1.5.6.5

2009-04-22 14:00:41

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 19/22] Update NR_FREE_PAGES only as necessary

When pages are being freed to the buddy allocator, the zone
NR_FREE_PAGES counter must be updated. In the case of bulk per-cpu page
freeing, it's updated once per page. This retouches cache lines more
than necessary. Update the counters one per per-cpu bulk free.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
---
mm/page_alloc.c | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6030f49..6494e13 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -460,7 +460,6 @@ static inline void __free_one_page(struct page *page,
int migratetype)
{
unsigned long page_idx;
- int order_size = 1 << order;

if (unlikely(PageCompound(page)))
if (unlikely(destroy_compound_page(page, order)))
@@ -470,10 +469,9 @@ static inline void __free_one_page(struct page *page,

page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);

- VM_BUG_ON(page_idx & (order_size - 1));
+ VM_BUG_ON(page_idx & ((1 << order) - 1));
VM_BUG_ON(bad_range(zone, page));

- __mod_zone_page_state(zone, NR_FREE_PAGES, order_size);
while (order < MAX_ORDER-1) {
unsigned long combined_idx;
struct page *buddy;
@@ -528,6 +526,8 @@ static void free_pages_bulk(struct zone *zone, int count,
spin_lock(&zone->lock);
zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
zone->pages_scanned = 0;
+
+ __mod_zone_page_state(zone, NR_FREE_PAGES, count);
while (count--) {
struct page *page;

@@ -546,6 +546,8 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
spin_lock(&zone->lock);
zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
zone->pages_scanned = 0;
+
+ __mod_zone_page_state(zone, NR_FREE_PAGES, 1 << order);
__free_one_page(page, zone, order, migratetype);
spin_unlock(&zone->lock);
}
@@ -690,7 +692,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
list_del(&page->lru);
rmv_page_order(page);
area->nr_free--;
- __mod_zone_page_state(zone, NR_FREE_PAGES, - (1UL << order));
expand(zone, page, order, current_order, area, migratetype);
return page;
}
@@ -830,8 +831,6 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
/* Remove the page from the freelists */
list_del(&page->lru);
rmv_page_order(page);
- __mod_zone_page_state(zone, NR_FREE_PAGES,
- -(1UL << order));

if (current_order == pageblock_order)
set_pageblock_migratetype(page,
@@ -904,6 +903,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
set_page_private(page, migratetype);
list = &page->lru;
}
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
spin_unlock(&zone->lock);
return i;
}
--
1.5.6.5

2009-04-22 14:00:57

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 20/22] Get the pageblock migratetype without disabling interrupts

Local interrupts are disabled when freeing pages to the PCP list. Part of
that free checks what the migratetype of the pageblock the page is in but it
checks this with interrupts disabled and interupts should never be disabled
longer than necessary. This patch checks the pagetype with interrupts
enabled with the impact that it is possible a page is freed to the wrong
list when a pageblock changes type. As that block is now already considered
mixed from an anti-fragmentation perspective, it's not of vital importance.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6494e13..ba41551 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1034,6 +1034,7 @@ static void free_hot_cold_page(struct page *page, int cold)
kernel_map_pages(page, 1, 0);

pcp = &zone_pcp(zone, get_cpu())->pcp;
+ set_page_private(page, get_pageblock_migratetype(page));
local_irq_save(flags);
if (unlikely(clearMlocked))
free_page_mlock(page);
@@ -1043,7 +1044,6 @@ static void free_hot_cold_page(struct page *page, int cold)
list_add_tail(&page->lru, &pcp->list);
else
list_add(&page->lru, &pcp->list);
- set_page_private(page, get_pageblock_migratetype(page));
pcp->count++;
if (pcp->count >= pcp->high) {
free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
--
1.5.6.5

2009-04-22 14:01:40

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 22/22] slab: Use nr_online_nodes to check for a NUMA platform

SLAB currently avoids checking a bitmap repeatedly by checking once and
storing a flag. When the addition of nr_online_nodes as a cheaper version
of num_online_nodes(), this check can be replaced by nr_online_nodes.

(Christoph did a patch that this is lifted almost verbatim from, hence the
first Signed-off-by. Christoph, can you confirm you're ok with that?)

Signed-off-by: Christoph Lameter <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
mm/slab.c | 7 ++-----
1 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 1c680e8..4d38dc2 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -892,7 +892,6 @@ static void __slab_error(const char *function, struct kmem_cache *cachep,
*/

static int use_alien_caches __read_mostly = 1;
-static int numa_platform __read_mostly = 1;
static int __init noaliencache_setup(char *s)
{
use_alien_caches = 0;
@@ -1453,10 +1452,8 @@ void __init kmem_cache_init(void)
int order;
int node;

- if (num_possible_nodes() == 1) {
+ if (num_possible_nodes() == 1)
use_alien_caches = 0;
- numa_platform = 0;
- }

for (i = 0; i < NUM_INIT_LISTS; i++) {
kmem_list3_init(&initkmem_list3[i]);
@@ -3579,7 +3576,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp)
* variable to skip the call, which is mostly likely to be present in
* the cache.
*/
- if (numa_platform && cache_free_alien(cachep, objp))
+ if (nr_online_nodes > 1 && cache_free_alien(cachep, objp))
return;

if (likely(ac->avail < ac->limit)) {
--
1.5.6.5

2009-04-22 14:01:24

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 21/22] Use a pre-calculated value instead of num_online_nodes() in fast paths

From: Christoph Lameter <[email protected]>

num_online_nodes() is called in a number of places but most often by the
page allocator when deciding whether the zonelist needs to be filtered based
on cpusets or the zonelist cache. This is actually a heavy function and
touches a number of cache lines.

This patch stores the number of online nodes at boot time and updates the
value when nodes get onlined and offlined. The value is then used in a
number of important paths in place of num_online_nodes().

Signed-off-by: Christoph Lameter <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/nodemask.h | 15 ++++++++++++++-
mm/hugetlb.c | 4 ++--
mm/page_alloc.c | 10 ++++++----
mm/slub.c | 2 +-
net/sunrpc/svc.c | 2 +-
5 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 848025c..474e73e 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -408,6 +408,19 @@ static inline int num_node_state(enum node_states state)
#define next_online_node(nid) next_node((nid), node_states[N_ONLINE])

extern int nr_node_ids;
+extern int nr_online_nodes;
+
+static inline void node_set_online(int nid)
+{
+ node_set_state(nid, N_ONLINE);
+ nr_online_nodes = num_node_state(N_ONLINE);
+}
+
+static inline void node_set_offline(int nid)
+{
+ node_clear_state(nid, N_ONLINE);
+ nr_online_nodes = num_node_state(N_ONLINE);
+}
#else

static inline int node_state(int node, enum node_states state)
@@ -434,7 +447,7 @@ static inline int num_node_state(enum node_states state)
#define first_online_node 0
#define next_online_node(nid) (MAX_NUMNODES)
#define nr_node_ids 1
-
+#define nr_online_nodes 1
#endif

#define node_online_map node_states[N_ONLINE]
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1234486..c9a404c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -875,7 +875,7 @@ static void return_unused_surplus_pages(struct hstate *h,
* can no longer free unreserved surplus pages. This occurs when
* the nodes with surplus pages have no free pages.
*/
- unsigned long remaining_iterations = num_online_nodes();
+ unsigned long remaining_iterations = nr_online_nodes;

/* Uncommit the reservation */
h->resv_huge_pages -= unused_resv_pages;
@@ -904,7 +904,7 @@ static void return_unused_surplus_pages(struct hstate *h,
h->surplus_huge_pages--;
h->surplus_huge_pages_node[nid]--;
nr_pages--;
- remaining_iterations = num_online_nodes();
+ remaining_iterations = nr_online_nodes;
}
}
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ba41551..1464aca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -165,7 +165,9 @@ static unsigned long __meminitdata dma_reserve;

#if MAX_NUMNODES > 1
int nr_node_ids __read_mostly = MAX_NUMNODES;
+int nr_online_nodes __read_mostly = 1;
EXPORT_SYMBOL(nr_node_ids);
+EXPORT_SYMBOL(nr_online_nodes);
#endif

int page_group_by_mobility_disabled __read_mostly;
@@ -1467,7 +1469,7 @@ this_zone_full:
if (NUMA_BUILD)
zlc_mark_zone_full(zonelist, z);
try_next_zone:
- if (NUMA_BUILD && !did_zlc_setup && num_online_nodes() > 1) {
+ if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
/*
* we do zlc_setup after the first zone is tried but only
* if there are multiple nodes make it worthwhile
@@ -2267,7 +2269,7 @@ int numa_zonelist_order_handler(ctl_table *table, int write,
}


-#define MAX_NODE_LOAD (num_online_nodes())
+#define MAX_NODE_LOAD (nr_online_nodes)
static int node_load[MAX_NUMNODES];

/**
@@ -2476,7 +2478,7 @@ static void build_zonelists(pg_data_t *pgdat)

/* NUMA-aware ordering of nodes */
local_node = pgdat->node_id;
- load = num_online_nodes();
+ load = nr_online_nodes;
prev_node = local_node;
nodes_clear(used_mask);

@@ -2627,7 +2629,7 @@ void build_all_zonelists(void)

printk("Built %i zonelists in %s order, mobility grouping %s. "
"Total pages: %ld\n",
- num_online_nodes(),
+ nr_online_nodes,
zonelist_order_name[current_zonelist_order],
page_group_by_mobility_disabled ? "off" : "on",
vm_total_pages);
diff --git a/mm/slub.c b/mm/slub.c
index 93f5fb0..8e181e2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3714,7 +3714,7 @@ static int list_locations(struct kmem_cache *s, char *buf,
to_cpumask(l->cpus));
}

- if (num_online_nodes() > 1 && !nodes_empty(l->nodes) &&
+ if (nr_online_nodes > 1 && !nodes_empty(l->nodes) &&
len < PAGE_SIZE - 60) {
len += sprintf(buf + len, " nodes=");
len += nodelist_scnprintf(buf + len, PAGE_SIZE - len - 50,
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 8847add..5ed8931 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -124,7 +124,7 @@ svc_pool_map_choose_mode(void)
{
unsigned int node;

- if (num_online_nodes() > 1) {
+ if (nr_online_nodes > 1) {
/*
* Actually have multiple NUMA nodes,
* so split pools on NUMA node boundaries
--
1.5.6.5

2009-04-22 14:37:50

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 22/22] slab: Use nr_online_nodes to check for a NUMA platform

On Wed, Apr 22, 2009 at 4:53 PM, Mel Gorman <[email protected]> wrote:
> SLAB currently avoids checking a bitmap repeatedly by checking once and
> storing a flag. When the addition of nr_online_nodes as a cheaper version
> of num_online_nodes(), this check can be replaced by nr_online_nodes.
>
> (Christoph did a patch that this is lifted almost verbatim from, hence the
> first Signed-off-by. Christoph, can you confirm you're ok with that?)
>
> Signed-off-by: Christoph Lameter <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: Pekka Enberg <[email protected]>

2009-04-22 16:13:40

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 02/22] Do not sanity check order in the fast path

On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> No user of the allocator API should be passing in an order >= MAX_ORDER
> but we check for it on each and every allocation. Delete this check and
> make it a VM_BUG_ON check further down the call path.

Should we get the check re-added to some of the upper-level functions,
then? Perhaps __get_free_pages() or things like alloc_pages_exact()?

I'm selfishly thinking of what I did in profile_init(). Can I slab
alloc it? Nope. Page allocator? Nope. Oh, well, try vmalloc():

prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
if (prof_buffer)
return 0;

prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
if (prof_buffer)
return 0;

prof_buffer = vmalloc(buffer_bytes);
if (prof_buffer)
return 0;

free_cpumask_var(prof_cpu_mask);
return -ENOMEM;

Same thing in __kmalloc_section_memmap():

page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
if (page)
goto got_map_page;

ret = vmalloc(memmap_size);
if (ret)
goto got_map_ptr;

I depend on the allocator to tell me when I've fed it too high of an
order. If we really need this, perhaps we should do an audit and then
add a WARN_ON() for a few releases to catch the stragglers.

-- Dave

2009-04-22 17:12:11

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 02/22] Do not sanity check order in the fast path

On Wed, Apr 22, 2009 at 09:13:11AM -0700, Dave Hansen wrote:
> On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > No user of the allocator API should be passing in an order >= MAX_ORDER
> > but we check for it on each and every allocation. Delete this check and
> > make it a VM_BUG_ON check further down the call path.
>
> Should we get the check re-added to some of the upper-level functions,
> then? Perhaps __get_free_pages() or things like alloc_pages_exact()?
>

I don't think so, no. It just moves the source of the text bloat and
for the few callers that are asking for something that will never
succeed.

> I'm selfishly thinking of what I did in profile_init(). Can I slab
> alloc it? Nope. Page allocator? Nope. Oh, well, try vmalloc():
>
> prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
> if (prof_buffer)
> return 0;
>
> prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
> if (prof_buffer)
> return 0;
>
> prof_buffer = vmalloc(buffer_bytes);
> if (prof_buffer)
> return 0;
>
> free_cpumask_var(prof_cpu_mask);
> return -ENOMEM;
>

Can this ever actually be asking for an order larger than MAX_ORDER
though? If so, you're condemning it to always behave poorly.

> Same thing in __kmalloc_section_memmap():
>
> page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
> if (page)
> goto got_map_page;
>
> ret = vmalloc(memmap_size);
> if (ret)
> goto got_map_ptr;
>

If I'm reading that right, the order will never be a stupid order. It can fail
for higher orders in which case it falls back to vmalloc() . For example,
to hit that limit, the section size for a 4K kernel, maximum usable order
of 10, the section size would need to be 256MB (assuming struct page size
of 64 bytes). I don't think it's ever that size and if so, it'll always be
sub-optimal which is a poor choice to make.

> I depend on the allocator to tell me when I've fed it too high of an
> order. If we really need this, perhaps we should do an audit and then
> add a WARN_ON() for a few releases to catch the stragglers.
>

I consider it buggy to ask for something so large that you always end up
with the worst option - vmalloc(). How about leaving it as a VM_BUG_ON
to get as many reports as possible on who is depending on this odd
behaviour?

If there are users with good reasons, then we could convert this to WARN_ON
to fix up the callers. I suspect that the allocator can already cope with
recieving a stupid order silently but slowly. It should go all the way to the
bottom and just never find anything useful and return NULL. zone_watermark_ok
is the most dangerous looking part but even it should never get to MAX_ORDER
because it should always find there are not enough free pages and return
before it overruns.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-04-22 17:12:27

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark

On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> struct zone {
> /* Fields commonly accessed by the page allocator */
> - unsigned long pages_min, pages_low, pages_high;
> + union {
> + struct {
> + unsigned long pages_min, pages_low, pages_high;
> + };
> + unsigned long pages_mark[3];
> + };

Why the union? It's a bit obfuscated for me. Why not just have a
couple of these:

static inline unsigned long zone_pages_min(struct zone *zone)
{
return zone->pages_mark[ALLOC_WMARK_MIN];
}

and s/zone->pages_min/zone_pages_min(zone)/

?

-- Dave

2009-04-22 17:15:09

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark

On Wed, Apr 22, 2009 at 10:11:53AM -0700, Dave Hansen wrote:
> On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > struct zone {
> > /* Fields commonly accessed by the page allocator */
> > - unsigned long pages_min, pages_low, pages_high;
> > + union {
> > + struct {
> > + unsigned long pages_min, pages_low, pages_high;
> > + };
> > + unsigned long pages_mark[3];
> > + };
>
> Why the union? It's a bit obfuscated for me. Why not just have a
> couple of these:
>
> static inline unsigned long zone_pages_min(struct zone *zone)
> {
> return zone->pages_mark[ALLOC_WMARK_MIN];
> }
>
> and s/zone->pages_min/zone_pages_min(zone)/
>
> ?
>

Preference of taste really. When I started a conversion to accessors, it
changed something recognised to something new that looked uglier to me.
Only one place cares about the union enough to access is via an array so
why spread it everywhere.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-04-22 17:30:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 02/22] Do not sanity check order in the fast path

On Wed, 2009-04-22 at 18:11 +0100, Mel Gorman wrote:
> On Wed, Apr 22, 2009 at 09:13:11AM -0700, Dave Hansen wrote:
> > On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > > No user of the allocator API should be passing in an order >= MAX_ORDER
> > > but we check for it on each and every allocation. Delete this check and
> > > make it a VM_BUG_ON check further down the call path.
> >
> > Should we get the check re-added to some of the upper-level functions,
> > then? Perhaps __get_free_pages() or things like alloc_pages_exact()?
>
> I don't think so, no. It just moves the source of the text bloat and
> for the few callers that are asking for something that will never
> succeed.

Well, it's a matter of figuring out when it can succeed. Some of this
stuff, we can figure out at compile-time. Others are a bit harder.

> > I'm selfishly thinking of what I did in profile_init(). Can I slab
> > alloc it? Nope. Page allocator? Nope. Oh, well, try vmalloc():
> >
> > prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
> > if (prof_buffer)
> > return 0;
> >
> > prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
> > if (prof_buffer)
> > return 0;
> >
> > prof_buffer = vmalloc(buffer_bytes);
> > if (prof_buffer)
> > return 0;
> >
> > free_cpumask_var(prof_cpu_mask);
> > return -ENOMEM;
> >
>
> Can this ever actually be asking for an order larger than MAX_ORDER
> though? If so, you're condemning it to always behave poorly.

Yeah. It is based on text size. Smaller kernels with trimmed configs
and no modules have no problem fitting under MAX_ORDER, as do kernels
with larger base page sizes.

> > Same thing in __kmalloc_section_memmap():
> >
> > page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
> > if (page)
> > goto got_map_page;
> >
> > ret = vmalloc(memmap_size);
> > if (ret)
> > goto got_map_ptr;
> >
>
> If I'm reading that right, the order will never be a stupid order. It can fail
> for higher orders in which case it falls back to vmalloc() . For example,
> to hit that limit, the section size for a 4K kernel, maximum usable order
> of 10, the section size would need to be 256MB (assuming struct page size
> of 64 bytes). I don't think it's ever that size and if so, it'll always be
> sub-optimal which is a poor choice to make.

I think the section size default used to be 512M on x86 because we
concentrate on removing whole DIMMs.

> > I depend on the allocator to tell me when I've fed it too high of an
> > order. If we really need this, perhaps we should do an audit and then
> > add a WARN_ON() for a few releases to catch the stragglers.
>
> I consider it buggy to ask for something so large that you always end up
> with the worst option - vmalloc(). How about leaving it as a VM_BUG_ON
> to get as many reports as possible on who is depending on this odd
> behaviour?
>
> If there are users with good reasons, then we could convert this to WARN_ON
> to fix up the callers. I suspect that the allocator can already cope with
> recieving a stupid order silently but slowly. It should go all the way to the
> bottom and just never find anything useful and return NULL. zone_watermark_ok
> is the most dangerous looking part but even it should never get to MAX_ORDER
> because it should always find there are not enough free pages and return
> before it overruns.

Whatever we do, I'd agree that it's fine that this is a degenerate case
that gets handled very slowly and as far out of hot paths as possible.
Anybody who can fall back to a vmalloc is not doing these things very
often.

-- Dave

2009-04-22 17:47:21

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark

On Wed, 2009-04-22 at 18:14 +0100, Mel Gorman wrote:
> Preference of taste really. When I started a conversion to accessors, it
> changed something recognised to something new that looked uglier to me.
> Only one place cares about the union enough to access is via an array so
> why spread it everywhere.

Personally, I'd say for consistency. Someone looking at both forms
wouldn't necessarily know that they refer to the same variables unless
they know about the union.

2009-04-22 20:06:28

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark

On Wed, 22 Apr 2009, Mel Gorman wrote:

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b174f2c..6030f49 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1154,10 +1154,15 @@ failed:
> return NULL;
> }
>
> -#define ALLOC_NO_WATERMARKS 0x01 /* don't check watermarks at all */
> -#define ALLOC_WMARK_MIN 0x02 /* use pages_min watermark */
> -#define ALLOC_WMARK_LOW 0x04 /* use pages_low watermark */
> -#define ALLOC_WMARK_HIGH 0x08 /* use pages_high watermark */
> +/* The WMARK bits are used as an index zone->pages_mark */
> +#define ALLOC_WMARK_MIN 0x00 /* use pages_min watermark */
> +#define ALLOC_WMARK_LOW 0x01 /* use pages_low watermark */
> +#define ALLOC_WMARK_HIGH 0x02 /* use pages_high watermark */
> +#define ALLOC_NO_WATERMARKS 0x04 /* don't check watermarks at all */
> +
> +/* Mask to get the watermark bits */
> +#define ALLOC_WMARK_MASK (ALLOC_NO_WATERMARKS-1)
> +
> #define ALLOC_HARDER 0x10 /* try to alloc harder */
> #define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
> #define ALLOC_CPUSET 0x40 /* check for correct cpuset */

The watermark flags should probably be members of an anonymous enum since
they're being used as an index into an array. If another watermark were
ever to be added it would require a value of 0x03, for instance.

enum {
ALLOC_WMARK_MIN,
ALLOC_WMARK_LOW,
ALLOC_WMARK_HIGH,

ALLOC_WMARK_MASK = 0xf /* no more than 16 possible watermarks */
};

This eliminates ALLOC_NO_WATERMARKS and the caller that uses it would
simply pass 0.

> @@ -1445,12 +1450,7 @@ zonelist_scan:
>
> if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {

This would become

if (alloc_flags & ALLOC_WMARK_MASK)

2009-04-22 20:12:09

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 02/22] Do not sanity check order in the fast path

On Wed, 22 Apr 2009, Mel Gorman wrote:

> If there are users with good reasons, then we could convert this to WARN_ON
> to fix up the callers. I suspect that the allocator can already cope with
> recieving a stupid order silently but slowly. It should go all the way to the
> bottom and just never find anything useful and return NULL. zone_watermark_ok
> is the most dangerous looking part but even it should never get to MAX_ORDER
> because it should always find there are not enough free pages and return
> before it overruns.
>

slub: enforce MAX_ORDER

slub_max_order may not be equal to or greater than MAX_ORDER.

Additionally, if a single object cannot be placed in a slab of
slub_max_order, it still must allocate slabs below MAX_ORDER.

Cc: Christoph Lameter <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
mm/slub.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1909,7 +1909,7 @@ static inline int calculate_order(int size)
* Doh this slab cannot be placed using slub_max_order.
*/
order = slab_order(size, 1, MAX_ORDER, 1);
- if (order <= MAX_ORDER)
+ if (order < MAX_ORDER)
return order;
return -ENOSYS;
}
@@ -2522,6 +2522,7 @@ __setup("slub_min_order=", setup_slub_min_order);
static int __init setup_slub_max_order(char *str)
{
get_option(&str, &slub_max_order);
+ slub_max_order = min(slub_max_order, MAX_ORDER - 1);

return 1;
}

2009-04-22 20:24:43

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 16/22] Do not setup zonelist cache when there is only one node

On Wed, 22 Apr 2009, Mel Gorman wrote:

> There is a zonelist cache which is used to track zones that are not in
> the allowed cpuset or found to be recently full. This is to reduce cache
> footprint on large machines. On smaller machines, it just incurs cost
> for no gain. This patch only uses the zonelist cache when there are NUMA
> nodes.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Reviewed-by: Christoph Lameter <[email protected]>
> ---
> mm/page_alloc.c | 7 +++++--
> 1 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7f45de1..e59bb80 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1467,8 +1467,11 @@ this_zone_full:
> if (NUMA_BUILD)
> zlc_mark_zone_full(zonelist, z);

If zonelist caching is never used for UMA machines, why should they ever
call zlc_mark_zone_full()? It will always dereference
zonelist->zlcache_ptr and immediately return without doing anything.

Wouldn't it better to just add

if (num_online_nodes() == 1)
continue;

right before this call to zlc_mark_zone_full()? This should compile out
the remainder of the loop for !CONFIG_NUMA kernels anyway.

> try_next_zone:
> - if (NUMA_BUILD && !did_zlc_setup) {
> - /* we do zlc_setup after the first zone is tried */
> + if (NUMA_BUILD && !did_zlc_setup && num_online_nodes() > 1) {
> + /*
> + * we do zlc_setup after the first zone is tried but only
> + * if there are multiple nodes make it worthwhile
> + */
> allowednodes = zlc_setup(zonelist, alloc_flags);
> zlc_active = 1;
> did_zlc_setup = 1;
> --
> 1.5.6.5

2009-04-22 20:32:36

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [PATCH 16/22] Do not setup zonelist cache when there is only one node

On Wed, 2009-04-22 at 13:24 -0700, David Rientjes wrote:
> On Wed, 22 Apr 2009, Mel Gorman wrote:
>
> > There is a zonelist cache which is used to track zones that are not in
> > the allowed cpuset or found to be recently full. This is to reduce cache
> > footprint on large machines. On smaller machines, it just incurs cost
> > for no gain. This patch only uses the zonelist cache when there are NUMA
> > nodes.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > Reviewed-by: Christoph Lameter <[email protected]>
> > ---
> > mm/page_alloc.c | 7 +++++--
> > 1 files changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 7f45de1..e59bb80 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1467,8 +1467,11 @@ this_zone_full:
> > if (NUMA_BUILD)
> > zlc_mark_zone_full(zonelist, z);
>
> If zonelist caching is never used for UMA machines, why should they ever
> call zlc_mark_zone_full()? It will always dereference
> zonelist->zlcache_ptr and immediately return without doing anything.
>
> Wouldn't it better to just add
>
> if (num_online_nodes() == 1)
> continue;
>
> right before this call to zlc_mark_zone_full()? This should compile out
> the remainder of the loop for !CONFIG_NUMA kernels anyway.

Shouldn't it already do that? NUMA_BUILD is defined as 0 when
!CONFIG_NUMA to avoid #ifdef's in the code while still allowing compiler
error checking in the dead code.

Lee

>
> > try_next_zone:
> > - if (NUMA_BUILD && !did_zlc_setup) {
> > - /* we do zlc_setup after the first zone is tried */
> > + if (NUMA_BUILD && !did_zlc_setup && num_online_nodes() > 1) {
> > + /*
> > + * we do zlc_setup after the first zone is tried but only
> > + * if there are multiple nodes make it worthwhile
> > + */
> > allowednodes = zlc_setup(zonelist, alloc_flags);
> > zlc_active = 1;
> > did_zlc_setup = 1;
> > --
> > 1.5.6.5
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-04-22 20:34:46

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 16/22] Do not setup zonelist cache when there is only one node

On Wed, 22 Apr 2009, Lee Schermerhorn wrote:

> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 7f45de1..e59bb80 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1467,8 +1467,11 @@ this_zone_full:
> > > if (NUMA_BUILD)
> > > zlc_mark_zone_full(zonelist, z);
> >
> > If zonelist caching is never used for UMA machines, why should they ever
> > call zlc_mark_zone_full()? It will always dereference
> > zonelist->zlcache_ptr and immediately return without doing anything.
> >
> > Wouldn't it better to just add
> >
> > if (num_online_nodes() == 1)
> > continue;
> >
> > right before this call to zlc_mark_zone_full()? This should compile out
> > the remainder of the loop for !CONFIG_NUMA kernels anyway.
>
> Shouldn't it already do that? NUMA_BUILD is defined as 0 when
> !CONFIG_NUMA to avoid #ifdef's in the code while still allowing compiler
> error checking in the dead code.
>

Yeah, but adding the check on num_online_nodes() also prevents needlessly
calling zlc_mark_zone_full() on CONFIG_NUMA kernels when running on an UMA
machine.

2009-04-22 23:05:10

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 21/22] Use a pre-calculated value instead of num_online_nodes() in fast paths

On Wed, 22 Apr 2009, Mel Gorman wrote:

> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index 848025c..474e73e 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -408,6 +408,19 @@ static inline int num_node_state(enum node_states state)
> #define next_online_node(nid) next_node((nid), node_states[N_ONLINE])
>
> extern int nr_node_ids;
> +extern int nr_online_nodes;
> +
> +static inline void node_set_online(int nid)
> +{
> + node_set_state(nid, N_ONLINE);
> + nr_online_nodes = num_node_state(N_ONLINE);
> +}
> +
> +static inline void node_set_offline(int nid)
> +{
> + node_clear_state(nid, N_ONLINE);
> + nr_online_nodes = num_node_state(N_ONLINE);
> +}
> #else
>
> static inline int node_state(int node, enum node_states state)

The later #define's of node_set_online() and node_set_offline() in
include/linux/nodemask.h should probably be removed now.

2009-04-23 00:11:32

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 16/22] Do not setup zonelist cache when there is only one node

> On Wed, 22 Apr 2009, Lee Schermerhorn wrote:
>
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 7f45de1..e59bb80 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -1467,8 +1467,11 @@ this_zone_full:
> > > > if (NUMA_BUILD)
> > > > zlc_mark_zone_full(zonelist, z);
> > >
> > > If zonelist caching is never used for UMA machines, why should they ever
> > > call zlc_mark_zone_full()? It will always dereference
> > > zonelist->zlcache_ptr and immediately return without doing anything.
> > >
> > > Wouldn't it better to just add
> > >
> > > if (num_online_nodes() == 1)
> > > continue;
> > >
> > > right before this call to zlc_mark_zone_full()? This should compile out
> > > the remainder of the loop for !CONFIG_NUMA kernels anyway.
> >
> > Shouldn't it already do that? NUMA_BUILD is defined as 0 when
> > !CONFIG_NUMA to avoid #ifdef's in the code while still allowing compiler
> > error checking in the dead code.
> >
>
> Yeah, but adding the check on num_online_nodes() also prevents needlessly
> calling zlc_mark_zone_full() on CONFIG_NUMA kernels when running on an UMA
> machine.

I don't like this idea...

In UMA system, zlc_mark_zone_full() isn't so expensive. but In large system
one branch increasing is often costly.


2009-04-23 00:13:40

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 02/22] Do not sanity check order in the fast path

On Wed, Apr 22, 2009 at 10:30:15AM -0700, Dave Hansen wrote:
> On Wed, 2009-04-22 at 18:11 +0100, Mel Gorman wrote:
> > On Wed, Apr 22, 2009 at 09:13:11AM -0700, Dave Hansen wrote:
> > > On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > > > No user of the allocator API should be passing in an order >= MAX_ORDER
> > > > but we check for it on each and every allocation. Delete this check and
> > > > make it a VM_BUG_ON check further down the call path.
> > >
> > > Should we get the check re-added to some of the upper-level functions,
> > > then? Perhaps __get_free_pages() or things like alloc_pages_exact()?
> >
> > I don't think so, no. It just moves the source of the text bloat and
> > for the few callers that are asking for something that will never
> > succeed.
>
> Well, it's a matter of figuring out when it can succeed. Some of this
> stuff, we can figure out at compile-time. Others are a bit harder.
>

What do you suggest then? Some sort of constant that tells you the
maximum size you can call for callers that think they might ever request
too much?

Shuffling the check around to other top-level helpers seems pointless to
me because as I said, it just moves text bloat from one place to the
next.

> > > I'm selfishly thinking of what I did in profile_init(). Can I slab
> > > alloc it? Nope. Page allocator? Nope. Oh, well, try vmalloc():
> > >
> > > prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
> > > if (prof_buffer)
> > > return 0;
> > >
> > > prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
> > > if (prof_buffer)
> > > return 0;
> > >
> > > prof_buffer = vmalloc(buffer_bytes);
> > > if (prof_buffer)
> > > return 0;
> > >
> > > free_cpumask_var(prof_cpu_mask);
> > > return -ENOMEM;
> > >
> >
> > Can this ever actually be asking for an order larger than MAX_ORDER
> > though? If so, you're condemning it to always behave poorly.
>
> Yeah. It is based on text size. Smaller kernels with trimmed configs
> and no modules have no problem fitting under MAX_ORDER, as do kernels
> with larger base page sizes.
>

It would seem that the right thing to have done here in the first place
then was

if (buffer_bytes > PAGE_SIZE << (MAX_ORDER-1)
return vmalloc(...)

kzalloc attempt

alloc_pages_exact attempt

> > > Same thing in __kmalloc_section_memmap():
> > >
> > > page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
> > > if (page)
> > > goto got_map_page;
> > >
> > > ret = vmalloc(memmap_size);
> > > if (ret)
> > > goto got_map_ptr;
> > >
> >
> > If I'm reading that right, the order will never be a stupid order. It can fail
> > for higher orders in which case it falls back to vmalloc() . For example,
> > to hit that limit, the section size for a 4K kernel, maximum usable order
> > of 10, the section size would need to be 256MB (assuming struct page size
> > of 64 bytes). I don't think it's ever that size and if so, it'll always be
> > sub-optimal which is a poor choice to make.
>
> I think the section size default used to be 512M on x86 because we
> concentrate on removing whole DIMMs.
>

It was a poor choice then as their sections always ended up in
vmalloc() or else it was using the bootmem allocator in which case it
doesn't matter that the core page allocator was doing.

> > > I depend on the allocator to tell me when I've fed it too high of an
> > > order. If we really need this, perhaps we should do an audit and then
> > > add a WARN_ON() for a few releases to catch the stragglers.
> >
> > I consider it buggy to ask for something so large that you always end up
> > with the worst option - vmalloc(). How about leaving it as a VM_BUG_ON
> > to get as many reports as possible on who is depending on this odd
> > behaviour?
> >
> > If there are users with good reasons, then we could convert this to WARN_ON
> > to fix up the callers. I suspect that the allocator can already cope with
> > recieving a stupid order silently but slowly. It should go all the way to the
> > bottom and just never find anything useful and return NULL. zone_watermark_ok
> > is the most dangerous looking part but even it should never get to MAX_ORDER
> > because it should always find there are not enough free pages and return
> > before it overruns.
>
> Whatever we do, I'd agree that it's fine that this is a degenerate case
> that gets handled very slowly and as far out of hot paths as possible.
> Anybody who can fall back to a vmalloc is not doing these things very
> often.
>

If that's the case, the simpliest course might be to just drop the VM_BUG_ON()
as a separate patch after asserting it's safe to call into the page
allocator with too large an order with the consequence of it being a
relatively expensive call considering it can never succeed.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-04-23 00:19:35

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 16/22] Do not setup zonelist cache when there is only one node

On Wed, Apr 22, 2009 at 01:24:26PM -0700, David Rientjes wrote:
> On Wed, 22 Apr 2009, Mel Gorman wrote:
>
> > There is a zonelist cache which is used to track zones that are not in
> > the allowed cpuset or found to be recently full. This is to reduce cache
> > footprint on large machines. On smaller machines, it just incurs cost
> > for no gain. This patch only uses the zonelist cache when there are NUMA
> > nodes.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > Reviewed-by: Christoph Lameter <[email protected]>
> > ---
> > mm/page_alloc.c | 7 +++++--
> > 1 files changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 7f45de1..e59bb80 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1467,8 +1467,11 @@ this_zone_full:
> > if (NUMA_BUILD)
> > zlc_mark_zone_full(zonelist, z);
>
> If zonelist caching is never used for UMA machines, why should they ever
> call zlc_mark_zone_full()? It will always dereference
> zonelist->zlcache_ptr and immediately return without doing anything.
>
> Wouldn't it better to just add
>
> if (num_online_nodes() == 1)
> continue;
>

num_online_nodes() is actually a really heavy function. It calls hweight
on a bitmap which is probably why it's not happening already. There is a
nr_online_nodes later in the patchset though. With nr_online_nodes, it's
a good idea to avoid a function call so I've taken note to do that patch
in pass 2.

Thanks

> right before this call to zlc_mark_zone_full()? This should compile out
> the remainder of the loop for !CONFIG_NUMA kernels anyway.
>
> > try_next_zone:
> > - if (NUMA_BUILD && !did_zlc_setup) {
> > - /* we do zlc_setup after the first zone is tried */
> > + if (NUMA_BUILD && !did_zlc_setup && num_online_nodes() > 1) {
> > + /*
> > + * we do zlc_setup after the first zone is tried but only
> > + * if there are multiple nodes make it worthwhile
> > + */
> > allowednodes = zlc_setup(zonelist, alloc_flags);
> > zlc_active = 1;
> > did_zlc_setup = 1;
> > --
> > 1.5.6.5
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-04-23 00:27:28

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark

> On Wed, 2009-04-22 at 18:14 +0100, Mel Gorman wrote:
> > Preference of taste really. When I started a conversion to accessors, it
> > changed something recognised to something new that looked uglier to me.
> > Only one place cares about the union enough to access is via an array so
> > why spread it everywhere.
>
> Personally, I'd say for consistency. Someone looking at both forms
> wouldn't necessarily know that they refer to the same variables unless
> they know about the union.

for just clalification...

AFAIK, C language specification don't gurantee point same value.
compiler can insert pad between struct-member and member, but not insert
into array.

However, all gcc version don't do that. I think. but perhaps I missed
some minor gcc release..


So, I also like Dave's idea. but it only personal feeling.


2009-04-23 00:29:48

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark

On Wed, Apr 22, 2009 at 01:06:07PM -0700, David Rientjes wrote:
> On Wed, 22 Apr 2009, Mel Gorman wrote:
>
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index b174f2c..6030f49 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1154,10 +1154,15 @@ failed:
> > return NULL;
> > }
> >
> > -#define ALLOC_NO_WATERMARKS 0x01 /* don't check watermarks at all */
> > -#define ALLOC_WMARK_MIN 0x02 /* use pages_min watermark */
> > -#define ALLOC_WMARK_LOW 0x04 /* use pages_low watermark */
> > -#define ALLOC_WMARK_HIGH 0x08 /* use pages_high watermark */
> > +/* The WMARK bits are used as an index zone->pages_mark */
> > +#define ALLOC_WMARK_MIN 0x00 /* use pages_min watermark */
> > +#define ALLOC_WMARK_LOW 0x01 /* use pages_low watermark */
> > +#define ALLOC_WMARK_HIGH 0x02 /* use pages_high watermark */
> > +#define ALLOC_NO_WATERMARKS 0x04 /* don't check watermarks at all */
> > +
> > +/* Mask to get the watermark bits */
> > +#define ALLOC_WMARK_MASK (ALLOC_NO_WATERMARKS-1)
> > +
> > #define ALLOC_HARDER 0x10 /* try to alloc harder */
> > #define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
> > #define ALLOC_CPUSET 0x40 /* check for correct cpuset */
>
> The watermark flags should probably be members of an anonymous enum since
> they're being used as an index into an array. If another watermark were
> ever to be added it would require a value of 0x03, for instance.
>
> enum {
> ALLOC_WMARK_MIN,
> ALLOC_WMARK_LOW,
> ALLOC_WMARK_HIGH,
>
> ALLOC_WMARK_MASK = 0xf /* no more than 16 possible watermarks */
> };
>
> This eliminates ALLOC_NO_WATERMARKS and the caller that uses it would
> simply pass 0.
>

I'm missing something here. If ALLOC_NO_WATERMARKS was defined as zero
then thing like this break.

if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
if (!in_interrupt() &&
((p->flags & PF_MEMALLOC) ||
unlikely(test_thread_flag(TIF_MEMDIE))))
alloc_flags |= ALLOC_NO_WATERMARKS;
}

Also, the ALLOC_HARDER and other alloc flags need to be redefined for
ALLOC_WMARK_MASK == 0xf. I know what you are getting at but it's a bit more
involved than you're making out and I'm not seeing an advantage.

> > @@ -1445,12 +1450,7 @@ zonelist_scan:
> >
> > if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
>
> This would become
>
> if (alloc_flags & ALLOC_WMARK_MASK)
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-04-23 00:44:39

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 21/22] Use a pre-calculated value instead of num_online_nodes() in fast paths

On Wed, Apr 22, 2009 at 04:04:47PM -0700, David Rientjes wrote:
> On Wed, 22 Apr 2009, Mel Gorman wrote:
>
> > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> > index 848025c..474e73e 100644
> > --- a/include/linux/nodemask.h
> > +++ b/include/linux/nodemask.h
> > @@ -408,6 +408,19 @@ static inline int num_node_state(enum node_states state)
> > #define next_online_node(nid) next_node((nid), node_states[N_ONLINE])
> >
> > extern int nr_node_ids;
> > +extern int nr_online_nodes;
> > +
> > +static inline void node_set_online(int nid)
> > +{
> > + node_set_state(nid, N_ONLINE);
> > + nr_online_nodes = num_node_state(N_ONLINE);
> > +}
> > +
> > +static inline void node_set_offline(int nid)
> > +{
> > + node_clear_state(nid, N_ONLINE);
> > + nr_online_nodes = num_node_state(N_ONLINE);
> > +}
> > #else
> >
> > static inline int node_state(int node, enum node_states state)
>
> The later #define's of node_set_online() and node_set_offline() in
> include/linux/nodemask.h should probably be removed now.
>

You'd think, but you can enable memory hotplug without NUMA and
node_set_online() is called when adding memory. Even though those
functions are nops on !NUMA, they're necessary.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-04-23 01:34:22

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 02/22] Do not sanity check order in the fast path

On Thu, 2009-04-23 at 01:13 +0100, Mel Gorman wrote:
> On Wed, Apr 22, 2009 at 10:30:15AM -0700, Dave Hansen wrote:
> > On Wed, 2009-04-22 at 18:11 +0100, Mel Gorman wrote:
> > > On Wed, Apr 22, 2009 at 09:13:11AM -0700, Dave Hansen wrote:
> > > > On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > > > > No user of the allocator API should be passing in an order >= MAX_ORDER
> > > > > but we check for it on each and every allocation. Delete this check and
> > > > > make it a VM_BUG_ON check further down the call path.
> > > >
> > > > Should we get the check re-added to some of the upper-level functions,
> > > > then? Perhaps __get_free_pages() or things like alloc_pages_exact()?
> > >
> > > I don't think so, no. It just moves the source of the text bloat and
> > > for the few callers that are asking for something that will never
> > > succeed.
> >
> > Well, it's a matter of figuring out when it can succeed. Some of this
> > stuff, we can figure out at compile-time. Others are a bit harder.
> >
>
> What do you suggest then? Some sort of constant that tells you the
> maximum size you can call for callers that think they might ever request
> too much?
>
> Shuffling the check around to other top-level helpers seems pointless to
> me because as I said, it just moves text bloat from one place to the
> next.

Do any of the actual fast paths pass 'order' in as a real variable? If
not, the compiler should be able to just take care of it. From a quick
scan, it does appear that at least a third of the direct alloc_pages()
users pass an explicit '0'. That should get optimized away
*immediately*.

> > > > I'm selfishly thinking of what I did in profile_init(). Can I slab
> > > > alloc it? Nope. Page allocator? Nope. Oh, well, try vmalloc():
> > > >
> > > > prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
> > > > if (prof_buffer)
> > > > return 0;
> > > >
> > > > prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
> > > > if (prof_buffer)
> > > > return 0;
> > > >
> > > > prof_buffer = vmalloc(buffer_bytes);
> > > > if (prof_buffer)
> > > > return 0;
> > > >
> > > > free_cpumask_var(prof_cpu_mask);
> > > > return -ENOMEM;
> > > >
> > >
> > > Can this ever actually be asking for an order larger than MAX_ORDER
> > > though? If so, you're condemning it to always behave poorly.
> >
> > Yeah. It is based on text size. Smaller kernels with trimmed configs
> > and no modules have no problem fitting under MAX_ORDER, as do kernels
> > with larger base page sizes.
> >
>
> It would seem that the right thing to have done here in the first place
> then was
>
> if (buffer_bytes > PAGE_SIZE << (MAX_ORDER-1)
> return vmalloc(...)
>
> kzalloc attempt
>
> alloc_pages_exact attempt

Yeah, but honestly, I don't expect most users to get that "(buffer_bytes
> PAGE_SIZE << (MAX_ORDER-1)" right. It seems like *exactly* the kind
of thing we should be wrapping up in common code.

Perhaps we do need an alloc_pages_nocheck() for the users that do have a
true non-compile-time-constant 'order' and still know they don't need
the check.

> > > > Same thing in __kmalloc_section_memmap():
> > > >
> > > > page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
> > > > if (page)
> > > > goto got_map_page;
> > > >
> > > > ret = vmalloc(memmap_size);
> > > > if (ret)
> > > > goto got_map_ptr;
> > > >
> > >
> > > If I'm reading that right, the order will never be a stupid order. It can fail
> > > for higher orders in which case it falls back to vmalloc() . For example,
> > > to hit that limit, the section size for a 4K kernel, maximum usable order
> > > of 10, the section size would need to be 256MB (assuming struct page size
> > > of 64 bytes). I don't think it's ever that size and if so, it'll always be
> > > sub-optimal which is a poor choice to make.
> >
> > I think the section size default used to be 512M on x86 because we
> > concentrate on removing whole DIMMs.
> >
>
> It was a poor choice then as their sections always ended up in
> vmalloc() or else it was using the bootmem allocator in which case it
> doesn't matter that the core page allocator was doing.

True, but we tried to code that sucker to work anywhere and to be as
optimal as possible (which vmalloc() is not) when we could.

> > > > I depend on the allocator to tell me when I've fed it too high of an
> > > > order. If we really need this, perhaps we should do an audit and then
> > > > add a WARN_ON() for a few releases to catch the stragglers.
> > >
> > > I consider it buggy to ask for something so large that you always end up
> > > with the worst option - vmalloc(). How about leaving it as a VM_BUG_ON
> > > to get as many reports as possible on who is depending on this odd
> > > behaviour?
> > >
> > > If there are users with good reasons, then we could convert this to WARN_ON
> > > to fix up the callers. I suspect that the allocator can already cope with
> > > recieving a stupid order silently but slowly. It should go all the way to the
> > > bottom and just never find anything useful and return NULL. zone_watermark_ok
> > > is the most dangerous looking part but even it should never get to MAX_ORDER
> > > because it should always find there are not enough free pages and return
> > > before it overruns.
> >
> > Whatever we do, I'd agree that it's fine that this is a degenerate case
> > that gets handled very slowly and as far out of hot paths as possible.
> > Anybody who can fall back to a vmalloc is not doing these things very
> > often.
>
> If that's the case, the simpliest course might be to just drop the VM_BUG_ON()
> as a separate patch after asserting it's safe to call into the page
> allocator with too large an order with the consequence of it being a
> relatively expensive call considering it can never succeed.

__rmqueue_smallest() seems to do the right thing and it is awfully deep
in the allocator.

How about this: I'll go and audit the use of order in page_alloc.c to
make sure that having an order>MAX_ORDER-1 floating around is OK and
won't break anything. I'll also go and see what the actual .text size
changes are from this patch both for alloc_pages() and
alloc_pages_node() separately to make sure what we're dealing with here.
Does this check even *exist* in the optimized code very often?

Deal? :)

-- Dave

2009-04-23 08:04:40

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 02/22] Do not sanity check order in the fast path

On Wed, 2009-04-22 at 13:11 -0700, David Rientjes wrote:
> On Wed, 22 Apr 2009, Mel Gorman wrote:
>
> > If there are users with good reasons, then we could convert this to WARN_ON
> > to fix up the callers. I suspect that the allocator can already cope with
> > recieving a stupid order silently but slowly. It should go all the way to the
> > bottom and just never find anything useful and return NULL. zone_watermark_ok
> > is the most dangerous looking part but even it should never get to MAX_ORDER
> > because it should always find there are not enough free pages and return
> > before it overruns.

> slub: enforce MAX_ORDER
>
> slub_max_order may not be equal to or greater than MAX_ORDER.
>
> Additionally, if a single object cannot be placed in a slab of
> slub_max_order, it still must allocate slabs below MAX_ORDER.
>
> Cc: Christoph Lameter <[email protected]>
> Signed-off-by: David Rientjes <[email protected]>

Applied, thanks!

2009-04-23 09:58:37

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 02/22] Do not sanity check order in the fast path

On Wed, Apr 22, 2009 at 06:34:07PM -0700, Dave Hansen wrote:
> On Thu, 2009-04-23 at 01:13 +0100, Mel Gorman wrote:
> > On Wed, Apr 22, 2009 at 10:30:15AM -0700, Dave Hansen wrote:
> > > On Wed, 2009-04-22 at 18:11 +0100, Mel Gorman wrote:
> > > > On Wed, Apr 22, 2009 at 09:13:11AM -0700, Dave Hansen wrote:
> > > > > On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > > > > > No user of the allocator API should be passing in an order >= MAX_ORDER
> > > > > > but we check for it on each and every allocation. Delete this check and
> > > > > > make it a VM_BUG_ON check further down the call path.
> > > > >
> > > > > Should we get the check re-added to some of the upper-level functions,
> > > > > then? Perhaps __get_free_pages() or things like alloc_pages_exact()?
> > > >
> > > > I don't think so, no. It just moves the source of the text bloat and
> > > > for the few callers that are asking for something that will never
> > > > succeed.
> > >
> > > Well, it's a matter of figuring out when it can succeed. Some of this
> > > stuff, we can figure out at compile-time. Others are a bit harder.
> > >
> >
> > What do you suggest then? Some sort of constant that tells you the
> > maximum size you can call for callers that think they might ever request
> > too much?
> >
> > Shuffling the check around to other top-level helpers seems pointless to
> > me because as I said, it just moves text bloat from one place to the
> > next.
>
> Do any of the actual fast paths pass 'order' in as a real variable? If
> not, the compiler should be able to just take care of it. From a quick
> scan, it does appear that at least a third of the direct alloc_pages()
> users pass an explicit '0'. That should get optimized away
> *immediately*.
>

You'd think but here are the vmlinux figures

3355718 0001-Replace-__alloc_pages_internal-with-__alloc_pages_.patch
3355622 0002-Do-not-sanity-check-order-in-the-fast-path.patch

That's a mostly modules kernel configuration and it's still making a
difference to the text size. Enough to care about and one to of the
things I was trying to keep in mind was branches in the fast paths.

> > > > > I'm selfishly thinking of what I did in profile_init(). Can I slab
> > > > > alloc it? Nope. Page allocator? Nope. Oh, well, try vmalloc():
> > > > >
> > > > > prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL);
> > > > > if (prof_buffer)
> > > > > return 0;
> > > > >
> > > > > prof_buffer = alloc_pages_exact(buffer_bytes, GFP_KERNEL|__GFP_ZERO);
> > > > > if (prof_buffer)
> > > > > return 0;
> > > > >
> > > > > prof_buffer = vmalloc(buffer_bytes);
> > > > > if (prof_buffer)
> > > > > return 0;
> > > > >
> > > > > free_cpumask_var(prof_cpu_mask);
> > > > > return -ENOMEM;
> > > > >
> > > >
> > > > Can this ever actually be asking for an order larger than MAX_ORDER
> > > > though? If so, you're condemning it to always behave poorly.
> > >
> > > Yeah. It is based on text size. Smaller kernels with trimmed configs
> > > and no modules have no problem fitting under MAX_ORDER, as do kernels
> > > with larger base page sizes.
> > >
> >
> > It would seem that the right thing to have done here in the first place
> > then was
> >
> > if (buffer_bytes > PAGE_SIZE << (MAX_ORDER-1)
> > return vmalloc(...)
> >
> > kzalloc attempt
> >
> > alloc_pages_exact attempt
>
> Yeah, but honestly, I don't expect most users to get that "(buffer_bytes
> > PAGE_SIZE << (MAX_ORDER-1)" right. It seems like *exactly* the kind
> of thing we should be wrapping up in common code.
>

No, I wouldn't expect them to get it right. I'd expect some sort of
helper to exist all right if there were enough of these callers to be
fixed up.

> Perhaps we do need an alloc_pages_nocheck() for the users that do have a
> true non-compile-time-constant 'order' and still know they don't need
> the check.
>

I'd do the opposite if we were doing this - alloc_pages_checksize() because
there are far more callers that are allocating sensible sizes than the
opposite. That said, making sure the allocator can handle a busted order
(even if we call all the way down to __rmqueue_smallest) seems the best
compromise between not having checks high in the fast path and callers that
want to chance their arm calling in and falling back to vmalloc() if it
doesn't work out.

> > > > > Same thing in __kmalloc_section_memmap():
> > > > >
> > > > > page = alloc_pages(GFP_KERNEL|__GFP_NOWARN, get_order(memmap_size));
> > > > > if (page)
> > > > > goto got_map_page;
> > > > >
> > > > > ret = vmalloc(memmap_size);
> > > > > if (ret)
> > > > > goto got_map_ptr;
> > > > >
> > > >
> > > > If I'm reading that right, the order will never be a stupid order. It can fail
> > > > for higher orders in which case it falls back to vmalloc() . For example,
> > > > to hit that limit, the section size for a 4K kernel, maximum usable order
> > > > of 10, the section size would need to be 256MB (assuming struct page size
> > > > of 64 bytes). I don't think it's ever that size and if so, it'll always be
> > > > sub-optimal which is a poor choice to make.
> > >
> > > I think the section size default used to be 512M on x86 because we
> > > concentrate on removing whole DIMMs.
> > >
> >
> > It was a poor choice then as their sections always ended up in
> > vmalloc() or else it was using the bootmem allocator in which case it
> > doesn't matter that the core page allocator was doing.
>
> True, but we tried to code that sucker to work anywhere and to be as
> optimal as possible (which vmalloc() is not) when we could.
>

Ok, seems fair.

> > > > > I depend on the allocator to tell me when I've fed it too high of an
> > > > > order. If we really need this, perhaps we should do an audit and then
> > > > > add a WARN_ON() for a few releases to catch the stragglers.
> > > >
> > > > I consider it buggy to ask for something so large that you always end up
> > > > with the worst option - vmalloc(). How about leaving it as a VM_BUG_ON
> > > > to get as many reports as possible on who is depending on this odd
> > > > behaviour?
> > > >
> > > > If there are users with good reasons, then we could convert this to WARN_ON
> > > > to fix up the callers. I suspect that the allocator can already cope with
> > > > recieving a stupid order silently but slowly. It should go all the way to the
> > > > bottom and just never find anything useful and return NULL. zone_watermark_ok
> > > > is the most dangerous looking part but even it should never get to MAX_ORDER
> > > > because it should always find there are not enough free pages and return
> > > > before it overruns.
> > >
> > > Whatever we do, I'd agree that it's fine that this is a degenerate case
> > > that gets handled very slowly and as far out of hot paths as possible.
> > > Anybody who can fall back to a vmalloc is not doing these things very
> > > often.
> >
> > If that's the case, the simpliest course might be to just drop the VM_BUG_ON()
> > as a separate patch after asserting it's safe to call into the page
> > allocator with too large an order with the consequence of it being a
> > relatively expensive call considering it can never succeed.
>
> __rmqueue_smallest() seems to do the right thing and it is awfully deep
> in the allocator.
>

And I believe zone_watermark_ok() does the right thing as well by
deciding that there are not enough free pages before it overflows the
buffer.

> How about this: I'll go and audit the use of order in page_alloc.c to
> make sure that having an order>MAX_ORDER-1 floating around is OK and
> won't break anything.

Great. Right now, I think it's ok but I haven't audited for this
explicily and a second set of eyes never hurts.

> I'll also go and see what the actual .text size
> changes are from this patch both for alloc_pages() and
> alloc_pages_node() separately to make sure what we're dealing with here.

It's .config specific of course, but check the leader where I posted vmlinux
sizes for each patch. scripts/bloat-o-meter by Matt Mackall also rules most
mightily for identifying where text increased or decreased between patches
in case you're not aware of it already.

> Does this check even *exist* in the optimized code very often?
>

Enough that it shrunk text size on my .config anyway.

> Deal? :)
>

Deal

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-04-23 10:04:11

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark

On Thu, Apr 23, 2009 at 09:27:15AM +0900, KOSAKI Motohiro wrote:
> > On Wed, 2009-04-22 at 18:14 +0100, Mel Gorman wrote:
> > > Preference of taste really. When I started a conversion to accessors, it
> > > changed something recognised to something new that looked uglier to me.
> > > Only one place cares about the union enough to access is via an array so
> > > why spread it everywhere.
> >
> > Personally, I'd say for consistency. Someone looking at both forms
> > wouldn't necessarily know that they refer to the same variables unless
> > they know about the union.
>
> for just clalification...
>
> AFAIK, C language specification don't gurantee point same value.
> compiler can insert pad between struct-member and member, but not insert
> into array.
>

Considering that they are the same type for elements and arrays, I
didn't think padding would ever be a problem.

> However, all gcc version don't do that. I think. but perhaps I missed
> some minor gcc release..
>
> So, I also like Dave's idea. but it only personal feeling.
>

The tide is against me on this one :).

How about I roll a patch on top of this set that replaces the union by
calling all sites? I figure that patch will go through a few revisions before
people are happy with the final API. However, as the patch wouldn't change
functionality, I'd like to see this series getting wider testing if possible. The
replace-union-with-single-array patch can be easily folded in then when
it settles.

Sound like a plan?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-04-23 17:37:17

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 02/22] Do not sanity check order in the fast path

On Thu, 2009-04-23 at 10:58 +0100, Mel Gorman wrote:
> > How about this: I'll go and audit the use of order in page_alloc.c to
> > make sure that having an order>MAX_ORDER-1 floating around is OK and
> > won't break anything.
>
> Great. Right now, I think it's ok but I haven't audited for this
> explicily and a second set of eyes never hurts.

OK, after looking through this, I have a couple of ideas. One is that
we do the MAX_ORDER check in __alloc_pages_internal(), but *after* the
first call to get_page_from_freelist(). That's because I'm worried if
we ever got into the reclaim code with a >MAX_ORDER 'order'. Such as:

void wakeup_kswapd(struct zone *zone, int order)
{
...
if (pgdat->kswapd_max_order < order)
pgdat->kswapd_max_order = order;
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
return;
if (!waitqueue_active(&pgdat->kswapd_wait))
return;
wake_up_interruptible(&pgdat->kswapd_wait);
}

unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *nodemask)
{
struct scan_control sc = {
...
.order = order,
.mem_cgroup = NULL,
.isolate_pages = isolate_pages_global,
.nodemask = nodemask,
};

return do_try_to_free_pages(zonelist, &sc);
}

This will keep us only checking 'order' once for each
alloc_pages_internal() call. It is an extra branch, but it is out of
the really, really hot path since we're about to start reclaim here
anyway.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e2f2699..1e3a01e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1498,6 +1498,13 @@ restart:
zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
if (page)
goto got_pg;
+ /*
+ * We're out of the rocket-hot area above, so do a quick sanity
+ * check. We do this here to avoid ever trying to do any reclaim
+ * of >=MAX_ORDER areas which can never succeed, of course.
+ */
+ if (order >= MAX_ORDER)
+ goto nopage;

/*
* GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and


-- Dave

2009-04-23 19:26:43

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 02/22] Do not sanity check order in the fast path

On Wed, 2009-04-22 at 18:34 -0700, Dave Hansen wrote:
> I'll also go and see what the actual .text size
> changes are from this patch both for alloc_pages() and
> alloc_pages_node() separately to make sure what we're dealing with
> here. Does this check even *exist* in the optimized code very
> often?

While this isn't definitive by any means, I did get some interesting
results. Pulling the check out of alloc_pages() had no effect at *all*
on text size because I'm trying with CONFIG_NUMA=n.

$ size i386-T41-laptop.{0,1}/vmlinux
text data bss dec hex filename
4348625 286560 860160 5495345 53da31 i386-T41-laptop.0/vmlinux
4348625 286560 860160 5495345 53da31 i386-T41-laptop.1/vmlinux

We get a slightly different when pulling the check out of
alloc_pages_node():

$ size i386-T41-laptop.{1,2}/vmlinux
text data bss dec hex filename
4348625 286560 860160 5495345 53da31 i386-T41-laptop.1/vmlinux
4348601 286560 860160 5495321 53da19 i386-T41-laptop.2/vmlinux

$ bloat-o-meter i386-T41-laptop.1/vmlinux i386-T41-laptop.2/vmlinux
add/remove: 0/0 grow/shrink: 9/7 up/down: 78/-107 (-29)
function old new delta
__get_user_pages 717 751 +34
st_read 1936 1944 +8
shmem_truncate_range 1660 1667 +7
pci_create_slot 410 417 +7
sg_build_indirect 449 455 +6
n_tty_read 1336 1342 +6
find_vma_prepare 103 108 +5
as_update_iohist 617 621 +4
ntfs_readdir 3426 3427 +1
enlarge_buffer 343 341 -2
__get_free_pages 36 33 -3
dma_generic_alloc_coherent 207 202 -5
mempool_alloc_pages 33 17 -16
futex_lock_pi 2120 2104 -16
kallsyms_lookup_name 102 82 -20
cache_alloc_refill 1171 1126 -45

I'm going to retry this with a NUMA config.

-- Dave

2009-04-23 19:29:46

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 21/22] Use a pre-calculated value instead of num_online_nodes() in fast paths

On Thu, 23 Apr 2009, Mel Gorman wrote:

> > > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> > > index 848025c..474e73e 100644
> > > --- a/include/linux/nodemask.h
> > > +++ b/include/linux/nodemask.h
> > > @@ -408,6 +408,19 @@ static inline int num_node_state(enum node_states state)
> > > #define next_online_node(nid) next_node((nid), node_states[N_ONLINE])
> > >
> > > extern int nr_node_ids;
> > > +extern int nr_online_nodes;
> > > +
> > > +static inline void node_set_online(int nid)
> > > +{
> > > + node_set_state(nid, N_ONLINE);
> > > + nr_online_nodes = num_node_state(N_ONLINE);
> > > +}
> > > +
> > > +static inline void node_set_offline(int nid)
> > > +{
> > > + node_clear_state(nid, N_ONLINE);
> > > + nr_online_nodes = num_node_state(N_ONLINE);
> > > +}
> > > #else
> > >
> > > static inline int node_state(int node, enum node_states state)
> >
> > The later #define's of node_set_online() and node_set_offline() in
> > include/linux/nodemask.h should probably be removed now.
> >
>
> You'd think, but you can enable memory hotplug without NUMA and
> node_set_online() is called when adding memory. Even though those
> functions are nops on !NUMA, they're necessary.
>

The problem is that your new functions above are never used because
node_set_online and node_set_offline are macro defined later in this
header for all cases, not just !CONFIG_NUMA.

You need this.
---
diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -448,6 +448,9 @@ static inline int num_node_state(enum node_states state)
#define next_online_node(nid) (MAX_NUMNODES)
#define nr_node_ids 1
#define nr_online_nodes 1
+
+#define node_set_online(node) node_set_state((node), N_ONLINE)
+#define node_set_offline(node) node_clear_state((node), N_ONLINE)
#endif

#define node_online_map node_states[N_ONLINE]
@@ -467,9 +470,6 @@ static inline int num_node_state(enum node_states state)
#define node_online(node) node_state((node), N_ONLINE)
#define node_possible(node) node_state((node), N_POSSIBLE)

-#define node_set_online(node) node_set_state((node), N_ONLINE)
-#define node_set_offline(node) node_clear_state((node), N_ONLINE)
-
#define for_each_node(node) for_each_node_state(node, N_POSSIBLE)
#define for_each_online_node(node) for_each_node_state(node, N_ONLINE)

2009-04-23 19:46:00

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 02/22] Do not sanity check order in the fast path

On Thu, 2009-04-23 at 12:26 -0700, Dave Hansen wrote:
> I'm going to retry this with a NUMA config.

Oddly enough, removing the alloc_pages() check didn't do anything to
text on a NUMA=y config, either:

dave@kernel:~/work/mhp-build$ size i386-numaq-sparse.?/vmlinux
text data bss dec hex filename
3460792 449336 1626112 5536240 5479f0 i386-numaq-sparse.0/vmlinux
3460792 449336 1626112 5536240 5479f0 i386-numaq-sparse.1/vmlinux
3460776 449336 1626112 5536224 5479e0 i386-numaq-sparse.2/vmlinux

Here's bloatmeter from removing the alloc_pages_node() check:

dave@kernel:~/work/mhp-build$ ../linux-2.6.git/scripts/bloat-o-meter
i386-numaq-sparse.{1,2}/vmlinux
add/remove: 0/0 grow/shrink: 9/16 up/down: 81/-99 (-18)
function old new delta
st_int_ioctl 2600 2624 +24
tcp_sendmsg 2153 2169 +16
diskstats_show 739 753 +14
iov_shorten 49 58 +9
unmap_vmas 1653 1661 +8
sg_build_indirect 449 455 +6
ahc_linux_biosparam 251 253 +2
nlmclnt_call 557 558 +1
do_mount 1533 1534 +1
skb_icv_walk 420 419 -1
nfs_readpage_truncate_uninitialised_page 288 287 -1
find_first_zero_bit 67 66 -1
enlarge_buffer 339 337 -2
ahc_parse_msg 2340 2338 -2
__get_free_pages 36 33 -3
flock_lock_file_wait 484 480 -4
find_vma_prepare 108 103 -5
arp_ignore 104 99 -5
__udp4_lib_err 312 307 -5
alloc_buddy_huge_page 170 163 -7
pbus_size_mem 907 899 -8
dma_generic_alloc_coherent 245 236 -9
cache_alloc_refill 1168 1158 -10
mempool_alloc_pages 33 17 -16
do_mremap 1208 1188 -20

2009-04-23 22:50:44

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 02/22] Do not sanity check order in the fast path

On Wed, 22 Apr 2009 18:11:51 +0100
Mel Gorman <[email protected]> wrote:

> > I depend on the allocator to tell me when I've fed it too high of an
> > order. If we really need this, perhaps we should do an audit and then
> > add a WARN_ON() for a few releases to catch the stragglers.
> >
>
> I consider it buggy to ask for something so large that you always end up
> with the worst option - vmalloc().

Nevertheless, it's a pretty common pattern for initialisation code all
over the kernel to do

while (allocate(huge_amount) == NULL)
huge_amount /= 2;

and the proposed change will convert that from "works" to "either goes
BUG or mysteriously overindexes zone->free_area[] in
__rmqueue_smallest()". The latter of which is really nasty.

> How about leaving it as a VM_BUG_ON
> to get as many reports as possible on who is depending on this odd
> behaviour?

That would be quite disruptive. Even emitting a trace for each call
would be irritating. How's about this:

--- a/mm/page_alloc.c~page-allocator-do-not-sanity-check-order-in-the-fast-path-fix
+++ a/mm/page_alloc.c
@@ -1405,7 +1405,8 @@ get_page_from_freelist(gfp_t gfp_mask, n

classzone_idx = zone_idx(preferred_zone);

- VM_BUG_ON(order >= MAX_ORDER);
+ if (WARN_ON_ONCE(order >= MAX_ORDER))
+ return NULL;

zonelist_scan:
/*
_


and then we revisit later?

2009-04-23 22:54:40

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 07/22] Calculate the preferred zone for allocation only once

On Wed, 22 Apr 2009 14:53:12 +0100
Mel Gorman <[email protected]> wrote:

> get_page_from_freelist() can be called multiple times for an allocation.
> Part of this calculates the preferred_zone which is the first usable zone
> in the zonelist but the zone depends on the GFP flags specified at the
> beginning of the allocation call. This patch calculates preferred_zone
> once. It's safe to do this because if preferred_zone is NULL at the start
> of the call, no amount of direct reclaim or other actions will change the
> fact the allocation will fail.
>
>
> ...
>
> - (void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
> -
> &preferred_zone);
> ...
>
> + /* The preferred zone is used for statistics later */
> + (void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,

Let's quietly zap that dopey cast.

--- a/mm/page_alloc.c~page-allocator-calculate-the-preferred-zone-for-allocation-only-once-fix
+++ a/mm/page_alloc.c
@@ -1775,8 +1775,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
return NULL;

/* The preferred zone is used for statistics later */
- (void)first_zones_zonelist(zonelist, high_zoneidx, nodemask,
- &preferred_zone);
+ first_zones_zonelist(zonelist, high_zoneidx, nodemask, &preferred_zone);
if (!preferred_zone)
return NULL;

_

2009-04-23 22:58:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 09/22] Calculate the alloc_flags for allocation only once

On Wed, 22 Apr 2009 14:53:14 +0100
Mel Gorman <[email protected]> wrote:

> Factor out the mapping between GFP and alloc_flags only once. Once factored
> out, it only needs to be calculated once but some care must be taken.
>
> [[email protected] says]
> As the test:
>
> - if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> - && !in_interrupt()) {
> - if (!(gfp_mask & __GFP_NOMEMALLOC)) {
>
> has been replaced with a slightly weaker one:
>
> + if (alloc_flags & ALLOC_NO_WATERMARKS) {
>
> Without care, this would allow recursion into the allocator via direct
> reclaim. This patch ensures we do not recurse when PF_MEMALLOC is set
> but TF_MEMDIE callers are now allowed to directly reclaim where they
> would have been prevented in the past.
>
> ...
>
> +static inline int
> +gfp_to_alloc_flags(gfp_t gfp_mask)
> +{
> + struct task_struct *p = current;
> + int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> + const gfp_t wait = gfp_mask & __GFP_WAIT;
> +
> + /*
> + * The caller may dip into page reserves a bit more if the caller
> + * cannot run direct reclaim, or if the caller has realtime scheduling
> + * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
> + * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
> + */
> + if (gfp_mask & __GFP_HIGH)
> + alloc_flags |= ALLOC_HIGH;
> +
> + if (!wait) {
> + alloc_flags |= ALLOC_HARDER;
> + /*
> + * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
> + * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> + */
> + alloc_flags &= ~ALLOC_CPUSET;
> + } else if (unlikely(rt_task(p)))
> + alloc_flags |= ALLOC_HARDER;
> +
> + if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> + if (!in_interrupt() &&
> + ((p->flags & PF_MEMALLOC) ||
> + unlikely(test_thread_flag(TIF_MEMDIE))))
> + alloc_flags |= ALLOC_NO_WATERMARKS;
> + }
> +
> + return alloc_flags;
> +}

hm. Was there a particular reason for the explicit inline?

It's OK as it stands, but might become suboptimal if we later add a
second caller?

2009-04-23 23:06:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 15/22] Do not disable interrupts in free_page_mlock()

On Wed, 22 Apr 2009 14:53:20 +0100
Mel Gorman <[email protected]> wrote:

> free_page_mlock() tests and clears PG_mlocked using locked versions of the
> bit operations. If set, it disables interrupts to update counters and this
> happens on every page free even though interrupts are disabled very shortly
> afterwards a second time. This is wasteful.

Well. It's only wasteful if the page was mlocked, which is rare.

> This patch splits what free_page_mlock() does. The bit check is still
> made. However, the update of counters is delayed until the interrupts are
> disabled and the non-lock version for clearing the bit is used. One potential
> weirdness with this split is that the counters do not get updated if the
> bad_page() check is triggered but a system showing bad pages is getting
> screwed already.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Reviewed-by: Christoph Lameter <[email protected]>
> Reviewed-by: Pekka Enberg <[email protected]>
> Reviewed-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/internal.h | 11 +++--------
> mm/page_alloc.c | 8 +++++++-
> 2 files changed, 10 insertions(+), 9 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 987bb03..58ec1bc 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> */
> static inline void free_page_mlock(struct page *page)
> {
> - if (unlikely(TestClearPageMlocked(page))) {
> - unsigned long flags;
> -
> - local_irq_save(flags);
> - __dec_zone_page_state(page, NR_MLOCK);
> - __count_vm_event(UNEVICTABLE_MLOCKFREED);
> - local_irq_restore(flags);
> - }
> + __ClearPageMlocked(page);
> + __dec_zone_page_state(page, NR_MLOCK);
> + __count_vm_event(UNEVICTABLE_MLOCKFREED);
> }

The conscientuous reviewer runs around and checks for free_page_mlock()
callers in other .c files which might be affected.

Only there are no such callers.

The reviewer's job would be reduced if free_page_mlock() wasn't
needlessly placed in a header file!

> #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 67cafd0..7f45de1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -499,7 +499,6 @@ static inline void __free_one_page(struct page *page,
>
> static inline int free_pages_check(struct page *page)
> {
> - free_page_mlock(page);
> if (unlikely(page_mapcount(page) |
> (page->mapping != NULL) |
> (page_count(page) != 0) |
> @@ -556,6 +555,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> unsigned long flags;
> int i;
> int bad = 0;
> + int clearMlocked = PageMlocked(page);
>
> for (i = 0 ; i < (1 << order) ; ++i)
> bad += free_pages_check(page + i);
> @@ -571,6 +571,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> kernel_map_pages(page, 1 << order, 0);
>
> local_irq_save(flags);
> + if (unlikely(clearMlocked))
> + free_page_mlock(page);

I wonder what the compiler does in the case
CONFIG_HAVE_MLOCKED_PAGE_BIT=n. If it is dumb, this patch would cause
additional code generation.

2009-04-23 23:13:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 19/22] Update NR_FREE_PAGES only as necessary

On Wed, 22 Apr 2009 14:53:24 +0100
Mel Gorman <[email protected]> wrote:

> When pages are being freed to the buddy allocator, the zone
> NR_FREE_PAGES counter must be updated. In the case of bulk per-cpu page
> freeing, it's updated once per page. This retouches cache lines more
> than necessary. Update the counters one per per-cpu bulk free.
>
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -460,7 +460,6 @@ static inline void __free_one_page(struct page *page,
> int migratetype)
> {
> unsigned long page_idx;
> - int order_size = 1 << order;
>
> if (unlikely(PageCompound(page)))
> if (unlikely(destroy_compound_page(page, order)))
> @@ -470,10 +469,9 @@ static inline void __free_one_page(struct page *page,
>
> page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
>
> - VM_BUG_ON(page_idx & (order_size - 1));
> + VM_BUG_ON(page_idx & ((1 << order) - 1));
> VM_BUG_ON(bad_range(zone, page));
>

<head spins>

Is this all a slow and obscure way of doing

VM_BUG_ON(order > MAX_ORDER);

?

If not, what _is_ it asserting?

Subject: Re: [PATCH 19/22] Update NR_FREE_PAGES only as necessary

On Thu, 23 Apr 2009, Andrew Morton wrote:

> If not, what _is_ it asserting?

That the alignment of an order N page in the max order block is proper?

2009-04-24 00:07:29

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 15/22] Do not disable interrupts in free_page_mlock()

> > @@ -556,6 +555,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> > unsigned long flags;
> > int i;
> > int bad = 0;
> > + int clearMlocked = PageMlocked(page);
> >
> > for (i = 0 ; i < (1 << order) ; ++i)
> > bad += free_pages_check(page + i);
> > @@ -571,6 +571,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> > kernel_map_pages(page, 1 << order, 0);
> >
> > local_irq_save(flags);
> > + if (unlikely(clearMlocked))
> > + free_page_mlock(page);
>
> I wonder what the compiler does in the case
> CONFIG_HAVE_MLOCKED_PAGE_BIT=n. If it is dumb, this patch would cause
> additional code generation.

if CONFIG_HAVE_MLOCKED_PAGE_BIT=n, PageMlocked() is {return 0;} then
gcc can remove following code, I think.
if (0)
free_page_mlock(page)


2009-04-24 00:34:08

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 15/22] Do not disable interrupts in free_page_mlock()

> > @@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> > */
> > static inline void free_page_mlock(struct page *page)
> > {
> > - if (unlikely(TestClearPageMlocked(page))) {
> > - unsigned long flags;
> > -
> > - local_irq_save(flags);
> > - __dec_zone_page_state(page, NR_MLOCK);
> > - __count_vm_event(UNEVICTABLE_MLOCKFREED);
> > - local_irq_restore(flags);
> > - }
> > + __ClearPageMlocked(page);
> > + __dec_zone_page_state(page, NR_MLOCK);
> > + __count_vm_event(UNEVICTABLE_MLOCKFREED);
> > }
>
> The conscientuous reviewer runs around and checks for free_page_mlock()
> callers in other .c files which might be affected.
>
> Only there are no such callers.
>
> The reviewer's job would be reduced if free_page_mlock() wasn't
> needlessly placed in a header file!

very sorry.

How about this?

=============================================
Subject: [PATCH] move free_page_mlock() to page_alloc.c

Currently, free_page_mlock() is only called from page_alloc.c.
Thus, we can move it to page_alloc.c.

Cc: Lee Schermerhorn <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/internal.h | 18 ------------------
mm/page_alloc.c | 21 +++++++++++++++++++++
2 files changed, 21 insertions(+), 18 deletions(-)

Index: b/mm/internal.h
===================================================================
--- a/mm/internal.h 2009-04-24 09:12:03.000000000 +0900
+++ b/mm/internal.h 2009-04-24 09:12:10.000000000 +0900
@@ -150,23 +150,6 @@ static inline void mlock_migrate_page(st
}
}

-/*
- * free_page_mlock() -- clean up attempts to free and mlocked() page.
- * Page should not be on lru, so no need to fix that up.
- * free_pages_check() will verify...
- */
-static inline void free_page_mlock(struct page *page)
-{
- if (unlikely(TestClearPageMlocked(page))) {
- unsigned long flags;
-
- local_irq_save(flags);
- __dec_zone_page_state(page, NR_MLOCK);
- __count_vm_event(UNEVICTABLE_MLOCKFREED);
- local_irq_restore(flags);
- }
-}
-
#else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
{
@@ -175,7 +158,6 @@ static inline int is_mlocked_vma(struct
static inline void clear_page_mlock(struct page *page) { }
static inline void mlock_vma_page(struct page *page) { }
static inline void mlock_migrate_page(struct page *new, struct page *old) { }
-static inline void free_page_mlock(struct page *page) { }

#endif /* CONFIG_HAVE_MLOCKED_PAGE_BIT */

Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c 2009-04-24 09:12:03.000000000 +0900
+++ b/mm/page_alloc.c 2009-04-24 09:13:25.000000000 +0900
@@ -491,6 +491,27 @@ static inline void __free_one_page(struc
zone->free_area[order].nr_free++;
}

+#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+/*
+ * free_page_mlock() -- clean up attempts to free and mlocked() page.
+ * Page should not be on lru, so no need to fix that up.
+ * free_pages_check() will verify...
+ */
+static void free_page_mlock(struct page *page)
+{
+ if (unlikely(TestClearPageMlocked(page))) {
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __dec_zone_page_state(page, NR_MLOCK);
+ __count_vm_event(UNEVICTABLE_MLOCKFREED);
+ local_irq_restore(flags);
+ }
+}
+#else
+static void free_page_mlock(struct page *page) { }
+#endif
+
static inline int free_pages_check(struct page *page)
{
free_page_mlock(page);

2009-04-24 02:57:19

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 02/22] Do not sanity check order in the fast path

> On Thu, 2009-04-23 at 10:58 +0100, Mel Gorman wrote:
> > > How about this: I'll go and audit the use of order in page_alloc.c to
> > > make sure that having an order>MAX_ORDER-1 floating around is OK and
> > > won't break anything.
> >
> > Great. Right now, I think it's ok but I haven't audited for this
> > explicily and a second set of eyes never hurts.
>
> OK, after looking through this, I have a couple of ideas. One is that
> we do the MAX_ORDER check in __alloc_pages_internal(), but *after* the
> first call to get_page_from_freelist(). That's because I'm worried if
> we ever got into the reclaim code with a >MAX_ORDER 'order'. Such as:
>
> void wakeup_kswapd(struct zone *zone, int order)
> {
> ...
> if (pgdat->kswapd_max_order < order)
> pgdat->kswapd_max_order = order;
> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> return;
> if (!waitqueue_active(&pgdat->kswapd_wait))
> return;
> wake_up_interruptible(&pgdat->kswapd_wait);
> }
>
> unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> gfp_t gfp_mask, nodemask_t *nodemask)
> {
> struct scan_control sc = {
> ...
> .order = order,
> .mem_cgroup = NULL,
> .isolate_pages = isolate_pages_global,
> .nodemask = nodemask,
> };
>
> return do_try_to_free_pages(zonelist, &sc);
> }
>
> This will keep us only checking 'order' once for each
> alloc_pages_internal() call. It is an extra branch, but it is out of
> the really, really hot path since we're about to start reclaim here
> anyway.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e2f2699..1e3a01e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1498,6 +1498,13 @@ restart:
> zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
> if (page)
> goto got_pg;
> + /*
> + * We're out of the rocket-hot area above, so do a quick sanity
> + * check. We do this here to avoid ever trying to do any reclaim
> + * of >=MAX_ORDER areas which can never succeed, of course.
> + */
> + if (order >= MAX_ORDER)
> + goto nopage;
>
> /*
> * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and

Good point.
if (WARN_ON_ONCE(order >= MAX_ORDER)) is better?


2009-04-24 06:41:33

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 18/22] Use allocation flags as an index to the zone watermark

> Considering that they are the same type for elements and arrays, I
> didn't think padding would ever be a problem.
>
> > However, all gcc version don't do that. I think. but perhaps I missed
> > some minor gcc release..
> >
> > So, I also like Dave's idea. but it only personal feeling.
> >
>
> The tide is against me on this one :).
>
> How about I roll a patch on top of this set that replaces the union by
> calling all sites? I figure that patch will go through a few revisions before
> people are happy with the final API. However, as the patch wouldn't change
> functionality, I'd like to see this series getting wider testing if possible. The
> replace-union-with-single-array patch can be easily folded in then when
> it settles.
>
> Sound like a plan?

Yeah, I agree testing is important than ugliness discussion :)


2009-04-24 09:22:11

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 02/22] Do not sanity check order in the fast path

On Thu, Apr 23, 2009 at 12:45:30PM -0700, Dave Hansen wrote:
> On Thu, 2009-04-23 at 12:26 -0700, Dave Hansen wrote:
> > I'm going to retry this with a NUMA config.
>
> Oddly enough, removing the alloc_pages() check didn't do anything to
> text on a NUMA=y config, either:
>
> dave@kernel:~/work/mhp-build$ size i386-numaq-sparse.?/vmlinux
> text data bss dec hex filename
> 3460792 449336 1626112 5536240 5479f0 i386-numaq-sparse.0/vmlinux
> 3460792 449336 1626112 5536240 5479f0 i386-numaq-sparse.1/vmlinux
> 3460776 449336 1626112 5536224 5479e0 i386-numaq-sparse.2/vmlinux
>
> Here's bloatmeter from removing the alloc_pages_node() check:
>
> dave@kernel:~/work/mhp-build$ ../linux-2.6.git/scripts/bloat-o-meter
> i386-numaq-sparse.{1,2}/vmlinux
> add/remove: 0/0 grow/shrink: 9/16 up/down: 81/-99 (-18)
> function old new delta
> st_int_ioctl 2600 2624 +24
> tcp_sendmsg 2153 2169 +16
> diskstats_show 739 753 +14
> iov_shorten 49 58 +9
> unmap_vmas 1653 1661 +8
> sg_build_indirect 449 455 +6
> ahc_linux_biosparam 251 253 +2
> nlmclnt_call 557 558 +1
> do_mount 1533 1534 +1

It doesn't make sense at all that text increased in size. Did you make clean
between each .config change and patch application? Are you using distcc
or anything else that might cause confusion? I found I had to eliminate
distscc and clean after each patch application because sometimes
net/ipv4/udp.c would sometimes generate different assembly when
accessing struct zone. Not really sure what was going on there.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-04-24 10:34:20

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 02/22] Do not sanity check order in the fast path

On Thu, Apr 23, 2009 at 10:36:50AM -0700, Dave Hansen wrote:
> On Thu, 2009-04-23 at 10:58 +0100, Mel Gorman wrote:
> > > How about this: I'll go and audit the use of order in page_alloc.c to
> > > make sure that having an order>MAX_ORDER-1 floating around is OK and
> > > won't break anything.
> >
> > Great. Right now, I think it's ok but I haven't audited for this
> > explicily and a second set of eyes never hurts.
>
> OK, after looking through this, I have a couple of ideas. One is that
> we do the MAX_ORDER check in __alloc_pages_internal(), but *after* the
> first call to get_page_from_freelist(). That's because I'm worried if
> we ever got into the reclaim code with a >MAX_ORDER 'order'. Such as:
>
> void wakeup_kswapd(struct zone *zone, int order)
> {
> ...
> if (pgdat->kswapd_max_order < order)
> pgdat->kswapd_max_order = order;
> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> return;
> if (!waitqueue_active(&pgdat->kswapd_wait))
> return;
> wake_up_interruptible(&pgdat->kswapd_wait);
> }
>
> unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> gfp_t gfp_mask, nodemask_t *nodemask)
> {
> struct scan_control sc = {
> ...
> .order = order,
> .mem_cgroup = NULL,
> .isolate_pages = isolate_pages_global,
> .nodemask = nodemask,
> };
>
> return do_try_to_free_pages(zonelist, &sc);
> }
>

That is perfectly rational.

> This will keep us only checking 'order' once for each
> alloc_pages_internal() call. It is an extra branch, but it is out of
> the really, really hot path since we're about to start reclaim here
> anyway.
>

I combined yours and Andrew's suggestions into a patch that applies on
top of the series. Dave, as it's basically your patch it needs your
sign-off if you agree it's ok.

I tested this with a bodge that allocates with increasing orders up to a
very large number. As you'd expect, it worked until it hit order-11 on an
x86 machine and failed for every higher order by returning NULL.

It reports a warning once. We'll probably drop the warning in time but this
will be a chance to check if there are callers that really are being stupid
and are not just callers that are trying to get the best buffer for the job.

=====
Sanity check order in the page allocator slow path

Callers may speculatively call different allocators in order of preference
trying to allocate a buffer of a given size. The order needed to allocate
this may be larger than what the page allocator can normally handle. While
the allocator mostly does the right thing, it should not direct reclaim or
wakeup kswapd with a bogus order. This patch sanity checks the order in the
slow path and returns NULL if it is too large.

Needs-signed-off-by from Dave Hansen here before merging. Based on his
not-signed-off-by patch.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1464aca..1c60141 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1434,7 +1434,6 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
int did_zlc_setup = 0; /* just call zlc_setup() one time */

classzone_idx = zone_idx(preferred_zone);
- VM_BUG_ON(order >= MAX_ORDER);

zonelist_scan:
/*
@@ -1692,6 +1691,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct task_struct *p = current;

/*
+ * In the slowpath, we sanity check order to avoid ever trying to
+ * reclaim >= MAX_ORDER areas which will never succeed. Callers may
+ * be using allocators in order of preference for an area that is
+ * too large.
+ */
+ if (WARN_ON_ONCE(order >= MAX_ORDER))
+ return NULL;
+
+ /*
* GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
* __GFP_NOWARN set) should not cause reclaim since the subsystem
* (f.e. slab) using GFP_THISNODE may choose to trigger reclaim

2009-04-24 10:47:30

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 09/22] Calculate the alloc_flags for allocation only once

On Thu, Apr 23, 2009 at 03:52:16PM -0700, Andrew Morton wrote:
> On Wed, 22 Apr 2009 14:53:14 +0100
> Mel Gorman <[email protected]> wrote:
>
> > Factor out the mapping between GFP and alloc_flags only once. Once factored
> > out, it only needs to be calculated once but some care must be taken.
> >
> > [[email protected] says]
> > As the test:
> >
> > - if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> > - && !in_interrupt()) {
> > - if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> >
> > has been replaced with a slightly weaker one:
> >
> > + if (alloc_flags & ALLOC_NO_WATERMARKS) {
> >
> > Without care, this would allow recursion into the allocator via direct
> > reclaim. This patch ensures we do not recurse when PF_MEMALLOC is set
> > but TF_MEMDIE callers are now allowed to directly reclaim where they
> > would have been prevented in the past.
> >
> > ...
> >
> > +static inline int
> > +gfp_to_alloc_flags(gfp_t gfp_mask)
> > +{
> > + struct task_struct *p = current;
> > + int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> > + const gfp_t wait = gfp_mask & __GFP_WAIT;
> > +
> > + /*
> > + * The caller may dip into page reserves a bit more if the caller
> > + * cannot run direct reclaim, or if the caller has realtime scheduling
> > + * policy or is asking for __GFP_HIGH memory. GFP_ATOMIC requests will
> > + * set both ALLOC_HARDER (!wait) and ALLOC_HIGH (__GFP_HIGH).
> > + */
> > + if (gfp_mask & __GFP_HIGH)
> > + alloc_flags |= ALLOC_HIGH;
> > +
> > + if (!wait) {
> > + alloc_flags |= ALLOC_HARDER;
> > + /*
> > + * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
> > + * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> > + */
> > + alloc_flags &= ~ALLOC_CPUSET;
> > + } else if (unlikely(rt_task(p)))
> > + alloc_flags |= ALLOC_HARDER;
> > +
> > + if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> > + if (!in_interrupt() &&
> > + ((p->flags & PF_MEMALLOC) ||
> > + unlikely(test_thread_flag(TIF_MEMDIE))))
> > + alloc_flags |= ALLOC_NO_WATERMARKS;
> > + }
> > +
> > + return alloc_flags;
> > +}
>
> hm. Was there a particular reason for the explicit inline?
>

Only because it was known there was only one caller.

> It's OK as it stands, but might become suboptimal if we later add a
> second caller?
>

This is true. As it is also in the slowpath and only called once, the
following patch should make no difference to performance but potentially
avoid a mistake later.

=======
Uninline gfp_to_alloc_flags() in the page allocator slow path

gfp_to_alloc_flags() is in the slowpath but inlined. While there is only one
caller now, a future second call would add suprising text-bloat. Uninline
it now to avoid surprises later as it should have no performance impact.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c60141..f08b4cb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1639,7 +1639,7 @@ void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
wakeup_kswapd(zone, order);
}

-static inline int
+static int
gfp_to_alloc_flags(gfp_t gfp_mask)
{
struct task_struct *p = current;

2009-04-24 11:19:08

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 15/22] Do not disable interrupts in free_page_mlock()

On Thu, Apr 23, 2009 at 03:59:51PM -0700, Andrew Morton wrote:
> On Wed, 22 Apr 2009 14:53:20 +0100
> Mel Gorman <[email protected]> wrote:
>
> > free_page_mlock() tests and clears PG_mlocked using locked versions of the
> > bit operations. If set, it disables interrupts to update counters and this
> > happens on every page free even though interrupts are disabled very shortly
> > afterwards a second time. This is wasteful.
>
> Well. It's only wasteful if the page was mlocked, which is rare.
>

True. mlocked pages are only going to be torn down during process exit or
munmap, both of which you'd expect to be rare when mlock is involved.

s/This is wasteful/While rare, this is unnecessary./

?

> > This patch splits what free_page_mlock() does. The bit check is still
> > made. However, the update of counters is delayed until the interrupts are
> > disabled and the non-lock version for clearing the bit is used. One potential
> > weirdness with this split is that the counters do not get updated if the
> > bad_page() check is triggered but a system showing bad pages is getting
> > screwed already.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > Reviewed-by: Christoph Lameter <[email protected]>
> > Reviewed-by: Pekka Enberg <[email protected]>
> > Reviewed-by: KOSAKI Motohiro <[email protected]>
> > ---
> > mm/internal.h | 11 +++--------
> > mm/page_alloc.c | 8 +++++++-
> > 2 files changed, 10 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 987bb03..58ec1bc 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> > */
> > static inline void free_page_mlock(struct page *page)
> > {
> > - if (unlikely(TestClearPageMlocked(page))) {
> > - unsigned long flags;
> > -
> > - local_irq_save(flags);
> > - __dec_zone_page_state(page, NR_MLOCK);
> > - __count_vm_event(UNEVICTABLE_MLOCKFREED);
> > - local_irq_restore(flags);
> > - }
> > + __ClearPageMlocked(page);
> > + __dec_zone_page_state(page, NR_MLOCK);
> > + __count_vm_event(UNEVICTABLE_MLOCKFREED);
> > }
>
> The conscientuous reviewer runs around and checks for free_page_mlock()
> callers in other .c files which might be affected.
>

The patch author should have done the same thing :/

> Only there are no such callers.
>
> The reviewer's job would be reduced if free_page_mlock() wasn't
> needlessly placed in a header file!
>
> > #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 67cafd0..7f45de1 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -499,7 +499,6 @@ static inline void __free_one_page(struct page *page,
> >
> > static inline int free_pages_check(struct page *page)
> > {
> > - free_page_mlock(page);
> > if (unlikely(page_mapcount(page) |
> > (page->mapping != NULL) |
> > (page_count(page) != 0) |
> > @@ -556,6 +555,7 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> > unsigned long flags;
> > int i;
> > int bad = 0;
> > + int clearMlocked = PageMlocked(page);
> >
> > for (i = 0 ; i < (1 << order) ; ++i)
> > bad += free_pages_check(page + i);
> > @@ -571,6 +571,8 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> > kernel_map_pages(page, 1 << order, 0);
> >
> > local_irq_save(flags);
> > + if (unlikely(clearMlocked))
> > + free_page_mlock(page);
>
> I wonder what the compiler does in the case
> CONFIG_HAVE_MLOCKED_PAGE_BIT=n. If it is dumb, this patch would cause
> additional code generation.
>

PageMlocked becomes

static inline int PageMlocked(page)
{ return 0; }

so the compiler should be fit to spot that clearMlocked will always be 0. Even
if it didn't, it should have at least spotted that free_page_mlock(page)
is an empty inline function in this case.

Still double checked and I did not see the branch with
CONFIG_HAVE_MLOCKED_PAGE_BIT=n

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-04-24 11:33:27

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 15/22] Do not disable interrupts in free_page_mlock()

On Fri, Apr 24, 2009 at 09:33:50AM +0900, KOSAKI Motohiro wrote:
> > > @@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> > > */
> > > static inline void free_page_mlock(struct page *page)
> > > {
> > > - if (unlikely(TestClearPageMlocked(page))) {
> > > - unsigned long flags;
> > > -
> > > - local_irq_save(flags);
> > > - __dec_zone_page_state(page, NR_MLOCK);
> > > - __count_vm_event(UNEVICTABLE_MLOCKFREED);
> > > - local_irq_restore(flags);
> > > - }
> > > + __ClearPageMlocked(page);
> > > + __dec_zone_page_state(page, NR_MLOCK);
> > > + __count_vm_event(UNEVICTABLE_MLOCKFREED);
> > > }
> >
> > The conscientuous reviewer runs around and checks for free_page_mlock()
> > callers in other .c files which might be affected.
> >
> > Only there are no such callers.
> >
> > The reviewer's job would be reduced if free_page_mlock() wasn't
> > needlessly placed in a header file!
>
> very sorry.
>
> How about this?
>
> =============================================
> Subject: [PATCH] move free_page_mlock() to page_alloc.c
>
> Currently, free_page_mlock() is only called from page_alloc.c.
> Thus, we can move it to page_alloc.c.
>

Looks good, but here is a version rebased on top of the patch series
where it would be easier to merge with "Do not disable interrupts in
free_page_mlock()".

I do note why it might be in the header though - it keeps all the
CONFIG_HAVE_MLOCKED_PAGE_BIT-related helper functions together making it
easier to find them. Lee, was that the intention?

=======
From: KOSAKI Motohiro <[email protected]>

Move free_page_mlock() from mm/internal.h to mm/page_alloc.c

Currently, free_page_mlock() is only called from page_alloc.c. This patch
moves it from a header to to page_alloc.c.

[[email protected]: Rebase on top of page allocator patches]
Cc: Lee Schermerhorn <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andrew Morton <[email protected]>
---
mm/internal.h | 13 -------------
mm/page_alloc.c | 16 ++++++++++++++++
2 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 58ec1bc..4b1672a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -150,18 +150,6 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
}
}

-/*
- * free_page_mlock() -- clean up attempts to free and mlocked() page.
- * Page should not be on lru, so no need to fix that up.
- * free_pages_check() will verify...
- */
-static inline void free_page_mlock(struct page *page)
-{
- __ClearPageMlocked(page);
- __dec_zone_page_state(page, NR_MLOCK);
- __count_vm_event(UNEVICTABLE_MLOCKFREED);
-}
-
#else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
{
@@ -170,7 +158,6 @@ static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
static inline void clear_page_mlock(struct page *page) { }
static inline void mlock_vma_page(struct page *page) { }
static inline void mlock_migrate_page(struct page *new, struct page *old) { }
-static inline void free_page_mlock(struct page *page) { }

#endif /* CONFIG_HAVE_MLOCKED_PAGE_BIT */

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f08b4cb..3db5f57 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -433,6 +433,22 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
return 0;
}

+#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
+/*
+ * free_page_mlock() -- clean up attempts to free and mlocked() page.
+ * Page should not be on lru, so no need to fix that up.
+ * free_pages_check() will verify...
+ */
+static inline void free_page_mlock(struct page *page)
+{
+ __ClearPageMlocked(page);
+ __dec_zone_page_state(page, NR_MLOCK);
+ __count_vm_event(UNEVICTABLE_MLOCKFREED);
+}
+#else
+static inline void free_page_mlock(struct page *page) { }
+#endif /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
+
/*
* Freeing function for a buddy system allocator.
*

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-04-24 11:52:29

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: [PATCH 15/22] Do not disable interrupts in free_page_mlock()

On Fri, 2009-04-24 at 12:33 +0100, Mel Gorman wrote:
> On Fri, Apr 24, 2009 at 09:33:50AM +0900, KOSAKI Motohiro wrote:
> > > > @@ -157,14 +157,9 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> > > > */
> > > > static inline void free_page_mlock(struct page *page)
> > > > {
> > > > - if (unlikely(TestClearPageMlocked(page))) {
> > > > - unsigned long flags;
> > > > -
> > > > - local_irq_save(flags);
> > > > - __dec_zone_page_state(page, NR_MLOCK);
> > > > - __count_vm_event(UNEVICTABLE_MLOCKFREED);
> > > > - local_irq_restore(flags);
> > > > - }
> > > > + __ClearPageMlocked(page);
> > > > + __dec_zone_page_state(page, NR_MLOCK);
> > > > + __count_vm_event(UNEVICTABLE_MLOCKFREED);
> > > > }
> > >
> > > The conscientuous reviewer runs around and checks for free_page_mlock()
> > > callers in other .c files which might be affected.
> > >
> > > Only there are no such callers.
> > >
> > > The reviewer's job would be reduced if free_page_mlock() wasn't
> > > needlessly placed in a header file!
> >
> > very sorry.
> >
> > How about this?
> >
> > =============================================
> > Subject: [PATCH] move free_page_mlock() to page_alloc.c
> >
> > Currently, free_page_mlock() is only called from page_alloc.c.
> > Thus, we can move it to page_alloc.c.
> >
>
> Looks good, but here is a version rebased on top of the patch series
> where it would be easier to merge with "Do not disable interrupts in
> free_page_mlock()".
>
> I do note why it might be in the header though - it keeps all the
> CONFIG_HAVE_MLOCKED_PAGE_BIT-related helper functions together making it
> easier to find them. Lee, was that the intention?

Yes. Having been dinged one too many times for adding extraneous
#ifdef's to .c's I try to avoid that...

Note that in page-flags.h, we define MLOCK_PAGES as 0 or 1 based on
CONFIG_HAVE_MLOCKED_PAGE_BIT, so you could test that in
free_page_mlock() to eliminate the #ifdef in page_alloc.c as we do in
try_to_unmap_*() over in rmap.c. Guess I could have done that when I
added MLOCK_PAGES. Got lost in the heat of the battle.

Lee


>
> =======
> From: KOSAKI Motohiro <[email protected]>
>
> Move free_page_mlock() from mm/internal.h to mm/page_alloc.c
>
> Currently, free_page_mlock() is only called from page_alloc.c. This patch
> moves it from a header to to page_alloc.c.
>
> [[email protected]: Rebase on top of page allocator patches]
> Cc: Lee Schermerhorn <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Pekka Enberg <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Andrew Morton <[email protected]>
> ---
> mm/internal.h | 13 -------------
> mm/page_alloc.c | 16 ++++++++++++++++
> 2 files changed, 16 insertions(+), 13 deletions(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 58ec1bc..4b1672a 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -150,18 +150,6 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
> }
> }
>
> -/*
> - * free_page_mlock() -- clean up attempts to free and mlocked() page.
> - * Page should not be on lru, so no need to fix that up.
> - * free_pages_check() will verify...
> - */
> -static inline void free_page_mlock(struct page *page)
> -{
> - __ClearPageMlocked(page);
> - __dec_zone_page_state(page, NR_MLOCK);
> - __count_vm_event(UNEVICTABLE_MLOCKFREED);
> -}
> -
> #else /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
> static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
> {
> @@ -170,7 +158,6 @@ static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
> static inline void clear_page_mlock(struct page *page) { }
> static inline void mlock_vma_page(struct page *page) { }
> static inline void mlock_migrate_page(struct page *new, struct page *old) { }
> -static inline void free_page_mlock(struct page *page) { }
>
> #endif /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f08b4cb..3db5f57 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -433,6 +433,22 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
> return 0;
> }
>
> +#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
> +/*
> + * free_page_mlock() -- clean up attempts to free and mlocked() page.
> + * Page should not be on lru, so no need to fix that up.
> + * free_pages_check() will verify...
> + */
> +static inline void free_page_mlock(struct page *page)
> +{
> + __ClearPageMlocked(page);
> + __dec_zone_page_state(page, NR_MLOCK);
> + __count_vm_event(UNEVICTABLE_MLOCKFREED);
> +}
> +#else
> +static inline void free_page_mlock(struct page *page) { }
> +#endif /* CONFIG_HAVE_MLOCKED_PAGE_BIT */
> +
> /*
> * Freeing function for a buddy system allocator.
> *
>

2009-04-24 13:07:00

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 19/22] Update NR_FREE_PAGES only as necessary

On Thu, Apr 23, 2009 at 04:06:10PM -0700, Andrew Morton wrote:
> On Wed, 22 Apr 2009 14:53:24 +0100
> Mel Gorman <[email protected]> wrote:
>
> > When pages are being freed to the buddy allocator, the zone
> > NR_FREE_PAGES counter must be updated. In the case of bulk per-cpu page
> > freeing, it's updated once per page. This retouches cache lines more
> > than necessary. Update the counters one per per-cpu bulk free.
> >
> > ...
> >
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -460,7 +460,6 @@ static inline void __free_one_page(struct page *page,
> > int migratetype)
> > {
> > unsigned long page_idx;
> > - int order_size = 1 << order;
> >
> > if (unlikely(PageCompound(page)))
> > if (unlikely(destroy_compound_page(page, order)))
> > @@ -470,10 +469,9 @@ static inline void __free_one_page(struct page *page,
> >
> > page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
> >
> > - VM_BUG_ON(page_idx & (order_size - 1));
> > + VM_BUG_ON(page_idx & ((1 << order) - 1));
> > VM_BUG_ON(bad_range(zone, page));
> >
>
> <head spins>
>
> Is this all a slow and obscure way of doing
>
> VM_BUG_ON(order > MAX_ORDER);
>
> ?

Nope.

>
> If not, what _is_ it asserting?
>

That the page is properly placed. The start of the page being freed has to
be size-of-page-aligned. If it isn't the calculation that works out where
the buddy is will break.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-04-24 13:31:39

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] Do not override definition of node_set_online() with macro

On Thu, Apr 23, 2009 at 12:29:27PM -0700, David Rientjes wrote:
> On Thu, 23 Apr 2009, Mel Gorman wrote:
>
> > > > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> > > > index 848025c..474e73e 100644
> > > > --- a/include/linux/nodemask.h
> > > > +++ b/include/linux/nodemask.h
> > > > @@ -408,6 +408,19 @@ static inline int num_node_state(enum node_states state)
> > > > #define next_online_node(nid) next_node((nid), node_states[N_ONLINE])
> > > >
> > > > extern int nr_node_ids;
> > > > +extern int nr_online_nodes;
> > > > +
> > > > +static inline void node_set_online(int nid)
> > > > +{
> > > > + node_set_state(nid, N_ONLINE);
> > > > + nr_online_nodes = num_node_state(N_ONLINE);
> > > > +}
> > > > +
> > > > +static inline void node_set_offline(int nid)
> > > > +{
> > > > + node_clear_state(nid, N_ONLINE);
> > > > + nr_online_nodes = num_node_state(N_ONLINE);
> > > > +}
> > > > #else
> > > >
> > > > static inline int node_state(int node, enum node_states state)
> > >
> > > The later #define's of node_set_online() and node_set_offline() in
> > > include/linux/nodemask.h should probably be removed now.
> > >
> >
> > You'd think, but you can enable memory hotplug without NUMA and
> > node_set_online() is called when adding memory. Even though those
> > functions are nops on !NUMA, they're necessary.
> >
>
> The problem is that your new functions above are never used because
> node_set_online and node_set_offline are macro defined later in this
> header for all cases, not just !CONFIG_NUMA.
>
> You need this.

You're absolutly correct, well spotted. I mistook what the #endif was
closing. Here it the patch again with a changelog. Thanks very much

==== CUT HERE ====
From: David Rientjes <[email protected]>

Do not override definition of node_set_online() with macro

node_set_online() updates node_states[] and updates the value of
nr_online_nodes. However, its definition is being accidentally overridden
by a macro definition intended for use in the !CONFIG_NUMA case. This patch
fixes the problem by moving the !CONFIG_NUMA macro definition.

This should be considered a fix to the patch
page-allocator-use-a-pre-calculated-value-instead-of-num_online_nodes-in-fast-paths.patch

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/nodemask.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 474e73e..829b94b 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -448,6 +448,9 @@ static inline int num_node_state(enum node_states state)
#define next_online_node(nid) (MAX_NUMNODES)
#define nr_node_ids 1
#define nr_online_nodes 1
+
+#define node_set_online(node) node_set_state((node), N_ONLINE)
+#define node_set_offline(node) node_clear_state((node), N_ONLINE)
#endif

#define node_online_map node_states[N_ONLINE]
@@ -467,9 +470,6 @@ static inline int num_node_state(enum node_states state)
#define node_online(node) node_state((node), N_ONLINE)
#define node_possible(node) node_state((node), N_POSSIBLE)

-#define node_set_online(node) node_set_state((node), N_ONLINE)
-#define node_set_offline(node) node_clear_state((node), N_ONLINE)
-
#define for_each_node(node) for_each_node_state(node, N_POSSIBLE)
#define for_each_online_node(node) for_each_node_state(node, N_ONLINE)

2009-04-24 14:17:19

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 02/22] Do not sanity check order in the fast path

On Fri, 2009-04-24 at 11:34 +0100, Mel Gorman wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1464aca..1c60141 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1434,7 +1434,6 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
> int did_zlc_setup = 0; /* just call zlc_setup() one time */
>
> classzone_idx = zone_idx(preferred_zone);
> - VM_BUG_ON(order >= MAX_ORDER);
>
> zonelist_scan:
> /*
> @@ -1692,6 +1691,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> struct task_struct *p = current;
>
> /*
> + * In the slowpath, we sanity check order to avoid ever trying to
> + * reclaim >= MAX_ORDER areas which will never succeed. Callers may
> + * be using allocators in order of preference for an area that is
> + * too large.
> + */
> + if (WARN_ON_ONCE(order >= MAX_ORDER))
> + return NULL;
> +
> + /*
> * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
> * __GFP_NOWARN set) should not cause reclaim since the subsystem
> * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim



Signed-off-by: Dave Hansen <[email protected]>

-- Dave

2009-04-24 14:25:26

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 02/22] Do not sanity check order in the fast path

On Fri, 2009-04-24 at 10:21 +0100, Mel Gorman wrote:
> > dave@kernel:~/work/mhp-build$ ../linux-2.6.git/scripts/bloat-o-meter
> > i386-numaq-sparse.{1,2}/vmlinux
> > add/remove: 0/0 grow/shrink: 9/16 up/down: 81/-99 (-18)
> > function old new delta
> > st_int_ioctl 2600 2624 +24
> > tcp_sendmsg 2153 2169 +16
> > diskstats_show 739 753 +14
> > iov_shorten 49 58 +9
> > unmap_vmas 1653 1661 +8
> > sg_build_indirect 449 455 +6
> > ahc_linux_biosparam 251 253 +2
> > nlmclnt_call 557 558 +1
> > do_mount 1533 1534 +1
>
> It doesn't make sense at all that text increased in size. Did you make clean
> between each .config change and patch application? Are you using distcc
> or anything else that might cause confusion? I found I had to eliminate
> distscc and clean after each patch application because sometimes
> net/ipv4/udp.c would sometimes generate different assembly when
> accessing struct zone. Not really sure what was going on there.

There's a ccache in there but no distcc.

I'll redo with clean builds.

-- Dave

2009-04-24 17:57:17

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 09/22] Calculate the alloc_flags for allocation only once

On Fri, 24 Apr 2009 11:47:17 +0100 Mel Gorman <[email protected]> wrote:

> Uninline gfp_to_alloc_flags() in the page allocator slow path

Well, there are <boggle> 37 inlines in page_alloc.c, so uninlining a
single function (and leaving its layout mucked up ;)) is a bit random.

Perhaps sometime you could take a look at "[patch] page allocator:
rationalise inlining"?

I'm kind of in two minds about it. Do we really know that all approved
versions of gcc will do the most desirable thing in all circumstances
on all architectures?

2009-04-27 07:58:26

by Yanmin Zhang

[permalink] [raw]
Subject: Re: [PATCH 00/22] Cleanup and optimise the page allocator V7

On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> Here is V7 of the cleanup and optimisation of the page allocator and
> it should be ready for wider testing. Please consider a possibility for
> merging as a Pass 1 at making the page allocator faster. Other passes will
> occur later when this one has had a bit of exercise. This patchset is based
> on mmotm-2009-04-17 and I've tested it successfully on a small number of
> machines.
We ran some performance benchmarks against V7 patch on top of 2.6.30-rc3.
It seems some counters in kernel are incorrect after we run some ffsb (disk I/O benchmark)
and swap-cp (a simple swap memory testing by cp on tmpfs). Free memory is bigger than
total memory.

[ymzhang@lkp-st02-x8664 ~]$ uname -a
Linux lkp-st02-x8664 2.6.30-rc3-mgpage #1 SMP Thu Apr 23 16:09:43 CST 2009 x86_64 x86_64 x86_64 GNU/Linux
[ymzhang@lkp-st02-x8664 ~]$ free
total used free shared buffers cached
Mem: 8166564 18014398497625640 20022908 0 2364424 247224
-/+ buffers/cache: 18014398495013992 22634556
Swap: 0 0 0
[ymzhang@lkp-st02-x8664 ~]$ cat /proc/meminfo
MemTotal: 8166564 kB
MemFree: 20022916 kB
Buffers: 2364424 kB
Cached: 247224 kB
SwapCached: 0 kB
Active: 2414520 kB
Inactive: 206168 kB
Active(anon): 4316 kB
Inactive(anon): 4932 kB
Active(file): 2410204 kB
Inactive(file): 201236 kB



[ymzhang@lkp-ne01 ~]$ uname -a
Linux lkp-ne01 2.6.30-rc3-mgpage #1 SMP Thu Apr 23 15:04:27 CST 2009 x86_64 x86_64 x86_64 GNU/Linux
[ymzhang@lkp-ne01 ~]$ free
total used free shared buffers cached
Mem: 6116356 18014398509340432 6257908 0 609804 1053512
-/+ buffers/cache: 18014398507677116 7921224
Swap: 15631204 0 15631204
[ymzhang@lkp-ne01 ~]$ cat /proc/meminfo
MemTotal: 6116356 kB
MemFree: 6257948 kB
Buffers: 609804 kB
Cached: 1053512 kB
SwapCached: 0 kB
Active: 723152 kB



Or a simple kernel source cp/rm/cp:
[ymzhang@lkp-ne01 linux-2.6.30-rc3_melgorman]$ uname -a
Linux lkp-ne01 2.6.30-rc3-mgpage #1 SMP Thu Apr 23 15:04:27 CST 2009 x86_64 x86_64 x86_64 GNU/Linux
[ymzhang@lkp-ne01 linux-2.6.30-rc3_melgorman]$ free
total used free shared buffers cached
Mem: 6116356 1309940 4806416 0 82184 1259072
-/+ buffers/cache: 18014398509450668 6147672
Swap: 15631204 0 15631204
[ymzhang@lkp-ne01 linux-2.6.30-rc3_melgorman]$ cat /proc/meminfo
MemTotal: 6116356 kB
MemFree: 4806724 kB
Buffers: 82184 kB
Cached: 1259072 kB
SwapCached: 0 kB
Active: 477396 kB
Inactive: 872388 kB
Active(anon): 8704 kB
Inactive(anon): 0 kB
Active(file): 468692 kB
Inactive(file): 872388 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 15631204 kB
SwapFree: 15631204 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 8632 kB
Mapped: 4140 kB
Slab: 174504 kB
SReclaimable: 154976 kB


>
> The performance improvements are in a wide range depending on the exact
> machine but the results I've seen so fair are approximately;
>
> kernbench: 0 to 0.12% (elapsed time)
> 0.49% to 3.20% (sys time)
> aim9: -4% to 30% (for page_test and brk_test)
> tbench: -1% to 4%
> hackbench: -2.5% to 3.45% (mostly within the noise though)
> netperf-udp -1.34% to 4.06% (varies between machines a bit)
> netperf-tcp -0.44% to 5.22% (varies between machines a bit)
>
> I haven't sysbench figures at hand, but previously they were within the -0.5%
> to 2% range.
>
> On netperf, the client and server were bound to opposite number CPUs to
> maximise the problems with cache line bouncing of the struct pages so I
> expect different people to report different results for netperf depending
> on their exact machine and how they ran the test (different machines, same
> cpus client/server, shared cache but two threads client/server, different
> socket client/server etc).
>
> I also measured the vmlinux sizes for a single x86-based config with
> CONFIG_DEBUG_INFO enabled but not CONFIG_DEBUG_VM. The core of the .config
> is based on the Debian Lenny kernel config so I expect it to be reasonably
> typical.
>
> text kernel
> 3355726 mmotm-20090417
> 3355718 0001-Replace-__alloc_pages_internal-with-__alloc_pages_.patch
> 3355622 0002-Do-not-sanity-check-order-in-the-fast-path.patch
> 3355574 0003-Do-not-check-NUMA-node-ID-when-the-caller-knows-the.patch
> 3355574 0004-Check-only-once-if-the-zonelist-is-suitable-for-the.patch
> 3355526 0005-Break-up-the-allocator-entry-point-into-fast-and-slo.patch
> 3355420 0006-Move-check-for-disabled-anti-fragmentation-out-of-fa.patch
> 3355452 0007-Calculate-the-preferred-zone-for-allocation-only-onc.patch
> 3355452 0008-Calculate-the-migratetype-for-allocation-only-once.patch
> 3355436 0009-Calculate-the-alloc_flags-for-allocation-only-once.patch
> 3355436 0010-Remove-a-branch-by-assuming-__GFP_HIGH-ALLOC_HIGH.patch
> 3355420 0011-Inline-__rmqueue_smallest.patch
> 3355420 0012-Inline-buffered_rmqueue.patch
> 3355420 0013-Inline-__rmqueue_fallback.patch
> 3355404 0014-Do-not-call-get_pageblock_migratetype-more-than-ne.patch
> 3355300 0015-Do-not-disable-interrupts-in-free_page_mlock.patch
> 3355300 0016-Do-not-setup-zonelist-cache-when-there-is-only-one-n.patch
> 3355188 0017-Do-not-check-for-compound-pages-during-the-page-allo.patch
> 3355161 0018-Use-allocation-flags-as-an-index-to-the-zone-waterma.patch
> 3355129 0019-Update-NR_FREE_PAGES-only-as-necessary.patch
> 3355129 0020-Get-the-pageblock-migratetype-without-disabling-inte.patch
> 3355129 0021-Use-a-pre-calculated-value-instead-of-num_online_nod.patch
>
> Some patches were dropped in this revision because while I believe they
> improved performance, they also increase the text size so they need to
> be revisited in isolation to show they actually help things and by how
> much. Other than that, the biggest changes were cleaning up accidental
> functional changes identified by Kosaki Motohiro. Massive credit to him for a
> very defailed review of V6, to Christoph Lameter who reviewed earlier versions
> quite heavily and Pekka who kicked through V6 in quite a lot of detail.
>
> arch/ia64/hp/common/sba_iommu.c | 2
> arch/ia64/kernel/mca.c | 3
> arch/ia64/kernel/uncached.c | 3
> arch/ia64/sn/pci/pci_dma.c | 3
> arch/powerpc/platforms/cell/ras.c | 2
> arch/x86/kvm/vmx.c | 2
> drivers/misc/sgi-gru/grufile.c | 2
> drivers/misc/sgi-xp/xpc_uv.c | 2
> include/linux/gfp.h | 27 -
> include/linux/mm.h | 1
> include/linux/mmzone.h | 11
> include/linux/nodemask.h | 15 -
> kernel/profile.c | 8
> mm/filemap.c | 2
> mm/hugetlb.c | 8
> mm/internal.h | 11
> mm/mempolicy.c | 2
> mm/migrate.c | 2
> mm/page_alloc.c | 555 ++++++++++++++++++++++++--------------
> mm/slab.c | 11
> mm/slob.c | 4
> mm/slub.c | 2
> net/sunrpc/svc.c | 2
> 23 files changed, 424 insertions(+), 256 deletions(-)
>
> Changes since V6
> o Remove unintentional functional changes when splitting into fast and slow paths
> o Drop patch 7 for zonelist filtering as it modified when zlc_setup() is called
> for the wrong reasons. The patch that avoids calling it for non-NUMA machines is
> still there which has the bulk of the saving. cpusets is relatively small
> o Drop an unnecessary check for in_interrupt() in gfp_to_alloc_flags()
> o Clarify comment on __GFP_HIGH == ALLOC_HIGH
> o Redefine the watermark mask to be expessed in terms of ALLOC_MARK_NOWATERMARK
> o Use BUILD_BUG_ON for checking __GFP_HIGH == ALLOC_HIGH
> o Drop some patches that were not reducing text sizes as expected
> o Remove numa_platform from slab
>
> Change since V5
> o Rebase to mmotm-2009-04-17
>
> Changes since V4
> o Drop the more controversial patches for now and focus on the "obvious win"
> material.
> o Add reviewed-by notes
> o Fix changelog entry to say __rmqueue_fallback instead __rmqueue
> o Add unlikely() for the clearMlocked check
> o Change where PGFREE is accounted in free_hot_cold_page() to have symmetry
> with __free_pages_ok()
> o Convert num_online_nodes() to use a static value so that callers do
> not have to be individually updated
> o Rebase to mmotm-2003-03-13
>
> Changes since V3
> o Drop the more controversial patches for now and focus on the "obvious win"
> material
> o Add reviewed-by notes
> o Fix changelog entry to say __rmqueue_fallback instead __rmqueue
> o Add unlikely() for the clearMlocked check
> o Change where PGFREE is accounted in free_hot_cold_page() to have symmetry
> with __free_pages_ok()
>
> Changes since V2
> o Remove brances by treating watermark flags as array indices
> o Remove branch by assuming __GFP_HIGH == ALLOC_HIGH
> o Do not check for compound on every page free
> o Remove branch by always ensuring the migratetype is known on free
> o Simplify buffered_rmqueue further
> o Reintroduce improved version of batched bulk free of pcp pages
> o Use allocation flags as an index to zone watermarks
> o Work out __GFP_COLD only once
> o Reduce the number of times zone stats are updated
> o Do not dump reserve pages back into the allocator. Instead treat them
> as MOVABLE so that MIGRATE_RESERVE gets used on the max-order-overlapped
> boundaries without causing trouble
> o Allow pages up to PAGE_ALLOC_COSTLY_ORDER to use the per-cpu allocator.
> order-1 allocations are frequently enough in particular to justify this
> o Rearrange inlining such that the hot-path is inlined but not in a way
> that increases the text size of the page allocator
> o Make the check for needing additional zonelist filtering due to NUMA
> or cpusets as light as possible
> o Do not destroy compound pages going to the PCP lists
> o Delay the merging of buddies until a high-order allocation needs them
> or anti-fragmentation is being forced to fallback
>
> Changes since V1
> o Remove the ifdef CONFIG_CPUSETS from inside get_page_from_freelist()
> o Use non-lock bit operations for clearing the mlock flag
> o Factor out alloc_flags calculation so it is only done once (Peter)
> o Make gfp.h a bit prettier and clear-cut (Peter)
> o Instead of deleting a debugging check, replace page_count() in the
> free path with a version that does not check for compound pages (Nick)
> o Drop the alteration for hot/cold page freeing until we know if it
> helps or not
>

2009-04-27 14:38:59

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/22] Cleanup and optimise the page allocator V7

On Mon, Apr 27, 2009 at 03:58:39PM +0800, Zhang, Yanmin wrote:
> On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > Here is V7 of the cleanup and optimisation of the page allocator and
> > it should be ready for wider testing. Please consider a possibility for
> > merging as a Pass 1 at making the page allocator faster. Other passes will
> > occur later when this one has had a bit of exercise. This patchset is based
> > on mmotm-2009-04-17 and I've tested it successfully on a small number of
> > machines.
> We ran some performance benchmarks against V7 patch on top of 2.6.30-rc3.
> It seems some counters in kernel are incorrect after we run some ffsb (disk I/O benchmark)
> and swap-cp (a simple swap memory testing by cp on tmpfs). Free memory is bigger than
> total memory.
>

oops. Can you try this patch please?

==== CUT HERE ====

Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()

free_pages_bulk() updates the number of free pages in the zone but it is
assuming that the pages being freed are order-0. While this is currently
always true, it's wrong to assume the order is 0. This patch fixes the
problem.

buffered_rmqueue() is not updating NR_FREE_PAGES when allocating pages with
__rmqueue(). This means that any high-order allocation will appear to increase
the number of free pages leading to the situation where free pages appears to
exceed available RAM. This patch accounts for those allocated pages properly.

This is a candidate fix to the patch
page-allocator-update-nr_free_pages-only-as-necessary.patch. It has yet to be
verified as fixing a problem where the free pages count is getting corrupted.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3db5f57..dd69593 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -545,7 +545,7 @@ static void free_pages_bulk(struct zone *zone, int count,
zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
zone->pages_scanned = 0;

- __mod_zone_page_state(zone, NR_FREE_PAGES, count);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, count << order);
while (count--) {
struct page *page;

@@ -1151,6 +1151,7 @@ again:
} else {
spin_lock_irqsave(&zone->lock, flags);
page = __rmqueue(zone, order, migratetype);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
spin_unlock(&zone->lock);
if (!page)
goto failed;

2009-04-27 17:01:17

by Mel Gorman

[permalink] [raw]
Subject: [RFC] Replace the watermark-related union in struct zone with a watermark[] array

Just build and boot-tested. This is to illustrate what the patch looks like
to replace the union with an array and to see if people are happy with
it or think it should look like something else.

==== CUT HERE ====

Patch page-allocator-use-allocation-flags-as-an-index-to-the-zone-watermark
from -mm added a union to struct zone where the watermarks could be
accessed with either zone->pages_* or a pages_mark array. The concern
was that this aliasing caused more confusion that it helped.

This patch replaces the union with a watermark array that is indexed with
WMARK_* defines and updates all sites that talk about zone->pages_*.

Signed-off-by: Mel Gorman <[email protected]>
---
Documentation/sysctl/vm.txt | 11 ++++++-----
Documentation/vm/balance | 18 +++++++++---------
arch/m32r/mm/discontig.c | 6 +++---
include/linux/mmzone.h | 16 ++++++++++------
mm/page_alloc.c | 41 +++++++++++++++++++++--------------------
mm/vmscan.c | 41 +++++++++++++++++++++++------------------
mm/vmstat.c | 6 +++---
7 files changed, 75 insertions(+), 64 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 97c4b32..f79e12b 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -231,8 +231,8 @@ These protections are added to score to judge whether this zone should be used
for page allocation or should be reclaimed.

In this example, if normal pages (index=2) are required to this DMA zone and
-pages_high is used for watermark, the kernel judges this zone should not be
-used because pages_free(1355) is smaller than watermark + protection[2]
+watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
+not be used because pages_free(1355) is smaller than watermark + protection[2]
(4 + 2004 = 2008). If this protection value is 0, this zone would be used for
normal page requirement. If requirement is DMA zone(index=0), protection[0]
(=0) is used.
@@ -278,9 +278,10 @@ The default value is 65536.
min_free_kbytes:

This is used to force the Linux VM to keep a minimum number
-of kilobytes free. The VM uses this number to compute a pages_min
-value for each lowmem zone in the system. Each lowmem zone gets
-a number of reserved free pages based proportionally on its size.
+of kilobytes free. The VM uses this number to compute a
+watermark[WMARK_MIN] value for each lowmem zone in the system.
+Each lowmem zone gets a number of reserved free pages based
+proportionally on its size.

Some minimal amount of memory is needed to satisfy PF_MEMALLOC
allocations; if you set this to lower than 1024KB, your system will
diff --git a/Documentation/vm/balance b/Documentation/vm/balance
index bd3d31b..c46e68c 100644
--- a/Documentation/vm/balance
+++ b/Documentation/vm/balance
@@ -75,15 +75,15 @@ Page stealing from process memory and shm is done if stealing the page would
alleviate memory pressure on any zone in the page's node that has fallen below
its watermark.

-pages_min/pages_low/pages_high/low_on_memory/zone_wake_kswapd: These are
-per-zone fields, used to determine when a zone needs to be balanced. When
-the number of pages falls below pages_min, the hysteric field low_on_memory
-gets set. This stays set till the number of free pages becomes pages_high.
-When low_on_memory is set, page allocation requests will try to free some
-pages in the zone (providing GFP_WAIT is set in the request). Orthogonal
-to this, is the decision to poke kswapd to free some zone pages. That
-decision is not hysteresis based, and is done when the number of free
-pages is below pages_low; in which case zone_wake_kswapd is also set.
+watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These
+are per-zone fields, used to determine when a zone needs to be balanced. When
+the number of pages falls below watermark[WMARK_MIN], the hysteric field
+low_on_memory gets set. This stays set till the number of free pages becomes
+watermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will
+try to free some pages in the zone (providing GFP_WAIT is set in the request).
+Orthogonal to this, is the decision to poke kswapd to free some zone pages.
+That decision is not hysteresis based, and is done when the number of free
+pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set.


(Good) Ideas that I have heard:
diff --git a/arch/m32r/mm/discontig.c b/arch/m32r/mm/discontig.c
index 7daf897..b7a78ad 100644
--- a/arch/m32r/mm/discontig.c
+++ b/arch/m32r/mm/discontig.c
@@ -154,9 +154,9 @@ unsigned long __init zone_sizes_init(void)
* Use all area of internal RAM.
* see __alloc_pages()
*/
- NODE_DATA(1)->node_zones->pages_min = 0;
- NODE_DATA(1)->node_zones->pages_low = 0;
- NODE_DATA(1)->node_zones->pages_high = 0;
+ NODE_DATA(1)->node_zones->watermark[WMARK_MIN] = 0;
+ NODE_DATA(1)->node_zones->watermark[WMARK_LOW] = 0;
+ NODE_DATA(1)->node_zones->watermark[WMARK_HIGH] = 0;

return holes;
}
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c1fa208..1ff59fd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -163,6 +163,13 @@ static inline int is_unevictable_lru(enum lru_list l)
#endif
}

+enum zone_watermarks {
+ WMARK_MIN,
+ WMARK_LOW,
+ WMARK_HIGH,
+ NR_WMARK
+};
+
struct per_cpu_pages {
int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
@@ -275,12 +282,9 @@ struct zone_reclaim_stat {

struct zone {
/* Fields commonly accessed by the page allocator */
- union {
- struct {
- unsigned long pages_min, pages_low, pages_high;
- };
- unsigned long pages_mark[3];
- };
+
+ /* zone watermarks, indexed with WMARK_LOW, WMARK_MIN and WMARK_HIGH */
+ unsigned long watermark[NR_WMARK];

/*
* We don't know if the memory that we're going to allocate will be freeable
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f5d5a63..5dd2d59 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1172,10 +1172,10 @@ failed:
return NULL;
}

-/* The WMARK bits are used as an index zone->pages_mark */
-#define ALLOC_WMARK_MIN 0x00 /* use pages_min watermark */
-#define ALLOC_WMARK_LOW 0x01 /* use pages_low watermark */
-#define ALLOC_WMARK_HIGH 0x02 /* use pages_high watermark */
+/* The ALLOC_WMARK bits are used as an index to zone->watermark */
+#define ALLOC_WMARK_MIN WMARK_MIN
+#define ALLOC_WMARK_LOW WMARK_LOW
+#define ALLOC_WMARK_HIGH WMARK_HIGH
#define ALLOC_NO_WATERMARKS 0x04 /* don't check watermarks at all */

/* Mask to get the watermark bits */
@@ -1464,9 +1464,10 @@ zonelist_scan:
!cpuset_zone_allowed_softwall(zone, gfp_mask))
goto try_next_zone;

+ BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
unsigned long mark;
- mark = zone->pages_mark[alloc_flags & ALLOC_WMARK_MASK];
+ mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
if (!zone_watermark_ok(zone, order, mark,
classzone_idx, alloc_flags)) {
if (!zone_reclaim_mode ||
@@ -1987,7 +1988,7 @@ static unsigned int nr_free_zone_pages(int offset)

for_each_zone_zonelist(zone, z, zonelist, offset) {
unsigned long size = zone->present_pages;
- unsigned long high = zone->pages_high;
+ unsigned long high = zone->watermark[WMARK_HIGH];
if (size > high)
sum += size - high;
}
@@ -2124,9 +2125,9 @@ void show_free_areas(void)
"\n",
zone->name,
K(zone_page_state(zone, NR_FREE_PAGES)),
- K(zone->pages_min),
- K(zone->pages_low),
- K(zone->pages_high),
+ K(zone->watermark[WMARK_MIN]),
+ K(zone->watermark[WMARK_LOW]),
+ K(zone->watermark[WMARK_HIGH]),
K(zone_page_state(zone, NR_ACTIVE_ANON)),
K(zone_page_state(zone, NR_INACTIVE_ANON)),
K(zone_page_state(zone, NR_ACTIVE_FILE)),
@@ -2730,8 +2731,8 @@ static inline unsigned long wait_table_bits(unsigned long size)

/*
* Mark a number of pageblocks as MIGRATE_RESERVE. The number
- * of blocks reserved is based on zone->pages_min. The memory within the
- * reserve will tend to store contiguous free pages. Setting min_free_kbytes
+ * of blocks reserved is based on zone->watermark[WMARK_MIN]. The memory within
+ * the reserve will tend to store contiguous free pages. Setting min_free_kbytes
* higher will lead to a bigger reserve which will get freed as contiguous
* blocks as reclaim kicks in
*/
@@ -2744,7 +2745,7 @@ static void setup_zone_migrate_reserve(struct zone *zone)
/* Get the start pfn, end pfn and the number of blocks to reserve */
start_pfn = zone->zone_start_pfn;
end_pfn = start_pfn + zone->spanned_pages;
- reserve = roundup(zone->pages_min, pageblock_nr_pages) >>
+ reserve = roundup(zone->watermark[WMARK_MIN], pageblock_nr_pages) >>
pageblock_order;

for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
@@ -4400,8 +4401,8 @@ static void calculate_totalreserve_pages(void)
max = zone->lowmem_reserve[j];
}

- /* we treat pages_high as reserved pages. */
- max += zone->pages_high;
+ /* we treat WMARK_HIGH as reserved pages. */
+ max += zone->watermark[WMARK_HIGH];

if (max > zone->present_pages)
max = zone->present_pages;
@@ -4481,7 +4482,7 @@ void setup_per_zone_pages_min(void)
* need highmem pages, so cap pages_min to a small
* value here.
*
- * The (pages_high-pages_low) and (pages_low-pages_min)
+ * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN)
* deltas controls asynch page reclaim, and so should
* not be capped for highmem.
*/
@@ -4492,17 +4493,17 @@ void setup_per_zone_pages_min(void)
min_pages = SWAP_CLUSTER_MAX;
if (min_pages > 128)
min_pages = 128;
- zone->pages_min = min_pages;
+ zone->watermark[WMARK_MIN] = min_pages;
} else {
/*
* If it's a lowmem zone, reserve a number of pages
* proportionate to the zone's size.
*/
- zone->pages_min = tmp;
+ zone->watermark[WMARK_MIN] = tmp;
}

- zone->pages_low = zone->pages_min + (tmp >> 2);
- zone->pages_high = zone->pages_min + (tmp >> 1);
+ zone->watermark[WMARK_LOW] = zone->watermark[WMARK_MIN] + (tmp >> 2);
+ zone->watermark[WMARK_HIGH] = zone->watermark[WMARK_MIN] + (tmp >> 1);
setup_zone_migrate_reserve(zone);
spin_unlock_irqrestore(&zone->lock, flags);
}
@@ -4647,7 +4648,7 @@ int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,
* whenever sysctl_lowmem_reserve_ratio changes.
*
* The reserve ratio obviously has absolutely no relation with the
- * pages_min watermarks. The lowmem reserve ratio can only make sense
+ * watermark[WMARK_MIN] watermarks. The lowmem reserve ratio can only make sense
* if in function of the boot time zone sizes.
*/
int lowmem_reserve_ratio_sysctl_handler(ctl_table *table, int write,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cef1801..371447c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1406,7 +1406,7 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
free = zone_page_state(zone, NR_FREE_PAGES);
/* If we have very few page cache pages,
force-scan anon pages. */
- if (unlikely(file + free <= zone->pages_high)) {
+ if (unlikely(file + free <= zone->watermark[WMARK_HIGH])) {
percent[0] = 100;
percent[1] = 0;
return;
@@ -1538,11 +1538,13 @@ static void shrink_zone(int priority, struct zone *zone,
* try to reclaim pages from zones which will satisfy the caller's allocation
* request.
*
- * We reclaim from a zone even if that zone is over pages_high. Because:
+ * We reclaim from a zone even if that zone is over watermark[WMARK_HIGH].
+ * Because:
* a) The caller may be trying to free *extra* pages to satisfy a higher-order
* allocation or
- * b) The zones may be over pages_high but they must go *over* pages_high to
- * satisfy the `incremental min' zone defense algorithm.
+ * b) The zones may be over watermark[WMARK_HIGH] but they must go *over*
+ * watermark[WMARK_HIGH] to satisfy the `incremental min' zone defense
+ * algorithm.
*
* If a zone is deemed to be full of pinned pages then just give it a light
* scan then give up on it.
@@ -1748,7 +1750,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,

/*
* For kswapd, balance_pgdat() will work across all this node's zones until
- * they are all at pages_high.
+ * they are all at watermark[WMARK_HIGH].
*
* Returns the number of pages which were actually freed.
*
@@ -1761,11 +1763,11 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
* the zone for when the problem goes away.
*
* kswapd scans the zones in the highmem->normal->dma direction. It skips
- * zones which have free_pages > pages_high, but once a zone is found to have
- * free_pages <= pages_high, we scan that zone and the lower zones regardless
- * of the number of free pages in the lower zones. This interoperates with
- * the page allocator fallback scheme to ensure that aging of pages is balanced
- * across the zones.
+ * zones which have free_pages > watermark[WMARK_HIGH], but once a zone is
+ * found to have free_pages <= watermarkpWMARK_HIGH], we scan that zone and the
+ * lower zones regardless of the number of free pages in the lower zones. This
+ * interoperates with the page allocator fallback scheme to ensure that aging
+ * of pages is balanced across the zones.
*/
static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
{
@@ -1786,7 +1788,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
};
/*
* temp_priority is used to remember the scanning priority at which
- * this zone was successfully refilled to free_pages == pages_high.
+ * this zone was successfully refilled to
+ * free_pages == watermark[WMARK_HIGH].
*/
int temp_priority[MAX_NR_ZONES];

@@ -1831,8 +1834,8 @@ loop_again:
shrink_active_list(SWAP_CLUSTER_MAX, zone,
&sc, priority, 0);

- if (!zone_watermark_ok(zone, order, zone->pages_high,
- 0, 0)) {
+ if (!zone_watermark_ok(zone, order,
+ zone->watermark[WMARK_HIGH], 0, 0)) {
end_zone = i;
break;
}
@@ -1866,8 +1869,9 @@ loop_again:
priority != DEF_PRIORITY)
continue;

- if (!zone_watermark_ok(zone, order, zone->pages_high,
- end_zone, 0))
+ if (!zone_watermark_ok(zone, order,
+ zone->watermark[WMARK_HIGH],
+ end_zone, 0))
all_zones_ok = 0;
temp_priority[i] = priority;
sc.nr_scanned = 0;
@@ -1876,8 +1880,9 @@ loop_again:
* We put equal pressure on every zone, unless one
* zone has way too many pages free already.
*/
- if (!zone_watermark_ok(zone, order, 8*zone->pages_high,
- end_zone, 0))
+ if (!zone_watermark_ok(zone, order,
+ 8*zone->watermark[WMARK_HIGH],
+ end_zone, 0))
shrink_zone(priority, zone, &sc);
reclaim_state->reclaimed_slab = 0;
nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
@@ -2043,7 +2048,7 @@ void wakeup_kswapd(struct zone *zone, int order)
return;

pgdat = zone->zone_pgdat;
- if (zone_watermark_ok(zone, order, zone->pages_low, 0, 0))
+ if (zone_watermark_ok(zone, order, zone->watermark[WMARK_LOW], 0, 0))
return;
if (pgdat->kswapd_max_order < order)
pgdat->kswapd_max_order = order;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 66f6130..17f2abb 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -725,9 +725,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
"\n spanned %lu"
"\n present %lu",
zone_page_state(zone, NR_FREE_PAGES),
- zone->pages_min,
- zone->pages_low,
- zone->pages_high,
+ zone->watermark[WMARK_MIN],
+ zone->watermark[WMARK_LOW],
+ zone->watermark[WMARK_HIGH],
zone->pages_scanned,
zone->lru[LRU_ACTIVE_ANON].nr_scan,
zone->lru[LRU_INACTIVE_ANON].nr_scan,

2009-04-27 20:49:14

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC] Replace the watermark-related union in struct zone with a watermark[] array

On Mon, 27 Apr 2009, Mel Gorman wrote:

> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index c1fa208..1ff59fd 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -163,6 +163,13 @@ static inline int is_unevictable_lru(enum lru_list l)
> #endif
> }
>
> +enum zone_watermarks {
> + WMARK_MIN,
> + WMARK_LOW,
> + WMARK_HIGH,
> + NR_WMARK
> +};
> +
> struct per_cpu_pages {
> int count; /* number of pages in the list */
> int high; /* high watermark, emptying needed */
> @@ -275,12 +282,9 @@ struct zone_reclaim_stat {
>
> struct zone {
> /* Fields commonly accessed by the page allocator */
> - union {
> - struct {
> - unsigned long pages_min, pages_low, pages_high;
> - };
> - unsigned long pages_mark[3];
> - };
> +
> + /* zone watermarks, indexed with WMARK_LOW, WMARK_MIN and WMARK_HIGH */
> + unsigned long watermark[NR_WMARK];
>
> /*
> * We don't know if the memory that we're going to allocate will be freeable

I thought the suggestion was for something like

#define zone_wmark_min(z) (z->pages_mark[WMARK_MIN])
...

2009-04-27 20:54:20

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC] Replace the watermark-related union in struct zone with a watermark[] array

On Mon, Apr 27, 2009 at 01:48:47PM -0700, David Rientjes wrote:
> On Mon, 27 Apr 2009, Mel Gorman wrote:
>
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index c1fa208..1ff59fd 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -163,6 +163,13 @@ static inline int is_unevictable_lru(enum lru_list l)
> > #endif
> > }
> >
> > +enum zone_watermarks {
> > + WMARK_MIN,
> > + WMARK_LOW,
> > + WMARK_HIGH,
> > + NR_WMARK
> > +};
> > +
> > struct per_cpu_pages {
> > int count; /* number of pages in the list */
> > int high; /* high watermark, emptying needed */
> > @@ -275,12 +282,9 @@ struct zone_reclaim_stat {
> >
> > struct zone {
> > /* Fields commonly accessed by the page allocator */
> > - union {
> > - struct {
> > - unsigned long pages_min, pages_low, pages_high;
> > - };
> > - unsigned long pages_mark[3];
> > - };
> > +
> > + /* zone watermarks, indexed with WMARK_LOW, WMARK_MIN and WMARK_HIGH */
> > + unsigned long watermark[NR_WMARK];
> >
> > /*
> > * We don't know if the memory that we're going to allocate will be freeable
>
> I thought the suggestion was for something like
>
> #define zone_wmark_min(z) (z->pages_mark[WMARK_MIN])
> ...

Was it the only suggestion? I thought just replacing the union with an
array would be an option as well.

The #define approach also requires setter versions like

static inline set_zone_wmark_min(struct zone *z, unsigned long val)
{
z->pages_mark[WMARK_MIN] = val;
}

and you need one of those for each watermark if you are to avoid weirdness like

zone_wmark_min(z) = val;

which looks all wrong. I felt this approach would look neater and be closer
in appearance to the code that is already there and therefore less surprising.

Would people prefer a getter/setter version?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

Subject: Re: [RFC] Replace the watermark-related union in struct zone with a watermark[] array

On Mon, 27 Apr 2009, Mel Gorman wrote:

> Would people prefer a getter/setter version?

I'd say leave it as is. It unifies the usage of the watermark array.

2009-04-27 21:04:20

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC] Replace the watermark-related union in struct zone with a watermark[] array

On Mon, 27 Apr 2009, Mel Gorman wrote:

> > I thought the suggestion was for something like
> >
> > #define zone_wmark_min(z) (z->pages_mark[WMARK_MIN])
> > ...
>
> Was it the only suggestion? I thought just replacing the union with an
> array would be an option as well.
>
> The #define approach also requires setter versions like
>
> static inline set_zone_wmark_min(struct zone *z, unsigned long val)
> {
> z->pages_mark[WMARK_MIN] = val;
> }
>
> and you need one of those for each watermark if you are to avoid weirdness like
>
> zone_wmark_min(z) = val;
>
> which looks all wrong.

Agreed, but we only set watermarks in a couple of different locations and
they really have no reason to change otherwise, so I don't think it's
necessary to care too much about how the setter looks.

Adding individual get/set functions for each watermark seems like
overkill.

I personally had no problem with the union struct aliasing the array, I
think ->pages_min, ->pages_low, etc. are already very familiar.

2009-04-28 01:59:04

by Yanmin Zhang

[permalink] [raw]
Subject: Re: [PATCH 00/22] Cleanup and optimise the page allocator V7

On Mon, 2009-04-27 at 15:38 +0100, Mel Gorman wrote:
> On Mon, Apr 27, 2009 at 03:58:39PM +0800, Zhang, Yanmin wrote:
> > On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > > Here is V7 of the cleanup and optimisation of the page allocator and
> > > it should be ready for wider testing. Please consider a possibility for
> > > merging as a Pass 1 at making the page allocator faster. Other passes will
> > > occur later when this one has had a bit of exercise. This patchset is based
> > > on mmotm-2009-04-17 and I've tested it successfully on a small number of
> > > machines.
> > We ran some performance benchmarks against V7 patch on top of 2.6.30-rc3.
> > It seems some counters in kernel are incorrect after we run some ffsb (disk I/O benchmark)
> > and swap-cp (a simple swap memory testing by cp on tmpfs). Free memory is bigger than
> > total memory.
> >
>
> oops. Can you try this patch please?
>
> ==== CUT HERE ====
>
> Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()
>
> free_pages_bulk() updates the number of free pages in the zone but it is
> assuming that the pages being freed are order-0. While this is currently
> always true, it's wrong to assume the order is 0. This patch fixes the
> problem.
>
> buffered_rmqueue() is not updating NR_FREE_PAGES when allocating pages with
> __rmqueue(). This means that any high-order allocation will appear to increase
> the number of free pages leading to the situation where free pages appears to
> exceed available RAM. This patch accounts for those allocated pages properly.
>
> This is a candidate fix to the patch
> page-allocator-update-nr_free_pages-only-as-necessary.patch. It has yet to be
> verified as fixing a problem where the free pages count is getting corrupted.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/page_alloc.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3db5f57..dd69593 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -545,7 +545,7 @@ static void free_pages_bulk(struct zone *zone, int count,
> zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
> zone->pages_scanned = 0;
>
> - __mod_zone_page_state(zone, NR_FREE_PAGES, count);
> + __mod_zone_page_state(zone, NR_FREE_PAGES, count << order);
> while (count--) {
> struct page *page;
>
> @@ -1151,6 +1151,7 @@ again:
> } else {
> spin_lock_irqsave(&zone->lock, flags);
> page = __rmqueue(zone, order, migratetype);
> + __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
Here 'i' should be 1?

> spin_unlock(&zone->lock);
> if (!page)
> goto failed;
I ran a cp kernel source files and swap-cp workload and didn't find
bad counter now.

2009-04-28 10:27:34

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/22] Cleanup and optimise the page allocator V7

On Tue, Apr 28, 2009 at 09:59:17AM +0800, Zhang, Yanmin wrote:
> On Mon, 2009-04-27 at 15:38 +0100, Mel Gorman wrote:
> > On Mon, Apr 27, 2009 at 03:58:39PM +0800, Zhang, Yanmin wrote:
> > > On Wed, 2009-04-22 at 14:53 +0100, Mel Gorman wrote:
> > > > Here is V7 of the cleanup and optimisation of the page allocator and
> > > > it should be ready for wider testing. Please consider a possibility for
> > > > merging as a Pass 1 at making the page allocator faster. Other passes will
> > > > occur later when this one has had a bit of exercise. This patchset is based
> > > > on mmotm-2009-04-17 and I've tested it successfully on a small number of
> > > > machines.
> > > We ran some performance benchmarks against V7 patch on top of 2.6.30-rc3.
> > > It seems some counters in kernel are incorrect after we run some ffsb (disk I/O benchmark)
> > > and swap-cp (a simple swap memory testing by cp on tmpfs). Free memory is bigger than
> > > total memory.
> > >
> >
> > oops. Can you try this patch please?
> >
> > ==== CUT HERE ====
> >
> > Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()
> >
> > free_pages_bulk() updates the number of free pages in the zone but it is
> > assuming that the pages being freed are order-0. While this is currently
> > always true, it's wrong to assume the order is 0. This patch fixes the
> > problem.
> >
> > buffered_rmqueue() is not updating NR_FREE_PAGES when allocating pages with
> > __rmqueue(). This means that any high-order allocation will appear to increase
> > the number of free pages leading to the situation where free pages appears to
> > exceed available RAM. This patch accounts for those allocated pages properly.
> >
> > This is a candidate fix to the patch
> > page-allocator-update-nr_free_pages-only-as-necessary.patch. It has yet to be
> > verified as fixing a problem where the free pages count is getting corrupted.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/page_alloc.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3db5f57..dd69593 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -545,7 +545,7 @@ static void free_pages_bulk(struct zone *zone, int count,
> > zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
> > zone->pages_scanned = 0;
> >
> > - __mod_zone_page_state(zone, NR_FREE_PAGES, count);
> > + __mod_zone_page_state(zone, NR_FREE_PAGES, count << order);
> > while (count--) {
> > struct page *page;
> >
> > @@ -1151,6 +1151,7 @@ again:
> > } else {
> > spin_lock_irqsave(&zone->lock, flags);
> > page = __rmqueue(zone, order, migratetype);
> > + __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> Here 'i' should be 1?

1UL even. Not sure how I managed to send a version with 'i' after a build +
boot test.

>
> > spin_unlock(&zone->lock);
> > if (!page)
> > goto failed;
>
> I ran a cp kernel source files and swap-cp workload and didn't find
> bad counter now.
>

I'm assuming you mean that it worked with s/i/1/. I'll send out an updated
version.

Thanks a lot.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-04-28 10:32:18

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()

This patch fixes two problems with one patch in the page allocator
optimisation patchset.

free_pages_bulk() updates the number of free pages in the zone but it is
assuming that the pages being freed are order-0. While this is currently
always true, it's wrong to assume the order is 0. This patch fixes the
problem.

buffered_rmqueue() is not updating NR_FREE_PAGES when allocating pages with
__rmqueue(). As a result, a high-order allocation will leave an elevated
free page count value leading to the situation where the free page count
exceeds available RAM. This patch accounts for those allocated pages properly.

This is a fix for page-allocator-update-nr_free_pages-only-as-necessary.patch.

Reported-by: Zhang, Yanmin <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dd2d59..59eb2e1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -545,7 +545,7 @@ static void free_pages_bulk(struct zone *zone, int count,
zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
zone->pages_scanned = 0;

- __mod_zone_page_state(zone, NR_FREE_PAGES, count);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, count << order);
while (count--) {
struct page *page;

@@ -1151,6 +1151,7 @@ again:
} else {
spin_lock_irqsave(&zone->lock, flags);
page = __rmqueue(zone, order, migratetype);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
spin_unlock(&zone->lock);
if (!page)
goto failed;

Subject: Re: [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()

On Tue, 28 Apr 2009, Mel Gorman wrote:

> @@ -1151,6 +1151,7 @@ again:
> } else {
> spin_lock_irqsave(&zone->lock, flags);
> page = __rmqueue(zone, order, migratetype);
> + __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> spin_unlock(&zone->lock);
> if (!page)
> goto failed;

__mod_zone_page_state takes an signed integer argument. Not sure what is
won by the UL suffix here.

2009-04-28 16:51:47

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()

On Tue, Apr 28, 2009 at 12:37:22PM -0400, Christoph Lameter wrote:
> On Tue, 28 Apr 2009, Mel Gorman wrote:
>
> > @@ -1151,6 +1151,7 @@ again:
> > } else {
> > spin_lock_irqsave(&zone->lock, flags);
> > page = __rmqueue(zone, order, migratetype);
> > + __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> > spin_unlock(&zone->lock);
> > if (!page)
> > goto failed;
>
> __mod_zone_page_state takes an signed integer argument. Not sure what is
> won by the UL suffix here.
>

Matches other call sites such as in __offline_isolated_pages(), habit when
using shifts like this and matches other locations, paranoia, doesn't hurt.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-04-28 17:24:34

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()

On Tue, 28 Apr 2009, Mel Gorman wrote:
> On Tue, Apr 28, 2009 at 12:37:22PM -0400, Christoph Lameter wrote:
> > On Tue, 28 Apr 2009, Mel Gorman wrote:
> >
> > > @@ -1151,6 +1151,7 @@ again:
> > > } else {
> > > spin_lock_irqsave(&zone->lock, flags);
> > > page = __rmqueue(zone, order, migratetype);
> > > + __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> > > spin_unlock(&zone->lock);
> > > if (!page)
> > > goto failed;
> >
> > __mod_zone_page_state takes an signed integer argument. Not sure what is
> > won by the UL suffix here.
> >
>
> Matches other call sites such as in __offline_isolated_pages(), habit when
> using shifts like this and matches other locations, paranoia, doesn't hurt.

Well, not a big deal, but I'd say that it does hurt: by wasting people's
time (mine!), wondering what that "UL" is for - I'd have had to ask, if
Christoph hadn't cleared this up first (thank you). And elsewhere,
you're using an int << order to update NR_FREE_PAGES, so I don't see
the consistency argument.

Hugh

2009-04-28 18:08:18

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue() V2

free_pages_bulk() updates the number of free pages in the zone but it is
assuming that the pages being freed are order-0. While this is currently
always true, it's wrong to assume the order is 0. This patch fixes the problem.

buffered_rmqueue() is not updating NR_FREE_PAGES when allocating pages with
__rmqueue(). As a result, high-order allocation will appear to increase
the number of free pages leading to the situation where the free page count
exceeds available RAM. This patch accounts for those allocated pages properly.

This is a fix for page-allocator-update-nr_free_pages-only-as-necessary.patch.

Changelog since V1
o Change 1UL to 1 as it's unnecessary in this case to be unsigned long

Reported-by: Zhang, Yanmin <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dd2d59..59eb2e1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -545,7 +545,7 @@ static void free_pages_bulk(struct zone *zone, int count,
zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
zone->pages_scanned = 0;

- __mod_zone_page_state(zone, NR_FREE_PAGES, count);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, count << order);
while (count--) {
struct page *page;

@@ -1151,6 +1151,7 @@ again:
} else {
spin_lock_irqsave(&zone->lock, flags);
page = __rmqueue(zone, order, migratetype);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));
spin_unlock(&zone->lock);
if (!page)
goto failed;

2009-04-28 18:28:49

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue() V2

On Tue, 28 Apr 2009, Mel Gorman wrote:

> free_pages_bulk() updates the number of free pages in the zone but it is
> assuming that the pages being freed are order-0. While this is currently
> always true, it's wrong to assume the order is 0. This patch fixes the problem.
>
> buffered_rmqueue() is not updating NR_FREE_PAGES when allocating pages with
> __rmqueue(). As a result, high-order allocation will appear to increase
> the number of free pages leading to the situation where the free page count
> exceeds available RAM. This patch accounts for those allocated pages properly.
>
> This is a fix for page-allocator-update-nr_free_pages-only-as-necessary.patch.
>
> Changelog since V1
> o Change 1UL to 1 as it's unnecessary in this case to be unsigned long
>
> Reported-by: Zhang, Yanmin <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>

Thanks, Mel! (And I don't need to test it, since it's now
remarkably similar to a patch I was preparing yesterday.)

Acked-by: Hugh Dickins <[email protected]>

> ---
> mm/page_alloc.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5dd2d59..59eb2e1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -545,7 +545,7 @@ static void free_pages_bulk(struct zone *zone, int count,
> zone_clear_flag(zone, ZONE_ALL_UNRECLAIMABLE);
> zone->pages_scanned = 0;
>
> - __mod_zone_page_state(zone, NR_FREE_PAGES, count);
> + __mod_zone_page_state(zone, NR_FREE_PAGES, count << order);
> while (count--) {
> struct page *page;
>
> @@ -1151,6 +1151,7 @@ again:
> } else {
> spin_lock_irqsave(&zone->lock, flags);
> page = __rmqueue(zone, order, migratetype);
> + __mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));
> spin_unlock(&zone->lock);
> if (!page)
> goto failed;

2009-04-28 18:36:20

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] Properly account for freed pages in free_pages_bulk() and when allocating high-order pages in buffered_rmqueue()

On Tue, Apr 28, 2009 at 06:15:30PM +0100, Hugh Dickins wrote:
> On Tue, 28 Apr 2009, Mel Gorman wrote:
> > On Tue, Apr 28, 2009 at 12:37:22PM -0400, Christoph Lameter wrote:
> > > On Tue, 28 Apr 2009, Mel Gorman wrote:
> > >
> > > > @@ -1151,6 +1151,7 @@ again:
> > > > } else {
> > > > spin_lock_irqsave(&zone->lock, flags);
> > > > page = __rmqueue(zone, order, migratetype);
> > > > + __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> > > > spin_unlock(&zone->lock);
> > > > if (!page)
> > > > goto failed;
> > >
> > > __mod_zone_page_state takes an signed integer argument. Not sure what is
> > > won by the UL suffix here.
> > >
> >
> > Matches other call sites such as in __offline_isolated_pages(), habit when
> > using shifts like this and matches other locations, paranoia, doesn't hurt.
>
> Well, not a big deal, but I'd say that it does hurt: by wasting people's
> time (mine!), wondering what that "UL" is for - I'd have had to ask, if
> Christoph hadn't cleared this up first (thank you). And elsewhere,
> you're using an int << order to update NR_FREE_PAGES, so I don't see
> the consistency argument.
>

Ok, I'll concede things are not super-consistent. In this case, I felt it
"didn't hurt" but I didn't take into account that it needs to be worked out
each time and I posted a second revision that should be clearer.

As an aside, in other places order changes from being signed in some places
to being unsigned in others. There is probably a pass in there just for
dealing with this sort of inconsistency.

Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-04-30 13:35:55

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC] Replace the watermark-related union in struct zone with a watermark[] array

On Mon, Apr 27, 2009 at 02:04:03PM -0700, David Rientjes wrote:
> On Mon, 27 Apr 2009, Mel Gorman wrote:
>
> > > I thought the suggestion was for something like
> > >
> > > #define zone_wmark_min(z) (z->pages_mark[WMARK_MIN])
> > > ...
> >
> > Was it the only suggestion? I thought just replacing the union with an
> > array would be an option as well.
> >
> > The #define approach also requires setter versions like
> >
> > static inline set_zone_wmark_min(struct zone *z, unsigned long val)
> > {
> > z->pages_mark[WMARK_MIN] = val;
> > }
> >
> > and you need one of those for each watermark if you are to avoid weirdness like
> >
> > zone_wmark_min(z) = val;
> >
> > which looks all wrong.
>
> Agreed, but we only set watermarks in a couple of different locations and
> they really have no reason to change otherwise, so I don't think it's
> necessary to care too much about how the setter looks.
>
> Adding individual get/set functions for each watermark seems like
> overkill.
>

I think what you're saying that you'd be ok with

zone_wmark_min(z)
zone_wmark_low(z)
zone_wmark_high(z)

and z->pages_mark[WMARK_MIN] =
and z->pages_mark[WMARK_LOW] =
and z->pages_mark[WMARK_HIGH] =

?

Is that a significant improvement over what the patch currently does? To
me, it seems more verbose.

> I personally had no problem with the union struct aliasing the array, I
> think ->pages_min, ->pages_low, etc. are already very familiar.
>

Can the people who do have a problem with the union make some sort of
comment on how they think it should look?

Obviously, I'm pro-the-current-patch :/

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-04-30 13:51:15

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC] Replace the watermark-related union in struct zone with a watermark[] array

On Thu, 2009-04-30 at 14:35 +0100, Mel Gorman wrote:
> I think what you're saying that you'd be ok with
>
> zone_wmark_min(z)
> zone_wmark_low(z)
> zone_wmark_high(z)
>
> and z->pages_mark[WMARK_MIN] =
> and z->pages_mark[WMARK_LOW] =
> and z->pages_mark[WMARK_HIGH] =
>
> ?
>
> Is that a significant improvement over what the patch currently does? To
> me, it seems more verbose.

Either way, there are _relatively_ few users. From a quick cscope, it
appears setup_per_zone_pages_min() is really the heaviest user assigning
them.

Personally, I do like having the 'wmark' or something similar in the
function or structure member names. But, I also like having the units
in there as well. There's probably not room for both, though. I'm fine
with the naming you have above. The only thing I might consider is
removing 'zone_' from the function names since it's implied from the
variable name:

min_wmark_pages(z)

The 'z->pages_mark[WMARK_*]' form is ugly, but it should be basically
restricted to use in setup_per_zone_pages_min(). I think that means we
don't need set_foo() functions because of a lack of use sites.

-- Dave