2005-12-05 19:01:10

by Christoph Lameter

[permalink] [raw]
Subject: [PATCH 1/3] Arch specific zone reclaim framework

Generic framework for arch specific zone reclaim.

Zone reclaim allows the reclaiming of pages from a zone if the number of free
pages falls below the watermark even if other zones still have enough pages
available.

Zone reclaim is of particular importance for NUMA machines. It can be more
beneficial to reclaim a page than taking the performance penalties that come
with allocating a page on a remote zone. Maybe this will also be useful
to implement reclaim for DMA zones in some architectures.

The penalty incurred by remote page accesses varies depending on the NUMA
factor of the archictecture. If the NUMA factor is very low (architectures
that have multiple nodes on the same motherboard like for example Opteron
multi-processor boards) then no page reclaim may be needed since access to
another nodes memory is almost as fast as a direct access.
On Itanium architectures and other bus based NUMA architectures a remote
access usually means that the access has to occur over some sort of NUMA
interlink. It is worth to sacrifice easily reclaimable pages in order to
allow a local allocation. Typically there are large number of easily reclaimable
page available if a scan over some files has just been done or if an application
has just terminated that mmapped many files.

Other architectures (especially software NUMA like VirtualIron) may have
higher NUMA factors and consequently it may be beneficial to do even more
cleaning of the local zone before going off-node for those.

This patch adds a hook to the page allocator and defines a generic zone
reclaim function. These allow an arch to implement its own zone reclaim
so that the off-node allocation behavior of of the page allocator may be
controlled in an arch specific way. The patch replaces Martin Hick's zone
reclaim function (which was never working properly).

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.15-rc4/mm/page_alloc.c
===================================================================
--- linux-2.6.15-rc4.orig/mm/page_alloc.c 2005-11-30 22:25:15.000000000 -0800
+++ linux-2.6.15-rc4/mm/page_alloc.c 2005-12-05 10:21:36.000000000 -0800
@@ -842,7 +842,8 @@ get_page_from_freelist(gfp_t gfp_mask, u
mark = (*z)->pages_high;
if (!zone_watermark_ok(*z, order, mark,
classzone_idx, alloc_flags))
- continue;
+ if (!arch_zone_reclaim(*z, gfp_mask, order))
+ continue;
}

page = buffered_rmqueue(*z, order, gfp_mask);
Index: linux-2.6.15-rc4/include/linux/swap.h
===================================================================
--- linux-2.6.15-rc4.orig/include/linux/swap.h 2005-11-30 22:25:15.000000000 -0800
+++ linux-2.6.15-rc4/include/linux/swap.h 2005-12-05 10:21:36.000000000 -0800
@@ -172,7 +172,7 @@ extern void swap_setup(void);

/* linux/mm/vmscan.c */
extern int try_to_free_pages(struct zone **, gfp_t);
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
+extern int zone_reclaim(struct zone *, gfp_t, int, int);
extern int shrink_all_memory(int);
extern int vm_swappiness;

Index: linux-2.6.15-rc4/mm/vmscan.c
===================================================================
--- linux-2.6.15-rc4.orig/mm/vmscan.c 2005-11-30 22:25:15.000000000 -0800
+++ linux-2.6.15-rc4/mm/vmscan.c 2005-12-05 10:21:36.000000000 -0800
@@ -1354,47 +1354,45 @@ static int __init kswapd_init(void)

module_init(kswapd_init)

-
/*
* Try to free up some pages from this zone through reclaim.
*/
-int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
+#ifdef CONFIG_ARCH_ZONE_RECLAIM
+int zone_reclaim(struct zone *z, gfp_t gfp_mask, int writepage, int swap)
{
+ struct task_struct *p = current;
struct scan_control sc;
- int nr_pages = 1 << order;
- int total_reclaimed = 0;
+ struct reclaim_state reclaim_state;

- /* The reclaim may sleep, so don't do it if sleep isn't allowed */
- if (!(gfp_mask & __GFP_WAIT))
- return 0;
- if (zone->all_unreclaimable)
- return 0;
-
- sc.gfp_mask = gfp_mask;
- sc.may_writepage = 0;
- sc.may_swap = 0;
- sc.nr_mapped = read_page_state(nr_mapped);
sc.nr_scanned = 0;
sc.nr_reclaimed = 0;
- /* scan at the highest priority */
+ sc.nr_mapped = read_page_state(nr_mapped);
sc.priority = 0;
- disable_swap_token();
-
- if (nr_pages > SWAP_CLUSTER_MAX)
- sc.swap_cluster_max = nr_pages;
- else
- sc.swap_cluster_max = SWAP_CLUSTER_MAX;
+ sc.gfp_mask = gfp_mask;
+ sc.may_writepage = writepage;
+ sc.may_swap = swap;
+ sc.swap_cluster_max = SWAP_CLUSTER_MAX;

+ /* The reclaim may sleep, so don't do it if sleep isn't allowed */
+ if (!(gfp_mask & __GFP_WAIT))
+ return 0;
+ if (z->all_unreclaimable)
+ return 0;
/* Don't reclaim the zone if there are other reclaimers active */
- if (atomic_read(&zone->reclaim_in_progress) > 0)
- goto out;
-
- shrink_zone(zone, &sc);
- total_reclaimed = sc.nr_reclaimed;
+ if (atomic_read(&z->reclaim_in_progress) > 0)
+ return 0;

- out:
- return total_reclaimed;
+ cond_resched();
+ p->flags |= PF_MEMALLOC;
+ reclaim_state.reclaimed_slab = 0;
+ p->reclaim_state = &reclaim_state;
+ shrink_zone(z, &sc);
+ p->reclaim_state = NULL;
+ current->flags &= ~PF_MEMALLOC;
+ cond_resched();
+ return sc.nr_reclaimed;
}
+#endif

asmlinkage long sys_set_zone_reclaim(unsigned int node, unsigned int zone,
unsigned int state)
Index: linux-2.6.15-rc4/include/linux/gfp.h
===================================================================
--- linux-2.6.15-rc4.orig/include/linux/gfp.h 2005-11-30 22:25:15.000000000 -0800
+++ linux-2.6.15-rc4/include/linux/gfp.h 2005-12-05 10:21:54.000000000 -0800
@@ -100,6 +100,13 @@ static inline int gfp_zone(gfp_t gfp)
static inline void arch_free_page(struct page *page, int order) { }
#endif

+#ifndef CONFIG_ARCH_ZONE_RECLAIM
+static inline int arch_zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
+{
+ return 0;
+}
+#endif
+
extern struct page *
FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *));


2005-12-05 19:01:16

by Christoph Lameter

[permalink] [raw]
Subject: [PATCH 2/3] ia64 zone reclaim

IA64 zone reclaim

Set up a zone reclaim function for IA64. The zone reclaim function will
reclaim easily reclaimable pages. Off node allocations will occur if no
easily reclaimable pages exist anymore.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.15-rc4/arch/ia64/mm/numa.c
===================================================================
--- linux-2.6.15-rc4.orig/arch/ia64/mm/numa.c 2005-11-30 22:25:15.000000000 -0800
+++ linux-2.6.15-rc4/arch/ia64/mm/numa.c 2005-12-05 10:12:14.000000000 -0800
@@ -17,6 +17,7 @@
#include <linux/node.h>
#include <linux/init.h>
#include <linux/bootmem.h>
+#include <linux/swap.h>
#include <asm/mmzone.h>
#include <asm/numa.h>

@@ -71,3 +72,17 @@ int early_pfn_to_nid(unsigned long pfn)
return 0;
}
#endif
+
+/*
+ * Remove easily reclaimable local pages if watermarks would prevent a
+ * local allocation.
+ */
+int arch_zone_reclaim(struct zone *z, gfp_t mask,
+ unsigned int order)
+{
+ if (z->zone_pgdat->node_id == numa_node_id()) {
+ if (zone_reclaim(z, mask, 0, 0) > (1 << order))
+ return 1;
+ }
+ return 0;
+}
Index: linux-2.6.15-rc4/arch/ia64/Kconfig
===================================================================
--- linux-2.6.15-rc4.orig/arch/ia64/Kconfig 2005-11-30 22:25:15.000000000 -0800
+++ linux-2.6.15-rc4/arch/ia64/Kconfig 2005-12-03 13:30:27.000000000 -0800
@@ -338,6 +338,10 @@ config HAVE_ARCH_EARLY_PFN_TO_NID
def_bool y
depends on NEED_MULTIPLE_NODES

+config ARCH_ZONE_RECLAIM
+ def_bool y
+ depends on NUMA
+
config IA32_SUPPORT
bool "Support for Linux/x86 binaries"
help

2005-12-05 19:01:46

by Christoph Lameter

[permalink] [raw]
Subject: [PATCH 3/3] Remove debris from old zone reclaim

Remove debris of old zone reclaim

Removes the leftovers from prior attempts to implement Zone reclaim.

sys_set_zone_reclaim is not reachable in 2.6.14.

The reclaim_pages field in struct zone is only used by sys_set_zone_reclaim.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.15-rc4/include/linux/mmzone.h
===================================================================
--- linux-2.6.15-rc4.orig/include/linux/mmzone.h 2005-11-30 22:25:15.000000000 -0800
+++ linux-2.6.15-rc4/include/linux/mmzone.h 2005-12-05 09:57:36.000000000 -0800
@@ -150,11 +150,6 @@ struct zone {
unsigned long pages_scanned; /* since last reclaim */
int all_unreclaimable; /* All pages pinned */

- /*
- * Does the allocator try to reclaim pages from the zone as soon
- * as it fails a watermark_ok() in __alloc_pages?
- */
- int reclaim_pages;
/* A count of how many reclaimers are scanning this zone */
atomic_t reclaim_in_progress;

Index: linux-2.6.15-rc4/mm/vmscan.c
===================================================================
--- linux-2.6.15-rc4.orig/mm/vmscan.c 2005-12-03 13:34:59.000000000 -0800
+++ linux-2.6.15-rc4/mm/vmscan.c 2005-12-05 09:57:36.000000000 -0800
@@ -1394,33 +1394,3 @@ int zone_reclaim(struct zone *z, gfp_t g
}
#endif

-asmlinkage long sys_set_zone_reclaim(unsigned int node, unsigned int zone,
- unsigned int state)
-{
- struct zone *z;
- int i;
-
- if (!capable(CAP_SYS_ADMIN))
- return -EACCES;
-
- if (node >= MAX_NUMNODES || !node_online(node))
- return -EINVAL;
-
- /* This will break if we ever add more zones */
- if (!(zone & (1<<ZONE_DMA|1<<ZONE_NORMAL|1<<ZONE_HIGHMEM)))
- return -EINVAL;
-
- for (i = 0; i < MAX_NR_ZONES; i++) {
- if (!(zone & 1<<i))
- continue;
-
- z = &NODE_DATA(node)->node_zones[i];
-
- if (state)
- z->reclaim_pages = 1;
- else
- z->reclaim_pages = 0;
- }
-
- return 0;
-}

2005-12-05 19:12:11

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/3] Arch specific zone reclaim framework

Nack. Arch control over VM reclaim logic will load to a total mess with VM
logic all over arch. Please introduce a framework that allows individual
machines control parameters, but procedural callouts are a big no-no.

2005-12-05 19:25:12

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/3] Arch specific zone reclaim framework

On Mon, 5 Dec 2005, Christoph Hellwig wrote:

> Nack. Arch control over VM reclaim logic will load to a total mess with VM
> logic all over arch. Please introduce a framework that allows individual
> machines control parameters, but procedural callouts are a big no-no.

The different penalties for off node accesses on various architectures may
dictate different techniques in order to get best performance . The
parameter control was tried before and it was not nice. IMHO this is the
cleanest possible solution.



2005-12-06 15:38:45

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/3] Arch specific zone reclaim framework

Christoph Lameter <[email protected]> writes:

> Generic framework for arch specific zone reclaim.
>
> Zone reclaim allows the reclaiming of pages from a zone if the number of free
> pages falls below the watermark even if other zones still have enough pages
> available.
>
> Zone reclaim is of particular importance for NUMA machines. It can be more
> beneficial to reclaim a page than taking the performance penalties that come
> with allocating a page on a remote zone. Maybe this will also be useful
> to implement reclaim for DMA zones in some architectures.
>
> The penalty incurred by remote page accesses varies depending on the NUMA
> factor of the archictecture. If the NUMA factor is very low (architectures
> that have multiple nodes on the same motherboard like for example Opteron
> multi-processor boards) then no page reclaim may be needed since access to
> another nodes memory is almost as fast as a direct access.
> On Itanium architectures and other bus based NUMA architectures a remote
> access usually means that the access has to occur over some sort of NUMA
> interlink. It is worth to sacrifice easily reclaimable pages in order to
> allow a local allocation. Typically there are large number of easily reclaimable
> page available if a scan over some files has just been done or if an application
> has just terminated that mmapped many files.
>
> Other architectures (especially software NUMA like VirtualIron) may have
> higher NUMA factors and consequently it may be beneficial to do even more
> cleaning of the local zone before going off-node for those.

I think it's a very very bad idea to have architecture specific
functions for such generic VM tasks. I'm all for fixing this
particular problem, but do it in generic code, possible
with an ifdef and some arch settable parameters. But no
architecture specific VM code please. Going down that path
would cause long term maintenance headaches.

I suppose this particular problem could be just handled by
just checking node_distance()

-Andi