2005-12-08 20:39:28

by Christoph Lameter

[permalink] [raw]
Subject: [PATCH 1/3] Zone reclaim V3: main patch

Zone reclaim allows the reclaiming of pages from a zone if the number of free
pages falls below the watermark even if other zones still have enough pages
available. Zone reclaim is of particular importance for NUMA machines. It can
be more beneficial to reclaim a page than taking the performance penalties
that come with allocating a page on a remote zone.

The patch replaces Martin Hick's zone reclaim function (which was never
working properly).

Zone reclaim is enabled if the maximum distance to another node is higher
than RECLAIM_DISTANCE, which may be defined by an arch. By default
RECLAIM_DISTANCE is 20 meaning the distance to another node in the
same component (enclosure or motherboard).

V2->V3:
- At Andi Kleen's suggestion: Use distance information to determine zone
reclaim behavior instead of using an arch specific function.
- Do not compile zone_reclaim logic if this is not a NUMA system
- Limit number of unsuccessful reclaim attempts

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.15-rc4/mm/page_alloc.c
===================================================================
--- linux-2.6.15-rc4.orig/mm/page_alloc.c 2005-11-30 22:25:15.000000000 -0800
+++ linux-2.6.15-rc4/mm/page_alloc.c 2005-12-08 12:30:29.000000000 -0800
@@ -842,7 +842,9 @@ get_page_from_freelist(gfp_t gfp_mask, u
mark = (*z)->pages_high;
if (!zone_watermark_ok(*z, order, mark,
classzone_idx, alloc_flags))
- continue;
+ if (!zone_reclaim_mode ||
+ !zone_reclaim(*z, gfp_mask, order))
+ continue;
}

page = buffered_rmqueue(*z, order, gfp_mask);
@@ -1559,13 +1561,22 @@ static void __init build_zonelists(pg_da
prev_node = local_node;
nodes_clear(used_mask);
while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
+ int distance = node_distance(local_node, node);
+
+ /*
+ * If another node is sufficiently far away then it is better
+ * to reclaim pages in a zone before going off node.
+ */
+ if (distance > RECLAIM_DISTANCE)
+ zone_reclaim_mode = 1;
+
/*
* We don't want to pressure a particular node.
* So adding penalty to the first node in same
* distance group to make it round-robin.
*/
- if (node_distance(local_node, node) !=
- node_distance(local_node, prev_node))
+
+ if (distance != node_distance(local_node, prev_node))
node_load[node] += load;
prev_node = node;
load--;
Index: linux-2.6.15-rc4/mm/vmscan.c
===================================================================
--- linux-2.6.15-rc4.orig/mm/vmscan.c 2005-11-30 22:25:15.000000000 -0800
+++ linux-2.6.15-rc4/mm/vmscan.c 2005-12-08 12:30:29.000000000 -0800
@@ -1354,6 +1354,14 @@ static int __init kswapd_init(void)

module_init(kswapd_init)

+#ifdef CONFIG_NUMA
+/*
+ * Zone reclaim mode
+ *
+ * If non-zero call zone_reclaim when the number of free pages falls below
+ * the watermarks.
+ */
+int zone_reclaim_mode __read_mostly;

/*
* Try to free up some pages from this zone through reclaim.
@@ -1362,12 +1370,13 @@ int zone_reclaim(struct zone *zone, gfp_
{
struct scan_control sc;
int nr_pages = 1 << order;
- int total_reclaimed = 0;
+ struct task_struct *p = current;
+ struct reclaim_state reclaim_state;

- /* The reclaim may sleep, so don't do it if sleep isn't allowed */
- if (!(gfp_mask & __GFP_WAIT))
- return 0;
- if (zone->all_unreclaimable)
+ if (!(gfp_mask & __GFP_WAIT) ||
+ zone->zone_pgdat->node_id != numa_node_id() ||
+ zone->all_unreclaimable ||
+ atomic_read(&zone->reclaim_in_progress) > 0)
return 0;

sc.gfp_mask = gfp_mask;
@@ -1376,25 +1385,22 @@ int zone_reclaim(struct zone *zone, gfp_
sc.nr_mapped = read_page_state(nr_mapped);
sc.nr_scanned = 0;
sc.nr_reclaimed = 0;
- /* scan at the highest priority */
sc.priority = 0;
disable_swap_token();

- if (nr_pages > SWAP_CLUSTER_MAX)
- sc.swap_cluster_max = nr_pages;
- else
- sc.swap_cluster_max = SWAP_CLUSTER_MAX;
-
- /* Don't reclaim the zone if there are other reclaimers active */
- if (atomic_read(&zone->reclaim_in_progress) > 0)
- goto out;
+ sc.swap_cluster_max = max(nr_pages, SWAP_CLUSTER_MAX);

+ cond_resched();
+ p->flags |= PF_MEMALLOC;
+ reclaim_state.reclaimed_slab = 0;
+ p->reclaim_state = &reclaim_state;
shrink_zone(zone, &sc);
- total_reclaimed = sc.nr_reclaimed;
-
- out:
- return total_reclaimed;
+ p->reclaim_state = NULL;
+ current->flags &= ~PF_MEMALLOC;
+ cond_resched();
+ return sc.nr_reclaimed >= (1 << order);
}
+#endif

asmlinkage long sys_set_zone_reclaim(unsigned int node, unsigned int zone,
unsigned int state)
Index: linux-2.6.15-rc4/include/linux/swap.h
===================================================================
--- linux-2.6.15-rc4.orig/include/linux/swap.h 2005-11-30 22:25:15.000000000 -0800
+++ linux-2.6.15-rc4/include/linux/swap.h 2005-12-08 12:30:57.000000000 -0800
@@ -172,7 +172,17 @@ extern void swap_setup(void);

/* linux/mm/vmscan.c */
extern int try_to_free_pages(struct zone **, gfp_t);
+#ifdef CONFIG_NUMA
+extern int zone_reclaim_mode;
extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
+#else
+#define zone_reclaim_mode 0
+static inline int zone_reclaim(struct zone *z, gfp_t mask,
+ unsigned int order)
+{
+ return 0;
+}
+#endif
extern int shrink_all_memory(int);
extern int vm_swappiness;

Index: linux-2.6.15-rc4/include/linux/topology.h
===================================================================
--- linux-2.6.15-rc4.orig/include/linux/topology.h 2005-11-30 22:25:15.000000000 -0800
+++ linux-2.6.15-rc4/include/linux/topology.h 2005-12-08 12:30:29.000000000 -0800
@@ -56,6 +56,9 @@
#define REMOTE_DISTANCE 20
#define node_distance(from,to) ((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE)
#endif
+#ifndef RECLAIM_DISTANCE
+#define RECLAIM_DISTANCE 20
+#endif
#ifndef PENALTY_FOR_NODE_WITH_CPUS
#define PENALTY_FOR_NODE_WITH_CPUS (1)
#endif


2005-12-08 20:37:36

by Christoph Lameter

[permalink] [raw]
Subject: [PATCH 2/3] Zone reclaim V3: Remove debris from old zone reclaim

Remove debris of old zone reclaim

Removes the leftovers from prior attempts to implement Zone reclaim.

sys_set_zone_reclaim is not rechable in 2.6.14.

The reclaim_pages field in struct zone is only used by sys_set_zone_reclaim.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.15-rc4/include/linux/mmzone.h
===================================================================
--- linux-2.6.15-rc4.orig/include/linux/mmzone.h 2005-11-30 22:25:15.000000000 -0800
+++ linux-2.6.15-rc4/include/linux/mmzone.h 2005-12-08 09:35:29.000000000 -0800
@@ -150,11 +150,6 @@ struct zone {
unsigned long pages_scanned; /* since last reclaim */
int all_unreclaimable; /* All pages pinned */

- /*
- * Does the allocator try to reclaim pages from the zone as soon
- * as it fails a watermark_ok() in __alloc_pages?
- */
- int reclaim_pages;
/* A count of how many reclaimers are scanning this zone */
atomic_t reclaim_in_progress;

Index: linux-2.6.15-rc4/mm/vmscan.c
===================================================================
--- linux-2.6.15-rc4.orig/mm/vmscan.c 2005-12-08 09:23:59.000000000 -0800
+++ linux-2.6.15-rc4/mm/vmscan.c 2005-12-08 09:35:29.000000000 -0800
@@ -1402,33 +1402,3 @@ int zone_reclaim(struct zone *zone, gfp_
}
#endif

-asmlinkage long sys_set_zone_reclaim(unsigned int node, unsigned int zone,
- unsigned int state)
-{
- struct zone *z;
- int i;
-
- if (!capable(CAP_SYS_ADMIN))
- return -EACCES;
-
- if (node >= MAX_NUMNODES || !node_online(node))
- return -EINVAL;
-
- /* This will break if we ever add more zones */
- if (!(zone & (1<<ZONE_DMA|1<<ZONE_NORMAL|1<<ZONE_HIGHMEM)))
- return -EINVAL;
-
- for (i = 0; i < MAX_NR_ZONES; i++) {
- if (!(zone & 1<<i))
- continue;
-
- z = &NODE_DATA(node)->node_zones[i];
-
- if (state)
- z->reclaim_pages = 1;
- else
- z->reclaim_pages = 0;
- }
-
- return 0;
-}

2005-12-08 20:37:46

by Christoph Lameter

[permalink] [raw]
Subject: [PATCH 3/3] Zone reclaim V3: Frequency of failed reclaim attempts

Reduce frequency of unsuccessful zone reclaim attempts

It is unlikely that zone reclaim is successful once it has failed. The
performance of the page allocator will sink signficantly for off-node
allocation if every page allocation attempt first requires a zone reclaim
scan to establish that no local memory is availale.

This patch limits the number of unsuccessful zone reclaim attempts to one
per tick by remembering the last time a zone reclaim failed on a zone.

Note that this approach may be avoided once we have per zone statistics
on the number of unmapped (==easily reclaimable) pages. I am working on
a statistics patch that may allow keeping track of unmapped pages per
zone. A check of that number may then allow an easy determination if it
makes sense to run zone reclaim.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.15-rc4/mm/vmscan.c
===================================================================
--- linux-2.6.15-rc4.orig/mm/vmscan.c 2005-12-08 11:10:14.000000000 -0800
+++ linux-2.6.15-rc4/mm/vmscan.c 2005-12-08 12:04:38.000000000 -0800
@@ -1379,6 +1379,16 @@ int zone_reclaim(struct zone *zone, gfp_
atomic_read(&zone->reclaim_in_progress) > 0)
return 0;

+ /*
+ * If an unsuccessful zone reclaim occurred in this tick then we
+ * already needed to go off before. Our local purity is already
+ * tainted and its likely that the scan for easily reclaimable pages
+ * will be a waste of time. Continue off node allocations for the
+ * duration of this tick.
+ */
+ if (zone->last_unsuccessful_zone_reclaim == get_jiffies_64())
+ return 0;
+
sc.gfp_mask = gfp_mask;
sc.may_writepage = 0;
sc.may_swap = 0;
@@ -1397,6 +1407,8 @@ int zone_reclaim(struct zone *zone, gfp_
shrink_zone(zone, &sc);
p->reclaim_state = NULL;
current->flags &= ~PF_MEMALLOC;
+ if (sc.nr_reclaimed == 0)
+ zone->last_unsuccessful_zone_reclaim = get_jiffies_64();
cond_resched();
return sc.nr_reclaimed >= (1 << order);
}
Index: linux-2.6.15-rc4/include/linux/mmzone.h
===================================================================
--- linux-2.6.15-rc4.orig/include/linux/mmzone.h 2005-12-08 11:10:14.000000000 -0800
+++ linux-2.6.15-rc4/include/linux/mmzone.h 2005-12-08 12:00:43.000000000 -0800
@@ -153,6 +153,8 @@ struct zone {
/* A count of how many reclaimers are scanning this zone */
atomic_t reclaim_in_progress;

+ unsigned long last_unsuccessful_zone_reclaim;
+
/*
* prev_priority holds the scanning priority for this zone. It is
* defined as the scanning priority at which we achieved our reclaim

2005-12-08 20:52:54

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/3] Zone reclaim V3: Frequency of failed reclaim attempts

> + if (zone->last_unsuccessful_zone_reclaim == get_jiffies_64())
> + return 0;


and

>
> + unsigned long last_unsuccessful_zone_reclaim;

For long you don't need get_jiffies_64. On 32bit it would be 32bit
anyways and on 64bit even normal jiffies is 64bit. So normal
jiffies would be suffice.

But I suspect it would be better to just merge the proper patch
with the full accounting instead of this kludge.

-Andi

2005-12-08 21:08:52

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/3] Zone reclaim V3: main patch

> Zone reclaim is enabled if the maximum distance to another node is higher
> than RECLAIM_DISTANCE, which may be defined by an arch. By default
> RECLAIM_DISTANCE is 20 meaning the distance to another node in the
> same component (enclosure or motherboard).

Sorry I made a mistake here earlier. On checking the ACPI spec
again it's valid to have distances < 20 (e.g. for a 1.5 NUMA factor
it would be legally 15)

So better just check > LOCAL_DISTANCE, not >= 20.

Also a lot of Opteron BIOS get that wrong, but I'm adding some
sanity checking now so it should work in future.

-Andi

2005-12-08 21:09:05

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/3] Zone reclaim V3: Frequency of failed reclaim attempts

On Thu, 8 Dec 2005, Andi Kleen wrote:

> For long you don't need get_jiffies_64. On 32bit it would be 32bit
> anyways and on 64bit even normal jiffies is 64bit. So normal
> jiffies would be suffice.

Patch follows.

> But I suspect it would be better to just merge the proper patch
> with the full accounting instead of this kludge.

I would also like to see the full accounting patch to fix this in the
right way but on the other hand I would like to disentangle different
patchsets as much as possible. The accounting patch may touch many
critical code paths.

2005-12-08 21:09:13

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 3/3] Zone reclaim V3: Frequency of failed reclaim attempts

Patch:

Index: linux-2.6.15-rc4/mm/vmscan.c
===================================================================
--- linux-2.6.15-rc4.orig/mm/vmscan.c 2005-12-08 12:31:56.000000000 -0800
+++ linux-2.6.15-rc4/mm/vmscan.c 2005-12-08 13:07:05.000000000 -0800
@@ -1386,7 +1386,7 @@ int zone_reclaim(struct zone *zone, gfp_
* will be a waste of time. Continue off node allocations for the
* duration of this tick.
*/
- if (zone->last_unsuccessful_zone_reclaim == get_jiffies_64())
+ if (zone->last_unsuccessful_zone_reclaim == jiffies)
return 0;

sc.gfp_mask = gfp_mask;
@@ -1408,7 +1408,7 @@ int zone_reclaim(struct zone *zone, gfp_
p->reclaim_state = NULL;
current->flags &= ~PF_MEMALLOC;
if (sc.nr_reclaimed == 0)
- zone->last_unsuccessful_zone_reclaim = get_jiffies_64();
+ zone->last_unsuccessful_zone_reclaim = jiffies;
cond_resched();
return sc.nr_reclaimed >= (1 << order);
}

2005-12-08 21:10:53

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/3] Zone reclaim V3: Frequency of failed reclaim attempts

On Thu, Dec 08, 2005 at 01:08:50PM -0800, Christoph Lameter wrote:
> Patch:

Looks good thanks.

I hope this will help Opteron users a lot who have been always
complaining about this too.

-Andi

2005-12-08 21:24:09

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/3] Zone reclaim V3: main patch

On Thu, 8 Dec 2005, Andi Kleen wrote:

> Sorry I made a mistake here earlier. On checking the ACPI spec
> again it's valid to have distances < 20 (e.g. for a 1.5 NUMA factor
> it would be legally 15)

Saw that too.

> So better just check > LOCAL_DISTANCE, not >= 20.

For Altix 20 means that the other node is remote but in the same
enclosure / motherboard. Latency is very low in these cases. I think in
these small configurations it is better to go off node rather than using
the reclaim logic.

Other small configurations may have the same issues.

RECLAIM_DISTANCE can be set per arch if the default is not okay.

> Also a lot of Opteron BIOS get that wrong, but I'm adding some
> sanity checking now so it should work in future.

Great!

2005-12-08 22:51:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/3] Zone reclaim V3: main patch

> For Altix 20 means that the other node is remote but in the same
> enclosure / motherboard. Latency is very low in these cases. I think in
> these small configurations it is better to go off node rather than using
> the reclaim logic.

On Opterons the NUMA factors are usually < 2, more towards 1, but people
definitely note a difference between node and off node.
So I don't think that's a good heuristic.

I would use > LOCAL_DISTANCE or perhaps if you really want
a new constant with value 12-15.

> RECLAIM_DISTANCE can be set per arch if the default is not okay.

Well if anything it would be per system - perhaps need to make
it a boot option or somesuch later.

-Andi

2005-12-08 23:20:05

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/3] Zone reclaim V3: main patch

On Thu, 8 Dec 2005, Andi Kleen wrote:

> I would use > LOCAL_DISTANCE or perhaps if you really want
> a new constant with value 12-15.

One may define RECLAIM_DISTANCE to be 12 for x86_64 in topology.h
in order to get zone reclaim earlier for the opteron clusters. I would
think though that large opteron clusters also have distances > 20.

My experience is that at 20 systems do not need zone reclaim yet.

> > RECLAIM_DISTANCE can be set per arch if the default is not okay.
>
> Well if anything it would be per system - perhaps need to make
> it a boot option or somesuch later.

The idea here was to avoid any manual configuration. The numa distances
must related in some real way to performance (at least per arch) in order
for the automatic determination of zone reclaim to make sense. We could
have a boot time override but then RECLAIM_DISTANCE needs to be a
variable not a macro.


2005-12-08 23:28:42

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] Re: [PATCH 1/3] Zone reclaim V3: main patch

On Thu, Dec 08, 2005 at 03:19:36PM -0800, Christoph Lameter wrote:
> On Thu, 8 Dec 2005, Andi Kleen wrote:
>
> > I would use > LOCAL_DISTANCE or perhaps if you really want
> > a new constant with value 12-15.
>
> One may define RECLAIM_DISTANCE to be 12 for x86_64 in topology.h
> in order to get zone reclaim earlier for the opteron clusters. I would
> think though that large opteron clusters also have distances > 20.
>
> My experience is that at 20 systems do not need zone reclaim yet.

I really cannot confirm your experience here.

>
> > > RECLAIM_DISTANCE can be set per arch if the default is not okay.
> >
> > Well if anything it would be per system - perhaps need to make
> > it a boot option or somesuch later.
>
> The idea here was to avoid any manual configuration. The numa distances

Sure as a default this makes sense.

I'm just questioning your default values.

> must related in some real way to performance (at least per arch) in order
> for the automatic determination of zone reclaim to make sense. We could
> have a boot time override but then RECLAIM_DISTANCE needs to be a
> variable not a macro.

The macro can be always later defined to a variable, no problem.

-Andi

2005-12-08 23:35:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: [discuss] Re: [PATCH 1/3] Zone reclaim V3: main patch

On Fri, 9 Dec 2005, Andi Kleen wrote:

> > My experience is that at 20 systems do not need zone reclaim yet.
>
> I really cannot confirm your experience here.

Maybe the meaning of these numbers varies? I know that 10 is a local
access but the assumption in include/linux/numa.h that 20 is a remote
access is probably already a guess.

I know that our Altix machines seem to use 10 for a local and 20 for
nonlocal but same box. The distances then increase from there.

2005-12-08 23:41:01

by Andi Kleen

[permalink] [raw]
Subject: Re: [discuss] Re: [PATCH 1/3] Zone reclaim V3: main patch

On Thu, Dec 08, 2005 at 03:35:05PM -0800, Christoph Lameter wrote:
> On Fri, 9 Dec 2005, Andi Kleen wrote:
>
> > > My experience is that at 20 systems do not need zone reclaim yet.
> >
> > I really cannot confirm your experience here.
>
> Maybe the meaning of these numbers varies? I know that 10 is a local
> access but the assumption in include/linux/numa.h that 20 is a remote
> access is probably already a guess.

The spec seems to suggest it's roughly the NUMA factor scaled (so for 1.4
you would get 14). But I haven't actually seen a Opteron with correct
SLIT yet so I don't know what they use ...

> I know that our Altix machines seem to use 10 for a local and 20 for
> nonlocal but same box. The distances then increase from there.

Unless non local same box is 2 times as slow as the local I wouldn't
consider that correct. (I would expect the Altix to do better than that)
-Andi

2005-12-09 00:10:45

by Christoph Lameter

[permalink] [raw]
Subject: Re: [discuss] Re: [PATCH 1/3] Zone reclaim V3: main patch

On Fri, 9 Dec 2005, Andi Kleen wrote:

> Unless non local same box is 2 times as slow as the local I wouldn't
> consider that correct. (I would expect the Altix to do better than that)

Maybe Jack could give us a hint how these slit numbers relate to
reality?