2005-11-18 01:52:05

by Christoph Lameter

[permalink] [raw]
Subject: [PATCH] NUMA policies in the slab allocator V2

This patch fixes a regression in 2.6.14 against 2.6.13 that causes an
imbalance in memory allocation during bootup.

The slab allocator in 2.6.13 is not numa aware and simply calls alloc_pages().
This means that memory policies may control the behavior of alloc_pages().
During bootup the memory policy is set to MPOL_INTERLEAVE resulting in the
spreading out of allocations during bootup over all available nodes.
The slab allocator in 2.6.13 has only a single list of slab pages. As a result
the per cpu slab cache and the spinlock controlled page lists may contain slab
entries from off node memory. The slab allocator in 2.6.13 makes no effort
to discern the locality of an entry on its lists.

The NUMA aware slab allocator in 2.6.14 controls locality of the slab pages
explicitly by calling alloc_pages_node(). The NUMA slab allocator
manages slab entries by having lists of available slab pages for each node.
The per cpu slab cache can only contain slab entries associated with the node
local to the processor. This guarantees that the default allocation mode
of the slab allocator always assigns local memory if available.

Setting MPOL_INTERLEAVE as a default policy during bootup has no effect
anymore. In 2.6.14 all node unspecific slab allocations are performed
on the boot processor. This means that most of key data structures are
allocated on one node. Most processors will have to refer to these
structures making the boot node a potential bottleneck. This may
reduce performance and cause unnecessary memory pressure on the boot node.

This patch implements NUMA policies in the slab layer. There is
the need of explicit application of NUMA memory policies by the slab
allcator itself since the NUMA slab allocator does no longer let the
page_allocator control locality.

The check for policies is made directly at the beginning of __cache_alloc
using current->mempolicy. The memory policy is already frequently checked
by the page allocator (alloc_page_vma() and alloc_page_current()). So
it is highly likely that the cacheline is present. For MPOL_INTERLEAVE
kmalloc() will spread out each request to one node after another so
that an equal distribution of allocations can be obtained during bootup.

It is not possible to push the policy check to lower layers of the NUMA
slab allocator since the per cpu caches are now only containing slab entries
from the current node. If the policy says that the local node is not
to be preferred or forbidden then there is no point in checking the
slab cache or local list of slab pages. The allocation better be directed
immediately to the lists containing slab entries for the allowed set of
nodes.

This way of applying policy also fixes another strange behavior in 2.6.13.
alloc_pages() is controlled by the memory allocation policy of the
current process. It could therefore be that one process is running with
MPOL_INTERLEAVE and would f.e. obtain a new page following that
policy since no slab entries are in the lists anymore. A page can
typically be used for multiple slab entries but lets say that the current
process is only using one. The other entries are then added to the slab
lists. These are now non local entries in the slab lists despite of the
possible availability of local pages that would provide faster access and
increase the performance of the application.

Another process without MPOL_INTERLEAVE may now run and expect a local
slab entry from kmalloc(). However, there are still these free slab
entries from the off node page obtained from the other process via
MPOL_INTERLEAVE in the cache. The process will then get an off node slab
entry although other slab entries may be available that are local to that
process. This means that the policy if one process may contaminate
the locality of the slab caches for other processes.

This patch in effect insures that a per process policy is followed for
the allocation of slab entries and that there cannot be a memory policy
influence from one process to another. A process with default policy will
always get a local slab entry if one is available. And the process using
memory policies will get its memory arranged as requested. Off-node
slab allocation will require the use of spinlocks and will make the use
of per cpu caches not possible. A process using memory policies to redirect
allocations offnode will have to cope with additional lock overhead in
addition to the latency added by the need to access a remote slab entry.

Changes V1->V2
- Remove #ifdef CONFIG_NUMA by moving forward declaration into
prior #ifdef CONFIG_NUMA section.

- Give the function determining the node number to use a saner
name.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.15-rc1-mm1/mm/slab.c
===================================================================
--- linux-2.6.15-rc1-mm1.orig/mm/slab.c 2005-11-17 15:17:32.000000000 -0800
+++ linux-2.6.15-rc1-mm1/mm/slab.c 2005-11-17 16:54:24.000000000 -0800
@@ -103,6 +103,7 @@
#include <linux/rcupdate.h>
#include <linux/string.h>
#include <linux/nodemask.h>
+#include <linux/mempolicy.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -775,6 +776,8 @@ static struct array_cache *alloc_arrayca
}

#ifdef CONFIG_NUMA
+static void *__cache_alloc_node(kmem_cache_t *, gfp_t, int);
+
static inline struct array_cache **alloc_alien_cache(int node, int limit)
{
struct array_cache **ac_ptr;
@@ -2538,6 +2541,15 @@ static inline void *____cache_alloc(kmem
void* objp;
struct array_cache *ac;

+#ifdef CONFIG_NUMA
+ if (current->mempolicy) {
+ int nid = slab_node(current->mempolicy);
+
+ if (nid != numa_node_id())
+ return __cache_alloc_node(cachep, flags, nid);
+ }
+#endif
+
check_irq_off();
ac = ac_data(cachep);
if (likely(ac->avail)) {
Index: linux-2.6.15-rc1-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.15-rc1-mm1.orig/mm/mempolicy.c 2005-11-17 15:17:32.000000000 -0800
+++ linux-2.6.15-rc1-mm1/mm/mempolicy.c 2005-11-17 16:54:24.000000000 -0800
@@ -975,6 +975,33 @@ static unsigned interleave_nodes(struct
return nid;
}

+/*
+ * Depending on the memory policy provide a node from which to allocate the
+ * next slab entry.
+ */
+unsigned slab_node(struct mempolicy *policy)
+{
+ switch (policy->policy) {
+ case MPOL_INTERLEAVE:
+ return interleave_nodes(policy);
+
+ case MPOL_BIND:
+ /*
+ * Follow bind policy behavior and start allocation at the
+ * first node.
+ */
+ return policy->v.zonelist->zones[0]->zone_pgdat->node_id;
+
+ case MPOL_PREFERRED:
+ if (policy->v.preferred_node >= 0)
+ return policy->v.preferred_node;
+ /* Fall through */
+
+ default:
+ return numa_node_id();
+ }
+}
+
/* Do static interleaving for a VMA with known offset. */
static unsigned offset_il_node(struct mempolicy *pol,
struct vm_area_struct *vma, unsigned long off)
Index: linux-2.6.15-rc1-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.15-rc1-mm1.orig/include/linux/mempolicy.h 2005-11-17 15:17:31.000000000 -0800
+++ linux-2.6.15-rc1-mm1/include/linux/mempolicy.h 2005-11-17 16:54:24.000000000 -0800
@@ -153,6 +153,7 @@ extern void numa_policy_rebind(const nod
extern struct mempolicy default_policy;
extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
unsigned long addr);
+extern unsigned slab_node(struct mempolicy *policy);

int do_migrate_pages(struct mm_struct *mm,
const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags);


2005-11-18 03:00:06

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] NUMA policies in the slab allocator V2

On Friday 18 November 2005 02:51, Christoph Lameter wrote:
> This patch fixes a regression in 2.6.14 against 2.6.13 that causes an
> imbalance in memory allocation during bootup.

I still think it's wrongly implemented. We shouldn't be slowing down the slab
fast path for this. Also BTW if anything your check would need to be
dependent on !in_interrupt(), otherwise the policy of slab allocations
in interrupt context will change randomly based on what the current
process is doing (that's wrong, interrupts should be always local)
But of course that would make the fast path even slower ...

-Andi

2005-11-18 03:38:30

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] NUMA policies in the slab allocator V2

On Fri, 18 Nov 2005, Andi Kleen wrote:

> On Friday 18 November 2005 02:51, Christoph Lameter wrote:
> > This patch fixes a regression in 2.6.14 against 2.6.13 that causes an
> > imbalance in memory allocation during bootup.
>
> I still think it's wrongly implemented. We shouldn't be slowing down the slab
> fast path for this. Also BTW if anything your check would need to be
> dependent on !in_interrupt(), otherwise the policy of slab allocations
> in interrupt context will change randomly based on what the current
> process is doing (that's wrong, interrupts should be always local)
> But of course that would make the fast path even slower ...

We can add that check to slab_node() to avoid these issues and it will be
out of the fast path then. I would like to hear about alternatives to
this. You really want to run the useless fastpath? Examine lists etc for
the local node despite the policy telling you to get off node?

Hmm. Is a hugepage ever allocated from interrupt context? We may have the
same issues there.

Index: linux-2.6/mm/mempolicy.c
===================================================================
--- linux-2.6.orig/mm/mempolicy.c 2005-11-17 19:30:10.862617183 -0800
+++ linux-2.6/mm/mempolicy.c 2005-11-17 19:31:47.040578059 -0800
@@ -774,6 +774,9 @@
*/
unsigned slab_node(struct mempolicy *policy)
{
+ if (in_interrupt())
+ return numa_node_id();
+
switch (policy->policy) {
case MPOL_INTERLEAVE:
return interleave_nodes(policy);

2005-11-18 04:32:35

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] NUMA policies in the slab allocator V2

On Friday 18 November 2005 04:38, Christoph Lameter wrote:
> You really want to run the useless fastpath? Examine lists etc for
> the local node despite the policy telling you to get off node?

Yes.

> Hmm. Is a hugepage ever allocated from interrupt context?

They aren't.

-Andi

2005-11-18 17:20:23

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH] NUMA policies in the slab allocator V2

On Fri, 18 Nov 2005, Andi Kleen wrote:

> On Friday 18 November 2005 04:38, Christoph Lameter wrote:
> > You really want to run the useless fastpath? Examine lists etc for
> > the local node despite the policy telling you to get off node?
>
> Yes.

And this is only the begining of the troubles with such an approach.

Lets say you do the fastpath, find that there are no local slab
entries available anymore and then consult policy. Policy tells you to
interleave so you go to a different node and retrieve a slab entry from
the slab list for that node (which is likely not to need to do any page
allocation at all). Then this particular request has been fulfilled but
there are still no local slab entries.

Then the interleave counter may have been incremented without
allocating a page.

The next request would find no local slab entries available again and
repeats the same dysfunctional behavior.

But lets say that there is another process running concurrently that uses
default policy. It fills up the local slab entry cache. The task running
with interleave policy will now start to ignore policy and use the slab
entries generated by the page allocation of the other task.

Have a look at the slab allocator. I cannot imagine how you could make
the approach work.

> > Hmm. Is a hugepage ever allocated from interrupt context?
>
> They aren't.

Lets hope that it stays that way...