2006-01-25 23:51:39

by Matthew Dobson

[permalink] [raw]
Subject: [patch 3/9] mempool - Make mempools NUMA aware

plain text document attachment (critical_mempools)
Add NUMA-awareness to the mempool code. This involves several changes:

1) Update mempool_alloc_t to include a node_id argument.
2) Change mempool_create_node() to pass its node_id argument on to the updated
pool->alloc() function in an attempt to allocate the memory pool elements on
the requested node.
3) Change mempool_create() to be a static inline calling mempool_create_node().
4) Add mempool_resize_node() to the mempool API. This function does the same
thing as the old mempool_resize() function, but attempts to allocate both
the internal storage array and the individual memory pool elements on the
specified node, just like mempool_create_node() does.
5) Change mempool_resize() to be a static inline calling mempool_resize_node().
6) Add mempool_alloc_node() to the mempool API. This function allows callers
to request memory from a particular node. This request is only guaranteed
to fulfilled with memory from the specified node if the mempool was
originally created with mempool_create_node(). If not, this is only a hint
to the allocator, and the request may actually be fulfilled by memory from
another node, because we don't know from which node the memory pool elements
were allocated.
7) Change mempool_alloc() to be a static inline calling mempool_alloc_node().
8) Update the two "builtin" mempool allocators. mempool_alloc_slab &
mempool_alloc_pages, to use the new mempool_alloc_t by adding a nid argument
and changing them to call kmem_cache_alloc_node() & alloc_pages_node(),
respectively.
9) Cleanup some minor whitespace and comments.

Signed-off-by: Matthew Dobson <[email protected]>

include/linux/mempool.h | 31 ++++++++++++++++++-------
mm/mempool.c | 58 ++++++++++++++++++++++++++----------------------
2 files changed, 54 insertions(+), 35 deletions(-)

Index: linux-2.6.16-rc1+critical_mempools/mm/mempool.c
===================================================================
--- linux-2.6.16-rc1+critical_mempools.orig/mm/mempool.c
+++ linux-2.6.16-rc1+critical_mempools/mm/mempool.c
@@ -38,12 +38,14 @@ static void free_pool(mempool_t *pool)
}

/**
- * mempool_create - create a memory pool
+ * mempool_create_node - create a memory pool
* @min_nr: the minimum number of elements guaranteed to be
* allocated for this pool.
* @alloc_fn: user-defined element-allocation function.
* @free_fn: user-defined element-freeing function.
* @pool_data: optional private data available to the user-defined functions.
+ * @node_id: node to allocate this memory pool's control structure, storage
+ * array and all of its reserved elements on.
*
* this function creates and allocates a guaranteed size, preallocated
* memory pool. The pool can be used from the mempool_alloc and mempool_free
@@ -51,15 +53,9 @@ static void free_pool(mempool_t *pool)
* functions might sleep - as long as the mempool_alloc function is not called
* from IRQ contexts.
*/
-mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
- mempool_free_t *free_fn, void *pool_data)
-{
- return mempool_create_node(min_nr,alloc_fn,free_fn, pool_data,-1);
-}
-EXPORT_SYMBOL(mempool_create);
-
mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
- mempool_free_t *free_fn, void *pool_data, int node_id)
+ mempool_free_t *free_fn, void *pool_data,
+ int node_id)
{
mempool_t *pool;
pool = kmalloc_node(sizeof(*pool), GFP_KERNEL, node_id);
@@ -85,7 +81,7 @@ mempool_t *mempool_create_node(int min_n
while (pool->curr_nr < pool->min_nr) {
void *element;

- element = pool->alloc(GFP_KERNEL, pool->pool_data);
+ element = pool->alloc(GFP_KERNEL, node_id, pool->pool_data);
if (unlikely(!element)) {
free_pool(pool);
return NULL;
@@ -97,12 +93,14 @@ mempool_t *mempool_create_node(int min_n
EXPORT_SYMBOL(mempool_create_node);

/**
- * mempool_resize - resize an existing memory pool
+ * mempool_resize_node - resize an existing memory pool
* @pool: pointer to the memory pool which was allocated via
* mempool_create().
* @new_min_nr: the new minimum number of elements guaranteed to be
* allocated for this pool.
* @gfp_mask: the usual allocation bitmask.
+ * @node_id: node to allocate this memory pool's new storage array
+ * and all of its reserved elements on.
*
* This function shrinks/grows the pool. In the case of growing,
* it cannot be guaranteed that the pool will be grown to the new
@@ -112,7 +110,8 @@ EXPORT_SYMBOL(mempool_create_node);
* while this function is running. mempool_alloc() & mempool_free()
* might be called (eg. from IRQ contexts) while this function executes.
*/
-int mempool_resize(mempool_t *pool, int new_min_nr, gfp_t gfp_mask)
+int mempool_resize_node(mempool_t *pool, int new_min_nr, gfp_t gfp_mask,
+ int node_id)
{
void *element;
void **new_elements;
@@ -134,7 +133,8 @@ int mempool_resize(mempool_t *pool, int
spin_unlock_irqrestore(&pool->lock, flags);

/* Grow the pool */
- new_elements = kmalloc(new_min_nr * sizeof(*new_elements), gfp_mask);
+ new_elements = kmalloc_node(new_min_nr * sizeof(*new_elements),
+ gfp_mask, node_id);
if (!new_elements)
return -ENOMEM;

@@ -153,7 +153,7 @@ int mempool_resize(mempool_t *pool, int

while (pool->curr_nr < pool->min_nr) {
spin_unlock_irqrestore(&pool->lock, flags);
- element = pool->alloc(gfp_mask, pool->pool_data);
+ element = pool->alloc(gfp_mask, node_id, pool->pool_data);
if (!element)
goto out;
spin_lock_irqsave(&pool->lock, flags);
@@ -170,14 +170,14 @@ out_unlock:
out:
return 0;
}
-EXPORT_SYMBOL(mempool_resize);
+EXPORT_SYMBOL(mempool_resize_node);

/**
* mempool_destroy - deallocate a memory pool
* @pool: pointer to the memory pool which was allocated via
* mempool_create().
*
- * this function only sleeps if the free_fn() function sleeps. The caller
+ * this function only sleeps if the ->free() function sleeps. The caller
* has to guarantee that all elements have been returned to the pool (ie:
* freed) prior to calling mempool_destroy().
*/
@@ -190,17 +190,22 @@ void mempool_destroy(mempool_t *pool)
EXPORT_SYMBOL(mempool_destroy);

/**
- * mempool_alloc - allocate an element from a specific memory pool
+ * mempool_alloc_node - allocate an element from a specific memory pool
* @pool: pointer to the memory pool which was allocated via
* mempool_create().
* @gfp_mask: the usual allocation bitmask.
+ * @node_id: node to _attempt_ to allocate from. This request is only
+ * guaranteed to fulfilled with memory from the specified node if
+ * the mempool was originally created with mempool_create_node().
+ * If not, this is only a hint to the allocator, and the request
+ * may actually be fulfilled by memory from another node.
*
- * this function only sleeps if the alloc_fn function sleeps or
+ * this function only sleeps if the ->alloc() function sleeps or
* returns NULL. Note that due to preallocation, this function
* *never* fails when called from process contexts. (it might
* fail if called from an IRQ context.)
*/
-void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
+void *mempool_alloc_node(mempool_t *pool, gfp_t gfp_mask, int node_id)
{
void *element;
unsigned long flags;
@@ -217,7 +222,7 @@ void * mempool_alloc(mempool_t *pool, gf

repeat_alloc:

- element = pool->alloc(gfp_temp, pool->pool_data);
+ element = pool->alloc(gfp_temp, node_id, pool->pool_data);
if (likely(element != NULL))
return element;

@@ -244,7 +249,7 @@ repeat_alloc:

goto repeat_alloc;
}
-EXPORT_SYMBOL(mempool_alloc);
+EXPORT_SYMBOL(mempool_alloc_node);

/**
* mempool_free - return an element to the pool.
@@ -252,7 +257,7 @@ EXPORT_SYMBOL(mempool_alloc);
* @pool: pointer to the memory pool which was allocated via
* mempool_create().
*
- * this function only sleeps if the free_fn() function sleeps.
+ * this function only sleeps if the ->free() function sleeps.
*/
void mempool_free(void *element, mempool_t *pool)
{
@@ -276,10 +281,10 @@ EXPORT_SYMBOL(mempool_free);
/*
* A commonly used alloc and free fn.
*/
-void *mempool_alloc_slab(gfp_t gfp_mask, void *pool_data)
+void *mempool_alloc_slab(gfp_t gfp_mask, int node_id, void *pool_data)
{
kmem_cache_t *mem = (kmem_cache_t *) pool_data;
- return kmem_cache_alloc(mem, gfp_mask);
+ return kmem_cache_alloc_node(mem, gfp_mask, node_id);
}
EXPORT_SYMBOL(mempool_alloc_slab);

@@ -293,10 +298,11 @@ EXPORT_SYMBOL(mempool_free_slab);
/*
* A simple mempool-backed page allocator
*/
-void *mempool_alloc_pages(gfp_t gfp_mask, void *pool_data)
+void *mempool_alloc_pages(gfp_t gfp_mask, int node_id, void *pool_data)
{
int order = (int)pool_data;
- return alloc_pages(gfp_mask, order);
+ return alloc_pages_node(node_id >= 0 ? node_id : numa_node_id(),
+ gfp_mask, order);
}
EXPORT_SYMBOL(mempool_alloc_pages);

Index: linux-2.6.16-rc1+critical_mempools/include/linux/mempool.h
===================================================================
--- linux-2.6.16-rc1+critical_mempools.orig/include/linux/mempool.h
+++ linux-2.6.16-rc1+critical_mempools/include/linux/mempool.h
@@ -6,7 +6,7 @@

#include <linux/wait.h>

-typedef void * (mempool_alloc_t)(gfp_t gfp_mask, void *pool_data);
+typedef void * (mempool_alloc_t)(gfp_t gfp_mask, int node_id, void *pool_data);
typedef void (mempool_free_t)(void *element, void *pool_data);

typedef struct mempool_s {
@@ -21,27 +21,40 @@ typedef struct mempool_s {
wait_queue_head_t wait;
} mempool_t;

-extern mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
- mempool_free_t *free_fn, void *pool_data);
extern mempool_t *mempool_create_node(int min_nr, mempool_alloc_t *alloc_fn,
- mempool_free_t *free_fn, void *pool_data, int nid);
-
-extern int mempool_resize(mempool_t *pool, int new_min_nr, gfp_t gfp_mask);
+ mempool_free_t *free_fn, void *pool_data, int node_id);
+static inline mempool_t *mempool_create(int min_nr, mempool_alloc_t *alloc_fn,
+ mempool_free_t *free_fn, void *pool_data)
+{
+ return mempool_create_node(min_nr, alloc_fn, free_fn, pool_data, -1);
+}
extern void mempool_destroy(mempool_t *pool);
-extern void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask);
+
+extern int mempool_resize_node(mempool_t *pool, int new_min_nr, gfp_t gfp_mask,
+ int node_id);
+static inline int mempool_resize(mempool_t *pool, int new_min_nr, gfp_t gfp_mask)
+{
+ return mempool_resize_node(pool, new_min_nr, gfp_mask, -1);
+}
+
+extern void *mempool_alloc_node(mempool_t *pool, gfp_t gfp_mask, int node_id);
+static inline void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
+{
+ return mempool_alloc_node(pool, gfp_mask, -1);
+}
extern void mempool_free(void *element, mempool_t *pool);

/*
* A mempool_alloc_t and mempool_free_t that get the memory from
* a slab that is passed in through pool_data.
*/
-void *mempool_alloc_slab(gfp_t gfp_mask, void *pool_data);
+void *mempool_alloc_slab(gfp_t gfp_mask, int node_id, void *pool_data);
void mempool_free_slab(void *element, void *pool_data);

/*
* A mempool_alloc_t and mempool_free_t for a simple page allocator
*/
-void *mempool_alloc_pages(gfp_t gfp_mask, void *pool_data);
+void *mempool_alloc_pages(gfp_t gfp_mask, int node_id, void *pool_data);
void mempool_free_pages(void *element, void *pool_data);

#endif /* _LINUX_MEMPOOL_H */

--


2006-01-26 17:55:06

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

On Wed, 25 Jan 2006, Matthew Dobson wrote:

> plain text document attachment (critical_mempools)
> Add NUMA-awareness to the mempool code. This involves several changes:

I am not quite sure why you would need numa awareness in an emergency
memory pool. Presumably the effectiveness of the accesses do not matter.
You only want to be sure that there is some memory available right?

You do not need this....

2006-01-26 22:57:14

by Matthew Dobson

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Christoph Lameter wrote:
> On Wed, 25 Jan 2006, Matthew Dobson wrote:
>
>
>>plain text document attachment (critical_mempools)
>>Add NUMA-awareness to the mempool code. This involves several changes:
>
>
> I am not quite sure why you would need numa awareness in an emergency
> memory pool. Presumably the effectiveness of the accesses do not matter.
> You only want to be sure that there is some memory available right?

Not all requests for memory from a specific node are performance
enhancements, some are for correctness. With large machines, especially as
those large machines' workloads are more and more likely to be partitioned
with something like cpusets, you want to be able to specify where you want
your reserve pool to come from. As it was not incredibly difficult to
offer this option, I added it. I was unwilling to completely ignore
callers' NUMA requests, assuming that they are all purely performance
motivated.


> You do not need this....

I do not agree...


Thanks!

-Matt

2006-01-26 23:15:52

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

On Thu, 26 Jan 2006, Matthew Dobson wrote:

> Not all requests for memory from a specific node are performance
> enhancements, some are for correctness. With large machines, especially as

alloc_pages_node and friends do not guarantee allocation on that specific
node. That argument for "correctness" is bogus.

> > You do not need this....
> I do not agree...

There is no way that you would need this patch.


2006-01-26 23:24:35

by Matthew Dobson

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Christoph Lameter wrote:
> On Thu, 26 Jan 2006, Matthew Dobson wrote:
>
>
>>Not all requests for memory from a specific node are performance
>>enhancements, some are for correctness. With large machines, especially as
>
>
> alloc_pages_node and friends do not guarantee allocation on that specific
> node. That argument for "correctness" is bogus.

alloc_pages_node() does not guarantee allocation on a specific node, but
calling __alloc_pages() with a specific nodelist would.


>>>You do not need this....
>>
>>I do not agree...
>
>
> There is no way that you would need this patch.

My goal was to not change the behavior of the slab allocator when inserting
a mempool-backed allocator "under" it. Without support for at least
*requesting* allocations from a specific node when allocating from a
mempool, this would change how the slab allocator works. That would be
bad. The slab allocator now does not guarantee that, for example, a
kmalloc_node() request is satisfied by memory from the requested node, but
it does at least TRY. Without adding mempool_alloc_node() then I would
never be able to even TRY to satisfy a mempool-backed kmalloc_node()
request from the correct node. I believe that would constitute an
unacceptable breakage from normal, documented behavior. So, I *do* need
this patch.

-Matt

2006-01-26 23:30:06

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

On Thu, 26 Jan 2006, Matthew Dobson wrote:

> alloc_pages_node() does not guarantee allocation on a specific node, but
> calling __alloc_pages() with a specific nodelist would.

True but you have emergency *_node function that do not take nodelists.

> > There is no way that you would need this patch.
>
> My goal was to not change the behavior of the slab allocator when inserting
> a mempool-backed allocator "under" it. Without support for at least
> *requesting* allocations from a specific node when allocating from a
> mempool, this would change how the slab allocator works. That would be
> bad. The slab allocator now does not guarantee that, for example, a
> kmalloc_node() request is satisfied by memory from the requested node, but
> it does at least TRY. Without adding mempool_alloc_node() then I would
> never be able to even TRY to satisfy a mempool-backed kmalloc_node()
> request from the correct node. I believe that would constitute an
> unacceptable breakage from normal, documented behavior. So, I *do* need
> this patch.

If you get to the emergency lists then you are already in a tight memory
situation. In that situation it does not make sense to worry about the
node number the memory is coming from. kmalloc_node is just a kmalloc with
an indication of a preference of where the memory should be coming from.
The node locality only influences performance and not correctness.

There is no change to the way the slab allocator works. Just drop the
*_node variants.

2006-01-27 00:15:53

by Matthew Dobson

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Christoph Lameter wrote:
> On Thu, 26 Jan 2006, Matthew Dobson wrote:
>
>
>>alloc_pages_node() does not guarantee allocation on a specific node, but
>>calling __alloc_pages() with a specific nodelist would.
>
>
> True but you have emergency *_node function that do not take nodelists.

Agreed.


>>>There is no way that you would need this patch.
>>
>>My goal was to not change the behavior of the slab allocator when inserting
>>a mempool-backed allocator "under" it. Without support for at least
>>*requesting* allocations from a specific node when allocating from a
>>mempool, this would change how the slab allocator works. That would be
>>bad. The slab allocator now does not guarantee that, for example, a
>>kmalloc_node() request is satisfied by memory from the requested node, but
>>it does at least TRY. Without adding mempool_alloc_node() then I would
>>never be able to even TRY to satisfy a mempool-backed kmalloc_node()
>>request from the correct node. I believe that would constitute an
>>unacceptable breakage from normal, documented behavior. So, I *do* need
>>this patch.
>
>
> If you get to the emergency lists then you are already in a tight memory
> situation. In that situation it does not make sense to worry about the
> node number the memory is coming from. kmalloc_node is just a kmalloc with
> an indication of a preference of where the memory should be coming from.
> The node locality only influences performance and not correctness.
>
> There is no change to the way the slab allocator works. Just drop the
> *_node variants.

If you look more carefully at how the emergency mempools are used, I think
you'll better understand why I did this:

Look at patch 9/9, specficially the changes to kmem_getpages():

- page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+ /*
+ * If this allocation request isn't backed by a memory pool, or if that
+ * memory pool's gfporder is not the same as the cache's gfporder, fall
+ * back to alloc_pages_node().
+ */
+ if (!pool || cachep->gfporder != (int)pool->pool_data)
+ page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+ else
+ page = mempool_alloc_node(pool, flags, nodeid);

Allocations backed by a mempool must always be allocated via
mempool_alloc() (or mempool_alloc_node() in this case). What that means
is, without a mempool_alloc_node() function, NO mempool backed allocations
will be able to request a specific node, even when the system has PLENTY of
memory! This, IMO, is unacceptable. Adding more NUMA-awareness to the
mempool system allows us to keep the same slab behavior as before, as well
as leaving us free to ignore the node requests when memory is low.

-Matt

2006-01-27 00:22:03

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

On Thu, 26 Jan 2006, Matthew Dobson wrote:

> Allocations backed by a mempool must always be allocated via
> mempool_alloc() (or mempool_alloc_node() in this case). What that means
> is, without a mempool_alloc_node() function, NO mempool backed allocations
> will be able to request a specific node, even when the system has PLENTY of
> memory! This, IMO, is unacceptable. Adding more NUMA-awareness to the
> mempool system allows us to keep the same slab behavior as before, as well
> as leaving us free to ignore the node requests when memory is low.

Ok. That makes sense. I thought the mempool_xxx functions were only for
emergencies. But nevertheless you still duplicate all memory allocation
functions. I already was a bit concerned when I added the _node stuff.

What may be better is to add some kind of "allocation policy" to an
allocation. That allocation policy could require the allocation on a node,
distribution over a series of nodes, require allocation on a particular
node, or allow the use of emergency pools etc.

Maybe unify all the different page allocations to one call and do the
same with the slab allocator.

2006-01-27 00:28:03

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

On Wed, Jan 25, 2006 at 03:51:33PM -0800, Matthew Dobson wrote:
> plain text document attachment (critical_mempools)
> Add NUMA-awareness to the mempool code. This involves several changes:

This is horribly bloated. Mempools should really just be a flag and
reserve count on a slab, as then the code would not be in hot paths.

-ben

2006-01-27 00:34:33

by Matthew Dobson

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Christoph Lameter wrote:
> On Thu, 26 Jan 2006, Matthew Dobson wrote:
>
>
>>Allocations backed by a mempool must always be allocated via
>>mempool_alloc() (or mempool_alloc_node() in this case). What that means
>>is, without a mempool_alloc_node() function, NO mempool backed allocations
>>will be able to request a specific node, even when the system has PLENTY of
>>memory! This, IMO, is unacceptable. Adding more NUMA-awareness to the
>>mempool system allows us to keep the same slab behavior as before, as well
>>as leaving us free to ignore the node requests when memory is low.
>
>
> Ok. That makes sense. I thought the mempool_xxx functions were only for
> emergencies. But nevertheless you still duplicate all memory allocation
> functions. I already was a bit concerned when I added the _node stuff.

I'm glad we're on the same page now. :) And yes, adding four "duplicate"
*_mempool allocators was not my first choice, but I couldn't easily see a
better way.


> What may be better is to add some kind of "allocation policy" to an
> allocation. That allocation policy could require the allocation on a node,
> distribution over a series of nodes, require allocation on a particular
> node, or allow the use of emergency pools etc.
>
> Maybe unify all the different page allocations to one call and do the
> same with the slab allocator.

Hmmm... I kinda like that. Some sort of

struct allocation_policy
{
enum policy_type;
nodemask_t nodes;
mempool_t critical_pool;
}

that could be passed to __alloc_pages()?

That seems a bit beyond the scope of what I'd hoped for this patch series,
but if an approach like this is believed to be generally useful, it's
something I'm more than willing to work on...

-Matt

2006-01-27 00:36:02

by Matthew Dobson

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Benjamin LaHaise wrote:
> On Wed, Jan 25, 2006 at 03:51:33PM -0800, Matthew Dobson wrote:
>
>>plain text document attachment (critical_mempools)
>>Add NUMA-awareness to the mempool code. This involves several changes:
>
>
> This is horribly bloated. Mempools should really just be a flag and
> reserve count on a slab, as then the code would not be in hot paths.
>
> -ben

Ummm... ok? But with only a simple flag, how do you know *which* mempool
you're trying to use? What if you want to use a mempool for a non-slab
allocation?

-Matt

2006-01-27 00:39:50

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

On Thu, 26 Jan 2006, Matthew Dobson wrote:

> That seems a bit beyond the scope of what I'd hoped for this patch series,
> but if an approach like this is believed to be generally useful, it's
> something I'm more than willing to work on...

We need this for other issues as well. f.e. to establish memory allocation
policies for the page cache, tmpfs and various other needs. Look at
mempolicy.h which defines a subset of what we need. Currently there is no
way to specify a policy when invoking the page allocator or slab
allocator. The policy is implicily fetched from the current task structure
which is not optimal.

2006-01-27 00:45:05

by Matthew Dobson

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Christoph Lameter wrote:
> On Thu, 26 Jan 2006, Matthew Dobson wrote:
>
>
>>That seems a bit beyond the scope of what I'd hoped for this patch series,
>>but if an approach like this is believed to be generally useful, it's
>>something I'm more than willing to work on...
>
>
> We need this for other issues as well. f.e. to establish memory allocation
> policies for the page cache, tmpfs and various other needs. Look at
> mempolicy.h which defines a subset of what we need. Currently there is no
> way to specify a policy when invoking the page allocator or slab
> allocator. The policy is implicily fetched from the current task structure
> which is not optimal.

I agree that the current, task-based policies are subobtimal. Having to
allocate and fill in even a small structure for each allocation is going to
be a tough sell, though. I suppose most allocations could get by with a
small handfull of static generic "policy structures"... This seems like it
will turn into a signifcant rework of all the kernel's allocation routines,
no small task. Certainly not something that I'd even start without
response from some other major players in the VM area... Anyone?


-Matt

2006-01-27 00:57:17

by Christoph Lameter

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

On Thu, 26 Jan 2006, Matthew Dobson wrote:

> > We need this for other issues as well. f.e. to establish memory allocation
> > policies for the page cache, tmpfs and various other needs. Look at
> > mempolicy.h which defines a subset of what we need. Currently there is no
> > way to specify a policy when invoking the page allocator or slab
> > allocator. The policy is implicily fetched from the current task structure
> > which is not optimal.
>
> I agree that the current, task-based policies are subobtimal. Having to
> allocate and fill in even a small structure for each allocation is going to
> be a tough sell, though. I suppose most allocations could get by with a
> small handfull of static generic "policy structures"... This seems like it
> will turn into a signifcant rework of all the kernel's allocation routines,
> no small task. Certainly not something that I'd even start without
> response from some other major players in the VM area... Anyone?

No you would have a set of policies and only pass a pointer to the
policies to the allocator. I.e. have one emergency policy allocated
somewhere in the IP stack and then pass that to the allocator.

I guess that Andi Kleen and Paul Jackson would likely be interested in
such an endeavor.

2006-01-27 01:13:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

On Friday 27 January 2006 01:57, Christoph Lameter wrote:
> On Thu, 26 Jan 2006, Matthew Dobson wrote:
>
> > > We need this for other issues as well. f.e. to establish memory allocation
> > > policies for the page cache, tmpfs and various other needs. Look at
> > > mempolicy.h which defines a subset of what we need. Currently there is no
> > > way to specify a policy when invoking the page allocator or slab
> > > allocator. The policy is implicily fetched from the current task structure
> > > which is not optimal.
> >
> > I agree that the current, task-based policies are subobtimal. Having to
> > allocate and fill in even a small structure for each allocation is going to
> > be a tough sell, though. I suppose most allocations could get by with a
> > small handfull of static generic "policy structures"... This seems like it
> > will turn into a signifcant rework of all the kernel's allocation routines,
> > no small task. Certainly not something that I'd even start without
> > response from some other major players in the VM area... Anyone?
>
> No you would have a set of policies and only pass a pointer to the
> policies to the allocator. I.e. have one emergency policy allocated
> somewhere in the IP stack and then pass that to the allocator.

What would that be needed for?

My goal for mempolicies was always to keep it as simple as possible
and keep the fast paths fast and there has to be a very good reason to add any
new complexity.

-Andi

2006-01-27 03:27:25

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

On Thu, Jan 26, 2006 at 04:35:56PM -0800, Matthew Dobson wrote:
> Ummm... ok? But with only a simple flag, how do you know *which* mempool
> you're trying to use? What if you want to use a mempool for a non-slab
> allocation?

Are there any? A quick poke around has only found a couple of places
that use kzalloc(), which is still quite effectively a slab allocation.
There seems to be just one page user, the dm-crypt driver, which could
be served by a reservation scheme.

-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.

2006-01-27 10:51:54

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Matthew wrote:
> I'm glad we're on the same page now. :) And yes, adding four "duplicate"
> *_mempool allocators was not my first choice, but I couldn't easily see a
> better way.

I hope the following comments aren't too far off target.

I too am inclined to prefer the __GFP_CRITICAL approach over this.
That or Andrea's suggestion, which except for a free hook, was entirely
outside of the page_alloc.c code paths. Or Alan's suggested revival
of the old code to drop non-critical network patches in duress.

I am tempted to think you've taken an approach that raised some
substantial looking issues:

* how to tell the system when to use the emergency pool
* this doesn't really solve the problem (network can still starve)
* it wastes memory most of the time
* it doesn't really improve on GFP_ATOMIC

and just added another substantial looking issue:

* it entwines another thread of complexity and performance costs
into the important memory allocation code path.

Progress in the wrong direction ;).

> With large machines, especially as
> those large machines' workloads are more and more likely to be partitioned
> with something like cpusets, you want to be able to specify where you want
> your reserve pool to come from.

Cpusets is about performance, not correctness. Anytime I get cornered
in the cpuset code, I prefer violating the cpuset containment, over
serious system failure.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-01-28 01:00:24

by Matthew Dobson

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Paul Jackson wrote:
> Matthew wrote:
>
>>I'm glad we're on the same page now. :) And yes, adding four "duplicate"
>>*_mempool allocators was not my first choice, but I couldn't easily see a
>>better way.
>
>
> I hope the following comments aren't too far off target.
>
> I too am inclined to prefer the __GFP_CRITICAL approach over this.

OK. Chalk one more up for that solution...


> That or Andrea's suggestion, which except for a free hook, was entirely
> outside of the page_alloc.c code paths.

This is supposed to be an implementation of Andrea's suggestion. There are
no hooks in ANY page_alloc.c code paths. These patches touch mempool code
and some slab code, but not any page allocator code.


> Or Alan's suggested revival
> of the old code to drop non-critical network patches in duress.

Dropping non-critical packets is still in our plan, but I don't think that
is a FULL solution. As we mentioned before on that topic, you can't tell
if a packet is critical until AFTER you receive it, by which point it has
already had an skbuff (hopefully) allocated for it. If your network
traffic is coming in faster than you can receive, examine, and drop
non-critical packets you're hosed. I still think some sort of reserve pool
is necessary to give the networking stack a little breathing room when
under both memory pressure and network load.


> I am tempted to think you've taken an approach that raised some
> substantial looking issues:
>
> * how to tell the system when to use the emergency pool

We've dropped the whole "in_emergency" thing. The system uses the
emergency pool when the normal pool (ie: the buddy allocator) is out of pages.

> * this doesn't really solve the problem (network can still starve)

Only if the pool is not large enough. One can argue that sizing the pool
appropriately is impossible (theoretical incoming traffic over a GigE card
or two for a minute or two is extremely large), but then I guess we
shouldn't even try to fix the problem...?

> * it wastes memory most of the time

True. Any "real" reserve system will suffer from that problem. Ben
LaHaise suggested a reserve system that allows the reserve pages to be used
for trivially reclaimable allocation while not in active use. An
interesting idea. Regardless, the Linux VM sorta already wastes memory by
keeping min_free_kbytes around, no?

> * it doesn't really improve on GFP_ATOMIC

I disagree. It improves on GFP_ATOMIC by giving it a second chance. If
you've got a GFP_ATOMIC allocation that is particularly critical, using a
mempool to back it means that you can keep going for a while when the rest
of the system OOMs/goes into SWAP hell/etc.

> and just added another substantial looking issue:
>
> * it entwines another thread of complexity and performance costs
> into the important memory allocation code path.

I can't say that it doesn't add any complexity into an important memory
allocation path, but I don't think it is a significant amount of
complexity. It is just a pointer check in kmem_getpages()...


>>With large machines, especially as
>>those large machines' workloads are more and more likely to be partitioned
>>with something like cpusets, you want to be able to specify where you want
>>your reserve pool to come from.
>
>
> Cpusets is about performance, not correctness. Anytime I get cornered
> in the cpuset code, I prefer violating the cpuset containment, over
> serious system failure.

Fair enough. But if we can keep the same baseline performance and add this
new feature, I'd like to do that. Doing our best to allocate on a
particular node when requested to isn't too much to ask.

-Matt

2006-01-28 01:09:00

by Matthew Dobson

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Benjamin LaHaise wrote:
> On Thu, Jan 26, 2006 at 04:35:56PM -0800, Matthew Dobson wrote:
>
>>Ummm... ok? But with only a simple flag, how do you know *which* mempool
>>you're trying to use? What if you want to use a mempool for a non-slab
>>allocation?
>
>
> Are there any? A quick poke around has only found a couple of places
> that use kzalloc(), which is still quite effectively a slab allocation.
> There seems to be just one page user, the dm-crypt driver, which could
> be served by a reservation scheme.

A couple. If Andrew is willing to pick up the mempool patches I posted an
hour or so ago, there will be only 4 mempool users that aren't using a
common mempool allocator. Regardless of whether that happens, there are
only a few users that aren't slab based:
1) mm/highmem.c - page based allocator
2) drivers/scsi/scsi_transport_iscsi.c - calls alloc_skb(), which does
eventually end up making a slab allocation
3) drivers/md/raid1.c & raid10.c - easily the biggest mempool_alloc
functions in the kernel. Non-trivial.
4) drivers/md/dm-crypt.c - the driver you mentioned, also using a page
allocator

So we could possibly get away with a reservation scheme, but a couple users
would be non-trivial to fixup.

-Matt

2006-01-28 05:08:26

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Matthew wrote:
> > I too am inclined to prefer the __GFP_CRITICAL approach over this.
>
> OK. Chalk one more up for that solution...

I don't think my vote should count for much. See below.

> This is supposed to be an implementation of Andrea's suggestion. There are
> no hooks in ANY page_alloc.c code paths. These patches touch mempool code
> and some slab code, but not any page allocator code.

Yeah - you're right. I misread your patch set. Sorry
for wasting your time.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-01-28 08:17:06

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Hi!

I'll probably regret getting into this discussion, but:

> > Or Alan's suggested revival
> > of the old code to drop non-critical network patches in duress.
>
> Dropping non-critical packets is still in our plan, but I don't think that
> is a FULL solution. As we mentioned before on that topic, you can't tell
> if a packet is critical until AFTER you receive it, by which point it has
> already had an skbuff (hopefully) allocated for it. If your network
> traffic is coming in faster than you can receive, examine, and drop
> non-critical packets you're hosed.

Why? You run out of atomic memory, start dropping the packets before
they even enter the kernel memory, and process backlog in the
meantime. Other hosts realize you are dropping packets and slow down,
or, if they are malicious, you just end up consistently dropping 70%
of packets. But that's okay.

> I still think some sort of reserve pool
> is necessary to give the networking stack a little breathing room when
> under both memory pressure and network load.

"Lets throw some memory there and hope it does some good?" Eek? What
about auditing/fixing the networking stack, instead?

> > * this doesn't really solve the problem (network can still starve)
>
> Only if the pool is not large enough. One can argue that sizing the pool
> appropriately is impossible (theoretical incoming traffic over a GigE card
> or two for a minute or two is extremely large), but then I guess we
> shouldn't even try to fix the problem...?

And what problem are you trying to fix, anyway? Last time I asked I
got reply around some strange clustering solution that absolutely has
to survive two minutes. And no, your patches do not even solve that,
because sizing the pool is impossible.
Pavel
--
Thanks, Sharp!

2006-01-28 16:14:55

by Sridhar Samudrala

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Pavel Machek wrote:
> Hi!
>
> I'll probably regret getting into this discussion, but:
>
>
>>> Or Alan's suggested revival
>>> of the old code to drop non-critical network patches in duress.
>>>
>> Dropping non-critical packets is still in our plan, but I don't think that
>> is a FULL solution. As we mentioned before on that topic, you can't tell
>> if a packet is critical until AFTER you receive it, by which point it has
>> already had an skbuff (hopefully) allocated for it. If your network
>> traffic is coming in faster than you can receive, examine, and drop
>> non-critical packets you're hosed.
>>
>
> Why? You run out of atomic memory, start dropping the packets before
> they even enter the kernel memory, and process backlog in the
> meantime. Other hosts realize you are dropping packets and slow down,
> or, if they are malicious, you just end up consistently dropping 70%
> of packets. But that's okay.
>
>
>> I still think some sort of reserve pool
>> is necessary to give the networking stack a little breathing room when
>> under both memory pressure and network load.
>>
>
> "Lets throw some memory there and hope it does some good?" Eek? What
> about auditing/fixing the networking stack, instead?
>
The other reason we need a separate critical pool is to satifsy critical
GFP_KERNEL allocations
when we are in emergency. These are made in the send side and we cannot
block/sleep.
>
>>> * this doesn't really solve the problem (network can still starve)
>>>
>> Only if the pool is not large enough. One can argue that sizing the pool
>> appropriately is impossible (theoretical incoming traffic over a GigE card
>> or two for a minute or two is extremely large), but then I guess we
>> shouldn't even try to fix the problem...?
>>
>
> And what problem are you trying to fix, anyway? Last time I asked I
> got reply around some strange clustering solution that absolutely has
> to survive two minutes. And no, your patches do not even solve that,
> because sizing the pool is impossible.
>
Yes, it is true that sizing the critical pool may be difficult if we use
it for all incoming allocations.
May be as an initial solution we could just depend on dropping
non-critical incoming packets
and use the critical pool only for outgoing allocations. We could
definitely size the pool if we use
it only for allocations for critical outgoing packets.

Thanks
Sridhar

2006-01-28 16:42:07

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

> >>I still think some sort of reserve pool
> >>is necessary to give the networking stack a little breathing room when
> >>under both memory pressure and network load.
> >>
> >
> >"Lets throw some memory there and hope it does some good?" Eek? What
> >about auditing/fixing the networking stack, instead?
> >
> The other reason we need a separate critical pool is to satifsy critical
> GFP_KERNEL allocations
> when we are in emergency. These are made in the send side and we cannot
> block/sleep.

If sending routines can work with constant ammount of memory, why use
kmalloc at all? Anyway I thought we were talking receiving side
earlier in the thread.

Ouch and wait a moment. You claim that GFP_KERNEL allocations can't
block/sleep? Of course they can, that's why they are GFP_KERNEL and
not GFP_ATOMIC.
Pavel
--
Thanks, Sharp!

2006-01-28 16:54:00

by Sridhar Samudrala

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Pavel Machek wrote:
>>>> I still think some sort of reserve pool
>>>> is necessary to give the networking stack a little breathing room when
>>>> under both memory pressure and network load.
>>>>
>>>>
>>> "Lets throw some memory there and hope it does some good?" Eek? What
>>> about auditing/fixing the networking stack, instead?
>>>
>>>
>> The other reason we need a separate critical pool is to satifsy critical
>> GFP_KERNEL allocations
>> when we are in emergency. These are made in the send side and we cannot
>> block/sleep.
>>
>
> If sending routines can work with constant ammount of memory, why use
> kmalloc at all? Anyway I thought we were talking receiving side
> earlier in the thread.
>
> Ouch and wait a moment. You claim that GFP_KERNEL allocations can't
> block/sleep? Of course they can, that's why they are GFP_KERNEL and
> not GFP_ATOMIC.
>
I didn't meant GFP_KERNEL allocations cannot block/sleep? When in
emergency, we
want even the GFP_KERNEL allocations that are made by critical sockets
not to block/sleep.
So my original critical sockets patches changes the gfp flag passed to
these allocation requests
to GFP_KERNEL|GFP_CRITICAL.

Thanks
Sridhar

2006-01-28 22:59:21

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Hi!

> >If sending routines can work with constant ammount of memory, why use
> >kmalloc at all? Anyway I thought we were talking receiving side
> >earlier in the thread.
> >
> >Ouch and wait a moment. You claim that GFP_KERNEL allocations can't
> >block/sleep? Of course they can, that's why they are GFP_KERNEL and
> >not GFP_ATOMIC.
> >
> I didn't meant GFP_KERNEL allocations cannot block/sleep? When in
> emergency, we
> want even the GFP_KERNEL allocations that are made by critical sockets
> not to block/sleep.
> So my original critical sockets patches changes the gfp flag passed to
> these allocation requests
> to GFP_KERNEL|GFP_CRITICAL.

Could we get description of what you are really trying to achieve?
I don't know what "critical socket" is, when you are "in emergency",
etc. When I am in emergency, I just dial 112...

[Having enough memory on the send side will not mean you'll be able to
send data at TCP layer.]

You seem to have some rather strange needs, that are maybe best served
by s/GFP_KERNEL/GFP_ATOMIC/ in network layer; but we can't / don't
want to do that in vanilla kernel -- your case is too specialized for
that. (Ouch and it does not work anyway without rewriting network
stack...)
Pavel
--
Thanks, Sharp!

2006-01-28 23:11:21

by Pavel Machek

[permalink] [raw]
Subject: Let the flames begin... [was Re: [patch 3/9] mempool - Make mempools NUMA aware]

On So 28-01-06 23:59:07, Pavel Machek wrote:
> Hi!
>
> > >If sending routines can work with constant ammount of memory, why use
> > >kmalloc at all? Anyway I thought we were talking receiving side
> > >earlier in the thread.
> > >
> > >Ouch and wait a moment. You claim that GFP_KERNEL allocations can't
> > >block/sleep? Of course they can, that's why they are GFP_KERNEL and
> > >not GFP_ATOMIC.
> > >
> > I didn't meant GFP_KERNEL allocations cannot block/sleep? When in
> > emergency, we
> > want even the GFP_KERNEL allocations that are made by critical sockets
> > not to block/sleep.
> > So my original critical sockets patches changes the gfp flag passed to
> > these allocation requests
> > to GFP_KERNEL|GFP_CRITICAL.

(I'd say Al Viro mode, but Al could take that personally)

IOW: You keep pushing complex and known-broken solution for problem
that does not exist.

(Now you should be angry enough, and please explain why I'm wrong in
easy to understand terms, so that even I will understand that we need
critical sockets for kernels in emergency).
Pavel
--
Thanks, Sharp!