LinuxLists.cc - [patch 3/9] mempool - Make mempools NUMA aware

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

On Wed, 25 Jan 2006, Matthew Dobson wrote:

> plain text document attachment (critical_mempools)
> Add NUMA-awareness to the mempool code. This involves several changes:

I am not quite sure why you would need numa awareness in an emergency
memory pool. Presumably the effectiveness of the accesses do not matter.
You only want to be sure that there is some memory available right?

You do not need this....

2006-01-26 22:57:14

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Christoph Lameter wrote:
> On Wed, 25 Jan 2006, Matthew Dobson wrote:
>
>
>>plain text document attachment (critical_mempools)
>>Add NUMA-awareness to the mempool code. This involves several changes:
>
>
> I am not quite sure why you would need numa awareness in an emergency
> memory pool. Presumably the effectiveness of the accesses do not matter.
> You only want to be sure that there is some memory available right?

Not all requests for memory from a specific node are performance
enhancements, some are for correctness. With large machines, especially as
those large machines' workloads are more and more likely to be partitioned
with something like cpusets, you want to be able to specify where you want
your reserve pool to come from. As it was not incredibly difficult to
offer this option, I added it. I was unwilling to completely ignore
callers' NUMA requests, assuming that they are all purely performance
motivated.

> You do not need this....

I do not agree...

Thanks!

-Matt

2006-01-26 23:15:52

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

On Thu, 26 Jan 2006, Matthew Dobson wrote:

> Not all requests for memory from a specific node are performance
> enhancements, some are for correctness. With large machines, especially as

alloc_pages_node and friends do not guarantee allocation on that specific
node. That argument for "correctness" is bogus.

> > You do not need this....
> I do not agree...

There is no way that you would need this patch.

2006-01-26 23:24:35

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Christoph Lameter wrote:
> On Thu, 26 Jan 2006, Matthew Dobson wrote:
>
>
>>Not all requests for memory from a specific node are performance
>>enhancements, some are for correctness. With large machines, especially as
>
>
> alloc_pages_node and friends do not guarantee allocation on that specific
> node. That argument for "correctness" is bogus.

alloc_pages_node() does not guarantee allocation on a specific node, but
calling __alloc_pages() with a specific nodelist would.

>>>You do not need this....
>>
>>I do not agree...
>
>
> There is no way that you would need this patch.

My goal was to not change the behavior of the slab allocator when inserting
a mempool-backed allocator "under" it. Without support for at least
*requesting* allocations from a specific node when allocating from a
mempool, this would change how the slab allocator works. That would be
bad. The slab allocator now does not guarantee that, for example, a
kmalloc_node() request is satisfied by memory from the requested node, but
it does at least TRY. Without adding mempool_alloc_node() then I would
never be able to even TRY to satisfy a mempool-backed kmalloc_node()
request from the correct node. I believe that would constitute an
unacceptable breakage from normal, documented behavior. So, I *do* need
this patch.

-Matt

2006-01-26 23:30:06

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

On Thu, 26 Jan 2006, Matthew Dobson wrote:

> alloc_pages_node() does not guarantee allocation on a specific node, but
> calling __alloc_pages() with a specific nodelist would.

True but you have emergency *_node function that do not take nodelists.

> > There is no way that you would need this patch.
>
> My goal was to not change the behavior of the slab allocator when inserting
> a mempool-backed allocator "under" it. Without support for at least
> *requesting* allocations from a specific node when allocating from a
> mempool, this would change how the slab allocator works. That would be
> bad. The slab allocator now does not guarantee that, for example, a
> kmalloc_node() request is satisfied by memory from the requested node, but
> it does at least TRY. Without adding mempool_alloc_node() then I would
> never be able to even TRY to satisfy a mempool-backed kmalloc_node()
> request from the correct node. I believe that would constitute an
> unacceptable breakage from normal, documented behavior. So, I *do* need
> this patch.

If you get to the emergency lists then you are already in a tight memory
situation. In that situation it does not make sense to worry about the
node number the memory is coming from. kmalloc_node is just a kmalloc with
an indication of a preference of where the memory should be coming from.
The node locality only influences performance and not correctness.

There is no change to the way the slab allocator works. Just drop the
*_node variants.

2006-01-27 00:15:53

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Christoph Lameter wrote:
> On Thu, 26 Jan 2006, Matthew Dobson wrote:
>
>
>>alloc_pages_node() does not guarantee allocation on a specific node, but
>>calling __alloc_pages() with a specific nodelist would.
>
>
> True but you have emergency *_node function that do not take nodelists.

Agreed.

>>>There is no way that you would need this patch.
>>
>>My goal was to not change the behavior of the slab allocator when inserting
>>a mempool-backed allocator "under" it. Without support for at least
>>*requesting* allocations from a specific node when allocating from a
>>mempool, this would change how the slab allocator works. That would be
>>bad. The slab allocator now does not guarantee that, for example, a
>>kmalloc_node() request is satisfied by memory from the requested node, but
>>it does at least TRY. Without adding mempool_alloc_node() then I would
>>never be able to even TRY to satisfy a mempool-backed kmalloc_node()
>>request from the correct node. I believe that would constitute an
>>unacceptable breakage from normal, documented behavior. So, I *do* need
>>this patch.
>
>
> If you get to the emergency lists then you are already in a tight memory
> situation. In that situation it does not make sense to worry about the
> node number the memory is coming from. kmalloc_node is just a kmalloc with
> an indication of a preference of where the memory should be coming from.
> The node locality only influences performance and not correctness.
>
> There is no change to the way the slab allocator works. Just drop the
> *_node variants.

If you look more carefully at how the emergency mempools are used, I think
you'll better understand why I did this:

Look at patch 9/9, specficially the changes to kmem_getpages():

- page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+ /*
+ * If this allocation request isn't backed by a memory pool, or if that
+ * memory pool's gfporder is not the same as the cache's gfporder, fall
+ * back to alloc_pages_node().
+ */
+ if (!pool || cachep->gfporder != (int)pool->pool_data)
+ page = alloc_pages_node(nodeid, flags, cachep->gfporder);
+ else
+ page = mempool_alloc_node(pool, flags, nodeid);

Allocations backed by a mempool must always be allocated via
mempool_alloc() (or mempool_alloc_node() in this case). What that means
is, without a mempool_alloc_node() function, NO mempool backed allocations
will be able to request a specific node, even when the system has PLENTY of
memory! This, IMO, is unacceptable. Adding more NUMA-awareness to the
mempool system allows us to keep the same slab behavior as before, as well
as leaving us free to ignore the node requests when memory is low.

-Matt

2006-01-27 00:22:03

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

On Thu, 26 Jan 2006, Matthew Dobson wrote:

> Allocations backed by a mempool must always be allocated via
> mempool_alloc() (or mempool_alloc_node() in this case). What that means
> is, without a mempool_alloc_node() function, NO mempool backed allocations
> will be able to request a specific node, even when the system has PLENTY of
> memory! This, IMO, is unacceptable. Adding more NUMA-awareness to the
> mempool system allows us to keep the same slab behavior as before, as well
> as leaving us free to ignore the node requests when memory is low.

Ok. That makes sense. I thought the mempool_xxx functions were only for
emergencies. But nevertheless you still duplicate all memory allocation
functions. I already was a bit concerned when I added the _node stuff.

What may be better is to add some kind of "allocation policy" to an
allocation. That allocation policy could require the allocation on a node,
distribution over a series of nodes, require allocation on a particular
node, or allow the use of emergency pools etc.

Maybe unify all the different page allocations to one call and do the
same with the slab allocator.

2006-01-27 00:28:03

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

On Wed, Jan 25, 2006 at 03:51:33PM -0800, Matthew Dobson wrote:
> plain text document attachment (critical_mempools)
> Add NUMA-awareness to the mempool code. This involves several changes:

This is horribly bloated. Mempools should really just be a flag and
reserve count on a slab, as then the code would not be in hot paths.

-ben

2006-01-27 00:34:33

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Christoph Lameter wrote:
> On Thu, 26 Jan 2006, Matthew Dobson wrote:
>
>
>>Allocations backed by a mempool must always be allocated via
>>mempool_alloc() (or mempool_alloc_node() in this case). What that means
>>is, without a mempool_alloc_node() function, NO mempool backed allocations
>>will be able to request a specific node, even when the system has PLENTY of
>>memory! This, IMO, is unacceptable. Adding more NUMA-awareness to the
>>mempool system allows us to keep the same slab behavior as before, as well
>>as leaving us free to ignore the node requests when memory is low.
>
>
> Ok. That makes sense. I thought the mempool_xxx functions were only for
> emergencies. But nevertheless you still duplicate all memory allocation
> functions. I already was a bit concerned when I added the _node stuff.

I'm glad we're on the same page now. :) And yes, adding four "duplicate"
*_mempool allocators was not my first choice, but I couldn't easily see a
better way.

> What may be better is to add some kind of "allocation policy" to an
> allocation. That allocation policy could require the allocation on a node,
> distribution over a series of nodes, require allocation on a particular
> node, or allow the use of emergency pools etc.
>
> Maybe unify all the different page allocations to one call and do the
> same with the slab allocator.

Hmmm... I kinda like that. Some sort of

struct allocation_policy
{
enum policy_type;
nodemask_t nodes;
mempool_t critical_pool;
}

that could be passed to __alloc_pages()?

That seems a bit beyond the scope of what I'd hoped for this patch series,
but if an approach like this is believed to be generally useful, it's
something I'm more than willing to work on...

-Matt

2006-01-27 00:36:02

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Benjamin LaHaise wrote:
> On Wed, Jan 25, 2006 at 03:51:33PM -0800, Matthew Dobson wrote:
>
>>plain text document attachment (critical_mempools)
>>Add NUMA-awareness to the mempool code. This involves several changes:
>
>
> This is horribly bloated. Mempools should really just be a flag and
> reserve count on a slab, as then the code would not be in hot paths.
>
> -ben

Ummm... ok? But with only a simple flag, how do you know *which* mempool
you're trying to use? What if you want to use a mempool for a non-slab
allocation?

-Matt

2006-01-27 00:39:50

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

On Thu, 26 Jan 2006, Matthew Dobson wrote:

> That seems a bit beyond the scope of what I'd hoped for this patch series,
> but if an approach like this is believed to be generally useful, it's
> something I'm more than willing to work on...

We need this for other issues as well. f.e. to establish memory allocation
policies for the page cache, tmpfs and various other needs. Look at
mempolicy.h which defines a subset of what we need. Currently there is no
way to specify a policy when invoking the page allocator or slab
allocator. The policy is implicily fetched from the current task structure
which is not optimal.

2006-01-27 00:45:05

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Christoph Lameter wrote:
> On Thu, 26 Jan 2006, Matthew Dobson wrote:
>
>
>>That seems a bit beyond the scope of what I'd hoped for this patch series,
>>but if an approach like this is believed to be generally useful, it's
>>something I'm more than willing to work on...
>
>
> We need this for other issues as well. f.e. to establish memory allocation
> policies for the page cache, tmpfs and various other needs. Look at
> mempolicy.h which defines a subset of what we need. Currently there is no
> way to specify a policy when invoking the page allocator or slab
> allocator. The policy is implicily fetched from the current task structure
> which is not optimal.

I agree that the current, task-based policies are subobtimal. Having to
allocate and fill in even a small structure for each allocation is going to
be a tough sell, though. I suppose most allocations could get by with a
small handfull of static generic "policy structures"... This seems like it
will turn into a signifcant rework of all the kernel's allocation routines,
no small task. Certainly not something that I'd even start without
response from some other major players in the VM area... Anyone?

-Matt

2006-01-27 00:57:17

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

On Thu, 26 Jan 2006, Matthew Dobson wrote:

> > We need this for other issues as well. f.e. to establish memory allocation
> > policies for the page cache, tmpfs and various other needs. Look at
> > mempolicy.h which defines a subset of what we need. Currently there is no
> > way to specify a policy when invoking the page allocator or slab
> > allocator. The policy is implicily fetched from the current task structure
> > which is not optimal.
>
> I agree that the current, task-based policies are subobtimal. Having to
> allocate and fill in even a small structure for each allocation is going to
> be a tough sell, though. I suppose most allocations could get by with a
> small handfull of static generic "policy structures"... This seems like it
> will turn into a signifcant rework of all the kernel's allocation routines,
> no small task. Certainly not something that I'd even start without
> response from some other major players in the VM area... Anyone?

No you would have a set of policies and only pass a pointer to the
policies to the allocator. I.e. have one emergency policy allocated
somewhere in the IP stack and then pass that to the allocator.

I guess that Andi Kleen and Paul Jackson would likely be interested in
such an endeavor.

2006-01-27 01:13:23

by Andi Kleen

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

On Friday 27 January 2006 01:57, Christoph Lameter wrote:
> On Thu, 26 Jan 2006, Matthew Dobson wrote:
>
> > > We need this for other issues as well. f.e. to establish memory allocation
> > > policies for the page cache, tmpfs and various other needs. Look at
> > > mempolicy.h which defines a subset of what we need. Currently there is no
> > > way to specify a policy when invoking the page allocator or slab
> > > allocator. The policy is implicily fetched from the current task structure
> > > which is not optimal.
> >
> > I agree that the current, task-based policies are subobtimal. Having to
> > allocate and fill in even a small structure for each allocation is going to
> > be a tough sell, though. I suppose most allocations could get by with a
> > small handfull of static generic "policy structures"... This seems like it
> > will turn into a signifcant rework of all the kernel's allocation routines,
> > no small task. Certainly not something that I'd even start without
> > response from some other major players in the VM area... Anyone?
>
> No you would have a set of policies and only pass a pointer to the
> policies to the allocator. I.e. have one emergency policy allocated
> somewhere in the IP stack and then pass that to the allocator.

What would that be needed for?

My goal for mempolicies was always to keep it as simple as possible
and keep the fast paths fast and there has to be a very good reason to add any
new complexity.

-Andi

2006-01-27 03:27:25

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

On Thu, Jan 26, 2006 at 04:35:56PM -0800, Matthew Dobson wrote:
> Ummm... ok? But with only a simple flag, how do you know *which* mempool
> you're trying to use? What if you want to use a mempool for a non-slab
> allocation?

Are there any? A quick poke around has only found a couple of places
that use kzalloc(), which is still quite effectively a slab allocation.
There seems to be just one page user, the dm-crypt driver, which could
be served by a reservation scheme.

-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.

2006-01-27 10:51:54

by Paul Jackson

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Matthew wrote:
> I'm glad we're on the same page now. :) And yes, adding four "duplicate"
> *_mempool allocators was not my first choice, but I couldn't easily see a
> better way.

I hope the following comments aren't too far off target.

I too am inclined to prefer the __GFP_CRITICAL approach over this.
That or Andrea's suggestion, which except for a free hook, was entirely
outside of the page_alloc.c code paths. Or Alan's suggested revival
of the old code to drop non-critical network patches in duress.

I am tempted to think you've taken an approach that raised some
substantial looking issues:

* how to tell the system when to use the emergency pool
* this doesn't really solve the problem (network can still starve)
* it wastes memory most of the time
* it doesn't really improve on GFP_ATOMIC

and just added another substantial looking issue:

* it entwines another thread of complexity and performance costs
into the important memory allocation code path.

Progress in the wrong direction ;).

> With large machines, especially as
> those large machines' workloads are more and more likely to be partitioned
> with something like cpusets, you want to be able to specify where you want
> your reserve pool to come from.

Cpusets is about performance, not correctness. Anytime I get cornered
in the cpuset code, I prefer violating the cpuset containment, over
serious system failure.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-01-28 01:00:24

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Paul Jackson wrote:
> Matthew wrote:
>
>>I'm glad we're on the same page now. :) And yes, adding four "duplicate"
>>*_mempool allocators was not my first choice, but I couldn't easily see a
>>better way.
>
>
> I hope the following comments aren't too far off target.
>
> I too am inclined to prefer the __GFP_CRITICAL approach over this.

OK. Chalk one more up for that solution...

> That or Andrea's suggestion, which except for a free hook, was entirely
> outside of the page_alloc.c code paths.

This is supposed to be an implementation of Andrea's suggestion. There are
no hooks in ANY page_alloc.c code paths. These patches touch mempool code
and some slab code, but not any page allocator code.

> Or Alan's suggested revival
> of the old code to drop non-critical network patches in duress.

Dropping non-critical packets is still in our plan, but I don't think that
is a FULL solution. As we mentioned before on that topic, you can't tell
if a packet is critical until AFTER you receive it, by which point it has
already had an skbuff (hopefully) allocated for it. If your network
traffic is coming in faster than you can receive, examine, and drop
non-critical packets you're hosed. I still think some sort of reserve pool
is necessary to give the networking stack a little breathing room when
under both memory pressure and network load.

> I am tempted to think you've taken an approach that raised some
> substantial looking issues:
>
> * how to tell the system when to use the emergency pool

We've dropped the whole "in_emergency" thing. The system uses the
emergency pool when the normal pool (ie: the buddy allocator) is out of pages.

> * this doesn't really solve the problem (network can still starve)

Only if the pool is not large enough. One can argue that sizing the pool
appropriately is impossible (theoretical incoming traffic over a GigE card
or two for a minute or two is extremely large), but then I guess we
shouldn't even try to fix the problem...?

> * it wastes memory most of the time

True. Any "real" reserve system will suffer from that problem. Ben
LaHaise suggested a reserve system that allows the reserve pages to be used
for trivially reclaimable allocation while not in active use. An
interesting idea. Regardless, the Linux VM sorta already wastes memory by
keeping min_free_kbytes around, no?

> * it doesn't really improve on GFP_ATOMIC

I disagree. It improves on GFP_ATOMIC by giving it a second chance. If
you've got a GFP_ATOMIC allocation that is particularly critical, using a
mempool to back it means that you can keep going for a while when the rest
of the system OOMs/goes into SWAP hell/etc.

> and just added another substantial looking issue:
>
> * it entwines another thread of complexity and performance costs
> into the important memory allocation code path.

I can't say that it doesn't add any complexity into an important memory
allocation path, but I don't think it is a significant amount of
complexity. It is just a pointer check in kmem_getpages()...

>>With large machines, especially as
>>those large machines' workloads are more and more likely to be partitioned
>>with something like cpusets, you want to be able to specify where you want
>>your reserve pool to come from.
>
>
> Cpusets is about performance, not correctness. Anytime I get cornered
> in the cpuset code, I prefer violating the cpuset containment, over
> serious system failure.

Fair enough. But if we can keep the same baseline performance and add this
new feature, I'd like to do that. Doing our best to allocate on a
particular node when requested to isn't too much to ask.

-Matt

2006-01-28 01:09:00

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Benjamin LaHaise wrote:
> On Thu, Jan 26, 2006 at 04:35:56PM -0800, Matthew Dobson wrote:
>
>>Ummm... ok? But with only a simple flag, how do you know *which* mempool
>>you're trying to use? What if you want to use a mempool for a non-slab
>>allocation?
>
>
> Are there any? A quick poke around has only found a couple of places
> that use kzalloc(), which is still quite effectively a slab allocation.
> There seems to be just one page user, the dm-crypt driver, which could
> be served by a reservation scheme.

A couple. If Andrew is willing to pick up the mempool patches I posted an
hour or so ago, there will be only 4 mempool users that aren't using a
common mempool allocator. Regardless of whether that happens, there are
only a few users that aren't slab based:
1) mm/highmem.c - page based allocator
2) drivers/scsi/scsi_transport_iscsi.c - calls alloc_skb(), which does
eventually end up making a slab allocation
3) drivers/md/raid1.c & raid10.c - easily the biggest mempool_alloc
functions in the kernel. Non-trivial.
4) drivers/md/dm-crypt.c - the driver you mentioned, also using a page
allocator

So we could possibly get away with a reservation scheme, but a couple users
would be non-trivial to fixup.

-Matt

2006-01-28 05:08:26

by Paul Jackson

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Matthew wrote:
> > I too am inclined to prefer the __GFP_CRITICAL approach over this.
>
> OK. Chalk one more up for that solution...

I don't think my vote should count for much. See below.

> This is supposed to be an implementation of Andrea's suggestion. There are
> no hooks in ANY page_alloc.c code paths. These patches touch mempool code
> and some slab code, but not any page allocator code.

Yeah - you're right. I misread your patch set. Sorry
for wasting your time.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-01-28 08:17:06

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Hi!

I'll probably regret getting into this discussion, but:

> > Or Alan's suggested revival
> > of the old code to drop non-critical network patches in duress.
>
> Dropping non-critical packets is still in our plan, but I don't think that
> is a FULL solution. As we mentioned before on that topic, you can't tell
> if a packet is critical until AFTER you receive it, by which point it has
> already had an skbuff (hopefully) allocated for it. If your network
> traffic is coming in faster than you can receive, examine, and drop
> non-critical packets you're hosed.

Why? You run out of atomic memory, start dropping the packets before
they even enter the kernel memory, and process backlog in the
meantime. Other hosts realize you are dropping packets and slow down,
or, if they are malicious, you just end up consistently dropping 70%
of packets. But that's okay.

> I still think some sort of reserve pool
> is necessary to give the networking stack a little breathing room when
> under both memory pressure and network load.

"Lets throw some memory there and hope it does some good?" Eek? What
about auditing/fixing the networking stack, instead?

> > * this doesn't really solve the problem (network can still starve)
>
> Only if the pool is not large enough. One can argue that sizing the pool
> appropriately is impossible (theoretical incoming traffic over a GigE card
> or two for a minute or two is extremely large), but then I guess we
> shouldn't even try to fix the problem...?

And what problem are you trying to fix, anyway? Last time I asked I
got reply around some strange clustering solution that absolutely has
to survive two minutes. And no, your patches do not even solve that,
because sizing the pool is impossible.
Pavel
--
Thanks, Sharp!

2006-01-28 16:14:55

by Sridhar Samudrala

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Pavel Machek wrote:
> Hi!
>
> I'll probably regret getting into this discussion, but:
>
>
>>> Or Alan's suggested revival
>>> of the old code to drop non-critical network patches in duress.
>>>
>> Dropping non-critical packets is still in our plan, but I don't think that
>> is a FULL solution. As we mentioned before on that topic, you can't tell
>> if a packet is critical until AFTER you receive it, by which point it has
>> already had an skbuff (hopefully) allocated for it. If your network
>> traffic is coming in faster than you can receive, examine, and drop
>> non-critical packets you're hosed.
>>
>
> Why? You run out of atomic memory, start dropping the packets before
> they even enter the kernel memory, and process backlog in the
> meantime. Other hosts realize you are dropping packets and slow down,
> or, if they are malicious, you just end up consistently dropping 70%
> of packets. But that's okay.
>
>
>> I still think some sort of reserve pool
>> is necessary to give the networking stack a little breathing room when
>> under both memory pressure and network load.
>>
>
> "Lets throw some memory there and hope it does some good?" Eek? What
> about auditing/fixing the networking stack, instead?
>
The other reason we need a separate critical pool is to satifsy critical
GFP_KERNEL allocations
when we are in emergency. These are made in the send side and we cannot
block/sleep.
>
>>> * this doesn't really solve the problem (network can still starve)
>>>
>> Only if the pool is not large enough. One can argue that sizing the pool
>> appropriately is impossible (theoretical incoming traffic over a GigE card
>> or two for a minute or two is extremely large), but then I guess we
>> shouldn't even try to fix the problem...?
>>
>
> And what problem are you trying to fix, anyway? Last time I asked I
> got reply around some strange clustering solution that absolutely has
> to survive two minutes. And no, your patches do not even solve that,
> because sizing the pool is impossible.
>
Yes, it is true that sizing the critical pool may be difficult if we use
it for all incoming allocations.
May be as an initial solution we could just depend on dropping
non-critical incoming packets
and use the critical pool only for outgoing allocations. We could
definitely size the pool if we use
it only for allocations for critical outgoing packets.

Thanks
Sridhar

2006-01-28 16:42:07

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

> >>I still think some sort of reserve pool
> >>is necessary to give the networking stack a little breathing room when
> >>under both memory pressure and network load.
> >>
> >
> >"Lets throw some memory there and hope it does some good?" Eek? What
> >about auditing/fixing the networking stack, instead?
> >
> The other reason we need a separate critical pool is to satifsy critical
> GFP_KERNEL allocations
> when we are in emergency. These are made in the send side and we cannot
> block/sleep.

If sending routines can work with constant ammount of memory, why use
kmalloc at all? Anyway I thought we were talking receiving side
earlier in the thread.

Ouch and wait a moment. You claim that GFP_KERNEL allocations can't
block/sleep? Of course they can, that's why they are GFP_KERNEL and
not GFP_ATOMIC.
Pavel
--
Thanks, Sharp!

2006-01-28 16:54:00

by Sridhar Samudrala

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Pavel Machek wrote:
>>>> I still think some sort of reserve pool
>>>> is necessary to give the networking stack a little breathing room when
>>>> under both memory pressure and network load.
>>>>
>>>>
>>> "Lets throw some memory there and hope it does some good?" Eek? What
>>> about auditing/fixing the networking stack, instead?
>>>
>>>
>> The other reason we need a separate critical pool is to satifsy critical
>> GFP_KERNEL allocations
>> when we are in emergency. These are made in the send side and we cannot
>> block/sleep.
>>
>
> If sending routines can work with constant ammount of memory, why use
> kmalloc at all? Anyway I thought we were talking receiving side
> earlier in the thread.
>
> Ouch and wait a moment. You claim that GFP_KERNEL allocations can't
> block/sleep? Of course they can, that's why they are GFP_KERNEL and
> not GFP_ATOMIC.
>
I didn't meant GFP_KERNEL allocations cannot block/sleep? When in
emergency, we
want even the GFP_KERNEL allocations that are made by critical sockets
not to block/sleep.
So my original critical sockets patches changes the gfp flag passed to
these allocation requests
to GFP_KERNEL|GFP_CRITICAL.

Thanks
Sridhar

2006-01-28 22:59:21

[permalink] [raw]

Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware

Hi!

> >If sending routines can work with constant ammount of memory, why use
> >kmalloc at all? Anyway I thought we were talking receiving side
> >earlier in the thread.
> >
> >Ouch and wait a moment. You claim that GFP_KERNEL allocations can't
> >block/sleep? Of course they can, that's why they are GFP_KERNEL and
> >not GFP_ATOMIC.
> >
> I didn't meant GFP_KERNEL allocations cannot block/sleep? When in
> emergency, we
> want even the GFP_KERNEL allocations that are made by critical sockets
> not to block/sleep.
> So my original critical sockets patches changes the gfp flag passed to
> these allocation requests
> to GFP_KERNEL|GFP_CRITICAL.

Could we get description of what you are really trying to achieve?
I don't know what "critical socket" is, when you are "in emergency",
etc. When I am in emergency, I just dial 112...

[Having enough memory on the send side will not mean you'll be able to
send data at TCP layer.]

You seem to have some rather strange needs, that are maybe best served
by s/GFP_KERNEL/GFP_ATOMIC/ in network layer; but we can't / don't
want to do that in vanilla kernel -- your case is too specialized for
that. (Ouch and it does not work anyway without rewriting network
stack...)
Pavel
--
Thanks, Sharp!

2006-01-28 23:11:21