LinuxLists.cc - [patch 0/9] Critical Mempools

2006-01-25 19:48:33

Subject: [patch 0/9] Critical Mempools

--
The following is a new patch series designed to solve the same problems as the
"Critical Page Pool" patches that were sent out in December. I've tried to
incorporate as much of the feedback that I received as possible into this new,
redesigned version.

Rather than inserting hooks directly into the page allocator, I've tried to
piggyback on the existing mempools infrastructure. What I've done is created
a new "common" mempool allocator for whole pages. I've also made some changes
to the mempool code to add more NUMA awareness. Lastly, I've made some
changes to the slab allocator to allow a single mempool to act as the critical
pool for an entire subsystem. All of these changes should be completely
transparent to existing users of mempools and the slab allocator.

Using this new approach, a subsystem can create a mempool and then pass a
pointer to this mempool on to all its slab allocations. Anytime one of its
slab allocations needs to allocate memory that memory will be allocated
through the specified mempool, rather than through alloc_pages_node() directly.

Feedback on these patches (against 2.6.16-rc1) would be greatly appreciated.

Thanks!

-Matt

2006-01-26 17:57:53

by Christoph Lameter

[permalink] [raw]

Subject: Re: [patch 0/9] Critical Mempools

On Wed, 25 Jan 2006, Matthew Dobson wrote:

> Using this new approach, a subsystem can create a mempool and then pass a
> pointer to this mempool on to all its slab allocations. Anytime one of its
> slab allocations needs to allocate memory that memory will be allocated
> through the specified mempool, rather than through alloc_pages_node() directly.

All subsystems will now get more complicated by having to add this
emergency functionality?

> Feedback on these patches (against 2.6.16-rc1) would be greatly appreciated.

There surely must be a better way than revising all subsystems for
critical allocations.

2006-01-26 23:01:53

by Matthew Dobson

[permalink] [raw]

Subject: Re: [patch 0/9] Critical Mempools

Christoph Lameter wrote:
> On Wed, 25 Jan 2006, Matthew Dobson wrote:
>
>
>>Using this new approach, a subsystem can create a mempool and then pass a
>>pointer to this mempool on to all its slab allocations. Anytime one of its
>>slab allocations needs to allocate memory that memory will be allocated
>>through the specified mempool, rather than through alloc_pages_node() directly.
>
>
> All subsystems will now get more complicated by having to add this
> emergency functionality?

Certainly not. Only subsystems that want to use emergency pools will get
more complicated. If you have a suggestion as to how to implement a
similar feature that is completely transparent to its users, I would *love*
to hear it. I have tried to keep the changes to implement this
functionality to a minimum. As the patches currently stand, existing slab
allocator and mempool users can continue using these subsystems without
modification.

>>Feedback on these patches (against 2.6.16-rc1) would be greatly appreciated.
>
>
> There surely must be a better way than revising all subsystems for
> critical allocations.

Again, I could not find any way to implement this functionality without
forcing the users of the functionality to make some, albeit very minor,
changes. Specific suggestions are more than welcome! :)

Thanks!

-Matt

2006-01-26 23:18:34

by Christoph Lameter

[permalink] [raw]

Subject: Re: [patch 0/9] Critical Mempools

On Thu, 26 Jan 2006, Matthew Dobson wrote:

> > All subsystems will now get more complicated by having to add this
> > emergency functionality?
>
> Certainly not. Only subsystems that want to use emergency pools will get
> more complicated. If you have a suggestion as to how to implement a
> similar feature that is completely transparent to its users, I would *love*

I thought the earlier __GFP_CRITICAL was a good idea.

> to hear it. I have tried to keep the changes to implement this
> functionality to a minimum. As the patches currently stand, existing slab
> allocator and mempool users can continue using these subsystems without
> modification.

The patches are extensive and the required changes to subsystems in order
to use these pools are also extensive.

> > There surely must be a better way than revising all subsystems for
> > critical allocations.
> Again, I could not find any way to implement this functionality without
> forcing the users of the functionality to make some, albeit very minor,
> changes. Specific suggestions are more than welcome! :)

Gfp flag? Better memory reclaim functionality?

2006-01-26 23:32:19

by Matthew Dobson

[permalink] [raw]

Subject: Re: [patch 0/9] Critical Mempools

Christoph Lameter wrote:
> On Thu, 26 Jan 2006, Matthew Dobson wrote:
>
>
>>>All subsystems will now get more complicated by having to add this
>>>emergency functionality?
>>
>>Certainly not. Only subsystems that want to use emergency pools will get
>>more complicated. If you have a suggestion as to how to implement a
>>similar feature that is completely transparent to its users, I would *love*
>
>
> I thought the earlier __GFP_CRITICAL was a good idea.

Well, I certainly could have used that feedback a month ago! ;) The
general response to that patchset was overwhelmingly negative. Yours is
the first vote in favor of that approach, that I'm aware of.

>>to hear it. I have tried to keep the changes to implement this
>>functionality to a minimum. As the patches currently stand, existing slab
>>allocator and mempool users can continue using these subsystems without
>>modification.
>
>
> The patches are extensive and the required changes to subsystems in order
> to use these pools are also extensive.

I can't really argue with your first point, but the changes required to use
the pools should actually be quite small. Sridhar (cc'd on this thread) is
working on the changes required for the networking subsystem to use these
pools, and it looks like the patches will be no larger than the ones from
the last attempt.

>>>There surely must be a better way than revising all subsystems for
>>>critical allocations.
>>
>>Again, I could not find any way to implement this functionality without
>>forcing the users of the functionality to make some, albeit very minor,
>>changes. Specific suggestions are more than welcome! :)
>
>
> Gfp flag? Better memory reclaim functionality?

Well, I've got patches that implement the GFP flag approach, but as I
mentioned above, that was poorly received. Better memory reclaim is a
broad and general approach that I agree is useful, but will not necessarily
solve the same set of problems (though it would likely lessen the severity
somewhat).

-Matt

2006-01-27 00:07:20

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: [patch 0/9] Critical Mempools

On Thu, Jan 26, 2006 at 03:32:14PM -0800, Matthew Dobson wrote:
> > I thought the earlier __GFP_CRITICAL was a good idea.
>
> Well, I certainly could have used that feedback a month ago! ;) The
> general response to that patchset was overwhelmingly negative. Yours is
> the first vote in favor of that approach, that I'm aware of.

Personally, I'm more in favour of a proper reservation system. mempools
are pretty inefficient. Reservations have useful properties, too -- one
could reserve memory for a critical process to use, but allow the system
to use that memory for easy to reclaim caches or to help with memory
defragmentation (more free pages really helps the buddy allocator).

> > Gfp flag? Better memory reclaim functionality?
>
> Well, I've got patches that implement the GFP flag approach, but as I
> mentioned above, that was poorly received. Better memory reclaim is a
> broad and general approach that I agree is useful, but will not necessarily
> solve the same set of problems (though it would likely lessen the severity
> somewhat).

Which areas are the priorities for getting this functionality into?
Networking over particular sockets? A GFP_ flag would plug into the current
network stack trivially, as sockets already have a field to store the memory
allocation flags.

-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.

2006-01-27 00:27:22

by Matthew Dobson

[permalink] [raw]

Subject: Re: [patch 0/9] Critical Mempools

Benjamin LaHaise wrote:
> On Thu, Jan 26, 2006 at 03:32:14PM -0800, Matthew Dobson wrote:
>
>>>I thought the earlier __GFP_CRITICAL was a good idea.
>>
>>Well, I certainly could have used that feedback a month ago! ;) The
>>general response to that patchset was overwhelmingly negative. Yours is
>>the first vote in favor of that approach, that I'm aware of.
>
>
> Personally, I'm more in favour of a proper reservation system. mempools
> are pretty inefficient. Reservations have useful properties, too -- one
> could reserve memory for a critical process to use, but allow the system
> to use that memory for easy to reclaim caches or to help with memory
> defragmentation (more free pages really helps the buddy allocator).

That's an interesting idea... Keep track of the number of pages "reserved"
but allow them to be used something like read-only pagecache... Something
along those lines would most certainly be easier on the page allocator,
since it wouldn't have chunks of pages "missing" for long periods of time.

>>>Gfp flag? Better memory reclaim functionality?
>>
>>Well, I've got patches that implement the GFP flag approach, but as I
>>mentioned above, that was poorly received. Better memory reclaim is a
>>broad and general approach that I agree is useful, but will not necessarily
>>solve the same set of problems (though it would likely lessen the severity
>>somewhat).
>
>
> Which areas are the priorities for getting this functionality into?
> Networking over particular sockets? A GFP_ flag would plug into the current
> network stack trivially, as sockets already have a field to store the memory
> allocation flags.

The impetus for this work was getting this functionality into the
networking stack, to keep the network alive under periods of extreme VM
pressure. Keeping track of 'criticalness' on a per-socket basis is good,
but the problem is the receive side. Networking packets are received and
put into skbuffs before there is any concept of what socket they belong to.
So to really handle incoming traffic under extreme memory pressure would
require something beyond just a per-socket flag.

I have to say I'm somewhat amused by how much support the old approach is
getting now that I've spent a few weeks going back to the drawing board and
coming up with what I thought was a more general solution! :\

-Matt

2006-01-27 07:35:52

by Pekka Enberg

[permalink] [raw]

Subject: Re: [patch 0/9] Critical Mempools

Hi,

Benjamin LaHaise wrote:
> > Personally, I'm more in favour of a proper reservation system. mempools
> > are pretty inefficient. Reservations have useful properties, too -- one
> > could reserve memory for a critical process to use, but allow the system
> > to use that memory for easy to reclaim caches or to help with memory
> > defragmentation (more free pages really helps the buddy allocator).

On 1/27/06, Matthew Dobson <[email protected]> wrote:
> That's an interesting idea... Keep track of the number of pages "reserved"
> but allow them to be used something like read-only pagecache... Something
> along those lines would most certainly be easier on the page allocator,
> since it wouldn't have chunks of pages "missing" for long periods of time.

Any thoughts on what kind of allocation patterns do we have for those
critical callers? The worst case is of course that for just one 32
byte critical allocation we steal away a complete page from the
reserves which doesn't sound like a good idea under extreme VM
pressure. For a general solution, I don't think it's enough that you
simply flag an allocation GFP_CRITICAL and let the page allocator do
the allocation.

As as side note, we already have __GFP_NOFAIL. How is it different
from GFP_CRITICAL and why aren't we improving that?

Pekka

2006-01-27 08:29:51

by Sridhar Samudrala

[permalink] [raw]

Subject: Re: [patch 0/9] Critical Mempools

Matthew Dobson wrote:
> Christoph Lameter wrote:
>
>> On Thu, 26 Jan 2006, Matthew Dobson wrote:
>>
>>
>>
>>>> All subsystems will now get more complicated by having to add this
>>>> emergency functionality?
>>>>
>>> Certainly not. Only subsystems that want to use emergency pools will get
>>> more complicated. If you have a suggestion as to how to implement a
>>> similar feature that is completely transparent to its users, I would *love*
>>>
>> I thought the earlier __GFP_CRITICAL was a good idea.
>>
>
> Well, I certainly could have used that feedback a month ago! ;) The
> general response to that patchset was overwhelmingly negative. Yours is
> the first vote in favor of that approach, that I'm aware of.
>
>
>
>>> to hear it. I have tried to keep the changes to implement this
>>> functionality to a minimum. As the patches currently stand, existing slab
>>> allocator and mempool users can continue using these subsystems without
>>> modification.
>>>
>> The patches are extensive and the required changes to subsystems in order
>> to use these pools are also extensive.
>>
>
> I can't really argue with your first point, but the changes required to use
> the pools should actually be quite small. Sridhar (cc'd on this thread) is
> working on the changes required for the networking subsystem to use these
> pools, and it looks like the patches will be no larger than the ones from
> the last attempt.
>
I would say that the patches to support critical sockets will be
slightly more complex with mempools
than the earlier patches that used the global critical page pool with a
new GFP_CRITICAL flag.

Basically we need a facility to mark an allocation request as critical
and satisfy this request without
any blocking in an emergency situation.

Thanks
Sridhar
>
>
>>>> There surely must be a better way than revising all subsystems for
>>>> critical allocations.
>>>>
>>> Again, I could not find any way to implement this functionality without
>>> forcing the users of the functionality to make some, albeit very minor,
>>> changes. Specific suggestions are more than welcome! :)
>>>
>> Gfp flag? Better memory reclaim functionality?
>>
>
> Well, I've got patches that implement the GFP flag approach, but as I
> mentioned above, that was poorly received. Better memory reclaim is a
> broad and general approach that I agree is useful, but will not necessarily
> solve the same set of problems (though it would likely lessen the severity
> somewhat).
>
> -Matt
>

2006-01-27 08:34:30

by Sridhar Samudrala

[permalink] [raw]

Subject: Re: [patch 0/9] Critical Mempools

Benjamin LaHaise wrote:
> On Thu, Jan 26, 2006 at 03:32:14PM -0800, Matthew Dobson wrote:
>
>>> I thought the earlier __GFP_CRITICAL was a good idea.
>>>
>> Well, I certainly could have used that feedback a month ago! ;) The
>> general response to that patchset was overwhelmingly negative. Yours is
>> the first vote in favor of that approach, that I'm aware of.
>>
>
> Personally, I'm more in favour of a proper reservation system. mempools
> are pretty inefficient. Reservations have useful properties, too -- one
> could reserve memory for a critical process to use, but allow the system
> to use that memory for easy to reclaim caches or to help with memory
> defragmentation (more free pages really helps the buddy allocator).
>
>
>>> Gfp flag? Better memory reclaim functionality?
>>>
>> Well, I've got patches that implement the GFP flag approach, but as I
>> mentioned above, that was poorly received. Better memory reclaim is a
>> broad and general approach that I agree is useful, but will not necessarily
>> solve the same set of problems (though it would likely lessen the severity
>> somewhat).
>>
>
> Which areas are the priorities for getting this functionality into?
> Networking over particular sockets? A GFP_ flag would plug into the current
> network stack trivially, as sockets already have a field to store the memory
> allocation flags.
>
Yes, i have posted patches that use this exact approach last month that
use a critical page pool with
GFP_CRITICAL flag.
http://lkml.org/lkml/2005/12/14/65
http://lkml.org/lkml/2005/12/14/66

Thanks
Sridhar

2006-01-27 10:11:17

by Paul Jackson

[permalink] [raw]

Subject: Re: [patch 0/9] Critical Mempools

Pekka wrote:
> As as side note, we already have __GFP_NOFAIL. How is it different
> from GFP_CRITICAL and why aren't we improving that?

Don't these two flags invoke two different mechanisms.
__GFP_NOFAIL can sleep for HZ/50 then retry, rather than return failure.
__GFP_CRITICAL can steal from the emergency pool rather than fail.

I would favor renaming at least the __GFP_CRITICAL to something
like __GFP_EMERGPOOL, to highlight the relevant distinction.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-01-27 11:07:56

by Pekka Enberg

[permalink] [raw]

Subject: Re: [patch 0/9] Critical Mempools

Hi,

Pekka wrote:
> > As as side note, we already have __GFP_NOFAIL. How is it different
> > from GFP_CRITICAL and why aren't we improving that?

On 1/27/06, Paul Jackson <[email protected]> wrote:
> Don't these two flags invoke two different mechanisms.
> __GFP_NOFAIL can sleep for HZ/50 then retry, rather than return failure.
> __GFP_CRITICAL can steal from the emergency pool rather than fail.
>
> I would favor renaming at least the __GFP_CRITICAL to something
> like __GFP_EMERGPOOL, to highlight the relevant distinction.

Yeah you're right. __GFP_NOFAIL guarantees to never fail but it
doesn't guarantee to actually succeed either. I think the suggested
semantics for __GFP_EMERGPOOL are that while it can fail, it tries to
avoid that by dipping into page reserves. However, I do still think
it's a bad idea to allow the slab allocator to steal whole pages for
critical allocations because in low-memory condition, it should be
fairly easy to exhaust the reserves and waste most of that memory at
the same time.

Pekka

2006-01-27 15:36:43

by Jan Kiszka

[permalink] [raw]

Subject: Re: [patch 0/9] Critical Mempools

2006/1/27, Matthew Dobson <[email protected]>:

> The impetus for this work was getting this functionality into the
> networking stack, to keep the network alive under periods of extreme VM
> pressure. Keeping track of 'criticalness' on a per-socket basis is good,
> but the problem is the receive side. Networking packets are received and
> put into skbuffs before there is any concept of what socket they belong to.
> So to really handle incoming traffic under extreme memory pressure would
> require something beyond just a per-socket flag.

Maybe as an interesting lecture you want study how we handle this in
the deterministic network stack RTnet (http://www.rtnet.org): exchange full
with empty (rt-)skbs between per-user packet pools. Every packet
producer or consumer (socket, NIC, in-kernel networking service) has
its own pool of pre-allocated, fixed-sized packets. Incoming packets
are first stored at the expense of the NIC. But as soon as the real
receiver is known, that one has to pass over an empty buffer in order
to get the full one. Otherwise, the packet is dropped. Kind of hard
policy, but it prevents any local user from starving the system with
respect to skbs. Additionally for full determinism, remote users have
to be controlled via bandwidth management (to avoid exhausting the
NIC's pool), in our case a TDMA mechanism.

I'm not suggesting that this is something easy to adopt into a general
purpose networking stack (this is /one/ reason why we maintain a
separate project for it). But maybe the concept can inspire something
in this direction. Would be funny to have "native" RTnet in the kernel
one day :). Separate memory pools for critical allocations is an
interesting step that may help us as well.

Jan

2006-01-28 00:41:42

by Matthew Dobson

[permalink] [raw]

Subject: Re: [patch 0/9] Critical Mempools

Pekka Enberg wrote:
> Hi,
>
> Pekka wrote:
>
>>>As as side note, we already have __GFP_NOFAIL. How is it different
>>>from GFP_CRITICAL and why aren't we improving that?
>
>
> On 1/27/06, Paul Jackson <[email protected]> wrote:
>
>>Don't these two flags invoke two different mechanisms.
>> __GFP_NOFAIL can sleep for HZ/50 then retry, rather than return failure.
>> __GFP_CRITICAL can steal from the emergency pool rather than fail.
>>
>>I would favor renaming at least the __GFP_CRITICAL to something
>>like __GFP_EMERGPOOL, to highlight the relevant distinction.
>
>
> Yeah you're right. __GFP_NOFAIL guarantees to never fail but it
> doesn't guarantee to actually succeed either. I think the suggested
> semantics for __GFP_EMERGPOOL are that while it can fail, it tries to
> avoid that by dipping into page reserves. However, I do still think
> it's a bad idea to allow the slab allocator to steal whole pages for
> critical allocations because in low-memory condition, it should be
> fairly easy to exhaust the reserves and waste most of that memory at
> the same time.

The main pushback I got on my previous attempt at somethign like
__GFP_EMERGPOOL was that a single, system-wide pool was unacceptable.
Determining the appropriate size for such a pool would be next to
impossible, particularly as the number of users of __GFP_EMERGPOOL grows.
The general concensus was that per-subsystem or dynamically created pools
would be a more useful addition to the kernel. Do any of you who are now
requesting the single pool approach have any suggestions as to how to
appropriately size a pool with potentially dozens of users so as to offer
any kind of useful guarantee? The less users of a single pool, obviously
the easier it is to appropriately size that pool...

As far as allowing the slab allocator to steal a whole page from the
critical pool to satisfy a single slab request, I think that is ok. The
only other suggestion I've heard is to insert a SLOB layer between the
critical pool's page allocator and the slab allocator, and have this SLOB
layer chopping up pages into pieces to handle slab requests that cannot be
satisfied through the normal slab/page allocator combo. This involves
adding a fair bit of code and complexity for the benefit of a few pages of
memory. Now, a few pages of memory could be incredibly crucial, since
we're discussing an emergency (presumably) low-mem situation, but if we're
going to be getting several requests for the same slab/kmalloc-size then
we're probably better of giving a whole page to the slab allocator. This
is pure speculation, of course... :)

-Matt

2006-01-28 10:21:54

by Pekka Enberg

[permalink] [raw]

Subject: Re: [patch 0/9] Critical Mempools

Hi,

On Fri, 2006-01-27 at 16:41 -0800, Matthew Dobson wrote:
> Now, a few pages of memory could be incredibly crucial, since
> we're discussing an emergency (presumably) low-mem situation, but if
> we're going to be getting several requests for the same
> slab/kmalloc-size then we're probably better of giving a whole page to
> the slab allocator. This is pure speculation, of course... :)

Yeah but even then there's no guarantee that the critical allocations
will be serviced first. The slab allocator can as well be giving away
bits of the fresh page to non-critical allocations. For the exact same
reason, I don't think it's enough that you pass a subsystem-specific
page pool to the slab allocator.

Sorry if this has been explained before but why aren't mempools
sufficient for your purposes? Also one more alternative would be to
create a separate object cache for each subsystem-specific critical
allocation and implement a internal "page pool" for the slab allocator
so that you could specify for the number of pages an object cache
guarantees to always hold on to.

Pekka

2006-01-30 22:39:06

by Matthew Dobson

[permalink] [raw]

Subject: Re: [patch 0/9] Critical Mempools

Pekka Enberg wrote:
> Hi,
>
> On Fri, 2006-01-27 at 16:41 -0800, Matthew Dobson wrote:
>
>>Now, a few pages of memory could be incredibly crucial, since
>>we're discussing an emergency (presumably) low-mem situation, but if
>>we're going to be getting several requests for the same
>>slab/kmalloc-size then we're probably better of giving a whole page to
>>the slab allocator. This is pure speculation, of course... :)
>
>
> Yeah but even then there's no guarantee that the critical allocations
> will be serviced first. The slab allocator can as well be giving away
> bits of the fresh page to non-critical allocations. For the exact same
> reason, I don't think it's enough that you pass a subsystem-specific
> page pool to the slab allocator.

Well, it would give at least one object from the new slab to the critical
request, but you're right, the rest of the slab could be allocated to
non-critical users. I had planned on a small follow-on patch to add
exclusivity to mempool/critical slab pages, but going a different route
seems to be the consensus.

> Sorry if this has been explained before but why aren't mempools
> sufficient for your purposes? Also one more alternative would be to
> create a separate object cache for each subsystem-specific critical
> allocation and implement a internal "page pool" for the slab allocator
> so that you could specify for the number of pages an object cache
> guarantees to always hold on to.

Mempools aren't sufficient because in order to create a real critical pool
for the whole networking subsystem, we'd have to create dozens of mempools,
one each for all the different slabs & kmalloc sizes the networking stack
requires, plus another for whole pages. Not impossible, but U-G-L-Y. And
wasteful. Creating all those mempools is surely more wasteful than
creating one reasonably sized pool to back ALL the allocations. Or, at
least, such was my rationale... :)

-Matt