2005-11-18 19:33:07

by Matthew Dobson

[permalink] [raw]
Subject: [RFC][PATCH 0/8] Critical Page Pool

We have a clustering product that needs to be able to guarantee that the
networking system won't stop functioning in the case of OOM/low memory
condition. The current mempool system is inadequate because to keep the
whole networking stack functioning, we need more than 1 or 2 slab caches to
be guaranteed. We need to guarantee that any request made with a specific
flag will succeed, assuming of course that you've made your "critical page
pool" big enough.

The following patch series implements such a critical page pool. It
creates 2 userspace triggers:

/proc/sys/vm/critical_pages: write the number of pages you want to reserve
for the critical pool into this file

/proc/sys/vm/in_emergency: write a non-zero value to tell the kernel that
the system is in an emergency state and authorize the kernel to dip into
the critical pool to satisfy critical allocations.

We mark critical allocations with the __GFP_CRITICAL flag, and when the
system is in an emergency state, we are allowed to delve into this pool to
satisfy __GFP_CRITICAL allocations that cannot be satisfied through the
normal means.

Feedback on our approach would be appreciated.

Thanks!

-Matt


2005-11-18 19:36:10

by Matthew Dobson

[permalink] [raw]
Subject: [RFC][PATCH 1/8] Create Critical Page Pool

Create the critical page pool and it's associated /proc/sys/vm/ file.

-Matt


Attachments:
critical_pool.patch (9.98 kB)

2005-11-18 19:37:05

by Matthew Dobson

[permalink] [raw]
Subject: [RFC][PATCH 2/8] Create emergency trigger

Create the in_emergency trigger.

-Matt


Attachments:
emergency_trigger.patch (4.78 kB)

2005-11-18 19:41:21

by Matthew Dobson

[permalink] [raw]
Subject: [RFC][PATCH 4/8] Fix a bug in scsi_get_command

Testing this patch series uncovered a small bug in scsi_get_command. This
patch fixes that bug.

-Matt


Attachments:
scsi_get_command-fix.patch (958.00 B)

2005-11-18 19:43:21

by Matthew Dobson

[permalink] [raw]
Subject: [RFC][PATCH 5/8] get_object/return_object

Move the code to get/return an object back to its slab into their own
functions.

-Matt


Attachments:
slab_prep-get_return_object.patch (3.76 kB)

2005-11-18 19:44:39

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] Critical Page Pool

Matthew Dobson wrote:

>We have a clustering product that needs to be able to guarantee that the
>networking system won't stop functioning in the case of OOM/low memory
>condition. The current mempool system is inadequate because to keep the
>whole networking stack functioning, we need more than 1 or 2 slab caches to
>be guaranteed. We need to guarantee that any request made with a specific
>flag will succeed, assuming of course that you've made your "critical page
>pool" big enough.
>
>The following patch series implements such a critical page pool. It
>creates 2 userspace triggers:
>
>/proc/sys/vm/critical_pages: write the number of pages you want to reserve
>for the critical pool into this file
>
>/proc/sys/vm/in_emergency: write a non-zero value to tell the kernel that
>the system is in an emergency state and authorize the kernel to dip into
>the critical pool to satisfy critical allocations.
>
>We mark critical allocations with the __GFP_CRITICAL flag, and when the
>system is in an emergency state, we are allowed to delve into this pool to
>satisfy __GFP_CRITICAL allocations that cannot be satisfied through the
>normal means.
>
>
>
1. If you have two subsystems which allocate critical pages, how do you
protect against the condition where one subsystem allocates all the
critical memory, causing the second to oom?

2. There already exists a critical pool: ordinary allocations fail if
free memory is below some limit, but special processes (kswapd) can
allocate that memory by setting PF_MEMALLOC. Perhaps this should be
extended, possibly with a per-process threshold.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2005-11-18 19:44:36

by Matthew Dobson

[permalink] [raw]
Subject: [RFC][PATCH 6/8] slab_destruct

Break the current slab_destroy() into 2 functions: slab_destroy and
slab_destruct. slab_destruct calls the destructor code and any necessary
debug code.

-Matt


Attachments:
slab_prep-slab_destruct.patch (2.03 kB)

2005-11-18 19:45:56

by Matthew Dobson

[permalink] [raw]
Subject: [RFC][PATCH 7/8] __cache_grow()

Create a helper for cache_grow() that handles doing the cache coloring and
allocating & initializing the struct slab.

-Matt


Attachments:
slab_prep-__cache_grow.patch (5.86 kB)

2005-11-18 19:47:16

by Matthew Dobson

[permalink] [raw]
Subject: [RFC][PATCH 8/8] Add support critical pool support to the slab allocator

Finally, teach the slab allocator how to deal with critical pages and how
to keep them for use exclusively by __GFP_CRITICAL allocations.

-Matt


Attachments:
slab_support.patch (6.48 kB)

2005-11-18 19:51:09

by Matthew Dobson

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] Critical Page Pool

Avi Kivity wrote:
> Matthew Dobson wrote:
>
>> We have a clustering product that needs to be able to guarantee that the
>> networking system won't stop functioning in the case of OOM/low memory
>> condition. The current mempool system is inadequate because to keep the
>> whole networking stack functioning, we need more than 1 or 2 slab
>> caches to
>> be guaranteed. We need to guarantee that any request made with a
>> specific
>> flag will succeed, assuming of course that you've made your "critical
>> page
>> pool" big enough.
>>
>> The following patch series implements such a critical page pool. It
>> creates 2 userspace triggers:
>>
>> /proc/sys/vm/critical_pages: write the number of pages you want to
>> reserve
>> for the critical pool into this file
>>
>> /proc/sys/vm/in_emergency: write a non-zero value to tell the kernel that
>> the system is in an emergency state and authorize the kernel to dip into
>> the critical pool to satisfy critical allocations.
>>
>> We mark critical allocations with the __GFP_CRITICAL flag, and when the
>> system is in an emergency state, we are allowed to delve into this
>> pool to
>> satisfy __GFP_CRITICAL allocations that cannot be satisfied through the
>> normal means.
>>
>>
>>
> 1. If you have two subsystems which allocate critical pages, how do you
> protect against the condition where one subsystem allocates all the
> critical memory, causing the second to oom?

You don't. You make sure that you size the critical pool appropriately for
your workload.


> 2. There already exists a critical pool: ordinary allocations fail if
> free memory is below some limit, but special processes (kswapd) can
> allocate that memory by setting PF_MEMALLOC. Perhaps this should be
> extended, possibly with a per-process threshold.

The exception for threads with PF_MEMALLOC set is there because those
threads are essentially promising that if the kernel gives them memory,
they will use that memory to free up MORE memory. If we ignore that
promise, and (ab)use the PF_MEMALLOC flag to simply bypass the
zone_watermarks, we'll simply OOM faster, and potentially in situations
that could be avoided (ie: we steal memory that kswapd could have used to
free up more memory).

Thanks for your feedback!

-Matt

2005-11-18 19:57:05

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] Critical Page Pool

* Matthew Dobson ([email protected]) wrote:
> /proc/sys/vm/critical_pages: write the number of pages you want to reserve
> for the critical pool into this file

How do you size this pool? Allocations are interrupt driven, so how to you
ensure you're allocating for the cluster network traffic you care about?

> /proc/sys/vm/in_emergency: write a non-zero value to tell the kernel that
> the system is in an emergency state and authorize the kernel to dip into
> the critical pool to satisfy critical allocations.

Seems odd to me. Why make this another knob? How did you run to set this
flag if you're in emergency and kswapd is going nuts?

thanks,
-chris

2005-11-18 20:42:48

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] Critical Page Pool

Matthew Dobson wrote:

>Avi Kivity wrote:
>
>
>>1. If you have two subsystems which allocate critical pages, how do you
>>protect against the condition where one subsystem allocates all the
>>critical memory, causing the second to oom?
>>
>>
>
>You don't. You make sure that you size the critical pool appropriately for
>your workload.
>
>
>
This may not be possible. What if subsystem A depends on subsystem B to
do its work, both are critical, and subsystem A allocated all the memory
reserve?
If A and B have different allocation thresholds, the deadlock is avoided.

At the very least you need a critical pool per subsystem.

>
>
>>2. There already exists a critical pool: ordinary allocations fail if
>>free memory is below some limit, but special processes (kswapd) can
>>allocate that memory by setting PF_MEMALLOC. Perhaps this should be
>>extended, possibly with a per-process threshold.
>>
>>
>
>The exception for threads with PF_MEMALLOC set is there because those
>threads are essentially promising that if the kernel gives them memory,
>they will use that memory to free up MORE memory. If we ignore that
>promise, and (ab)use the PF_MEMALLOC flag to simply bypass the
>zone_watermarks, we'll simply OOM faster, and potentially in situations
>that could be avoided (ie: we steal memory that kswapd could have used to
>free up more memory).
>
>
Sure, but that's just an example of a critical subsystem.

If we introduce yet another mechanism for critical memory allocation,
we'll have a hard time making different subsystems, which use different
critical allocation mechanisms, play well together.

I propose that instead of a single watermark, there should be a
watermark per critical subsystem. The watermarks would be arranged
according to the dependency graph, with the depended-on services allowed
to go the deepest into the reserves.

(instead of PF_MEMALLOC have a tsk->memory_allocation_threshold, or
similar. set it to 0 for kswapd, and for other systems according to taste)

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2005-11-19 00:09:05

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/8] Create Critical Page Pool

Total nit:

#define __GFP_HARDWALL ((__force gfp_t)0x40000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_CRITICAL ((__force gfp_t)0x80000u) /* Critical allocation. MUST succeed! */

Looks like you used a space instead of a tab.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-11-19 00:10:42

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] Critical Page Pool

Avi wrote:
> This may not be possible. What if subsystem A depends on subsystem B to
> do its work, both are critical, and subsystem A allocated all the memory
> reserve?

Apparently Matthew's subsystems have some knowable upper limits on
their critical memory needs, so that your scenario can be avoided.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-11-19 00:21:39

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/8] Create emergency trigger

> @@ -876,6 +879,16 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
> int can_try_harder;
> int did_some_progress;
>
> + if (is_emergency_alloc(gfp_mask)) {

Can this check for is_emergency_alloc be moved lower in __alloc_pages?

I don't see any reason why most __alloc_pages() calls, that succeed
easily in the first loop over the zonelist, have to make this check.
This would save one conditional test and jump on the most heavily
used code path in __alloc_pages().

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-11-20 07:46:18

by Keith Owens

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] Critical Page Pool

On Fri, 18 Nov 2005 11:32:57 -0800,
Matthew Dobson <[email protected]> wrote:
>We have a clustering product that needs to be able to guarantee that the
>networking system won't stop functioning in the case of OOM/low memory
>condition. The current mempool system is inadequate because to keep the
>whole networking stack functioning, we need more than 1 or 2 slab caches to
>be guaranteed. We need to guarantee that any request made with a specific
>flag will succeed, assuming of course that you've made your "critical page
>pool" big enough.
>
>The following patch series implements such a critical page pool. It
>creates 2 userspace triggers:
>
>/proc/sys/vm/critical_pages: write the number of pages you want to reserve
>for the critical pool into this file
>
>/proc/sys/vm/in_emergency: write a non-zero value to tell the kernel that
>the system is in an emergency state and authorize the kernel to dip into
>the critical pool to satisfy critical allocations.

FWIW, the Kernel Debugger (KDB) has similar problems where the system
is dying because of lack of memory but KDB must call some functions
that use kmalloc. A related problem is that sometimes KDB is invoked
from a non maskable interrupt, so I could not even trust the state of
the spinlocks and the chains in the slab code.

I worked around the problem by adding a last ditch allocator. Extract
from the kdb patch.

---

/* Last ditch allocator for debugging, so we can still debug even when the
* GFP_ATOMIC pool has been exhausted. The algorithms are tuned for space
* usage, not for speed. One smallish memory pool, the free chain is always in
* ascending address order to allow coalescing, allocations are done in brute
* force best fit.
*/

struct debug_alloc_header {
u32 next; /* offset of next header from start of pool */
u32 size;
};
#define dah_align 8

static u64 debug_alloc_pool_aligned[64*1024/dah_align]; /* 64K pool */
static char *debug_alloc_pool = (char *)debug_alloc_pool_aligned;
static u32 dah_first;

/* Locking is awkward. The debug code is called from all contexts, including
* non maskable interrupts. A normal spinlock is not safe in NMI context. Try
* to get the debug allocator lock, if it cannot be obtained after a second
* then give up. If the lock could not be previously obtained on this cpu then
* only try once.
*/
static DEFINE_SPINLOCK(dap_lock);
static inline
int get_dap_lock(void)
{
static int dap_locked = -1;
int count;
if (dap_locked == smp_processor_id())
count = 1;
else
count = 1000;
while (1) {
if (spin_trylock(&dap_lock)) {
dap_locked = -1;
return 1;
}
if (!count--)
break;
udelay(1000);
}
dap_locked = smp_processor_id();
return 0;
}

void *debug_kmalloc(size_t size, int flags)
{
unsigned int rem, h_offset;
struct debug_alloc_header *best, *bestprev, *prev, *h;
void *p = NULL;
if ((p = kmalloc(size, flags)))
return p;
if (!get_dap_lock())
return NULL;
h = (struct debug_alloc_header *)(debug_alloc_pool + dah_first);
prev = best = bestprev = NULL;
while (1) {
if (h->size >= size && (!best || h->size < best->size)) {
best = h;
bestprev = prev;
}
if (!h->next)
break;
prev = h;
h = (struct debug_alloc_header *)(debug_alloc_pool + h->next);
}
if (!best)
goto out;
rem = (best->size - size) & -dah_align;
/* The pool must always contain at least one header */
if (best->next == 0 && bestprev == NULL && rem < sizeof(*h))
goto out;
if (rem >= sizeof(*h)) {
best->size = (size + dah_align - 1) & -dah_align;
h_offset = (char *)best - debug_alloc_pool + sizeof(*best) + best->size;
h = (struct debug_alloc_header *)(debug_alloc_pool + h_offset);
h->size = rem - sizeof(*h);
h->next = best->next;
} else
h_offset = best->next;
if (bestprev)
bestprev->next = h_offset;
else
dah_first = h_offset;
p = best+1;
out:
spin_unlock(&dap_lock);
return p;
}

void debug_kfree(const void *p)
{
struct debug_alloc_header *h;
unsigned int h_offset;
if (!p)
return;
if ((char *)p < debug_alloc_pool ||
(char *)p >= debug_alloc_pool + sizeof(debug_alloc_pool_aligned)) {
kfree(p);
return;
}
if (!get_dap_lock())
return; /* memory leak, cannot be helped */
h = (struct debug_alloc_header *)p - 1;
h_offset = (char *)h - debug_alloc_pool;
if (h_offset < dah_first) {
h->next = dah_first;
dah_first = h_offset;
} else {
struct debug_alloc_header *prev;
prev = (struct debug_alloc_header *)(debug_alloc_pool + dah_first);
while (1) {
if (!prev->next || prev->next > h_offset)
break;
prev = (struct debug_alloc_header *)(debug_alloc_pool + prev->next);
}
if (sizeof(*prev) + prev->size == h_offset) {
prev->size += sizeof(*h) + h->size;
h = prev;
h_offset = (char *)h - debug_alloc_pool;
} else {
h->next = prev->next;
prev->next = h_offset;
}
}
if (h_offset + sizeof(*h) + h->size == h->next) {
struct debug_alloc_header *next;
next = (struct debug_alloc_header *)(debug_alloc_pool + h->next);
h->size += sizeof(*next) + next->size;
h->next = next->next;
}
spin_unlock(&dap_lock);
}

void kdb_initsupport()
{
struct debug_alloc_header *h;
h = (struct debug_alloc_header *)debug_alloc_pool;
h->next = 0;
h->size = sizeof(debug_alloc_pool_aligned) - sizeof(*h);
dah_first = 0;
}

2005-11-21 00:39:38

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] Critical Page Pool

On Fri 18-11-05 11:32:57, Matthew Dobson wrote:
> We have a clustering product that needs to be able to guarantee that the
> networking system won't stop functioning in the case of OOM/low memory
> condition. The current mempool system is inadequate because to keep the
> whole networking stack functioning, we need more than 1 or 2 slab caches to
> be guaranteed. We need to guarantee that any request made with a specific
> flag will succeed, assuming of course that you've made your "critical page
> pool" big enough.
>
> The following patch series implements such a critical page pool. It
> creates 2 userspace triggers:
>
> /proc/sys/vm/critical_pages: write the number of pages you want to reserve
> for the critical pool into this file
>
> /proc/sys/vm/in_emergency: write a non-zero value to tell the kernel that
> the system is in an emergency state and authorize the kernel to dip into
> the critical pool to satisfy critical allocations.
>
> We mark critical allocations with the __GFP_CRITICAL flag, and when the
> system is in an emergency state, we are allowed to delve into this pool to
> satisfy __GFP_CRITICAL allocations that cannot be satisfied through the
> normal means.

Ugh, relying on userspace to tell you that you need to dip into emergency
pool seems to be racy and unreliable. How can you guarantee that userspace
is scheduled soon enough in case of OOM?
Pavel
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2005-11-21 05:37:10

by Matthew Dobson

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] Critical Page Pool

Appologies for the delay in responding to comments, but I have been en
route to the East Coast of the US to visit family.

Avi Kivity wrote:
> Matthew Dobson wrote:
>
>> Avi Kivity wrote:
>>
>>
>>> 1. If you have two subsystems which allocate critical pages, how do you
>>> protect against the condition where one subsystem allocates all the
>>> critical memory, causing the second to oom?
>>>
>>
>>
>> You don't. You make sure that you size the critical pool
>> appropriately for
>> your workload.
>>
>>
>>
> This may not be possible. What if subsystem A depends on subsystem B to
> do its work, both are critical, and subsystem A allocated all the memory
> reserve?
> If A and B have different allocation thresholds, the deadlock is avoided.
>
> At the very least you need a critical pool per subsystem.

As Paul suggested in his follow up to your mail, to even attempt a
"guarantee" that you won't still run out of memory, your subsystem does
need an upper bound on how much memory it could possibly need. If there is
NO upper limit, then the possibility of exhausting your critical pool is
very real.


>>> 2. There already exists a critical pool: ordinary allocations fail if
>>> free memory is below some limit, but special processes (kswapd) can
>>> allocate that memory by setting PF_MEMALLOC. Perhaps this should be
>>> extended, possibly with a per-process threshold.
>>>
>>
>>
>> The exception for threads with PF_MEMALLOC set is there because those
>> threads are essentially promising that if the kernel gives them memory,
>> they will use that memory to free up MORE memory. If we ignore that
>> promise, and (ab)use the PF_MEMALLOC flag to simply bypass the
>> zone_watermarks, we'll simply OOM faster, and potentially in situations
>> that could be avoided (ie: we steal memory that kswapd could have used to
>> free up more memory).
>>
>>
> Sure, but that's just an example of a critical subsystem.
>
> If we introduce yet another mechanism for critical memory allocation,
> we'll have a hard time making different subsystems, which use different
> critical allocation mechanisms, play well together.
>
> I propose that instead of a single watermark, there should be a
> watermark per critical subsystem. The watermarks would be arranged
> according to the dependency graph, with the depended-on services allowed
> to go the deepest into the reserves.
>
> (instead of PF_MEMALLOC have a tsk->memory_allocation_threshold, or
> similar. set it to 0 for kswapd, and for other systems according to taste)

Your idea is certainly an interesting approach to solving the problem. I'm
not sure it quite does what I'm looking for, but I'll have to think about
your idea some more to be sure. One problem is that networking doesn't
have a specific "task" associated with it, where we could set a
memory_allocation_threshold.

Thanks!

-Matt

2005-11-21 05:47:24

by Matthew Dobson

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] Critical Page Pool

Chris Wright wrote:
> * Matthew Dobson ([email protected]) wrote:
>
>>/proc/sys/vm/critical_pages: write the number of pages you want to reserve
>>for the critical pool into this file
>
>
> How do you size this pool?

Trial and error. If you want networking to survive with no memory other
than the critical pool for 2 minutes, for example, you pick a random value,
block all other allocations (I have a test patch to do this), and send a
boatload of packets at the box. If it OOMs, you need a bigger pool.
Lather, rinse, repeat.


> Allocations are interrupt driven, so how to you
> ensure you're allocating for the cluster network traffic you care about?

On the receive side, you can't. :( You *have* to allocate an skbuff for
the packet, and only a couple levels up the networking 7-layer burrito can
you tell if you can toss the packet as non-critical or keep it. On the
send side, you can create a simple socket flag that tags all that socket's
SEND requests as critical.


>>/proc/sys/vm/in_emergency: write a non-zero value to tell the kernel that
>>the system is in an emergency state and authorize the kernel to dip into
>>the critical pool to satisfy critical allocations.
>
>
> Seems odd to me. Why make this another knob? How did you run to set this
> flag if you're in emergency and kswapd is going nuts?

We did this because we didn't want __GFP_CRITICAL allocations dipping into
the pool in the case of a transient low mem situation. In those cases we
want to force the task to do writeback to get a page (as usual), so that
the critical pool will be full when the system REALLY goes critical. We
also open the in_emergency file when the app starts so that we can just
write to it and don't need to try to open it when kswapd is going nuts.

-Matt

2005-11-21 05:51:29

by Matthew Dobson

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/8] Create emergency trigger

Paul Jackson wrote:
>>@@ -876,6 +879,16 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
>> int can_try_harder;
>> int did_some_progress;
>>
>>+ if (is_emergency_alloc(gfp_mask)) {
>
>
> Can this check for is_emergency_alloc be moved lower in __alloc_pages?
>
> I don't see any reason why most __alloc_pages() calls, that succeed
> easily in the first loop over the zonelist, have to make this check.
> This would save one conditional test and jump on the most heavily
> used code path in __alloc_pages().

Good point, Paul. Will make sure that gets moved.

Thanks!

-Matt

2005-11-21 05:50:51

by Matthew Dobson

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/8] Create Critical Page Pool

Paul Jackson wrote:
> Total nit:
>
> #define __GFP_HARDWALL ((__force gfp_t)0x40000u) /* Enforce hardwall cpuset memory allocs */
> +#define __GFP_CRITICAL ((__force gfp_t)0x80000u) /* Critical allocation. MUST succeed! */
>
> Looks like you used a space instead of a tab.

It's a tab on my side... Maybe some whitespace munging by Thunderbird?
Will make sure it's definitely a tab on the next itteration.

-Matt

2005-11-21 05:53:56

by Matthew Dobson

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] Critical Page Pool

Keith Owens wrote:
> On Fri, 18 Nov 2005 11:32:57 -0800,
> Matthew Dobson <[email protected]> wrote:
>
>>We have a clustering product that needs to be able to guarantee that the
>>networking system won't stop functioning in the case of OOM/low memory
>>condition. The current mempool system is inadequate because to keep the
>>whole networking stack functioning, we need more than 1 or 2 slab caches to
>>be guaranteed. We need to guarantee that any request made with a specific
>>flag will succeed, assuming of course that you've made your "critical page
>>pool" big enough.
>>
>>The following patch series implements such a critical page pool. It
>>creates 2 userspace triggers:
>>
>>/proc/sys/vm/critical_pages: write the number of pages you want to reserve
>>for the critical pool into this file
>>
>>/proc/sys/vm/in_emergency: write a non-zero value to tell the kernel that
>>the system is in an emergency state and authorize the kernel to dip into
>>the critical pool to satisfy critical allocations.
>
>
> FWIW, the Kernel Debugger (KDB) has similar problems where the system
> is dying because of lack of memory but KDB must call some functions
> that use kmalloc. A related problem is that sometimes KDB is invoked
> from a non maskable interrupt, so I could not even trust the state of
> the spinlocks and the chains in the slab code.
>
> I worked around the problem by adding a last ditch allocator. Extract
> from the kdb patch.

Ahh... very interesting. And dissapointingly much smaller than mine. :(

Thanks for the patch and the feedback!

-Matt

2005-11-21 05:54:49

by Paul Jackson

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/8] Create Critical Page Pool

Matthew wrote:
> It's a tab on my side.

Oh - ok.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-11-21 05:58:31

by Matthew Dobson

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] Critical Page Pool

Pavel Machek wrote:
> On Fri 18-11-05 11:32:57, Matthew Dobson wrote:
>
>>We have a clustering product that needs to be able to guarantee that the
>>networking system won't stop functioning in the case of OOM/low memory
>>condition. The current mempool system is inadequate because to keep the
>>whole networking stack functioning, we need more than 1 or 2 slab caches to
>>be guaranteed. We need to guarantee that any request made with a specific
>>flag will succeed, assuming of course that you've made your "critical page
>>pool" big enough.
>>
>>The following patch series implements such a critical page pool. It
>>creates 2 userspace triggers:
>>
>>/proc/sys/vm/critical_pages: write the number of pages you want to reserve
>>for the critical pool into this file
>>
>>/proc/sys/vm/in_emergency: write a non-zero value to tell the kernel that
>>the system is in an emergency state and authorize the kernel to dip into
>>the critical pool to satisfy critical allocations.
>>
>>We mark critical allocations with the __GFP_CRITICAL flag, and when the
>>system is in an emergency state, we are allowed to delve into this pool to
>>satisfy __GFP_CRITICAL allocations that cannot be satisfied through the
>>normal means.
>
>
> Ugh, relying on userspace to tell you that you need to dip into emergency
> pool seems to be racy and unreliable. How can you guarantee that userspace
> is scheduled soon enough in case of OOM?
> Pavel

It's not really for userspace to tell us that we're about to OOM, as the
kernel is in a far better position to determine that. It is to let the
kernel know that *something* has gone wrong, and we've got to keep
networking (or any other user of __GFP_CRITICAL) up for a few minutes, *no
matter what*. We may not ever OOM, or even run terribly low on memory, but
the trigger allows the use of the pool IF that happens.

-Matt

2005-11-21 13:29:28

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] Critical Page Pool

Hi!

> > * Matthew Dobson ([email protected]) wrote:
> >
> >>/proc/sys/vm/critical_pages: write the number of pages you want to reserve
> >>for the critical pool into this file
> >
> >
> > How do you size this pool?
>
> Trial and error. If you want networking to survive with no memory other
> than the critical pool for 2 minutes, for example, you pick a random value,
> block all other allocations (I have a test patch to do this), and send a
> boatload of packets at the box. If it OOMs, you need a bigger pool.
> Lather, rinse, repeat.

...and then you find out that your test was not "bad enough" or that
it needs more memory on different machines. It may be good enough hack
for your usage, but I do not think it belongs in mainline.
Pavel
--
Thanks, Sharp!

2005-12-06 22:54:53

by Matthew Dobson

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] Critical Page Pool

Pavel Machek wrote:
> Hi!
>
>
>>>* Matthew Dobson ([email protected]) wrote:
>>>
>>>
>>>>/proc/sys/vm/critical_pages: write the number of pages you want to reserve
>>>>for the critical pool into this file
>>>
>>>
>>>How do you size this pool?
>>
>>Trial and error. If you want networking to survive with no memory other
>>than the critical pool for 2 minutes, for example, you pick a random value,
>>block all other allocations (I have a test patch to do this), and send a
>>boatload of packets at the box. If it OOMs, you need a bigger pool.
>>Lather, rinse, repeat.
>
>
> ...and then you find out that your test was not "bad enough" or that
> it needs more memory on different machines. It may be good enough hack
> for your usage, but I do not think it belongs in mainline.
> Pavel

Way late in responding to this, but...

Apropriate sizing of this pool is a known issue. For example, we want to
use it to keep the networking stack alive during extreme memory pressure
situations. The only way to size the pool so as to *guarantee* that it
will not be exhausted during the 2 minute window we need would be to ensure
that the pool has at least (TOTAL_BANDWITH_OF_ALL_NICS * 120 seconds) bytes
available. In the case of a simple system with a single GigE adapter we'd
need (1 gigbit/sec * 120 sec) = 120 gigabits = 15 gigabytes of reserve
pool. That is obviously completely impractical, considering many boxes
have multiple GigE adapters or even 10 GigE adapters. It is also
incredibly unlikely that the NIC will be hit with a continuous stream of
packets at a level that would completely saturate the link. Starting with
an educated guess and some test runs with a reasonble workload should give
you a good idea of how much space you'd *realistically* need to reserve.
Given any reserve size less than the theoretical maximum you obviously
can't *guarantee* the pool won't be exhausted, but you can be pretty confident.

-Matt

2005-12-10 08:39:26

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/8] Critical Page Pool

Hi!

> > ...and then you find out that your test was not "bad enough" or that
> > it needs more memory on different machines. It may be good enough hack
> > for your usage, but I do not think it belongs in mainline.
> > Pavel
>
> Way late in responding to this, but...
>
> Apropriate sizing of this pool is a known issue. For example, we want to
> use it to keep the networking stack alive during extreme memory pressure
> situations. The only way to size the pool so as to *guarantee* that it
> will not be exhausted during the 2 minute window we need would be to ensure
> that the pool has at least (TOTAL_BANDWITH_OF_ALL_NICS * 120 seconds) bytes
> available. In the case of a simple system with a single GigE adapter we'd
> need (1 gigbit/sec * 120 sec) = 120 gigabits = 15 gigabytes of reserve
> pool. That is obviously completely impractical, considering many boxes

And it is not enough... If someone hits you with small packets,
allocation overhead is going to be high.
Pavel
--
Thanks, Sharp!