2007-08-06 10:46:10

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 04/10] mm: slub: add knowledge of reserve pages

Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
contexts that are entitled to it.

Care is taken to only touch the SLUB slow path.

Because the reserve threshold is system wide (by virtue of the previous patches)
we can do with a single kmem_cache wide state.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
include/linux/slub_def.h | 2 +
mm/slub.c | 75 ++++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 70 insertions(+), 7 deletions(-)

Index: linux-2.6-2/include/linux/slub_def.h
===================================================================
--- linux-2.6-2.orig/include/linux/slub_def.h
+++ linux-2.6-2/include/linux/slub_def.h
@@ -50,6 +50,8 @@ struct kmem_cache {
struct kobject kobj; /* For sysfs */
#endif

+ struct page *reserve_slab;
+
#ifdef CONFIG_NUMA
int defrag_ratio;
struct kmem_cache_node *node[MAX_NUMNODES];
Index: linux-2.6-2/mm/slub.c
===================================================================
--- linux-2.6-2.orig/mm/slub.c
+++ linux-2.6-2/mm/slub.c
@@ -20,11 +20,13 @@
#include <linux/mempolicy.h>
#include <linux/ctype.h>
#include <linux/kallsyms.h>
+#include "internal.h"

/*
* Lock order:
- * 1. slab_lock(page)
- * 2. slab->list_lock
+ * 1. reserve_lock
+ * 2. slab_lock(page)
+ * 3. node->list_lock
*
* The slab_lock protects operations on the object of a particular
* slab and its metadata in the page struct. If the slab lock
@@ -258,6 +260,8 @@ static inline int sysfs_slab_alias(struc
static inline void sysfs_slab_remove(struct kmem_cache *s) {}
#endif

+static DEFINE_SPINLOCK(reserve_lock);
+
/********************************************************************
* Core slab cache functions
*******************************************************************/
@@ -1069,7 +1073,7 @@ static void setup_object(struct kmem_cac
s->ctor(object, s, 0);
}

-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *reserve)
{
struct page *page;
struct kmem_cache_node *n;
@@ -1087,6 +1091,7 @@ static struct page *new_slab(struct kmem
if (!page)
goto out;

+ *reserve = page->reserve;
n = get_node(s, page_to_nid(page));
if (n)
atomic_long_inc(&n->nr_slabs);
@@ -1457,6 +1462,7 @@ static void *__slab_alloc(struct kmem_ca
{
void **object;
int cpu = smp_processor_id();
+ int reserve = 0;

if (!page)
goto new_slab;
@@ -1486,10 +1492,25 @@ new_slab:
if (page) {
s->cpu_slab[cpu] = page;
goto load_freelist;
- }
+ } else if (unlikely(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
+ goto try_reserve;

- page = new_slab(s, gfpflags, node);
- if (page) {
+alloc_slab:
+ page = new_slab(s, gfpflags, node, &reserve);
+ if (page && !reserve) {
+ if (unlikely(s->reserve_slab)) {
+ struct page *reserve;
+
+ spin_lock(&reserve_lock);
+ reserve = s->reserve_slab;
+ s->reserve_slab = NULL;
+ spin_unlock(&reserve_lock);
+
+ if (reserve) {
+ slab_lock(reserve);
+ unfreeze_slab(s, reserve);
+ }
+ }
cpu = smp_processor_id();
if (s->cpu_slab[cpu]) {
/*
@@ -1517,6 +1538,18 @@ new_slab:
SetSlabFrozen(page);
s->cpu_slab[cpu] = page;
goto load_freelist;
+ } else if (page) {
+ spin_lock(&reserve_lock);
+ if (s->reserve_slab) {
+ discard_slab(s, page);
+ page = s->reserve_slab;
+ goto got_reserve;
+ }
+ slab_lock(page);
+ SetSlabFrozen(page);
+ s->reserve_slab = page;
+ spin_unlock(&reserve_lock);
+ goto use_reserve;
}
return NULL;
debug:
@@ -1528,6 +1561,31 @@ debug:
page->freelist = object[page->offset];
slab_unlock(page);
return object;
+
+try_reserve:
+ spin_lock(&reserve_lock);
+ page = s->reserve_slab;
+ if (!page) {
+ spin_unlock(&reserve_lock);
+ goto alloc_slab;
+ }
+
+got_reserve:
+ slab_lock(page);
+ if (!page->freelist) {
+ s->reserve_slab = NULL;
+ spin_unlock(&reserve_lock);
+ unfreeze_slab(s, page);
+ goto alloc_slab;
+ }
+ spin_unlock(&reserve_lock);
+
+use_reserve:
+ object = page->freelist;
+ page->inuse++;
+ page->freelist = object[page->offset];
+ slab_unlock(page);
+ return object;
}

/*
@@ -1872,10 +1930,11 @@ static struct kmem_cache_node * __init e
{
struct page *page;
struct kmem_cache_node *n;
+ int reserve;

BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));

- page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node);
+ page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node, &reserve);

BUG_ON(!page);
n = page->freelist;
@@ -2091,6 +2150,8 @@ static int kmem_cache_open(struct kmem_c
s->defrag_ratio = 100;
#endif

+ s->reserve_slab = NULL;
+
if (init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
return 1;
error:

--


2007-08-08 00:19:08

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages

On Mon, 6 Aug 2007, Peter Zijlstra wrote:

> Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
> contexts that are entitled to it.

Is this patch actually necessary?

If you are in an atomic context and bound to a cpu then a per cpu slab is
assigned to you and no one else can take object aways from that process
since nothing else can run on the cpu. The point of the patch is to avoid
other processes draining objects right? So you do not need the
modifications in that case.

If you are not in an atomic context and are preemptable or can switch
allocation context then you can create another context in which reclaim
could be run to remove some clean pages and get you more memory. Again no
need for the patch.

I guess you may be limited in not being able to call into reclaim again
because it is already running. Maybe that can be fixed? F.e. zone reclaim
does that for the NUMA case. It simply scans for easily reclaimable pages.

We could guarantee easily reclaimable pages to exist in much larger
numbers than the reserves of min_free_kbytes. So in a tight spot one could
reclaim from those.

2007-08-08 01:45:03

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages

On Tue, Aug 07, 2007 at 05:13:52PM -0700, Christoph Lameter wrote:
> On Mon, 6 Aug 2007, Peter Zijlstra wrote:
>
> > Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
> > contexts that are entitled to it.
>
> Is this patch actually necessary?
>
> If you are in an atomic context and bound to a cpu then a per cpu slab is
> assigned to you and no one else can take object aways from that process
> since nothing else can run on the cpu.

Servicing I/O over the network requires an allocation to send a buffer
and an allocation to later receive the acknowledgement. We can't free
our send buffer (or the memory it's supposed to clean) until the
relevant ack is received. We have to hold our reserves privately
throughout, even if an interrupt that wants to do GFP_ATOMIC
allocation shows up in-between.

> If you are not in an atomic context and are preemptable or can switch
> allocation context then you can create another context in which reclaim
> could be run to remove some clean pages and get you more memory. Again no
> need for the patch.

By the point that this patch is relevant, there are already no clean
pages. The only way to free up more memory is via I/O.

--
Mathematics is the supreme nostalgia of our time.

2007-08-08 17:14:30

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages

On Tue, 7 Aug 2007, Matt Mackall wrote:

> > If you are in an atomic context and bound to a cpu then a per cpu slab is
> > assigned to you and no one else can take object aways from that process
> > since nothing else can run on the cpu.
>
> Servicing I/O over the network requires an allocation to send a buffer
> and an allocation to later receive the acknowledgement. We can't free
> our send buffer (or the memory it's supposed to clean) until the
> relevant ack is received. We have to hold our reserves privately
> throughout, even if an interrupt that wants to do GFP_ATOMIC
> allocation shows up in-between.

If you can take an interrupt then you can move to a different allocation
context. This means reclaim could free up more pages if we tell reclaim
not to allocate any memory.

> > If you are not in an atomic context and are preemptable or can switch
> > allocation context then you can create another context in which reclaim
> > could be run to remove some clean pages and get you more memory. Again no
> > need for the patch.
>
> By the point that this patch is relevant, there are already no clean
> pages. The only way to free up more memory is via I/O.

That is never true. The dirty ratio limit limits the number of dirty pages
in memory. There is always a large percentage of memory that is kept
clean. Pages that are file backed and clean can be freed without any
additional memory allocation. This is true for the executable code that
you must have to execute any instructions. We could guarantee that the
number of pages reclaimable without memory allocs stays above certain
limits by checking VM counters.

I think there are two ways to address this in a simpler way:

1. Allow recursive calls into reclaim. If we are in a PF_MEMALLOC context
then we can still scan lru lists and free up memory of clean pages. Idea
patch follows.

2. Make pageout figure out if the write action requires actual I/O
submission. If so then the submission will *not* immediately free memory
and we have to wait for I/O to complete. In that case do not immediately
initiate I/O (which would not free up memory and its bad to initiate
I/O when we have not enough free memory) but put all those pages on a
pageout list. When reclaim has reclaimed enough memory then go through the
pageout list and trigger I/O. That can be done without PF_MEMALLOC so that
additional reclaim could be triggered as needed. Maybe we can just get rid
of PF_MEMALLOC and some of the contorted code around it?




Recursive reclaim concept patch:

---
include/linux/swap.h | 2 ++
mm/page_alloc.c | 11 +++++++++++
mm/vmscan.c | 27 +++++++++++++++++++++++++++
3 files changed, 40 insertions(+)

Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h 2007-08-08 04:31:06.000000000 -0700
+++ linux-2.6/include/linux/swap.h 2007-08-08 04:31:28.000000000 -0700
@@ -190,6 +190,8 @@ extern void swap_setup(void);
/* linux/mm/vmscan.c */
extern unsigned long try_to_free_pages(struct zone **zones, int order,
gfp_t gfp_mask);
+extern unsigned long emergency_free_pages(struct zone **zones, int order,
+ gfp_t gfp_mask);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2007-08-08 04:17:33.000000000 -0700
+++ linux-2.6/mm/page_alloc.c 2007-08-08 04:39:26.000000000 -0700
@@ -1306,6 +1306,17 @@ nofail_alloc:
zonelist, ALLOC_NO_WATERMARKS);
if (page)
goto got_pg;
+
+ /*
+ * We cannot go into full synchrononous reclaim
+ * but we can still scan for easily reclaimable
+ * pages.
+ */
+ if (p->flags & PF_MEMALLOC &&
+ emergency_free_pages(zonelist->zones, order,
+ gfp_mask))
+ goto nofail_alloc;
+
if (gfp_mask & __GFP_NOFAIL) {
congestion_wait(WRITE, HZ/50);
goto nofail_alloc;
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c 2007-08-08 04:21:14.000000000 -0700
+++ linux-2.6/mm/vmscan.c 2007-08-08 04:42:24.000000000 -0700
@@ -1204,6 +1204,33 @@ out:
}

/*
+ * Emergency reclaim. We are alreedy in the vm write out path
+ * and we have exhausted all memory. We have to free memory without
+ * any additional allocations. So no writes and no swap. Get
+ * as bare bones as we can.
+ */
+unsigned long emergency_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
+{
+ int priority;
+ unsigned long nr_reclaimed = 0;
+ struct scan_control sc = {
+ .gfp_mask = gfp_mask,
+ .swap_cluster_max = SWAP_CLUSTER_MAX,
+ .order = order,
+ };
+
+ for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+ sc.nr_scanned = 0;
+ nr_reclaimed += shrink_zones(priority, zones, &sc);
+ if (nr_reclaimed >= sc.swap_cluster_max)
+ return 1;
+ }
+
+ /* top priority shrink_caches still had more to do? don't OOM, then */
+ return sc.all_unreclaimable;
+}
+
+/*
* For kswapd, balance_pgdat() will work across all this node's zones until
* they are all at pages_high.
*



2007-08-08 17:46:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages

On Wed, 8 Aug 2007 10:13:05 -0700 (PDT)
Christoph Lameter <[email protected]> wrote:

> I think there are two ways to address this in a simpler way:
>
> 1. Allow recursive calls into reclaim. If we are in a PF_MEMALLOC context
> then we can still scan lru lists and free up memory of clean pages. Idea
> patch follows.
>
> 2. Make pageout figure out if the write action requires actual I/O
> submission. If so then the submission will *not* immediately free memory
> and we have to wait for I/O to complete. In that case do not immediately
> initiate I/O (which would not free up memory and its bad to initiate
> I/O when we have not enough free memory) but put all those pages on a
> pageout list. When reclaim has reclaimed enough memory then go through the
> pageout list and trigger I/O. That can be done without PF_MEMALLOC so that
> additional reclaim could be triggered as needed. Maybe we can just get rid
> of PF_MEMALLOC and some of the contorted code around it?

3. Perform page reclaim from hard IRQ context. Pretty simple to
implement, most of the work would be needed in the rmap code. It might be
better to make it opt-in via a new __GFP_flag.

2007-08-08 17:57:27

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages

On Wed, 8 Aug 2007, Andrew Morton wrote:

> 3. Perform page reclaim from hard IRQ context. Pretty simple to
> implement, most of the work would be needed in the rmap code. It might be
> better to make it opt-in via a new __GFP_flag.

In a hardirq context one is bound to a processor and through the slab
allocator to a slab from which one allocates objects. The slab is per cpu
and so the slab is reserved for the current context. Nothing can take
objects away from it. The modifications here would not be needed for that
context.

I think in general irq context reclaim is doable. Cannot see obvious
issues on a first superficial pass through rmap.c. The irq holdoff would
be pretty long though which may make it unacceptable.

2007-08-08 18:53:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages

On Wed, 8 Aug 2007 10:57:13 -0700 (PDT)
Christoph Lameter <[email protected]> wrote:

> I think in general irq context reclaim is doable. Cannot see obvious
> issues on a first superficial pass through rmap.c. The irq holdoff would
> be pretty long though which may make it unacceptable.

The IRQ holdoff could be tremendous. But if it is sufficiently infrequent
and if the worst effect is merely a network rx ring overflow then the tradeoff
might be a good one.

2007-08-10 01:54:16

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages

On 8/8/07, Andrew Morton <[email protected]> wrote:
> On Wed, 8 Aug 2007 10:57:13 -0700 (PDT)
> Christoph Lameter <[email protected]> wrote:
>
> > I think in general irq context reclaim is doable. Cannot see obvious
> > issues on a first superficial pass through rmap.c. The irq holdoff would
> > be pretty long though which may make it unacceptable.
>
> The IRQ holdoff could be tremendous. But if it is sufficiently infrequent
> and if the worst effect is merely a network rx ring overflow then the tradeoff
> might be a good one.

Hi Andrew,

No matter how you look at this problem, you still need to have _some_
sort of reserve, and limit access to it. We extend existing methods,
you are proposing to what seems like an entirely new reserve
management system. Great idea, maybe, but it does not solve the
deadlocks. You still need some organized way of being sure that your
reserve is as big as you need (hopefully not an awful lot bigger) and
you still have to make sure that nobody dips into that reserve further
than they are allowed to.

So translation: reclaim from "easily freeable" lists is an
optimization, maybe a great one. Probably great. Reclaim from atomic
context is also a great idea, probably. But you are talking about a
whole nuther patch set. Neither of those are in themselves a fix for
these deadlocks.

Regards,

Daniel

2007-08-10 02:01:55

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages

On Thu, 9 Aug 2007, Daniel Phillips wrote:

> No matter how you look at this problem, you still need to have _some_
> sort of reserve, and limit access to it. We extend existing methods,

The reserve is in the memory in the zone and reclaim can guarantee that
there are a sufficient number of easily reclaimable pages in it.

> you are proposing to what seems like an entirely new reserve

The reserve always has been managed by per zone counters. Nothing new
there.

> management system. Great idea, maybe, but it does not solve the
> deadlocks. You still need some organized way of being sure that your
> reserve is as big as you need (hopefully not an awful lot bigger) and
> you still have to make sure that nobody dips into that reserve further
> than they are allowed to.

Nope there is no need to have additional reserves. You delay the writeout
until you are finished with reclaim. Then you do the writeout. During
writeout reclaim may be called as needed. After the writeout is complete
then you recheck the vm counters again to be sure that dirty ratio /
easily reclaimable ratio and mem low / high boundaries are still okay. If not go
back to reclaim.

> So translation: reclaim from "easily freeable" lists is an
> optimization, maybe a great one. Probably great. Reclaim from atomic
> context is also a great idea, probably. But you are talking about a
> whole nuther patch set. Neither of those are in themselves a fix for
> these deadlocks.

Yes they are a much better fix and may allow code cleanup by getting rid
of checks for PF_MEMALLOC. They integrate in a straightforward way
into the existing reclaim methods.

2007-08-20 07:38:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages

Ok, so I got rid of the global stuff, this also obsoletes 3/10.


---
Subject: mm: slub: add knowledge of reserve pages

Restrict objects from reserve slabs (ALLOC_NO_WATERMARKS) to allocation
contexts that are entitled to it.

Care is taken to only touch the SLUB slow path.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
mm/slub.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 69 insertions(+), 18 deletions(-)

Index: linux-2.6-2/mm/slub.c
===================================================================
--- linux-2.6-2.orig/mm/slub.c
+++ linux-2.6-2/mm/slub.c
@@ -20,11 +20,12 @@
#include <linux/mempolicy.h>
#include <linux/ctype.h>
#include <linux/kallsyms.h>
+#include "internal.h"

/*
* Lock order:
* 1. slab_lock(page)
- * 2. slab->list_lock
+ * 2. node->list_lock
*
* The slab_lock protects operations on the object of a particular
* slab and its metadata in the page struct. If the slab lock
@@ -1069,7 +1070,7 @@ static void setup_object(struct kmem_cac
s->ctor(object, s, 0);
}

-static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *reserve)
{
struct page *page;
struct kmem_cache_node *n;
@@ -1087,6 +1088,7 @@ static struct page *new_slab(struct kmem
if (!page)
goto out;

+ *reserve = page->reserve;
n = get_node(s, page_to_nid(page));
if (n)
atomic_long_inc(&n->nr_slabs);
@@ -1403,12 +1405,36 @@ static inline void flush_slab(struct kme
}

/*
+ * cpu slab reserve magic
+ *
+ * we mark reserve status in the lsb of the ->cpu_slab[] pointer.
+ */
+static inline unsigned long cpu_slab_reserve(struct kmem_cache *s, int cpu)
+{
+ return unlikely((unsigned long)s->cpu_slab[cpu] & 1);
+}
+
+static inline void
+cpu_slab_set(struct kmem_cache *s, int cpu, struct page *page, int reserve)
+{
+ if (unlikely(reserve))
+ page = (struct page *)((unsigned long)page | 1);
+
+ s->cpu_slab[cpu] = page;
+}
+
+static inline struct page *cpu_slab(struct kmem_cache *s, int cpu)
+{
+ return (struct page *)((unsigned long)s->cpu_slab[cpu] & ~1UL);
+}
+
+/*
* Flush cpu slab.
* Called from IPI handler with interrupts disabled.
*/
static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
{
- struct page *page = s->cpu_slab[cpu];
+ struct page *page = cpu_slab(s, cpu);

if (likely(page))
flush_slab(s, page, cpu);
@@ -1457,10 +1483,22 @@ static void *__slab_alloc(struct kmem_ca
{
void **object;
int cpu = smp_processor_id();
+ int reserve = 0;

if (!page)
goto new_slab;

+ if (cpu_slab_reserve(s, cpu)) {
+ /*
+ * If the current slab is a reserve slab and the current
+ * allocation context does not allow access to the reserves
+ * we must force an allocation to test the current levels.
+ */
+ if (!(gfp_to_alloc_flags(gfpflags) & ALLOC_NO_WATERMARKS))
+ goto alloc_slab;
+ reserve = 1;
+ }
+
slab_lock(page);
if (unlikely(node != -1 && page_to_nid(page) != node))
goto another_slab;
@@ -1468,10 +1506,9 @@ load_freelist:
object = page->freelist;
if (unlikely(!object))
goto another_slab;
- if (unlikely(SlabDebug(page)))
+ if (unlikely(SlabDebug(page) || reserve))
goto debug;

- object = page->freelist;
page->lockless_freelist = object[page->offset];
page->inuse = s->objects;
page->freelist = NULL;
@@ -1484,14 +1521,28 @@ another_slab:
new_slab:
page = get_partial(s, gfpflags, node);
if (page) {
- s->cpu_slab[cpu] = page;
+ cpu_slab_set(s, cpu, page, reserve);
goto load_freelist;
}

- page = new_slab(s, gfpflags, node);
+alloc_slab:
+ page = new_slab(s, gfpflags, node, &reserve);
if (page) {
+ struct page *slab;
+
cpu = smp_processor_id();
- if (s->cpu_slab[cpu]) {
+ slab = cpu_slab(s, cpu);
+
+ if (cpu_slab_reserve(s, cpu) && !reserve) {
+ /*
+ * If the current cpu_slab is a reserve slab but we
+ * managed to allocate a new slab the pressure is
+ * lifted and we can unmark the current one.
+ */
+ cpu_slab_set(s, cpu, slab, 0);
+ }
+
+ if (slab) {
/*
* Someone else populated the cpu_slab while we
* enabled interrupts, or we have gotten scheduled
@@ -1499,29 +1550,28 @@ new_slab:
* requested node even if __GFP_THISNODE was
* specified. So we need to recheck.
*/
- if (node == -1 ||
- page_to_nid(s->cpu_slab[cpu]) == node) {
+ if (node == -1 || page_to_nid(slab) == node) {
/*
* Current cpuslab is acceptable and we
* want the current one since its cache hot
*/
discard_slab(s, page);
- page = s->cpu_slab[cpu];
+ page = slab;
slab_lock(page);
goto load_freelist;
}
/* New slab does not fit our expectations */
- flush_slab(s, s->cpu_slab[cpu], cpu);
+ flush_slab(s, slab, cpu);
}
slab_lock(page);
SetSlabFrozen(page);
- s->cpu_slab[cpu] = page;
+ cpu_slab_set(s, cpu, page, reserve);
goto load_freelist;
}
return NULL;
debug:
- object = page->freelist;
- if (!alloc_debug_processing(s, page, object, addr))
+ if (SlabDebug(page) &&
+ !alloc_debug_processing(s, page, object, addr))
goto another_slab;

page->inuse++;
@@ -1548,7 +1598,7 @@ static void __always_inline *slab_alloc(
unsigned long flags;

local_irq_save(flags);
- page = s->cpu_slab[smp_processor_id()];
+ page = cpu_slab(s, smp_processor_id());
if (unlikely(!page || !page->lockless_freelist ||
(node != -1 && page_to_nid(page) != node)))

@@ -1873,10 +1923,11 @@ static struct kmem_cache_node * __init e
{
struct page *page;
struct kmem_cache_node *n;
+ int reserve;

BUG_ON(kmalloc_caches->size < sizeof(struct kmem_cache_node));

- page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node);
+ page = new_slab(kmalloc_caches, gfpflags | GFP_THISNODE, node, &reserve);

BUG_ON(!page);
n = page->freelist;
@@ -3189,7 +3240,7 @@ static unsigned long slab_objects(struct
per_cpu = nodes + nr_node_ids;

for_each_possible_cpu(cpu) {
- struct page *page = s->cpu_slab[cpu];
+ struct page *page = cpu_slab(s, cpu);
int node;

if (page) {


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2007-08-20 07:43:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages

On Mon, 2007-08-20 at 09:38 +0200, Peter Zijlstra wrote:
> Ok, so I got rid of the global stuff, this also obsoletes 3/10.

2/10 that is

2007-08-20 09:12:42

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages

Hi Peter,

On Mon, 20 Aug 2007, Peter Zijlstra wrote:
> -static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
> +static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *reserve)
> {

[snip]

> + *reserve = page->reserve;

Any reason why the callers that are actually interested in this don't do
page->reserve on their own?

2007-08-20 09:17:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages

On Mon, 2007-08-20 at 12:12 +0300, Pekka J Enberg wrote:
> Hi Peter,
>
> On Mon, 20 Aug 2007, Peter Zijlstra wrote:
> > -static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node)
> > +static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *reserve)
> > {
>
> [snip]
>
> > + *reserve = page->reserve;
>
> Any reason why the callers that are actually interested in this don't do
> page->reserve on their own?

because new_slab() destroys the content?

struct page {
...
union {
pgoff_t index; /* Our offset within mapping. */
void *freelist; /* SLUB: freelist req. slab lock */
int reserve; /* page_alloc: page is a reserve page */
atomic_t frag_count; /* skb fragment use count */
};
...
};

static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node, int *reserve)
{
...
*reserve = page->reserve;
...
page->freelist = start;
...
}


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2007-08-20 09:28:20

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages

Hi Peter,

On Mon, 2007-08-20 at 12:12 +0300, Pekka J Enberg wrote:
> > Any reason why the callers that are actually interested in this don't do
> > page->reserve on their own?

On 8/20/07, Peter Zijlstra <[email protected]> wrote:
> because new_slab() destroys the content?

Right. So maybe we could move the initialization parts of new_slab()
to __new_slab() so that the callers that are actually interested in
'reserve' could do allocate_slab(), store page->reserve and do rest of
the initialization with it?

As for the __GFP_WAIT handling, I *think* we can move the interrupt
enable/disable to allocate_slab()... Christoph?

Pekka

2007-08-20 19:26:20

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages

On Mon, 20 Aug 2007, Pekka Enberg wrote:

> Hi Peter,
>
> On Mon, 2007-08-20 at 12:12 +0300, Pekka J Enberg wrote:
> > > Any reason why the callers that are actually interested in this don't do
> > > page->reserve on their own?
>
> On 8/20/07, Peter Zijlstra <[email protected]> wrote:
> > because new_slab() destroys the content?
>
> Right. So maybe we could move the initialization parts of new_slab()
> to __new_slab() so that the callers that are actually interested in
> 'reserve' could do allocate_slab(), store page->reserve and do rest of
> the initialization with it?

I am still not convinced about this approach and there seems to be
agreement that this is not working on large NUMA. So #ifdef it out?
!CONFIG_NUMA? Some more general approach that does not rely on a single
slab being a reserve?

The object is to check the alloc flags when having allocated a reserve
slab right? Adding another flag SlabReserve and keying off on that one may
be the easiest solution.

I have pending patches here that add per cpu structures. Those will make
that job easier.

> As for the __GFP_WAIT handling, I *think* we can move the interrupt
> enable/disable to allocate_slab()... Christoph?

The reason the enable/disable is in new_slab is to minimize interrupt
holdoff time. If we move it to allocate slab then the slab preparation is
done with interrupts disabled.

2007-08-20 20:09:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages

On Mon, 2007-08-20 at 12:26 -0700, Christoph Lameter wrote:
> On Mon, 20 Aug 2007, Pekka Enberg wrote:
>
> > Hi Peter,
> >
> > On Mon, 2007-08-20 at 12:12 +0300, Pekka J Enberg wrote:
> > > > Any reason why the callers that are actually interested in this don't do
> > > > page->reserve on their own?
> >
> > On 8/20/07, Peter Zijlstra <[email protected]> wrote:
> > > because new_slab() destroys the content?
> >
> > Right. So maybe we could move the initialization parts of new_slab()
> > to __new_slab() so that the callers that are actually interested in
> > 'reserve' could do allocate_slab(), store page->reserve and do rest of
> > the initialization with it?
>
> I am still not convinced about this approach and there seems to be
> agreement that this is not working on large NUMA. So #ifdef it out?
> !CONFIG_NUMA? Some more general approach that does not rely on a single
> slab being a reserve?

See the patch I sent earlier today?

> The object is to check the alloc flags when having allocated a reserve
> slab right?

The initial idea was to make each slab allocation respect the watermarks
like page allocation does (the page rank thingies, if you remember).
That is if the slab is allocated from below the ALLOC_MIN|ALLOC_HARDER
threshold, an ALLOC_MIN|ALLOC_HIGH allocation would get memory, but an
ALLOC_MIN would not.

Now, we only needed the ALLOC_MIN|ALLOC_HIGH|ALLOC_HARDER <->
ALLOC_NO_WATERMARKS transition and hence fell back to a binary system
that is not quite fair wrt to all the other levels but suffices for the
problem at hand.

So we want to ensure that slab allocations that are _not_ entitled to
ALLOC_NO_WATERMARK memory will not get objects when a page allocation
with the same right would fail, even if there is a slab present.

> Adding another flag SlabReserve and keying off on that one may
> be the easiest solution.

Trouble with something like that is that page flags are peristent and
you'd need to clean them when the status flips -> O(n) -> unwanted.

> I have pending patches here that add per cpu structures. Those will make
> that job easier.

Yeah, I've seen earlier versions of those.