2010-08-04 02:52:14

by Christoph Lameter

[permalink] [raw]
Subject: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)

The following is a first release of an allocator based on SLAB
and SLUB that integrates the best approaches from both allocators. The
per cpu queuing is like the two prior releases. The NUMA facilities
were much improved vs V2. Shared and alien cache support was added to
track the cache hot state of objects.

After this patches SLUB will track the cpu cache contents
like SLAB attemped to. There are a number of architectural differences:

1. SLUB accurately tracks cpu caches instead of assuming that there
is only a single cpu cache per node or system.

2. SLUB object expiration is tied into the page reclaim logic. There
is no periodic cache expiration.

3. SLUB caches are dynamically configurable via the sysfs filesystem.

4. There is no per slab page metadata structure to maintain (aside
from the object bitmap that usually fits into the page struct).

5. Keeps all the other good features of SLUB as well.

SLUB+Q is a merging of SLUB with some queuing concepts from SLAB and a
new way of managing objects in the slabs using bitmaps. It uses a percpu
queue so that free operations can be properly buffered and a bitmap for
managing the free/allocated state in the slabs. It is slightly more
inefficient than SLUB (due to the need to place large bitmaps --sized
a few words--in some slab pages if there are more than BITS_PER_LONG
objects in a slab) but in general does not increase space use too much.

The SLAB scheme of not touching the object during management is adopted.
SLUB+Q can efficiently free and allocate cache cold objects without
causing cache misses.

I have had limited time for benchmarking this release so far since I
was more focused on getting SLAB features merged in and making it
work reliably with all the usual SLUB bells and whistles. The queueing
scheme from the SLUB+Q V1/V2 releases was not changed so that the basic
SMP performance is still the same. V1 and V2 did not have NUMA clean
queues and therefore the performance on NUMA system was not great.

Since the basic queueing scheme from SLAB was taken we should be seeing
similar or better performance on NUMA. But then I am limited to two node
systems at this point. For those systems the alien caches are allocated
of similar size than the shared caches. Meaning that more optimizations
will now be geared to small NUMA systems.



Patches against 2.6.35

1,2 Some percpu stuff that I hope will independently be merged in the 2.6.36
cycle.

3-13 Cleanup patches for SLUB that are general improvements. Some of those
are already in the slab tree for 2.6.36.

14-18 Minimal set that realizes per cpu queues without fancy shared or alien
queues. This should be enough to be competitive with SMP against SLAB
on modern hardware as the earlier measurements show.

19 NUMA policies applied at the object level. This will cause significantly
more processing in the allocator hotpath for the NUMA case on
particular slabs so that individual allocations can be redirected
to different nodes.

20 Shared caches per cache sibling group between processors.

21 Alien caches per cache sibling group. Just adds a couple of
shared caches and uses them for foreign nodes.

22 Cache expiration

23 Expire caches from page reclaim logic in mm/vmscan.c


2010-08-04 04:39:15

by David Rientjes

[permalink] [raw]
Subject: Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)

On Tue, 3 Aug 2010, Christoph Lameter wrote:

> The following is a first release of an allocator based on SLAB
> and SLUB that integrates the best approaches from both allocators. The
> per cpu queuing is like the two prior releases. The NUMA facilities
> were much improved vs V2. Shared and alien cache support was added to
> track the cache hot state of objects.
>

This insta-reboots on my netperf benchmarking servers (but works with
numa=off), so I'll have to wait until I can hook up a serial before
benchmarking this series.

2010-08-04 16:17:51

by Christoph Lameter

[permalink] [raw]
Subject: Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)

On Tue, 3 Aug 2010, David Rientjes wrote:

> On Tue, 3 Aug 2010, Christoph Lameter wrote:
>
> > The following is a first release of an allocator based on SLAB
> > and SLUB that integrates the best approaches from both allocators. The
> > per cpu queuing is like the two prior releases. The NUMA facilities
> > were much improved vs V2. Shared and alien cache support was added to
> > track the cache hot state of objects.
> >
>
> This insta-reboots on my netperf benchmarking servers (but works with
> numa=off), so I'll have to wait until I can hook up a serial before
> benchmarking this series.

There are potential issues with

1. The size of per cpu reservation on bootup and the new percpu code that
allows allocations for per cpu areas during bootup. Sometime I wonder if I
should just go back to static allocs for that.

2. The topology information provided by the machine for the cache setup.

3. My code of course.

Bootlog would be appreciated.

2010-08-05 08:38:37

by David Rientjes

[permalink] [raw]
Subject: Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)

On Wed, 4 Aug 2010, Christoph Lameter wrote:

> > This insta-reboots on my netperf benchmarking servers (but works with
> > numa=off), so I'll have to wait until I can hook up a serial before
> > benchmarking this series.
>
> There are potential issues with
>
> 1. The size of per cpu reservation on bootup and the new percpu code that
> allows allocations for per cpu areas during bootup. Sometime I wonder if I
> should just go back to static allocs for that.
>
> 2. The topology information provided by the machine for the cache setup.
>
> 3. My code of course.
>

I bisected this to patch 8 but still don't have a bootlog. I'm assuming
in the meantime that something is kmallocing DMA memory on this machine
prior to kmem_cache_init_late() and get_slab() is returning a NULL
pointer.

2010-08-05 17:33:27

by Christoph Lameter

[permalink] [raw]
Subject: Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)

On Thu, 5 Aug 2010, David Rientjes wrote:

> I bisected this to patch 8 but still don't have a bootlog. I'm assuming
> in the meantime that something is kmallocing DMA memory on this machine
> prior to kmem_cache_init_late() and get_slab() is returning a NULL
> pointer.

There is a kernel option "earlyprintk=..." that allows you to see early
boot messages.

If this indeed is a problem with the DMA caches then try the following
patch:



Subject: slub: Move dma cache initialization up

Do dma kmalloc initialization in kmem_cache_init and not in kmem_cache_init_late()

Signed-off-by: Christoph Lameter <[email protected]>

---
mm/slub.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-08-05 12:24:21.000000000 -0500
+++ linux-2.6/mm/slub.c 2010-08-05 12:28:58.000000000 -0500
@@ -3866,13 +3866,8 @@ void __init kmem_cache_init(void)
#ifdef CONFIG_SMP
register_cpu_notifier(&slab_notifier);
#endif
-}

-void __init kmem_cache_init_late(void)
-{
#ifdef CONFIG_ZONE_DMA
- int i;
-
/* Create the dma kmalloc array and make it operational */
for (i = 0; i < SLUB_PAGE_SHIFT; i++) {
struct kmem_cache *s = kmalloc_caches[i];
@@ -3891,6 +3886,10 @@ void __init kmem_cache_init_late(void)
#endif
}

+void __init kmem_cache_init_late(void)
+{
+}
+
/*
* Find a mergeable slab cache
*/

2010-08-17 04:56:45

by David Rientjes

[permalink] [raw]
Subject: Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)

On Thu, 5 Aug 2010, Christoph Lameter wrote:

> > I bisected this to patch 8 but still don't have a bootlog. I'm assuming
> > in the meantime that something is kmallocing DMA memory on this machine
> > prior to kmem_cache_init_late() and get_slab() is returning a NULL
> > pointer.
>
> There is a kernel option "earlyprintk=..." that allows you to see early
> boot messages.
>

Ok, so this is panicking because of the error handling when trying to
create sysfs directories with the same name (in this case, :dt-0000064).
I'll look into while this isn't failing gracefully later, but I isolated
this to the new code that statically allocates the DMA caches in
kmem_cache_init_late().

The iteration runs from 0 to SLUB_PAGE_SHIFT; that's actually incorrect
since the kmem_cache_node cache occupies the first spot in the
kmalloc_caches array and has a size, 64 bytes, equal to a power of two
that is duplicated later. So this patch tries creating two DMA kmalloc
caches with 64 byte object size which triggers a BUG_ON() during
kmem_cache_release() in the error handling later.

The fix is to start the iteration at 1 instead of 0 so that all other
caches have their equivalent DMA caches created and the special-case
kmem_cache_node cache is excluded (see below).

I'm really curious why nobody else ran into this problem before,
especially if they have CONFIG_SLUB_DEBUG enabled so
struct kmem_cache_node has the same size. Perhaps my early bug report
caused people not to test the series...

I'm adding Tejun Heo to the cc because of another thing that may be
problematic: alloc_percpu() allocates GFP_KERNEL memory, so when we try to
allocate kmem_cache_cpu for a DMA cache we may be returning memory from a
node that doesn't include lowmem so there will be no affinity between the
struct and the slab. I'm wondering if it would be better for the percpu
allocator to be extended for kzalloc_node(), or vmalloc_node(), when
allocating memory after the slab layer is up.

There're a couple more issues with the patch as well:

- the entire iteration in kmem_cache_init_late() needs to be protected by
slub_lock. The comment in create_kmalloc_cache() should be revised
since you're no longer calling it only with irqs disabled.
kmem_cache_init_late() has irqs enabled and, thus, slab_caches must be
protected.

- a BUG_ON(!name) needs to be added in kmem_cache_init_late() when
kasprintf() returns NULL. This isn't checked in kmem_cache_open() so
it'll only encounter a problem in the sysfs layer. Adding a BUG_ON()
will help track those down.

Otherwise, I didn't find any problem with removing the dynamic DMA cache
allocation on my machines.

Please fold this into patch 8.

Signed-off-by: David Rientjes <[email protected]>
---
diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2552,13 +2552,12 @@ static int __init setup_slub_nomerge(char *str)

__setup("slub_nomerge", setup_slub_nomerge);

+/*
+ * Requires slub_lock if called when irqs are enabled after early boot.
+ */
static void create_kmalloc_cache(struct kmem_cache *s,
const char *name, int size, unsigned int flags)
{
- /*
- * This function is called with IRQs disabled during early-boot on
- * single CPU so there's no need to take slub_lock here.
- */
if (!kmem_cache_open(s, name, size, ARCH_KMALLOC_MINALIGN,
flags, NULL))
goto panic;
@@ -3063,17 +3062,20 @@ void __init kmem_cache_init_late(void)
#ifdef CONFIG_ZONE_DMA
int i;

- for (i = 0; i < SLUB_PAGE_SHIFT; i++) {
+ down_write(&slub_lock);
+ for (i = 1; i < SLUB_PAGE_SHIFT; i++) {
struct kmem_cache *s = &kmalloc_caches[i];

- if (s && s->size) {
+ if (s->size) {
char *name = kasprintf(GFP_KERNEL,
"dma-kmalloc-%d", s->objsize);

+ BUG_ON(!name);
create_kmalloc_cache(&kmalloc_dma_caches[i],
name, s->objsize, SLAB_CACHE_DMA);
}
}
+ up_write(&slub_lock);
#endif
}

2010-08-17 08:00:04

by Tejun Heo

[permalink] [raw]
Subject: Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)

Hello,

On 08/17/2010 06:56 AM, David Rientjes wrote:
> I'm adding Tejun Heo to the cc because of another thing that may be
> problematic: alloc_percpu() allocates GFP_KERNEL memory, so when we try to
> allocate kmem_cache_cpu for a DMA cache we may be returning memory from a
> node that doesn't include lowmem so there will be no affinity between the
> struct and the slab. I'm wondering if it would be better for the percpu
> allocator to be extended for kzalloc_node(), or vmalloc_node(), when
> allocating memory after the slab layer is up.

Hmmm... do you mean adding @gfp_mask to percpu allocation function?
I've been thinking about adding it for atomic allocations (Christoph,
do you still want it?). I've been sort of against it because I
primarily don't really like atomic allocations (it often just pushes
error handling complexities elsewhere where it becomes more complex)
and it would also require making vmalloc code do atomic allocations.

Most of percpu use cases seem pretty happy with GFP_KERNEL allocation,
so I'm still quite reluctant to change that. We can add a semi
internal interface w/ @gfp_mask but w/o GFP_ATOMIC support, which is a
bit ugly. How important would this be?

Thanks.

--
tejun

2010-08-17 13:56:33

by Christoph Lameter

[permalink] [raw]
Subject: Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)

On Tue, 17 Aug 2010, Tejun Heo wrote:

> Hello,
>
> On 08/17/2010 06:56 AM, David Rientjes wrote:
> > I'm adding Tejun Heo to the cc because of another thing that may be
> > problematic: alloc_percpu() allocates GFP_KERNEL memory, so when we try to
> > allocate kmem_cache_cpu for a DMA cache we may be returning memory from a
> > node that doesn't include lowmem so there will be no affinity between the
> > struct and the slab. I'm wondering if it would be better for the percpu
> > allocator to be extended for kzalloc_node(), or vmalloc_node(), when
> > allocating memory after the slab layer is up.
>
> Hmmm... do you mean adding @gfp_mask to percpu allocation function?

DMA caches may only exist on certain nodes because others do not have a
DMA zone. Their role is quite limited these days. DMA caches allocated on
nodes without DMA zones would have their percpu area allocated on the node
but the DMA allocations would be redirected to the closest node with DMA
memory.

> I've been thinking about adding it for atomic allocations (Christoph,
> do you still want it?). I've been sort of against it because I
> primarily don't really like atomic allocations (it often just pushes
> error handling complexities elsewhere where it becomes more complex)
> and it would also require making vmalloc code do atomic allocations.

At this point I would think that we do not need that support.

2010-08-17 17:23:16

by Christoph Lameter

[permalink] [raw]
Subject: Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)

On Mon, 16 Aug 2010, David Rientjes wrote:

> Ok, so this is panicking because of the error handling when trying to
> create sysfs directories with the same name (in this case, :dt-0000064).
> I'll look into while this isn't failing gracefully later, but I isolated
> this to the new code that statically allocates the DMA caches in
> kmem_cache_init_late().

Hmm.... Strange. The DMA caches should create a distinct pattern there.

> The iteration runs from 0 to SLUB_PAGE_SHIFT; that's actually incorrect
> since the kmem_cache_node cache occupies the first spot in the
> kmalloc_caches array and has a size, 64 bytes, equal to a power of two
> that is duplicated later. So this patch tries creating two DMA kmalloc
> caches with 64 byte object size which triggers a BUG_ON() during
> kmem_cache_release() in the error handling later.

The kmem_cache_node cache is no longer at position 0.
kmalloc_caches[0] should be NULL and therefore be skipped.

> The fix is to start the iteration at 1 instead of 0 so that all other
> caches have their equivalent DMA caches created and the special-case
> kmem_cache_node cache is excluded (see below).
>
> I'm really curious why nobody else ran into this problem before,
> especially if they have CONFIG_SLUB_DEBUG enabled so
> struct kmem_cache_node has the same size. Perhaps my early bug report
> caused people not to test the series...

Which patches were applied?

> - the entire iteration in kmem_cache_init_late() needs to be protected by
> slub_lock. The comment in create_kmalloc_cache() should be revised
> since you're no longer calling it only with irqs disabled.
> kmem_cache_init_late() has irqs enabled and, thus, slab_caches must be
> protected.

I moved it to kmem_cache_init() which is run when we only have one
execution thread. That takes care of the issue and ensures that the dma
caches are available as early as before.

> - a BUG_ON(!name) needs to be added in kmem_cache_init_late() when
> kasprintf() returns NULL. This isn't checked in kmem_cache_open() so
> it'll only encounter a problem in the sysfs layer. Adding a BUG_ON()
> will help track those down.

Ok.

2010-08-17 17:29:56

by Christoph Lameter

[permalink] [raw]
Subject: Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)

On Tue, 17 Aug 2010, Christoph Lameter wrote:

> > I'm really curious why nobody else ran into this problem before,
> > especially if they have CONFIG_SLUB_DEBUG enabled so
> > struct kmem_cache_node has the same size. Perhaps my early bug report
> > caused people not to test the series...
>
> Which patches were applied?


If you do not apply all patches then you can be at a stage were
kmalloc_caches[0] is still used for kmem_cache_node. Then things break.

2010-08-17 18:02:23

by David Rientjes

[permalink] [raw]
Subject: Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)

On Tue, 17 Aug 2010, Christoph Lameter wrote:

> > Ok, so this is panicking because of the error handling when trying to
> > create sysfs directories with the same name (in this case, :dt-0000064).
> > I'll look into while this isn't failing gracefully later, but I isolated
> > this to the new code that statically allocates the DMA caches in
> > kmem_cache_init_late().
>
> Hmm.... Strange. The DMA caches should create a distinct pattern there.
>

They do after patch 11 when you introduce dynamically sized kmalloc
caches, but not after only patches 1-8 were applied. Since this wasn't
booting on my system, I bisected the problem to patch 8 where
kmem_cache_init_late() would create two DMA caches of size 64 bytes: one
becauses of kmalloc_caches[0] (kmem_cache_node) and one because of
kmalloc_caches[6] (2^6 = 64). So my fixes are necessary for patch 8 but
obsoleted later, and then the shared cache support panics on memset().

> > - the entire iteration in kmem_cache_init_late() needs to be protected by
> > slub_lock. The comment in create_kmalloc_cache() should be revised
> > since you're no longer calling it only with irqs disabled.
> > kmem_cache_init_late() has irqs enabled and, thus, slab_caches must be
> > protected.
>
> I moved it to kmem_cache_init() which is run when we only have one
> execution thread. That takes care of the issue and ensures that the dma
> caches are available as early as before.
>

I didn't know if that was a debugging patch for me or if you wanted to
push that as part of your series, I'm not sure if you actually need to
move it to kmem_cache_init() now that slub_state is protected by
slub_lock. I'm not sure if we want to allocate DMA objects between
kmem_cache_init() and kmem_cache_init_late().

2010-08-17 18:47:27

by Christoph Lameter

[permalink] [raw]
Subject: Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)

On Tue, 17 Aug 2010, David Rientjes wrote:

> I didn't know if that was a debugging patch for me or if you wanted to
> push that as part of your series, I'm not sure if you actually need to
> move it to kmem_cache_init() now that slub_state is protected by
> slub_lock. I'm not sure if we want to allocate DMA objects between
> kmem_cache_init() and kmem_cache_init_late().

Drivers may allocate dma buffers during initialization.

2010-08-17 18:54:44

by David Rientjes

[permalink] [raw]
Subject: Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)

On Tue, 17 Aug 2010, Christoph Lameter wrote:

> > I didn't know if that was a debugging patch for me or if you wanted to
> > push that as part of your series, I'm not sure if you actually need to
> > move it to kmem_cache_init() now that slub_state is protected by
> > slub_lock. I'm not sure if we want to allocate DMA objects between
> > kmem_cache_init() and kmem_cache_init_late().
>
> Drivers may allocate dma buffers during initialization.
>

Ok, I moved the DMA cache creation from kmem_cache_init_late() to
kmem_cache_init(). Note: the kasprintf() will need to use GFP_NOWAIT and
not GFP_KERNEL now.

2010-08-17 19:34:24

by Christoph Lameter

[permalink] [raw]
Subject: Re: [S+Q3 00/23] SLUB: The Unified slab allocator (V3)

On Tue, 17 Aug 2010, David Rientjes wrote:

> On Tue, 17 Aug 2010, Christoph Lameter wrote:
>
> > > I didn't know if that was a debugging patch for me or if you wanted to
> > > push that as part of your series, I'm not sure if you actually need to
> > > move it to kmem_cache_init() now that slub_state is protected by
> > > slub_lock. I'm not sure if we want to allocate DMA objects between
> > > kmem_cache_init() and kmem_cache_init_late().
> >
> > Drivers may allocate dma buffers during initialization.
> >
>
> Ok, I moved the DMA cache creation from kmem_cache_init_late() to
> kmem_cache_init(). Note: the kasprintf() will need to use GFP_NOWAIT and
> not GFP_KERNEL now.

ok. I have revised the patch since there is also a problem with the
indirection on kmalloc_caches.