LinuxLists.cc - Regression with SLUB on Netperf and Volanomark

2007-05-02 19:36:04

Subject: Regression with SLUB on Netperf and Volanomark

Christoph,

We tested SLUB on a 2 socket Clovertown (Core 2 cpu with 2 cores/socket)
and a 2 socket Woodcrest (Core2 cpu with 4 cores/socket).

We found that for Netperf's TCP streaming tests in a loop back mode, the
TCP streaming performance is about 7% worse when SLUB is enabled on
2.6.21-rc7-mm1 kernel (x86_64). This test have a lot of sk_buff
allocation/deallocation.

For Volanomark, the performance is 7% worse for Woodcrest and 12% worse
for Clovertown.

Regards,
Tim

2007-05-02 19:47:36

by Christoph Lameter

[permalink] [raw]

Subject: Re: Regression with SLUB on Netperf and Volanomark

On Wed, 2 May 2007, Tim Chen wrote:

> We tested SLUB on a 2 socket Clovertown (Core 2 cpu with 2 cores/socket)
> and a 2 socket Woodcrest (Core2 cpu with 4 cores/socket).

Try to boot with

slub_max_order=4 slub_min_objects=8

If that does not help increase slub_min_objects to 16.

> We found that for Netperf's TCP streaming tests in a loop back mode, the
> TCP streaming performance is about 7% worse when SLUB is enabled on
> 2.6.21-rc7-mm1 kernel (x86_64). This test have a lot of sk_buff
> allocation/deallocation.

2.6.21-rc7-mm2 contains some performance fixes that may or may not be
useful to you.
>
> For Volanomark, the performance is 7% worse for Woodcrest and 12% worse
> for Clovertown.

SLUBs "queueing" is restricted to the number of objects that fit in page
order slab. SLAB can queue more objects since it has true queues.
Increasing the page size that SLUB uses may fix the problem but then we
run into higher page order issues.

Check slabinfo output for the network slabs and see what order is used.
The number of objects per slab is important for performance.

2007-05-03 23:28:18

by Chen, Tim C

[permalink] [raw]

Subject: RE: Regression with SLUB on Netperf and Volanomark

Christoph Lameter wrote:
> Try to boot with
>
> slub_max_order=4 slub_min_objects=8
>
> If that does not help increase slub_min_objects to 16.
>

We are still seeing a 5% regression on TCP streaming with
slub_min_objects set at 16 and a 10% regression for Volanomark, after
increasing slub_min_objects to 16 and setting slub_max_order=4 and using
the 2.6.21-rc7-mm2 kernel. The performance between slub_min_objects=8
and 16 are similar.

>> We found that for Netperf's TCP streaming tests in a loop back mode,
>> the TCP streaming performance is about 7% worse when SLUB is enabled
>> on
>> 2.6.21-rc7-mm1 kernel (x86_64). This test have a lot of sk_buff
>> allocation/deallocation.
>
> 2.6.21-rc7-mm2 contains some performance fixes that may or may not be
> useful to you.

We've switched to 2.6.21-rc7-mm2 in our tests now.

>>
>> For Volanomark, the performance is 7% worse for Woodcrest and 12%
>> worse for Clovertown.
>
> SLUBs "queueing" is restricted to the number of objects that fit in
> page order slab. SLAB can queue more objects since it has true queues.
> Increasing the page size that SLUB uses may fix the problem but then
> we run into higher page order issues.
>
> Check slabinfo output for the network slabs and see what order is
> used. The number of objects per slab is important for performance.

The order used is 0 for the buffer_head, which is the most used object.

I think they are 104 bytes per object.

Tim

2007-05-04 00:44:22

by Christoph Lameter

[permalink] [raw]

Subject: RE: Regression with SLUB on Netperf and Volanomark

On Thu, 3 May 2007, Chen, Tim C wrote:

> We are still seeing a 5% regression on TCP streaming with
> slub_min_objects set at 16 and a 10% regression for Volanomark, after
> increasing slub_min_objects to 16 and setting slub_max_order=4 and using
> the 2.6.21-rc7-mm2 kernel. The performance between slub_min_objects=8
> and 16 are similar.

Ok. We then need to look at partial list management. It could be that the
sequence of partials is reversed. The problem is that I do not really
have time to concentrate on performance right now. Stability comes
first. We will likely end up putting some probes in there to find out
where the overhead comes from.

> > Check slabinfo output for the network slabs and see what order is
> > used. The number of objects per slab is important for performance.
>
> The order used is 0 for the buffer_head, which is the most used object.
>
> I think they are 104 bytes per object.

Hmmm.... Then it was not affected by slab_max_order? Try
slab_min_order=1 or 2 to increase that?

2007-05-04 01:45:51

by Christoph Lameter

[permalink] [raw]

Subject: RE: Regression with SLUB on Netperf and Volanomark

Hmmmm.. One potential issues are the complicated way the slab is
handled. Could you try this patch and see what impact it has?

If it has any then remove the cachline alignment and see how that
influences things.

Remove constructor from buffer_head

Buffer head management uses a constructor which increases overhead
for object handling. Remove the constructor. That way SLUB can place
the freepointer in an optimal location instead of after the object
in potentially another cache line.

Also having no constructor makes allocation and disposal of slabs
from the page allocator much easier since no pass over the objects
allocated to call construtors is necessary. SLUB can directly begin by
serving the first object.

Plus it simplifies the code and removes a difficult to understand
element for buffer handling.

Align the buffer heads on cacheline boundaries for best performance.

Signed-off-by: Christoph Lameter <[email protected]>

---
fs/buffer.c | 22 ++++------------------
include/linux/buffer_head.h | 2 +-
2 files changed, 5 insertions(+), 19 deletions(-)

Index: slub/fs/buffer.c
===================================================================
--- slub.orig/fs/buffer.c 2007-04-30 22:03:21.000000000 -0700
+++ slub/fs/buffer.c 2007-05-03 18:37:47.000000000 -0700
@@ -2907,9 +2907,10 @@ static void recalc_bh_state(void)

struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)
{
- struct buffer_head *ret = kmem_cache_alloc(bh_cachep,
+ struct buffer_head *ret = kmem_cache_zalloc(bh_cachep,
set_migrateflags(gfp_flags, __GFP_RECLAIMABLE));
if (ret) {
+ INIT_LIST_HEAD(&ret->b_assoc_buffers);
get_cpu_var(bh_accounting).nr++;
recalc_bh_state();
put_cpu_var(bh_accounting);
@@ -2928,17 +2929,6 @@ void free_buffer_head(struct buffer_head
}
EXPORT_SYMBOL(free_buffer_head);

-static void
-init_buffer_head(void *data, struct kmem_cache *cachep, unsigned long flags)
-{
- if (flags & SLAB_CTOR_CONSTRUCTOR) {
- struct buffer_head * bh = (struct buffer_head *)data;
-
- memset(bh, 0, sizeof(*bh));
- INIT_LIST_HEAD(&bh->b_assoc_buffers);
- }
-}
-
static void buffer_exit_cpu(int cpu)
{
int i;
@@ -2965,12 +2955,8 @@ void __init buffer_init(void)
{
int nrpages;

- bh_cachep = kmem_cache_create("buffer_head",
- sizeof(struct buffer_head), 0,
- (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
- SLAB_MEM_SPREAD),
- init_buffer_head,
- NULL);
+ bh_cachep = KMEM_CACHE(buffer_head,
+ SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);

/*
* Limit the bh occupancy to 10% of ZONE_NORMAL
Index: slub/include/linux/buffer_head.h
===================================================================
--- slub.orig/include/linux/buffer_head.h 2007-05-03 18:40:51.000000000 -0700
+++ slub/include/linux/buffer_head.h 2007-05-03 18:41:07.000000000 -0700
@@ -73,7 +73,7 @@ struct buffer_head {
struct address_space *b_assoc_map; /* mapping this buffer is
associated with */
atomic_t b_count; /* users using this buffer_head */
-};
+} ____cacheline_aligned_in_smp;

/*
* macro tricks to expand the set_buffer_foo(), clear_buffer_foo()

2007-05-04 02:42:06

by Christoph Lameter

[permalink] [raw]

Subject: RE: Regression with SLUB on Netperf and Volanomark

Hmmmm... I do not see a regression (up to date slub with all outstanding
patches applied). This is without any options enabled (but antifrag
patches are present so slub_max_order=4 slub_min_objects=16) Could you
post a .config? Missing patches against 2.6.21-rc7-mm2 can be found at
http://ftp.kernel.org/pub/linux/kernel/peopl/christoph/slub-patches

slab

TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost
(127.0.0.1) port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380 16384 16384 10.01 6068.61
87380 16384 16384 10.01 5877.91
87380 16384 16384 10.01 5835.68
87380 16384 16384 10.01 5840.58

slub

TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

87380 16384 16384 10.53 5646.53
87380 16384 16384 10.01 6073.09
87380 16384 16384 10.01 6094.68
87380 16384 16384 10.01 6088.50

2007-05-04 18:07:18

by Tim Chen

[permalink] [raw]

Subject: RE: Regression with SLUB on Netperf and Volanomark

On Thu, 2007-05-03 at 19:42 -0700, Christoph Lameter wrote:
> Hmmmm... I do not see a regression (up to date slub with all outstanding
> patches applied). This is without any options enabled (but antifrag
> patches are present so slub_max_order=4 slub_min_objects=16) Could you
> post a .config?

Attached is the config file I've used.

A side note is that for my tests, I bound the netserver and client to
separate cpu core on different sockets in my tests, to make sure that
the server and client do not share the same cache.

Tim

Attachments:

config-2.6.21-rc7mm2 (36.59 kB)

2007-05-04 18:10:54

by Christoph Lameter

[permalink] [raw]

Subject: RE: Regression with SLUB on Netperf and Volanomark

On Fri, 4 May 2007, Tim Chen wrote:

> A side note is that for my tests, I bound the netserver and client to
> separate cpu core on different sockets in my tests, to make sure that
> the server and client do not share the same cache.

Ahhh... You have some scripts that you run. Care to share?

This is no NUMA syste? Two processors in an SMP system?

So its likely an issue of partial slabs shifting between multiple cpus and
if the partial slab is now used on the other cpu then it may be cache cold
there. Different sockets mean limitations to FSB bandwidth and bad caching
effects. I hope I can reproduce this somewhere.

2007-05-04 18:27:57

by Christoph Lameter

[permalink] [raw]

Subject: RE: Regression with SLUB on Netperf and Volanomark

If I optimize now for the case that we do not share the cpu cache between
different cpus then performance way drop for the case in which we share
the cache (hyperthreading).

If we do not share the cache then processors essentially needs to have
their own lists of partial caches in which they keep cache hot objects.
(something mini NUMA like). Any writes to shared objects will cause
cacheline eviction on the other which is not good.

If they do share the cpu cache then they need to have a shared list of
partial slabs.

Not sure where to go here. Increasing the per cpu slab size may hold off
the issue up to a certain cpu cache size. For that we would need to
identify which slabs create the performance issue.

One easy way to check that this is indeed the case: Enable fake NUMA. You
will then have separate queues for each processor since they are on
different "nodes". Create two fake nodes. Run one thread in each node and
see if this fixes it.

2007-05-04 18:31:25

by Tim Chen

[permalink] [raw]

Subject: RE: Regression with SLUB on Netperf and Volanomark

On Fri, 2007-05-04 at 11:10 -0700, Christoph Lameter wrote:
> On Fri, 4 May 2007, Tim Chen wrote:
>
> > A side note is that for my tests, I bound the netserver and client to
> > separate cpu core on different sockets in my tests, to make sure that
> > the server and client do not share the same cache.
>
> Ahhh... You have some scripts that you run. Care to share?

I do

taskset -c 1 netserver

and

taskset -c 2 netperf -t TCP_STREAM -l 60 -H 127.0.0.1 -- -s 57344 -S
57344 -m 4096

>
> This is no NUMA syste? Two processors in an SMP system?

Yes, it is a SMP system with 2 socket. Each socket has 4 cores.

Tim

2007-05-04 23:35:46

by Tim Chen

[permalink] [raw]

Subject: RE: Regression with SLUB on Netperf and Volanomark

On Fri, 2007-05-04 at 11:27 -0700, Christoph Lameter wrote:

>
> Not sure where to go here. Increasing the per cpu slab size may hold off
> the issue up to a certain cpu cache size. For that we would need to
> identify which slabs create the performance issue.
>
> One easy way to check that this is indeed the case: Enable fake NUMA. You
> will then have separate queues for each processor since they are on
> different "nodes". Create two fake nodes. Run one thread in each node and
> see if this fixes it.

I tried with fake NUMA (boot with numa=fake=2) and use

numactl --physcpubind=1 --membind=0 ./netserver
numactl --physcpubind=2 --membind=1 ./netperf -t TCP_STREAM -l 60 -H
127.0.0.1 -i 5,5 -I 99,5 -- -s 57344 -S 57344 -m 4096

to run the tests. The results are about the same as the non-NUMA case,
with slab about 5% better than slub.

So probably the difference is due to some other reasons than partial
slab. The kernel config file is attached.

Tim

Attachments:

config-numa-slub (24.83 kB)

2007-05-04 23:59:18

by Christoph Lameter

[permalink] [raw]

Subject: RE: Regression with SLUB on Netperf and Volanomark

On Fri, 4 May 2007, Tim Chen wrote:

> On Fri, 2007-05-04 at 11:27 -0700, Christoph Lameter wrote:
>
> >
> > Not sure where to go here. Increasing the per cpu slab size may hold off
> > the issue up to a certain cpu cache size. For that we would need to
> > identify which slabs create the performance issue.
> >
> > One easy way to check that this is indeed the case: Enable fake NUMA. You
> > will then have separate queues for each processor since they are on
> > different "nodes". Create two fake nodes. Run one thread in each node and
> > see if this fixes it.
>
> I tried with fake NUMA (boot with numa=fake=2) and use
>
> numactl --physcpubind=1 --membind=0 ./netserver
> numactl --physcpubind=2 --membind=1 ./netperf -t TCP_STREAM -l 60 -H
> 127.0.0.1 -i 5,5 -I 99,5 -- -s 57344 -S 57344 -m 4096
>
> to run the tests. The results are about the same as the non-NUMA case,
> with slab about 5% better than slub.

Hmmmm... both tests were run in the same context? NUMA has additional
overhead in other areas.

2007-05-05 00:33:36

by Tim Chen

[permalink] [raw]

Subject: RE: Regression with SLUB on Netperf and Volanomark

On Thu, 2007-05-03 at 18:45 -0700, Christoph Lameter wrote:
> Hmmmm.. One potential issues are the complicated way the slab is
> handled. Could you try this patch and see what impact it has?
>
The patch boost the throughput of TCP_STREAM test by 5%, for both slab
and slub. But slab is still 5% better in my tests.

> If it has any then remove the cachline alignment and see how that
> influences things.

Removing the cacheline alignment didn't change the throughput.

Tim

2007-05-05 00:35:09

by Tim Chen

[permalink] [raw]

Subject: RE: Regression with SLUB on Netperf and Volanomark

On Fri, 2007-05-04 at 16:59 -0700, Christoph Lameter wrote:

> >
> > to run the tests. The results are about the same as the non-NUMA case,
> > with slab about 5% better than slub.
>
> Hmmmm... both tests were run in the same context? NUMA has additional
> overhead in other areas.

Both slab and slub tests are tested with the same NUMA options and
config.

Tim

2007-05-05 01:02:34

by Christoph Lameter

[permalink] [raw]

Subject: RE: Regression with SLUB on Netperf and Volanomark

On Fri, 4 May 2007, Tim Chen wrote:

> On Thu, 2007-05-03 at 18:45 -0700, Christoph Lameter wrote:
> > Hmmmm.. One potential issues are the complicated way the slab is
> > handled. Could you try this patch and see what impact it has?
> >
> The patch boost the throughput of TCP_STREAM test by 5%, for both slab
> and slub. But slab is still 5% better in my tests.

Really? buffer head handling improves TCP performance? I think you have
run to run variances. I need to look at this myself.

2007-05-05 01:41:08

by Christoph Lameter

[permalink] [raw]

Subject: RE: Regression with SLUB on Netperf and Volanomark

If you want to test some more: Here is a patch that removes the atomic ops
from the allocation patch. But I only see minor improvements on my amd64
box here.

Avoid the use of atomics in slab_alloc

This only increases netperf performance by 1%. Wonder why?

What we do is add the last free field in the page struct to setup
a separate per cpu freelist. From that one we can allocate without
taking the slab lock because we checkout the complete list of fre
objects when we first touch the slab.

This allows concurrent allocations and frees from the same slab using
two mutually exclusive freelists. If the allocator is running out of
its per cpu freelist then it will consult the per slab freelist and reload
if objects were freed in it.

Signed-off-by: Christoph Lameter <[email protected]>

---
include/linux/mm_types.h | 5 ++++-
mm/slub.c | 44 +++++++++++++++++++++++++++++++++++---------
2 files changed, 39 insertions(+), 10 deletions(-)

Index: slub/include/linux/mm_types.h
===================================================================
--- slub.orig/include/linux/mm_types.h 2007-05-04 17:39:33.000000000 -0700
+++ slub/include/linux/mm_types.h 2007-05-04 17:58:38.000000000 -0700
@@ -50,9 +50,12 @@ struct page {
spinlock_t ptl;
#endif
struct { /* SLUB uses */
- struct page *first_page; /* Compound pages */
+ void **active_freelist; /* Allocation freelist */
struct kmem_cache *slab; /* Pointer to slab */
};
+ struct {
+ struct page *first_page; /* Compound pages */
+ };
};
union {
pgoff_t index; /* Our offset within mapping. */
Index: slub/mm/slub.c
===================================================================
--- slub.orig/mm/slub.c 2007-05-04 17:40:50.000000000 -0700
+++ slub/mm/slub.c 2007-05-04 18:14:23.000000000 -0700
@@ -845,6 +845,7 @@ static struct page *new_slab(struct kmem
page->offset = s->offset / sizeof(void *);
page->slab = s;
page->inuse = 0;
+ page->active_freelist = NULL;
start = page_address(page);
end = start + s->objects * s->size;

@@ -1137,6 +1138,19 @@ static void putback_slab(struct kmem_cac
*/
static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
{
+ /* Two freelists that are now to be consolidated */
+ while (unlikely(page->active_freelist)) {
+ void **object;
+
+ /* Retrieve object from active_freelist */
+ object = page->active_freelist;
+ page->active_freelist = page->active_freelist[page->offset];
+
+ /* And put onto the regular freelist */
+ object[page->offset] = page->freelist;
+ page->freelist = object;
+ page->inuse--;
+ }
s->cpu_slab[cpu] = NULL;
ClearPageActive(page);

@@ -1206,25 +1220,32 @@ static void *slab_alloc(struct kmem_cach
local_irq_save(flags);
cpu = smp_processor_id();
page = s->cpu_slab[cpu];
- if (!page)
+ if (unlikely(!page))
goto new_slab;

+ if (likely(page->active_freelist)) {
+fast_object:
+ object = page->active_freelist;
+ page->active_freelist = object[page->offset];
+ local_irq_restore(flags);
+ return object;
+ }
+
slab_lock(page);
if (unlikely(node != -1 && page_to_nid(page) != node))
goto another_slab;
redo:
- object = page->freelist;
- if (unlikely(!object))
+ if (unlikely(!page->freelist))
goto another_slab;
if (unlikely(PageError(page)))
goto debug;

-have_object:
- page->inuse++;
- page->freelist = object[page->offset];
+ /* Reload the active freelist */
+ page->active_freelist = page->freelist;
+ page->freelist = NULL;
+ page->inuse = s->objects;
slab_unlock(page);
- local_irq_restore(flags);
- return object;
+ goto fast_object;

another_slab:
deactivate_slab(s, page, cpu);
@@ -1267,6 +1288,7 @@ have_slab:
local_irq_restore(flags);
return NULL;
debug:
+ object = page->freelist;
if (!alloc_object_checks(s, page, object))
goto another_slab;
if (s->flags & SLAB_STORE_USER)
@@ -1278,7 +1300,11 @@ debug:
dump_stack();
}
init_object(s, object, 1);
- goto have_object;
+ page->freelist = object[page->offset];
+ page->inuse++;
+ slab_unlock(page);
+ local_irq_restore(flags);
+ return object;
}

void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)

2007-05-05 02:05:06

by Christoph Lameter

[permalink] [raw]

Subject: RE: Regression with SLUB on Netperf and Volanomark

Got something.... If I remove the atomics from both alloc and free then I
get a performance jump. But maybe also a runtime variation????

Avoid the use of atomics in slab_alloc

About 5-7% performance gain. Or am I also seeing runtime variations?

What we do is add the last free field in the page struct to setup
a separate per cpu freelist. From that one we can allocate without
taking the slab lock because we checkout the complete list of free
objects when we first touch the slab. If we have an active list
then we can also free to that list if we run on that processor
without taking the slab lock.

This allows even concurrent allocations and frees from the same slab using
two mutually exclusive freelists. If the allocator is running out of
its per cpu freelist then it will consult the per slab freelist and reload
if objects were freed in it.

Signed-off-by: Christoph Lameter <[email protected]>

---
include/linux/mm_types.h | 5 +++-
mm/slub.c | 54 +++++++++++++++++++++++++++++++++++++++--------
2 files changed, 49 insertions(+), 10 deletions(-)

Index: slub/include/linux/mm_types.h
===================================================================
--- slub.orig/include/linux/mm_types.h 2007-05-04 18:58:06.000000000 -0700
+++ slub/include/linux/mm_types.h 2007-05-04 18:59:42.000000000 -0700
@@ -50,9 +50,12 @@ struct page {
spinlock_t ptl;
#endif
struct { /* SLUB uses */
- struct page *first_page; /* Compound pages */
+ void **cpu_freelist; /* Per cpu freelist */
struct kmem_cache *slab; /* Pointer to slab */
};
+ struct {
+ struct page *first_page; /* Compound pages */
+ };
};
union {
pgoff_t index; /* Our offset within mapping. */
Index: slub/mm/slub.c
===================================================================
--- slub.orig/mm/slub.c 2007-05-04 18:58:06.000000000 -0700
+++ slub/mm/slub.c 2007-05-04 19:02:33.000000000 -0700
@@ -845,6 +845,7 @@ static struct page *new_slab(struct kmem
page->offset = s->offset / sizeof(void *);
page->slab = s;
page->inuse = 0;
+ page->cpu_freelist = NULL;
start = page_address(page);
end = start + s->objects * s->size;

@@ -1137,6 +1138,23 @@ static void putback_slab(struct kmem_cac
*/
static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
{
+ /*
+ * Merge cpu freelist into freelist. Typically we get here
+ * because both freelists are empty. So this is unlikely
+ * to occur.
+ */
+ while (unlikely(page->cpu_freelist)) {
+ void **object;
+
+ /* Retrieve object from cpu_freelist */
+ object = page->cpu_freelist;
+ page->cpu_freelist = page->cpu_freelist[page->offset];
+
+ /* And put onto the regular freelist */
+ object[page->offset] = page->freelist;
+ page->freelist = object;
+ page->inuse--;
+ }
s->cpu_slab[cpu] = NULL;
ClearPageActive(page);

@@ -1206,25 +1224,32 @@ static void *slab_alloc(struct kmem_cach
local_irq_save(flags);
cpu = smp_processor_id();
page = s->cpu_slab[cpu];
- if (!page)
+ if (unlikely(!page))
goto new_slab;

+ if (likely(page->cpu_freelist)) {
+fast_object:
+ object = page->cpu_freelist;
+ page->cpu_freelist = object[page->offset];
+ local_irq_restore(flags);
+ return object;
+ }
+
slab_lock(page);
if (unlikely(node != -1 && page_to_nid(page) != node))
goto another_slab;
redo:
- object = page->freelist;
- if (unlikely(!object))
+ if (unlikely(!page->freelist))
goto another_slab;
if (unlikely(PageError(page)))
goto debug;

-have_object:
- page->inuse++;
- page->freelist = object[page->offset];
+ /* Reload the cpu freelist */
+ page->cpu_freelist = page->freelist;
+ page->freelist = NULL;
+ page->inuse = s->objects;
slab_unlock(page);
- local_irq_restore(flags);
- return object;
+ goto fast_object;

another_slab:
deactivate_slab(s, page, cpu);
@@ -1267,6 +1292,7 @@ have_slab:
local_irq_restore(flags);
return NULL;
debug:
+ object = page->freelist;
if (!alloc_object_checks(s, page, object))
goto another_slab;
if (s->flags & SLAB_STORE_USER)
@@ -1278,7 +1304,11 @@ debug:
dump_stack();
}
init_object(s, object, 1);
- goto have_object;
+ page->freelist = object[page->offset];
+ page->inuse++;
+ slab_unlock(page);
+ local_irq_restore(flags);
+ return object;
}

void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
@@ -1309,6 +1339,12 @@ static void slab_free(struct kmem_cache
unsigned long flags;

local_irq_save(flags);
+ if (!PageError(page) && page == s->cpu_slab[smp_processor_id()]) {
+ object[page->offset] = page->cpu_freelist;
+ page->cpu_freelist = object;
+ local_irq_restore(flags);
+ return;
+ }
slab_lock(page);

if (unlikely(PageError(page)))

2007-05-08 01:32:53

by Tim Chen

[permalink] [raw]

Subject: RE: Regression with SLUB on Netperf and Volanomark

On Fri, 2007-05-04 at 18:02 -0700, Christoph Lameter wrote:
> On Fri, 4 May 2007, Tim Chen wrote:
>
> > On Thu, 2007-05-03 at 18:45 -0700, Christoph Lameter wrote:
> > > Hmmmm.. One potential issues are the complicated way the slab is
> > > handled. Could you try this patch and see what impact it has?
> > >
> > The patch boost the throughput of TCP_STREAM test by 5%, for both slab
> > and slub. But slab is still 5% better in my tests.
>
> Really? buffer head handling improves TCP performance? I think you have
> run to run variances. I need to look at this myself.

I think the object of interest should be sk_buff, not buffer_head. I
made a mistake of accidentally using another config after applying your
buffer head patch. I compared the kernel again under the same config,
with and without the buffer_head patch. There's no boost to TCP_STREAM
test from the patch. So things make sense again. My apology for the
error.

However, the output from TCP_STREAM is quite stable.
I am still seeing a 4% difference between the SLAB and SLUB kernel.
Looking at the L2 cache miss rate with emon, I saw 6% more cache miss on
the client side with SLUB. The server side has the same amount of cache
miss. This is test under SMP mode with client and server bound to
different core on separate package.

Thanks.

Tim

2007-05-08 01:50:00

by Christoph Lameter

[permalink] [raw]

Subject: RE: Regression with SLUB on Netperf and Volanomark

On Mon, 7 May 2007, Tim Chen wrote:

> However, the output from TCP_STREAM is quite stable.
> I am still seeing a 4% difference between the SLAB and SLUB kernel.
> Looking at the L2 cache miss rate with emon, I saw 6% more cache miss on
> the client side with SLUB. The server side has the same amount of cache
> miss. This is test under SMP mode with client and server bound to
> different core on separate package.

Could you try the following patch on top of 2.6.21-mm1 with the patches
from http://ftp.kernel.org/pub/linux/kernel/people/christoph/slub-patches?

I sent it to you before. This is one is an updated version

Avoid atomic overhead in slab_alloc and slab_free

SLUB needs to use the slab_lock for the per cpu slabs to synchronize
with potential kfree operations. This patch avoids that need by moving
all free objects onto a lockless_freelist. The regular freelist
continues to exist and will be used to free objects. So while we consume
the lockless_freelist the regular freelist may build up objects.
If we are out of objects on the lockless_freelist then we may check
the regular freelist. If it has objects then we move those over to the
lockless_freelist and do this again. There is a significant savings
in terms of atomic operations that have to be performed.

We can even free directly to the lockless_freelist if we know that we
are running on the same processor. So this speeds up short lived
objects. The may be allocated and frees without taking the slab_lock.
This is particular good for netperf.

In order to maximize the effect of the new faster hotpath we extract the
hottest performance pieces into inlined functions. These are then inlined
into kmem_cache_alloc and kmem_cache_free. So the hotpath allocation and
freeing no longer requires a subroutine call within SLUB.

Signed-off-by: Christoph Lameter <[email protected]>

---
include/linux/mm_types.h | 7 +-
mm/slub.c | 154 ++++++++++++++++++++++++++++++++++++-----------
2 files changed, 123 insertions(+), 38 deletions(-)

Index: slub/include/linux/mm_types.h
===================================================================
--- slub.orig/include/linux/mm_types.h 2007-05-07 17:31:11.000000000 -0700
+++ slub/include/linux/mm_types.h 2007-05-07 17:33:54.000000000 -0700
@@ -50,13 +50,16 @@ struct page {
spinlock_t ptl;
#endif
struct { /* SLUB uses */
- struct page *first_page; /* Compound pages */
+ void **lockless_freelist;
struct kmem_cache *slab; /* Pointer to slab */
};
+ struct {
+ struct page *first_page; /* Compound pages */
+ };
};
union {
pgoff_t index; /* Our offset within mapping. */
- void *freelist; /* SLUB: pointer to free object */
+ void *freelist; /* SLUB: freelist req. slab lock */
};
struct list_head lru; /* Pageout list, eg. active_list
* protected by zone->lru_lock !
Index: slub/mm/slub.c
===================================================================
--- slub.orig/mm/slub.c 2007-05-07 17:31:11.000000000 -0700
+++ slub/mm/slub.c 2007-05-07 17:33:54.000000000 -0700
@@ -81,10 +81,14 @@
* PageActive The slab is used as a cpu cache. Allocations
* may be performed from the slab. The slab is not
* on any slab list and cannot be moved onto one.
+ * The cpu slab may be equipped with an additioanl
+ * lockless_freelist that allows lockless access to
+ * free objects in addition to the regular freelist
+ * that requires the slab lock.
*
* PageError Slab requires special handling due to debug
* options set. This moves slab handling out of
- * the fast path.
+ * the fast path and disables lockless freelists.
*/

static inline int SlabDebug(struct page *page)
@@ -1016,6 +1020,7 @@ static struct page *new_slab(struct kmem
set_freepointer(s, last, NULL);

page->freelist = start;
+ page->lockless_freelist = NULL;
page->inuse = 0;
out:
if (flags & __GFP_WAIT)
@@ -1278,6 +1283,23 @@ static void putback_slab(struct kmem_cac
*/
static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
{
+ /*
+ * Merge cpu freelist into freelist. Typically we get here
+ * because both freelists are empty. So this is unlikely
+ * to occur.
+ */
+ while (unlikely(page->lockless_freelist)) {
+ void **object;
+
+ /* Retrieve object from cpu_freelist */
+ object = page->lockless_freelist;
+ page->lockless_freelist = page->lockless_freelist[page->offset];
+
+ /* And put onto the regular freelist */
+ object[page->offset] = page->freelist;
+ page->freelist = object;
+ page->inuse--;
+ }
s->cpu_slab[cpu] = NULL;
ClearPageActive(page);

@@ -1324,47 +1346,46 @@ static void flush_all(struct kmem_cache
}

/*
- * slab_alloc is optimized to only modify two cachelines on the fast path
- * (aside from the stack):
+ * Slow path. The lockless freelist is empty or we need to perform
+ * debugging duties.
*
- * 1. The page struct
- * 2. The first cacheline of the object to be allocated.
+ * Interrupts are disabled.
*
- * The only other cache lines that are read (apart from code) is the
- * per cpu array in the kmem_cache struct.
+ * Processing is still very fast if new objects have been freed to the
+ * regular freelist. In that case we simply take over the regular freelist
+ * as the lockless freelist and zap the regular freelist.
*
- * Fastpath is not possible if we need to get a new slab or have
- * debugging enabled (which means all slabs are marked with SlabDebug)
+ * If that is not working then we fall back to the partial lists. We take the
+ * first element of the freelist as the object to allocate now and move the
+ * rest of the freelist to the lockless freelist.
+ *
+ * And if we were unable to get a new slab from the partial slab lists then
+ * we need to allocate a new slab. This is slowest path since we may sleep.
*/
-static void *slab_alloc(struct kmem_cache *s,
- gfp_t gfpflags, int node, void *addr)
+static void *__slab_alloc(struct kmem_cache *s,
+ gfp_t gfpflags, int node, void *addr, struct page *page)
{
- struct page *page;
void **object;
- unsigned long flags;
- int cpu;
+ int cpu = smp_processor_id();

- local_irq_save(flags);
- cpu = smp_processor_id();
- page = s->cpu_slab[cpu];
if (!page)
goto new_slab;

slab_lock(page);
if (unlikely(node != -1 && page_to_nid(page) != node))
goto another_slab;
-redo:
+load_freelist:
object = page->freelist;
if (unlikely(!object))
goto another_slab;
if (unlikely(SlabDebug(page)))
goto debug;

-have_object:
- page->inuse++;
- page->freelist = object[page->offset];
+ object = page->freelist;
+ page->lockless_freelist = object[page->offset];
+ page->inuse = s->objects;
+ page->freelist = NULL;
slab_unlock(page);
- local_irq_restore(flags);
return object;

another_slab:
@@ -1372,11 +1393,11 @@ another_slab:

new_slab:
page = get_partial(s, gfpflags, node);
- if (likely(page)) {
+ if (page) {
have_slab:
s->cpu_slab[cpu] = page;
SetPageActive(page);
- goto redo;
+ goto load_freelist;
}

page = new_slab(s, gfpflags, node);
@@ -1399,7 +1420,7 @@ have_slab:
discard_slab(s, page);
page = s->cpu_slab[cpu];
slab_lock(page);
- goto redo;
+ goto load_freelist;
}
/* New slab does not fit our expectations */
flush_slab(s, s->cpu_slab[cpu], cpu);
@@ -1407,16 +1428,52 @@ have_slab:
slab_lock(page);
goto have_slab;
}
- local_irq_restore(flags);
return NULL;
debug:
+ object = page->freelist;
if (!alloc_object_checks(s, page, object))
goto another_slab;
if (s->flags & SLAB_STORE_USER)
set_track(s, object, TRACK_ALLOC, addr);
trace(s, page, object, 1);
init_object(s, object, 1);
- goto have_object;
+
+ page->inuse++;
+ page->freelist = object[page->offset];
+ slab_unlock(page);
+ return object;
+}
+
+/*
+ * Inlined fastpath so that allocation functions (kmalloc, kmem_cache_alloc)
+ * have the fastpath folded into their functions. So no function call
+ * overhead for requests that can be satisfied on the fastpath.
+ *
+ * The fastpath works by first checking if the lockless freelist can be used.
+ * If not then __slab_alloc is called for slow processing.
+ *
+ * Otherwise we can simply pick the next object from the lockless free list.
+ */
+static void __always_inline *slab_alloc(struct kmem_cache *s,
+ gfp_t gfpflags, int node, void *addr)
+{
+ struct page *page;
+ void **object;
+ unsigned long flags;
+
+ local_irq_save(flags);
+ page = s->cpu_slab[smp_processor_id()];
+ if (unlikely(!page || !page->lockless_freelist ||
+ (node != -1 && page_to_nid(page) != node)))
+
+ object = __slab_alloc(s, gfpflags, node, addr, page);
+
+ else {
+ object = page->lockless_freelist;
+ page->lockless_freelist = object[page->offset];
+ }
+ local_irq_restore(flags);
+ return object;
}

void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
@@ -1434,20 +1491,19 @@ EXPORT_SYMBOL(kmem_cache_alloc_node);
#endif

/*
- * The fastpath only writes the cacheline of the page struct and the first
- * cacheline of the object.
+ * Slow patch handling. This may still be called frequently since objects
+ * have a longer lifetime than the cpu slabs in most processing loads.
*
- * We read the cpu_slab cacheline to check if the slab is the per cpu
- * slab for this processor.
+ * So we still attempt to reduce cache line usage. Just take the slab
+ * lock and free the item. If there is no additional partial page
+ * handling required then we can return immediately.
*/
-static void slab_free(struct kmem_cache *s, struct page *page,
+static void __slab_free(struct kmem_cache *s, struct page *page,
void *x, void *addr)
{
void *prior;
void **object = (void *)x;
- unsigned long flags;

- local_irq_save(flags);
slab_lock(page);

if (unlikely(SlabDebug(page)))
@@ -1477,7 +1533,6 @@ checks_ok:

out_unlock:
slab_unlock(page);
- local_irq_restore(flags);
return;

slab_empty:
@@ -1489,7 +1544,6 @@ slab_empty:

slab_unlock(page);
discard_slab(s, page);
- local_irq_restore(flags);
return;

debug:
@@ -1504,6 +1558,34 @@ debug:
goto checks_ok;
}

+/*
+ * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
+ * can perform fastpath freeing without additional function calls.
+ *
+ * The fastpath is only possible if we are freeing to the current cpu slab
+ * of this processor. This typically the case if we have just allocated
+ * the item before.
+ *
+ * If fastpath is not possible then fall back to __slab_free where we deal
+ * with all sorts of special processing.
+ */
+static void __always_inline slab_free(struct kmem_cache *s,
+ struct page *page, void *x, void *addr)
+{
+ void **object = (void *)x;
+ unsigned long flags;
+
+ local_irq_save(flags);
+ if (likely(page == s->cpu_slab[smp_processor_id()] &&
+ !SlabDebug(page))) {
+ object[page->offset] = page->lockless_freelist;
+ page->lockless_freelist = object;
+ } else
+ __slab_free(s, page, x, addr);
+
+ local_irq_restore(flags);
+}
+
void kmem_cache_free(struct kmem_cache *s, void *x)
{
struct page *page;

2007-05-08 04:57:36

by Christoph Lameter

[permalink] [raw]

Subject: RE: Regression with SLUB on Netperf and Volanomark

2007-05-08 21:54:23

by Tim Chen

[permalink] [raw]

Subject: RE: Regression with SLUB on Netperf and Volanomark

On Mon, 2007-05-07 at 18:49 -0700, Christoph Lameter wrote:
> On Mon, 7 May 2007, Tim Chen wrote:
>
> > However, the output from TCP_STREAM is quite stable.
> > I am still seeing a 4% difference between the SLAB and SLUB kernel.
> > Looking at the L2 cache miss rate with emon, I saw 6% more cache miss on
> > the client side with SLUB. The server side has the same amount of cache
> > miss. This is test under SMP mode with client and server bound to
> > different core on separate package.
>
> Could you try the following patch on top of 2.6.21-mm1 with the patches
> from http://ftp.kernel.org/pub/linux/kernel/people/christoph/slub-patches?
>
> I sent it to you before. This is one is an updated version
>
>
>
> Avoid atomic overhead in slab_alloc and slab_free
>

I tried the slub-patches and the avoid atomic overhead patch against
2.6.21-mm1. It brings the TCP_STREAM performance for SLUB to the SLAB
level. The patches not mentioned in the "series" file did not apply
cleanly to 2.6.21-mm1 and I skipped most of those.

Patches applied are:
http://ftp.kernel.org/pub/linux/kernel/people/christoph/slub-
patches/series + dentry_target_reclaimed + kmem_cache_ops + slub_stats +
skip_atomic_overhead

Without skip atomic overhead patch, the throughput drops by 1 to 1.5%.

The change from slub_min_order=0 slub_max_order=4
to slub_min_order=6 slub_max_order=7 did not make much difference in
my tests.

Tim

2007-05-08 22:02:55

by Christoph Lameter

[permalink] [raw]

Subject: RE: Regression with SLUB on Netperf and Volanomark

On Tue, 8 May 2007, Tim Chen wrote:

> I tried the slub-patches and the avoid atomic overhead patch against
> 2.6.21-mm1. It brings the TCP_STREAM performance for SLUB to the SLAB
> level. The patches not mentioned in the "series" file did not apply
> cleanly to 2.6.21-mm1 and I skipped most of those.

Ahhh. Great. The patches not mentioned should not be applied corredct.

> Without skip atomic overhead patch, the throughput drops by 1 to 1.5%.
>
> The change from slub_min_order=0 slub_max_order=4
> to slub_min_order=6 slub_max_order=7 did not make much difference in
> my tests.

Allright. I will then put that patch in.