2010-07-09 19:12:19

by Christoph Lameter

[permalink] [raw]
Subject: [S+Q2 00/19] SLUB with queueing (V2) beats SLAB netperf TCP_RR

The following patchset cleans some pieces up and then equips SLUB with
per cpu queues that work similar to SLABs queues. With that approach
SLUB wins significantly in hackbench and improves also on tcp_rr.

Hackbench test script:

#!/bin/bash
uname -a
echo "./hackbench 100 process 200000"
./hackbench 100 process 200000
echo "./hackbench 100 process 20000"
./hackbench 100 process 20000
echo "./hackbench 100 process 20000"
./hackbench 100 process 20000
echo "./hackbench 100 process 20000"
./hackbench 100 process 20000
echo "./hackbench 10 process 20000"
./hackbench 10 process 20000
echo "./hackbench 10 process 20000"
./hackbench 10 process 20000
echo "./hackbench 10 process 20000"
./hackbench 10 process 20000
echo "./hackbench 1 process 20000"
./hackbench 1 process 20000
echo "./hackbench 1 process 20000"
./hackbench 1 process 20000
echo "./hackbench 1 process 20000"
./hackbench 1 process 20000

Dell Dual Quad Penryn on Linux 2.6.35-rc3
Time measurements: Smaller is better:

Procs NR SLAB SLUB SLUB+Queuing %
-------------------------------------------------------------
100 200000 2741.3 2764.7 2231.9 -18
100 20000 279.3 270.3 219.0 -27
100 20000 278.0 273.1 219.2 -26
100 20000 279.0 271.7 218.8 -27
10 20000 34.0 35.6 28.8 -18
10 20000 30.3 35.2 28.4 -6
10 20000 32.9 34.6 28.4 -15
1 20000 6.4 6.7 6.5 +1
1 20000 6.3 6.8 6.5 +3
1 20000 6.4 6.9 6.4 0


SLUB+Q also wins against SLAB in netperf:

Script:

#!/bin/bash

TIME=60 # seconds
HOSTNAME=localhost # netserver

NR_CPUS=$(grep ^processor /proc/cpuinfo | wc -l)
echo NR_CPUS=$NR_CPUS

run_netperf() {
for i in $(seq 1 $1); do
netperf -H $HOSTNAME -t TCP_RR -l $TIME &
done
}

ITERATIONS=0
while [ $ITERATIONS -lt 12 ]; do
RATE=0
ITERATIONS=$[$ITERATIONS + 1]
THREADS=$[$NR_CPUS * $ITERATIONS]
RESULTS=$(run_netperf $THREADS | grep -v '[a-zA-Z]' | awk '{ print $6 }')

for j in $RESULTS; do
RATE=$[$RATE + ${j/.*}]
done
echo threads=$THREADS rate=$RATE
done


Dell Dual Quad Penryn on Linux 2.6.35-rc4

Loop counts: Larger is better.

Threads SLAB SLUB+Q %
8 690869 714788 + 3.4
16 680295 711771 + 4.6
24 672677 703014 + 4.5
32 676780 703914 + 4.0
40 668458 699806 + 4.6
48 667017 698908 + 4.7
56 671227 696034 + 3.6
64 667956 696913 + 4.3
72 668332 694931 + 3.9
80 667073 695658 + 4.2
88 682866 697077 + 2.0
96 668089 694719 + 3.9


SLUB+Q is a merging of SLUB with some queuing concepts from SLAB and a
new way of managing objects in the slabs using bitmaps. It uses a percpu
queue so that free operations can be properly buffered and a bitmap for
managing the free/allocated state in the slabs. It is slightly more
inefficient than SLUB (due to the need to place large bitmaps --sized
a few words--in some slab pages if there are more than BITS_PER_LONG
objects in a slab) but in general does not increase space use too much.

The SLAB scheme of not touching the object during management is adopted.
SLUB+Q can efficiently free and allocate cache cold objects without
causing cache misses.

The queueing patches are likely still be a bit rough around corner cases
and special features and need to see some more widespread testing.


2010-07-10 19:57:07

by Heinz Diehl

[permalink] [raw]
Subject: Re: [S+Q2 00/19] SLUB with queueing (V2) beats SLAB netperf TCP_RR

On 10.07.2010, Christoph Lameter wrote:

> The following patchset cleans some pieces up and then equips SLUB with
> per cpu queues that work similar to SLABs queues. With that approach
> SLUB wins significantly in hackbench and improves also on tcp_rr.

The patchset applies cleanly, however compilation fails with

[....]
mm/slub.c: In function ‘alloc_kmem_cache_cpus’:
mm/slub.c:2093: error: negative width in bit-field ‘<anonymous>’
make[1]: *** [mm/slub.o] Error 1
make: *** [mm] Error 2
make: *** Waiting for unfinished jobs....
[....]

2010-07-12 15:15:33

by Christoph Lameter

[permalink] [raw]
Subject: Re: [S+Q2 00/19] SLUB with queueing (V2) beats SLAB netperf TCP_RR

On Sat, 10 Jul 2010, Heinz Diehl wrote:

> On 10.07.2010, Christoph Lameter wrote:
>
> > The following patchset cleans some pieces up and then equips SLUB with
> > per cpu queues that work similar to SLABs queues. With that approach
> > SLUB wins significantly in hackbench and improves also on tcp_rr.
>
> The patchset applies cleanly, however compilation fails with
>
> [....]
> mm/slub.c: In function ‘alloc_kmem_cache_cpus’:
> mm/slub.c:2093: error: negative width in bit-field ‘<anonymous>’
> make[1]: *** [mm/slub.o] Error 1
> make: *** [mm] Error 2
> make: *** Waiting for unfinished jobs....
> [....]

You need a sufficient PERCPU_DYNAMIC_EARLY_SIZE to be configured. What
platform is this? Tejon: You suggested the BUILD_BUG_ON(). How can he
increase the early size?

2010-07-12 16:39:58

by Heinz Diehl

[permalink] [raw]
Subject: Re: [S+Q2 00/19] SLUB with queueing (V2) beats SLAB netperf TCP_RR

On 12.07.2010, Christoph Lameter wrote:

> You need a sufficient PERCPU_DYNAMIC_EARLY_SIZE to be configured. What
> platform is this?

This is an AMD Phenom II X4-905e with 8GB RAM and an (heavily modified)
opensuse 11.1 64-bit with kernel 2.6.35-rc4-git4 (vanilla from kernel.org, no
distribution kernel). Dmesg is attached.

Thanks,
Heinz.


Attachments:
(No filename) (343.00 B)
dmesg.txt.bz2 (11.15 kB)
Download all attachments

2010-07-12 17:04:31

by Christoph Lameter

[permalink] [raw]
Subject: Re: [S+Q2 00/19] SLUB with queueing (V2) beats SLAB netperf TCP_RR

On Mon, 12 Jul 2010, Heinz Diehl wrote:

> On 12.07.2010, Christoph Lameter wrote:
>
> > You need a sufficient PERCPU_DYNAMIC_EARLY_SIZE to be configured. What
> > platform is this?
>
> This is an AMD Phenom II X4-905e with 8GB RAM and an (heavily modified)
> opensuse 11.1 64-bit with kernel 2.6.35-rc4-git4 (vanilla from kernel.org, no
> distribution kernel). Dmesg is attached.

Can you get us the config file. What is the value of
PERCPU_DYMAMIC_EARLY_SIZE?

I have run this on x86 for a long time. Why does the percpu subsystem
have a lower PERCPU_DYNAMIC_EARLY_SIZE on Heinzes system?


Attachments:
dmesg.txt.bz2 (11.15 kB)

2010-07-13 13:57:06

by Heinz Diehl

[permalink] [raw]
Subject: Re: [S+Q2 00/19] SLUB with queueing (V2) beats SLAB netperf TCP_RR

On 13.07.2010, Christoph Lameter wrote:

> Can you get us the config file. What is the value of
> PERCPU_DYMAMIC_EARLY_SIZE?

My .config file is attached. I don't know how to find out what value
PERCPU_DYNAMIC_EARLY_SIZE is actually on, how could I do that? There's
no such thing in my .config.


Attachments:
(No filename) (298.00 B)
config.bz2 (18.71 kB)
Download all attachments

2010-07-14 02:04:56

by Christoph Lameter

[permalink] [raw]
Subject: Re: [S+Q2 00/19] SLUB with queueing (V2) beats SLAB netperf TCP_RR

On Tue, 13 Jul 2010, Heinz Diehl wrote:

> On 13.07.2010, Christoph Lameter wrote:
>
> > Can you get us the config file. What is the value of
> > PERCPU_DYMAMIC_EARLY_SIZE?
>
> My .config file is attached. I don't know how to find out what value
> PERCPU_DYNAMIC_EARLY_SIZE is actually on, how could I do that? There's
> no such thing in my .config.

I dont see anything in there at first glance that would cause slub to
increase its percpu usage. This is straight upstream?

Try to just comment out the BUILD_BUG_ON. I had it misfire before and
fixed the formulae to no longer give false positives. Maybe that is
another case. Tejun wanted that but never was able to give me an exact
formular to check for.

At the Ottawa Linux Symposium right now so responses may be delayed.
Hotels Internet connection keeps getting clogged for some reason.

2010-07-14 11:47:09

by Tejun Heo

[permalink] [raw]
Subject: Re: [S+Q2 00/19] SLUB with queueing (V2) beats SLAB netperf TCP_RR

Hello,

On 07/12/2010 05:11 PM, Christoph Lameter wrote:
> You need a sufficient PERCPU_DYNAMIC_EARLY_SIZE to be configured. What
> platform is this? Tejon: You suggested the BUILD_BUG_ON(). How can he
> increase the early size?

The size is determined by PERCPU_DYNAMIC_EARLY_SIZE, so bumping it up
should do it but it would probably be wiser to bump
PERCPU_DYNAMIC_RESERVE too. PERCPU_DYNAMIC_EARLY_SIZE is currently
12k. How high should it be?

Thanks.

--
tejun

2010-07-14 11:51:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [S+Q2 00/19] SLUB with queueing (V2) beats SLAB netperf TCP_RR

Hello,

On 07/14/2010 04:01 AM, Christoph Lameter wrote:
> I dont see anything in there at first glance that would cause slub to
> increase its percpu usage. This is straight upstream?

It's basically checking constant expressions there and
PERCPU_DYNAMIC_EARLY_SIZE is defined as 12k, so slub is thinking that
it's gonna use more memory on that build.

> Try to just comment out the BUILD_BUG_ON. I had it misfire before and
> fixed the formulae to no longer give false positives. Maybe that is
> another case. Tejun wanted that but never was able to give me an exact
> formular to check for.

Yeah, unfortunately, due to alignment requirements, it can't be
determined with accuracy. We'll just have to size it sufficiently.

> At the Ottawa Linux Symposium right now so responses may be delayed.
> Hotels Internet connection keeps getting clogged for some reason.

I'm in suse labs conf until next week so I don't think I'll be doing
much till then either.

Thanks.

--
tejun

2010-07-14 14:25:43

by Heinz Diehl

[permalink] [raw]
Subject: Re: [S+Q2 00/19] SLUB with queueing (V2) beats SLAB netperf TCP_RR

On 14.07.2010, Christoph Lameter wrote:

> I dont see anything in there at first glance that would cause slub to
> increase its percpu usage. This is straight upstream?

Yes ,it's plain vanilla 2.6.35-rc4/-rc5 from kernel.org.

> Try to just comment out the BUILD_BUG_ON.

I first bumped it up to 24k, but that was obviously not enough, so I
commented out the BUILD_BUG_ON which triggers the build error. Now It builds
fine, and I'll do some testing.

Thanks,
Heinz.

2010-07-14 20:22:27

by David Rientjes

[permalink] [raw]
Subject: Re: [S+Q2 00/19] SLUB with queueing (V2) beats SLAB netperf TCP_RR

On Tue, 13 Jul 2010, Christoph Lameter wrote:

> > > Can you get us the config file. What is the value of
> > > PERCPU_DYMAMIC_EARLY_SIZE?
> >
> > My .config file is attached. I don't know how to find out what value
> > PERCPU_DYNAMIC_EARLY_SIZE is actually on, how could I do that? There's
> > no such thing in my .config.
>
> I dont see anything in there at first glance that would cause slub to
> increase its percpu usage. This is straight upstream?
>

The problem is that he has CONFIG_NODES_SHIFT=10 and struct kmem_cache has
an array of struct kmem_cache_node pointers with MAX_NUMNODES entries
which blows its size up to over 8K. That's probably overkill for his
quad-core 8GB AMD, so I'd recommend lowering CONFIG_NODES_SHIFT to 6.

2010-07-14 22:26:41

by David Rientjes

[permalink] [raw]
Subject: Re: [S+Q2 00/19] SLUB with queueing (V2) beats SLAB netperf TCP_RR

On Fri, 9 Jul 2010, Christoph Lameter wrote:

> SLUB+Q also wins against SLAB in netperf:
>
> Script:
>
> #!/bin/bash
>
> TIME=60 # seconds
> HOSTNAME=localhost # netserver
>
> NR_CPUS=$(grep ^processor /proc/cpuinfo | wc -l)
> echo NR_CPUS=$NR_CPUS
>
> run_netperf() {
> for i in $(seq 1 $1); do
> netperf -H $HOSTNAME -t TCP_RR -l $TIME &
> done
> }
>
> ITERATIONS=0
> while [ $ITERATIONS -lt 12 ]; do
> RATE=0
> ITERATIONS=$[$ITERATIONS + 1]
> THREADS=$[$NR_CPUS * $ITERATIONS]
> RESULTS=$(run_netperf $THREADS | grep -v '[a-zA-Z]' | awk '{ print $6 }')
>
> for j in $RESULTS; do
> RATE=$[$RATE + ${j/.*}]
> done
> echo threads=$THREADS rate=$RATE
> done
>
>
> Dell Dual Quad Penryn on Linux 2.6.35-rc4
>
> Loop counts: Larger is better.
>
> Threads SLAB SLUB+Q %
> 8 690869 714788 + 3.4
> 16 680295 711771 + 4.6
> 24 672677 703014 + 4.5
> 32 676780 703914 + 4.0
> 40 668458 699806 + 4.6
> 48 667017 698908 + 4.7
> 56 671227 696034 + 3.6
> 64 667956 696913 + 4.3
> 72 668332 694931 + 3.9
> 80 667073 695658 + 4.2
> 88 682866 697077 + 2.0
> 96 668089 694719 + 3.9
>

I see you're using my script for collecting netperf TCP_RR benchmark data,
thanks very much for looking into this workload for slab allocator
performance!

There are a couple differences between how you're using it compared to how
I showed the initial regression between slab and slub, however: you're
using localhost for your netserver which isn't representative of a real
networking round-robin workload and you're using a smaller system with
eight cores. We never measured a _significant_ performance problem with
slub compared to slab with four or eight cores, the problem only emerges
on larger systems.

When running this patchset on two (client and server running
netperf-2.4.5) four 2.2GHz quad-core AMD processors with 64GB of memory,
here's the results:

threads SLAB SLUB+Q diff
16 205580 179109 -12.9%
32 264024 215613 -18.3%
48 286175 237036 -17.2%
64 305309 253222 -17.1%
80 308248 243848 -20.9%
96 299845 243848 -18.7%
112 305560 259427 -15.1%
128 312668 263803 -15.6%
144 329671 271335 -17.7%
160 318737 280290 -12.1%
176 325295 287918 -11.5%
192 333356 287995 -13.6%

If you'd like to add statistics to your patchset that are enabled with
CONFIG_SLUB_STATS, I'd be happy to run it on this setup and collect more
data for you.

2010-07-14 23:52:31

by David Rientjes

[permalink] [raw]
Subject: Re: [S+Q2 00/19] SLUB with queueing (V2) beats SLAB netperf TCP_RR

On Fri, 9 Jul 2010, Christoph Lameter wrote:

> The following patchset cleans some pieces up and then equips SLUB with
> per cpu queues that work similar to SLABs queues.

Pekka, I think patches 4-8 could be applied to your tree now, they're
relatively unchanged from what's been posted before. (I didn't ack patch
9 because I think it makes slab_lock() -> slab_unlock() matching more
difficult with little win, but I don't feel strongly about it.)

I'd also consider patch 7 for 2.6.35-rc6 (and -stable).

2010-07-15 20:20:47

by Christoph Lameter

[permalink] [raw]
Subject: Re: [S+Q2 00/19] SLUB with queueing (V2) beats SLAB netperf TCP_RR

On Wed, 14 Jul 2010, David Rientjes wrote:

> There are a couple differences between how you're using it compared to how
> I showed the initial regression between slab and slub, however: you're
> using localhost for your netserver which isn't representative of a real
> networking round-robin workload and you're using a smaller system with
> eight cores. We never measured a _significant_ performance problem with
> slub compared to slab with four or eight cores, the problem only emerges
> on larger systems.

Larger systems would more NUMA support than is present in the current
patches.

> When running this patchset on two (client and server running
> netperf-2.4.5) four 2.2GHz quad-core AMD processors with 64GB of memory,
> here's the results:

What is their NUMA topology? I dont have anything beyond two nodes here.

2010-07-15 20:31:13

by David Rientjes

[permalink] [raw]
Subject: Re: [S+Q2 00/19] SLUB with queueing (V2) beats SLAB netperf TCP_RR

On Thu, 15 Jul 2010, Christoph Lameter wrote:

> > When running this patchset on two (client and server running
> > netperf-2.4.5) four 2.2GHz quad-core AMD processors with 64GB of memory,
> > here's the results:
>
> What is their NUMA topology? I dont have anything beyond two nodes here.
>

These two machines happen to have four 16GB nodes with asymmetrical
distances:

# cat /sys/devices/system/node/node*/distance
10 20 20 30
20 10 20 20
20 20 10 20
30 20 20 10

2010-07-16 08:23:38

by Pekka Enberg

[permalink] [raw]
Subject: Re: [S+Q2 00/19] SLUB with queueing (V2) beats SLAB netperf TCP_RR

David Rientjes wrote:
> On Fri, 9 Jul 2010, Christoph Lameter wrote:
>
>> The following patchset cleans some pieces up and then equips SLUB with
>> per cpu queues that work similar to SLABs queues.
>
> Pekka, I think patches 4-8 could be applied to your tree now, they're
> relatively unchanged from what's been posted before. (I didn't ack patch
> 9 because I think it makes slab_lock() -> slab_unlock() matching more
> difficult with little win, but I don't feel strongly about it.)

Yup, I applied 4-8. Thanks guys!

> I'd also consider patch 7 for 2.6.35-rc6 (and -stable).

It's an obvious bug fix but is it triggered in practice? Is there a
bugzilla report for that?

2010-07-16 09:02:47

by David Rientjes

[permalink] [raw]
Subject: Re: [S+Q2 00/19] SLUB with queueing (V2) beats SLAB netperf TCP_RR

On Fri, 16 Jul 2010, Pekka Enberg wrote:

> > I'd also consider patch 7 for 2.6.35-rc6 (and -stable).
>
> It's an obvious bug fix but is it triggered in practice? Is there a bugzilla
> report for that?
>

Let's ask Benjamin who initially reported the problem with arch_initcall
whether or not this is rc (and stable) material.

For reference, we're talking about the sysfs_slab_remove() check on
slab_state to prevent the WARN in the kobject code you hit with its fix
below:


From: Christoph Lameter <[email protected]>

slub: Allow removal of slab caches during boot

If a slab cache is removed before we have setup sysfs then simply skip over
the sysfs handling.

Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Roland Dreier <[email protected]>
Signed-off-by: Christoph Lameter <[email protected]>

---
mm/slub.c | 7 +++++++
1 file changed, 7 insertions(+)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-07-06 15:13:48.000000000 -0500
+++ linux-2.6/mm/slub.c 2010-07-06 15:15:27.000000000 -0500
@@ -4507,6 +4507,13 @@ static int sysfs_slab_add(struct kmem_ca

static void sysfs_slab_remove(struct kmem_cache *s)
{
+ if (slab_state < SYSFS)
+ /*
+ * Sysfs has not been setup yet so no need to remove the
+ * cache from sysfs.
+ */
+ return;
+
kobject_uevent(&s->kobj, KOBJ_REMOVE);
kobject_del(&s->kobj);
kobject_put(&s->kobj);

2010-07-19 00:19:06

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [S+Q2 00/19] SLUB with queueing (V2) beats SLAB netperf TCP_RR

On Fri, 2010-07-16 at 02:02 -0700, David Rientjes wrote:
> On Fri, 16 Jul 2010, Pekka Enberg wrote:
>
> > > I'd also consider patch 7 for 2.6.35-rc6 (and -stable).
> >
> > It's an obvious bug fix but is it triggered in practice? Is there a bugzilla
> > report for that?
> >
>
> Let's ask Benjamin who initially reported the problem with arch_initcall
> whether or not this is rc (and stable) material.
>
> For reference, we're talking about the sysfs_slab_remove() check on
> slab_state to prevent the WARN in the kobject code you hit with its fix
> below:

The only case where I reproduce that is an in-house kernel port that we
haven't published yet so it doesn't have to be -stable material as far
as I'm concerned.

Cheers,
Ben.

>
> From: Christoph Lameter <[email protected]>
>
> slub: Allow removal of slab caches during boot
>
> If a slab cache is removed before we have setup sysfs then simply skip over
> the sysfs handling.
>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Roland Dreier <[email protected]>
> Signed-off-by: Christoph Lameter <[email protected]>
>
> ---
> mm/slub.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c 2010-07-06 15:13:48.000000000 -0500
> +++ linux-2.6/mm/slub.c 2010-07-06 15:15:27.000000000 -0500
> @@ -4507,6 +4507,13 @@ static int sysfs_slab_add(struct kmem_ca
>
> static void sysfs_slab_remove(struct kmem_cache *s)
> {
> + if (slab_state < SYSFS)
> + /*
> + * Sysfs has not been setup yet so no need to remove the
> + * cache from sysfs.
> + */
> + return;
> +
> kobject_uevent(&s->kobj, KOBJ_REMOVE);
> kobject_del(&s->kobj);
> kobject_put(&s->kobj);
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>