LinuxLists.cc - [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

On Mon, Oct 12, 2009 at 1:40 PM, Tejun Heo <[email protected]> wrote:
> [email protected] wrote:
>> Use this_cpu_* operations in the hotpath to avoid calculations of
>> kmem_cache_cpu pointer addresses.
>>
>> On x86 there is a trade off: Multiple uses segment prefixes against an
>> address calculation and more register pressure. Code size is reduced
>> also therefore it is an advantage icache wise.
>>
>> The use of prefixes is necessary if we want to use a scheme
>> for fastpaths that do not require disabling interrupts.
>>
>> Cc: Mathieu Desnoyers <[email protected]>
>> Cc: Pekka Enberg <[email protected]>
>> Signed-off-by: Christoph Lameter <[email protected]>
>
> The rest of the patches look good to me but I'm no expert in this area
> of code. ?But you're the maintainer of the allocator and the changes
> definitely are percpu related, so if you're comfortable with it, I can
> happily carry the patches through percpu tree.

The patch looks sane to me but the changelog contains no relevant
numbers on performance. I am fine with the patch going in -percpu but
the patch probably needs some more beating performance-wise before it
can go into .33. I'm CC'ing some more people who are known to do SLAB
performance testing just in case they're interested in looking at the
patch. In any case,

Acked-by: Pekka Enberg <[email protected]>

Pekka

2009-10-12 15:03:19

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

2009-10-13 09:47:13

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

On Mon, 12 Oct 2009, Pekka Enberg wrote:

> The patch looks sane to me but the changelog contains no relevant
> numbers on performance. I am fine with the patch going in -percpu but
> the patch probably needs some more beating performance-wise before it
> can go into .33. I'm CC'ing some more people who are known to do SLAB
> performance testing just in case they're interested in looking at the
> patch. In any case,
>

I ran 60-second netperf TCP_RR benchmarks with various thread counts over
two machines, both four quad-core Opterons. I ran the trials ten times
each with both vanilla per-cpu#for-next at 9288f99 and with v6 of this
patchset. The transfer rates were virtually identical showing no
improvement or regression with this patchset in this benchmark.

[ As I reported in http://marc.info/?l=linux-kernel&m=123839191416472,
this benchmark continues to be the most significant regression slub has
compared to slab. ]

2009-10-13 14:51:01

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

On Tue, 13 Oct 2009, David Rientjes wrote:

> I ran 60-second netperf TCP_RR benchmarks with various thread counts over
> two machines, both four quad-core Opterons. I ran the trials ten times
> each with both vanilla per-cpu#for-next at 9288f99 and with v6 of this
> patchset. The transfer rates were virtually identical showing no
> improvement or regression with this patchset in this benchmark.
>
> [ As I reported in http://marc.info/?l=linux-kernel&m=123839191416472,
> this benchmark continues to be the most significant regression slub has
> compared to slab. ]

Hmmm... Last time I ran the in kernel benchmarks this showed a reduction
in cycle counts. Did not get to get my tests yet.

Can you also try the irqless hotpath?

2009-10-13 19:22:36

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

Here are some cycle numbers w/o the slub patches and with. I will post the
full test results and the patches to do these in kernel tests in a new
thread. The regression may be due to caching behavior of SLUB that will
not change with these patches.

Alloc fastpath wins ~ 50%. kfree also has a 50% win if the fastpath is
being used. First test does 10000 kmallocs and then frees them all.
Second test alloc one and free one and does that 10000 times.

no this_cpu ops

1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 239 cycles kfree -> 261 cycles
10000 times kmalloc(16) -> 249 cycles kfree -> 208 cycles
10000 times kmalloc(32) -> 215 cycles kfree -> 232 cycles
10000 times kmalloc(64) -> 164 cycles kfree -> 216 cycles
10000 times kmalloc(128) -> 266 cycles kfree -> 275 cycles
10000 times kmalloc(256) -> 478 cycles kfree -> 199 cycles
10000 times kmalloc(512) -> 449 cycles kfree -> 201 cycles
10000 times kmalloc(1024) -> 484 cycles kfree -> 398 cycles
10000 times kmalloc(2048) -> 475 cycles kfree -> 559 cycles
10000 times kmalloc(4096) -> 792 cycles kfree -> 506 cycles
10000 times kmalloc(8192) -> 753 cycles kfree -> 679 cycles
10000 times kmalloc(16384) -> 968 cycles kfree -> 712 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 292 cycles
10000 times kmalloc(16)/kfree -> 308 cycles
10000 times kmalloc(32)/kfree -> 326 cycles
10000 times kmalloc(64)/kfree -> 303 cycles
10000 times kmalloc(128)/kfree -> 257 cycles
10000 times kmalloc(256)/kfree -> 262 cycles
10000 times kmalloc(512)/kfree -> 293 cycles
10000 times kmalloc(1024)/kfree -> 262 cycles
10000 times kmalloc(2048)/kfree -> 289 cycles
10000 times kmalloc(4096)/kfree -> 274 cycles
10000 times kmalloc(8192)/kfree -> 265 cycles
10000 times kmalloc(16384)/kfree -> 1041 cycles

with this_cpu_xx

1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 134 cycles kfree -> 212 cycles
10000 times kmalloc(16) -> 109 cycles kfree -> 116 cycles
10000 times kmalloc(32) -> 157 cycles kfree -> 231 cycles
10000 times kmalloc(64) -> 168 cycles kfree -> 169 cycles
10000 times kmalloc(128) -> 263 cycles kfree -> 260 cycles
10000 times kmalloc(256) -> 430 cycles kfree -> 251 cycles
10000 times kmalloc(512) -> 415 cycles kfree -> 258 cycles
10000 times kmalloc(1024) -> 406 cycles kfree -> 432 cycles
10000 times kmalloc(2048) -> 457 cycles kfree -> 579 cycles
10000 times kmalloc(4096) -> 624 cycles kfree -> 553 cycles
10000 times kmalloc(8192) -> 851 cycles kfree -> 851 cycles
10000 times kmalloc(16384) -> 907 cycles kfree -> 722 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 232 cycles
10000 times kmalloc(16)/kfree -> 150 cycles
10000 times kmalloc(32)/kfree -> 278 cycles
10000 times kmalloc(64)/kfree -> 263 cycles
10000 times kmalloc(128)/kfree -> 280 cycles
10000 times kmalloc(256)/kfree -> 279 cycles
10000 times kmalloc(512)/kfree -> 299 cycles
10000 times kmalloc(1024)/kfree -> 289 cycles
10000 times kmalloc(2048)/kfree -> 288 cycles
10000 times kmalloc(4096)/kfree -> 321 cycles
10000 times kmalloc(8192)/kfree -> 285 cycles
10000 times kmalloc(16384)/kfree -> 1002 cycles

2009-10-13 19:45:50

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

Hi Christoph,

Christoph Lameter wrote:
> Here are some cycle numbers w/o the slub patches and with. I will post the
> full test results and the patches to do these in kernel tests in a new
> thread. The regression may be due to caching behavior of SLUB that will
> not change with these patches.
>
> Alloc fastpath wins ~ 50%. kfree also has a 50% win if the fastpath is
> being used. First test does 10000 kmallocs and then frees them all.
> Second test alloc one and free one and does that 10000 times.

I wonder how reliable these numbers are. We did similar testing a while
back because we thought kmalloc-96 caches had weird cache behavior but
finally figured out the anomaly was explained by the order of the tests
run, not cache size.

AFAICT, we have similar artifact in these tests as well:

> no this_cpu ops
>
> 1. Kmalloc: Repeatedly allocate then free test
> 10000 times kmalloc(8) -> 239 cycles kfree -> 261 cycles
> 10000 times kmalloc(16) -> 249 cycles kfree -> 208 cycles
> 10000 times kmalloc(32) -> 215 cycles kfree -> 232 cycles
> 10000 times kmalloc(64) -> 164 cycles kfree -> 216 cycles

Notice the jump from 32 to 64 and then back to 64. One would expect we
see linear increase as object size grows as we hit the page allocator
more often, no?

> 10000 times kmalloc(128) -> 266 cycles kfree -> 275 cycles
> 10000 times kmalloc(256) -> 478 cycles kfree -> 199 cycles
> 10000 times kmalloc(512) -> 449 cycles kfree -> 201 cycles
> 10000 times kmalloc(1024) -> 484 cycles kfree -> 398 cycles
> 10000 times kmalloc(2048) -> 475 cycles kfree -> 559 cycles
> 10000 times kmalloc(4096) -> 792 cycles kfree -> 506 cycles
> 10000 times kmalloc(8192) -> 753 cycles kfree -> 679 cycles
> 10000 times kmalloc(16384) -> 968 cycles kfree -> 712 cycles
> 2. Kmalloc: alloc/free test
> 10000 times kmalloc(8)/kfree -> 292 cycles
> 10000 times kmalloc(16)/kfree -> 308 cycles
> 10000 times kmalloc(32)/kfree -> 326 cycles
> 10000 times kmalloc(64)/kfree -> 303 cycles
> 10000 times kmalloc(128)/kfree -> 257 cycles
> 10000 times kmalloc(256)/kfree -> 262 cycles
> 10000 times kmalloc(512)/kfree -> 293 cycles
> 10000 times kmalloc(1024)/kfree -> 262 cycles
> 10000 times kmalloc(2048)/kfree -> 289 cycles
> 10000 times kmalloc(4096)/kfree -> 274 cycles
> 10000 times kmalloc(8192)/kfree -> 265 cycles
> 10000 times kmalloc(16384)/kfree -> 1041 cycles
>
>
> with this_cpu_xx
>
> 1. Kmalloc: Repeatedly allocate then free test
> 10000 times kmalloc(8) -> 134 cycles kfree -> 212 cycles
> 10000 times kmalloc(16) -> 109 cycles kfree -> 116 cycles

Same artifact here.

> 10000 times kmalloc(32) -> 157 cycles kfree -> 231 cycles
> 10000 times kmalloc(64) -> 168 cycles kfree -> 169 cycles
> 10000 times kmalloc(128) -> 263 cycles kfree -> 260 cycles
> 10000 times kmalloc(256) -> 430 cycles kfree -> 251 cycles
> 10000 times kmalloc(512) -> 415 cycles kfree -> 258 cycles
> 10000 times kmalloc(1024) -> 406 cycles kfree -> 432 cycles
> 10000 times kmalloc(2048) -> 457 cycles kfree -> 579 cycles
> 10000 times kmalloc(4096) -> 624 cycles kfree -> 553 cycles
> 10000 times kmalloc(8192) -> 851 cycles kfree -> 851 cycles
> 10000 times kmalloc(16384) -> 907 cycles kfree -> 722 cycles

And looking at these numbers:

> 2. Kmalloc: alloc/free test
> 10000 times kmalloc(8)/kfree -> 232 cycles
> 10000 times kmalloc(16)/kfree -> 150 cycles
> 10000 times kmalloc(32)/kfree -> 278 cycles
> 10000 times kmalloc(64)/kfree -> 263 cycles
> 10000 times kmalloc(128)/kfree -> 280 cycles
> 10000 times kmalloc(256)/kfree -> 279 cycles
> 10000 times kmalloc(512)/kfree -> 299 cycles
> 10000 times kmalloc(1024)/kfree -> 289 cycles
> 10000 times kmalloc(2048)/kfree -> 288 cycles
> 10000 times kmalloc(4096)/kfree -> 321 cycles
> 10000 times kmalloc(8192)/kfree -> 285 cycles
> 10000 times kmalloc(16384)/kfree -> 1002 cycles

If there's 50% improvement in the kmalloc() path, why does the
this_cpu() version seem to be roughly as fast as the mainline version?

Pekka

2009-10-13 19:56:36

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

On Tue, 13 Oct 2009, Pekka Enberg wrote:

> I wonder how reliable these numbers are. We did similar testing a while back
> because we thought kmalloc-96 caches had weird cache behavior but finally
> figured out the anomaly was explained by the order of the tests run, not cache
> size.

Well you need to look behind these numbers to see when the allocator uses
the fastpath or slow path. Only the fast path is optimized here.

> AFAICT, we have similar artifact in these tests as well:
>
> > no this_cpu ops
> >
> > 1. Kmalloc: Repeatedly allocate then free test
> > 10000 times kmalloc(8) -> 239 cycles kfree -> 261 cycles
> > 10000 times kmalloc(16) -> 249 cycles kfree -> 208 cycles
> > 10000 times kmalloc(32) -> 215 cycles kfree -> 232 cycles
> > 10000 times kmalloc(64) -> 164 cycles kfree -> 216 cycles
>
> Notice the jump from 32 to 64 and then back to 64. One would expect we see
> linear increase as object size grows as we hit the page allocator more often,
> no?

64 is the cacheline size for the machine. At that point you have the
advantage of no overlapping data between different allocations and the
prefetcher may do a particularly good job.

> > 10000 times kmalloc(16384)/kfree -> 1002 cycles
>
> If there's 50% improvement in the kmalloc() path, why does the this_cpu()
> version seem to be roughly as fast as the mainline version?

Its not that the kmalloc() is faster. The instructions used for the
fastpath generate less cycles. Other components figure into the total
latency as well.

16k allocations for example are not handled by slub anymore. Fastpath has
no effect. The wins there is just the improved percpu handling in the page
allocator.

I have some numbers here for irqless which drops another half of the
fastpath latency (and it adds some code to the slow path, sigh):

1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 55 cycles kfree -> 251 cycles
10000 times kmalloc(16) -> 201 cycles kfree -> 261 cycles
10000 times kmalloc(32) -> 220 cycles kfree -> 261 cycles
10000 times kmalloc(64) -> 186 cycles kfree -> 224 cycles
10000 times kmalloc(128) -> 205 cycles kfree -> 125 cycles
10000 times kmalloc(256) -> 351 cycles kfree -> 267 cycles
10000 times kmalloc(512) -> 330 cycles kfree -> 310 cycles
10000 times kmalloc(1024) -> 416 cycles kfree -> 419 cycles
10000 times kmalloc(2048) -> 537 cycles kfree -> 439 cycles
10000 times kmalloc(4096) -> 458 cycles kfree -> 594 cycles
10000 times kmalloc(8192) -> 810 cycles kfree -> 678 cycles
10000 times kmalloc(16384) -> 879 cycles kfree -> 746 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 66 cycles
10000 times kmalloc(16)/kfree -> 187 cycles
10000 times kmalloc(32)/kfree -> 116 cycles
10000 times kmalloc(64)/kfree -> 107 cycles
10000 times kmalloc(128)/kfree -> 115 cycles
10000 times kmalloc(256)/kfree -> 65 cycles
10000 times kmalloc(512)/kfree -> 66 cycles
10000 times kmalloc(1024)/kfree -> 206 cycles
10000 times kmalloc(2048)/kfree -> 65 cycles
10000 times kmalloc(4096)/kfree -> 193 cycles
10000 times kmalloc(8192)/kfree -> 65 cycles
10000 times kmalloc(16384)/kfree -> 976 cycles

2009-10-13 20:16:43

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

On Tue, 13 Oct 2009, Christoph Lameter wrote:

> > I wonder how reliable these numbers are. We did similar testing a while back
> > because we thought kmalloc-96 caches had weird cache behavior but finally
> > figured out the anomaly was explained by the order of the tests run, not cache
> > size.
>
> Well you need to look behind these numbers to see when the allocator uses
> the fastpath or slow path. Only the fast path is optimized here.
>

With the netperf -t TCP_RR -l 60 benchmark I ran, CONFIG_SLUB_STATS shows
the allocation fastpath is utilized quite a bit for a couple of key
caches:

cache ALLOC_FASTPATH ALLOC_SLOWPATH
kmalloc-256 98125871 31585955
kmalloc-2048 77243698 52347453

For an optimized fastpath, I'd expect such a workload would result in at
least a slightly higher transfer rate.

I'll try the irqless patch, but this particular benchmark may not
appropriately demonstrate any performance gain because of the added code
in the also significantly-used slowpath.

2009-10-13 20:33:31

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

2009-10-13 21:06:41

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

On Tue, 13 Oct 2009, David Rientjes wrote:

> For an optimized fastpath, I'd expect such a workload would result in at
> least a slightly higher transfer rate.

There will be no improvements if the load is dominated by the
instructions in the network layer or caching issues. None of that is
changed by the path. It only reduces the cycle count in the fastpath.

2009-10-13 22:54:15

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

On Tue, 13 Oct 2009, Christoph Lameter wrote:

> > For an optimized fastpath, I'd expect such a workload would result in at
> > least a slightly higher transfer rate.
>
> There will be no improvements if the load is dominated by the
> instructions in the network layer or caching issues. None of that is
> changed by the path. It only reduces the cycle count in the fastpath.
>

Right, but CONFIG_SLAB shows a 5-6% improvement over CONFIG_SLUB in the
same workload so it shows that the slab allocator does have an impact in
transfer rate. I understand that the performance gain with this patchset,
however, may not be representative with the benchmark since it also
frequently uses the slowpath for kmalloc-256 about 25% of the time and the
added code of the irqless patch may mask the fastpath gain.

2009-10-14 01:34:27

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

On Tue, 13 Oct 2009, Christoph Lameter wrote:

> > I ran 60-second netperf TCP_RR benchmarks with various thread counts over
> > two machines, both four quad-core Opterons. I ran the trials ten times
> > each with both vanilla per-cpu#for-next at 9288f99 and with v6 of this
> > patchset. The transfer rates were virtually identical showing no
> > improvement or regression with this patchset in this benchmark.
> >
> > [ As I reported in http://marc.info/?l=linux-kernel&m=123839191416472,
> > this benchmark continues to be the most significant regression slub has
> > compared to slab. ]
>
> Hmmm... Last time I ran the in kernel benchmarks this showed a reduction
> in cycle counts. Did not get to get my tests yet.
>
> Can you also try the irqless hotpath?
>

v6 of your patchset applied to percpu#for-next now at dec54bf "this_cpu:
Use this_cpu_xx in trace_functions_graph.c" works fine, but when I apply
the irqless patch from http://marc.info/?l=linux-kernel&m=125503037213262
it hangs my netserver machine within the first 60 seconds when running
this benchmark. These kernels both include the fixes to kmem_cache_open()
and dma_kmalloc_cache() you posted earlier. I'll have to debug why that's
happening before collecting results.

2009-10-14 13:35:33

by Mel Gorman

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

On Tue, Oct 13, 2009 at 03:53:00PM -0700, David Rientjes wrote:
> On Tue, 13 Oct 2009, Christoph Lameter wrote:
>
> > > For an optimized fastpath, I'd expect such a workload would result in at
> > > least a slightly higher transfer rate.
> >
> > There will be no improvements if the load is dominated by the
> > instructions in the network layer or caching issues. None of that is
> > changed by the path. It only reduces the cycle count in the fastpath.
> >
>
> Right, but CONFIG_SLAB shows a 5-6% improvement over CONFIG_SLUB in the
> same workload so it shows that the slab allocator does have an impact in
> transfer rate. I understand that the performance gain with this patchset,
> however, may not be representative with the benchmark since it also
> frequently uses the slowpath for kmalloc-256 about 25% of the time and the
> added code of the irqless patch may mask the fastpath gain.
>

I have a bit more detailed results based on the following machine

CPU type: AMD Phenom 9950
CPU counts: 1 CPU (4 cores)
CPU Speed: 1.3GHz
Motherboard: Gigabyte GA-MA78GM-S2H
Memory: 8GB

The reference kernel used is mmotm-2009-10-09-01-07. The patches applied
are the patches in this thread. The headings are a bit munged but it's

SLUB-vanilla where vanilla is mmotm-2009-10-09-01-07
SLUB-this-cpu mmotm-2009-10-09-01-07 + patches in this thread
SLAB-* same as above but SLAB configured instead of SLUB.
I know it wasn't necessary to run SLAB-this-cpu but
it gives an idea to what degree results can vary
between reboots even if results are stable once the
machine is running.

The benchmarks run were kernbench, netperf UDP_STREAM and TCP_STREAM and
sysbench with postgres.

Kernbench is 5 kernel compiles and an average taken. One kernel compile
is done at the start to warm the benchmark up and this result is
discarded.

Netperf is the _STREAM tests as opposed to the _RR tests reported
elsewhere. No special effort is done to bind processes to any particular
CPU. The results reported tried to be 99% confidence that the estimated
mean was within 1% of the true mean. Results where netperf failed to
achieve the necessary confidence are marked with a * and the line after
such a result states what percentage the estimated mean is to the true
mean. The test is run with different packet sizes.

Sysbench is a read-only test (to avoid IO) and is the "complex"
workload. The test is run with varying numbers of threads.

In all the results, SLUB-vanilla is the reference baseline. This allows
a comparison between SLUB-vanilla and SLAB-vanilla as well with the
patches applied.

kernbench-SLUB-vanilla-kernbench kernbench-SLUBkernbench-SLAB-vanilla-kernbench kernbench-SLAB
SLUB-vanilla this-cpu SLAB-vanilla this-cpu
Elapsed min 92.95 ( 0.00%) 92.62 ( 0.36%) 92.93 ( 0.02%) 92.62 ( 0.36%)
Elapsed mean 93.11 ( 0.00%) 92.74 ( 0.40%) 93.00 ( 0.13%) 92.82 ( 0.32%)
Elapsed stddev 0.10 ( 0.00%) 0.14 (-40.55%) 0.04 (55.47%) 0.18 (-84.33%)
Elapsed max 93.20 ( 0.00%) 92.95 ( 0.27%) 93.05 ( 0.16%) 93.09 ( 0.12%)
User min 323.21 ( 0.00%) 322.60 ( 0.19%) 322.50 ( 0.22%) 323.26 (-0.02%)
User mean 323.81 ( 0.00%) 323.20 ( 0.19%) 323.16 ( 0.20%) 323.54 ( 0.08%)
User stddev 0.40 ( 0.00%) 0.46 (-15.30%) 0.48 (-20.92%) 0.29 (26.07%)
User max 324.32 ( 0.00%) 323.72 ( 0.19%) 323.86 ( 0.14%) 323.98 ( 0.10%)
System min 35.95 ( 0.00%) 35.50 ( 1.25%) 35.35 ( 1.67%) 36.01 (-0.17%)
System mean 36.30 ( 0.00%) 35.96 ( 0.96%) 36.17 ( 0.36%) 36.23 ( 0.21%)
System stddev 0.25 ( 0.00%) 0.45 (-75.60%) 0.56 (-121.14%) 0.14 (46.14%)
System max 36.65 ( 0.00%) 36.67 (-0.05%) 36.94 (-0.79%) 36.39 ( 0.71%)
CPU min 386.00 ( 0.00%) 386.00 ( 0.00%) 386.00 ( 0.00%) 386.00 ( 0.00%)
CPU mean 386.25 ( 0.00%) 386.75 (-0.13%) 386.00 ( 0.06%) 387.25 (-0.26%)
CPU stddev 0.43 ( 0.00%) 0.83 (-91.49%) 0.00 (100.00%) 0.83 (-91.49%)
CPU max 387.00 ( 0.00%) 388.00 (-0.26%) 386.00 ( 0.26%) 388.00 (-0.26%)

Small gains in the User, System and Elapsed times with this-cpu patches
applied. It is interest to note for the mean times that the patches more
than close the gap between SLUB and SLAB for the most part - the
exception being User which has marginally better performance. This might
indicate that SLAB is still slightly better at giving back cache-hot
memory but this is speculation.

NETPERF UDP_STREAM
Packet netperf-udp udp-SLUB netperf-udp udp-SLAB
Size SLUB-vanilla this-cpu SLAB-vanilla this-cpu
64 148.48 ( 0.00%) 152.03 ( 2.34%) 147.45 (-0.70%) 150.07 ( 1.06%)
128 294.65 ( 0.00%) 299.92 ( 1.76%) 289.20 (-1.88%) 290.15 (-1.55%)
256 583.63 ( 0.00%) 609.14 ( 4.19%) 590.78 ( 1.21%) 586.42 ( 0.48%)
1024 2217.90 ( 0.00%) 2261.99 ( 1.95%) 2219.64 ( 0.08%) 2207.93 (-0.45%)
2048 4164.27 ( 0.00%) 4161.47 (-0.07%) 4216.46 ( 1.24%) 4155.11 (-0.22%)
3312 6284.17 ( 0.00%) 6383.24 ( 1.55%) 6231.88 (-0.84%) 6243.82 (-0.65%)
4096 7399.42 ( 0.00%) 7686.38 ( 3.73%) 7394.89 (-0.06%) 7487.91 ( 1.18%)
6144 10014.35 ( 0.00%) 10199.48 ( 1.82%) 9927.92 (-0.87%)* 10067.40 ( 0.53%)
1.00% 1.00% 1.08% 1.00%
8192 11232.50 ( 0.00%)* 11368.13 ( 1.19%)* 12280.88 ( 8.54%)* 12244.23 ( 8.26%)
1.65% 1.64% 1.32% 1.00%
10240 12961.87 ( 0.00%) 13099.82 ( 1.05%)* 13816.33 ( 6.18%)* 13927.18 ( 6.93%)
1.00% 1.03% 1.21% 1.00%
12288 14403.74 ( 0.00%)* 14276.89 (-0.89%)* 15173.09 ( 5.07%)* 15464.05 ( 6.86%)*
1.31% 1.63% 1.93% 1.55%
14336 15229.98 ( 0.00%)* 15218.52 (-0.08%)* 16412.94 ( 7.21%) 16252.98 ( 6.29%)
1.37% 2.76% 1.00% 1.00%
16384 15367.60 ( 0.00%)* 16038.71 ( 4.18%) 16635.91 ( 7.62%) 17128.87 (10.28%)*
1.29% 1.00% 1.00% 6.36%

The patches mostly improve the performance of netperf UDP_STREAM by a good
whack so the patches are a plus here. However, it should also be noted that
SLAB was mostly faster than SLUB, particularly for large packet sizes. Refresh
my memory, how do SLUB and SLAB differ in regards to off-loading large
allocations to the page allocator these days?

NETPERF TCP_STREAM
Packet netperf-tcp tcp-SLUB netperf-tcp tcp-SLAB
Size SLUB-vanilla this-cpu SLAB-vanilla this-cpu
64 1773.00 ( 0.00%) 1731.63 (-2.39%)* 1794.48 ( 1.20%) 2029.46 (12.64%)
1.00% 2.43% 1.00% 1.00%
128 3181.12 ( 0.00%) 3471.22 ( 8.36%) 3296.37 ( 3.50%) 3251.33 ( 2.16%)
256 4794.35 ( 0.00%) 4797.38 ( 0.06%) 4912.99 ( 2.41%) 4846.86 ( 1.08%)
1024 9438.10 ( 0.00%) 8681.05 (-8.72%)* 8270.58 (-14.12%) 8268.85 (-14.14%)
1.00% 7.31% 1.00% 1.00%
2048 9196.06 ( 0.00%) 9375.72 ( 1.92%) 11474.59 (19.86%) 9420.01 ( 2.38%)
3312 10338.49 ( 0.00%)* 10021.82 (-3.16%)* 12018.72 (13.98%)* 12069.28 (14.34%)*
9.49% 6.36% 1.21% 2.12%
4096 9931.20 ( 0.00%)* 10285.38 ( 3.44%)* 12265.59 (19.03%)* 10175.33 ( 2.40%)*
1.31% 1.38% 9.97% 8.33%
6144 12775.08 ( 0.00%)* 10559.63 (-20.98%) 13139.34 ( 2.77%) 13210.79 ( 3.30%)*
1.45% 1.00% 1.00% 2.99%
8192 10933.93 ( 0.00%)* 10534.41 (-3.79%)* 10876.42 (-0.53%)* 10738.25 (-1.82%)*
14.29% 2.10% 12.50% 9.55%
10240 12868.58 ( 0.00%) 12991.65 ( 0.95%) 10892.20 (-18.14%) 13106.01 ( 1.81%)
12288 11854.97 ( 0.00%) 12122.34 ( 2.21%)* 12129.79 ( 2.27%)* 12411.84 ( 4.49%)*
1.00% 6.61% 5.78% 8.95%
14336 12552.48 ( 0.00%)* 12501.71 (-0.41%)* 12274.54 (-2.26%) 12322.63 (-1.87%)*
6.05% 2.58% 1.00% 2.23%
16384 11733.09 ( 0.00%)* 12735.05 ( 7.87%)* 13195.68 (11.08%)* 14401.62 (18.53%)
1.14% 9.79% 10.30% 1.00%

The results for the patches are a bit all over the place for TCP_STREAM
with big gains and losses depending on the packet size, particularly 6144
for some reason. SLUB vs SLAB shows SLAB often has really massive advantages
and this is not always for the larger packet sizes where the page allocator
might be a suspect.

SYSBENCH
sysbench-SLUB-vanilla-sysbench sysbench-SLUBsysbench-SLAB-vanilla-sysbench sysbench-SLAB
SLUB-vanilla this-cpu SLAB-vanilla this-cpu
1 26950.79 ( 0.00%) 26822.05 (-0.48%) 26919.89 (-0.11%) 26746.18 (-0.77%)
2 51555.51 ( 0.00%) 51928.02 ( 0.72%) 51370.02 (-0.36%) 51129.82 (-0.83%)
3 76204.23 ( 0.00%) 76333.58 ( 0.17%) 76483.99 ( 0.37%) 75954.52 (-0.33%)
4 100599.12 ( 0.00%) 101757.98 ( 1.14%) 100499.65 (-0.10%) 101605.61 ( 0.99%)
5 100211.45 ( 0.00%) 100435.33 ( 0.22%) 100150.98 (-0.06%) 99398.11 (-0.82%)
6 99390.81 ( 0.00%) 99840.85 ( 0.45%) 99234.38 (-0.16%) 99244.42 (-0.15%)
7 98740.56 ( 0.00%) 98727.61 (-0.01%) 98305.88 (-0.44%) 98123.56 (-0.63%)
8 98075.89 ( 0.00%) 98048.62 (-0.03%) 98183.99 ( 0.11%) 97587.82 (-0.50%)
9 96502.22 ( 0.00%) 97276.80 ( 0.80%) 96819.88 ( 0.33%) 97320.51 ( 0.84%)
10 96598.70 ( 0.00%) 96545.37 (-0.06%) 96222.51 (-0.39%) 96221.69 (-0.39%)
11 95500.66 ( 0.00%) 95671.11 ( 0.18%) 95003.21 (-0.52%) 95246.81 (-0.27%)
12 94572.87 ( 0.00%) 95266.70 ( 0.73%) 93807.60 (-0.82%) 94859.82 ( 0.30%)
13 93811.85 ( 0.00%) 94309.18 ( 0.53%) 93219.81 (-0.64%) 93051.63 (-0.82%)
14 92972.16 ( 0.00%) 93849.87 ( 0.94%) 92641.50 (-0.36%) 92916.70 (-0.06%)
15 92276.06 ( 0.00%) 92454.94 ( 0.19%) 91094.04 (-1.30%) 91972.79 (-0.33%)
16 90265.35 ( 0.00%) 90416.26 ( 0.17%) 89309.26 (-1.07%) 90103.89 (-0.18%)

The patches mostly gain for sysbench although the gains are very marginal
and SLUB has a minor advantage over SLAB. I haven't actually checked how
slab-intensive this workload is. The differences are no marginal, I would
guess the answer is "not very".

Overall based on these results, I would say that the patches are a "Good Thing"
for this machine at least. With the patches applied, SLUB has a marginal
advantage over SLAB for kernbench. However, netperf TCP_STREAM and UDP_STREAM
both show significant disadvantages for SLUB and this cannot be always
explained by differing behaviour with respect to page-allocator offloading.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-10-14 14:16:25

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

The test did not include the irqless patch I hope?

On Wed, 14 Oct 2009, Mel Gorman wrote:

> Small gains in the User, System and Elapsed times with this-cpu patches
> applied. It is interest to note for the mean times that the patches more
> than close the gap between SLUB and SLAB for the most part - the
> exception being User which has marginally better performance. This might
> indicate that SLAB is still slightly better at giving back cache-hot
> memory but this is speculation.

The queuing in SLAB allows a better cache hot behavior. Without a queue
SLUB has a difficult time improvising cache hot behavior based on objects
restricted to a slab page. Therefore the size of the slab page will
affect how much "queueing" SLUB can do.

> The patches mostly improve the performance of netperf UDP_STREAM by a good
> whack so the patches are a plus here. However, it should also be noted that
> SLAB was mostly faster than SLUB, particularly for large packet sizes. Refresh
> my memory, how do SLUB and SLAB differ in regards to off-loading large
> allocations to the page allocator these days?

SLUB offloads allocations > 8k to the page allocator.
SLAB does create large slabs.

2009-10-14 15:50:24

by Mel Gorman

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

On Wed, Oct 14, 2009 at 10:08:12AM -0400, Christoph Lameter wrote:
> The test did not include the irqless patch I hope?
>

Correct. Only the patches in this thread were tested.

> On Wed, 14 Oct 2009, Mel Gorman wrote:
>
> > Small gains in the User, System and Elapsed times with this-cpu patches
> > applied. It is interest to note for the mean times that the patches more
> > than close the gap between SLUB and SLAB for the most part - the
> > exception being User which has marginally better performance. This might
> > indicate that SLAB is still slightly better at giving back cache-hot
> > memory but this is speculation.
>
> The queuing in SLAB allows a better cache hot behavior. Without a queue
> SLUB has a difficult time improvising cache hot behavior based on objects
> restricted to a slab page. Therefore the size of the slab page will
> affect how much "queueing" SLUB can do.
>

Ok, so the speculation is a plausible explanation.

> > The patches mostly improve the performance of netperf UDP_STREAM by a good
> > whack so the patches are a plus here. However, it should also be noted that
> > SLAB was mostly faster than SLUB, particularly for large packet sizes. Refresh
> > my memory, how do SLUB and SLAB differ in regards to off-loading large
> > allocations to the page allocator these days?
>
> SLUB offloads allocations > 8k to the page allocator.
> SLAB does create large slabs.
>

Allocations >8k might explain then why 8K and 16K packets for UDP_STREAM
performance suffers. That can be marked as future possible work to sort
out within the allocator.

However, does it explain why TCP_STREAM suffers so badly even for packet
sizes like 2K? It's also important to note in some cases, SLAB was far
slower even when the packet sizes were greater than 8k so I don't think
the page allocator is an adequate explanation for TCP_STREAM.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-10-14 15:54:27

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

Hi Mel,

Mel Gorman wrote:
>>> The patches mostly improve the performance of netperf UDP_STREAM by a good
>>> whack so the patches are a plus here. However, it should also be noted that
>>> SLAB was mostly faster than SLUB, particularly for large packet sizes. Refresh
>>> my memory, how do SLUB and SLAB differ in regards to off-loading large
>>> allocations to the page allocator these days?
>> SLUB offloads allocations > 8k to the page allocator.
>> SLAB does create large slabs.
>>
>
> Allocations >8k might explain then why 8K and 16K packets for UDP_STREAM
> performance suffers. That can be marked as future possible work to sort
> out within the allocator.
>
> However, does it explain why TCP_STREAM suffers so badly even for packet
> sizes like 2K? It's also important to note in some cases, SLAB was far
> slower even when the packet sizes were greater than 8k so I don't think
> the page allocator is an adequate explanation for TCP_STREAM.

SLAB is able to queue lots of large objects but SLUB can't do that
because it has no queues. In SLUB, each CPU gets a page assigned to it
that serves as a "queue" but the size of the queue gets smaller as
object size approaches page size.

We try to offset that with higher order allocations but IIRC we don't
increase the order linearly with object size and cap it to some
reasonable maximum.

Pekka

2009-10-14 16:04:38

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

On Wed, 14 Oct 2009, Pekka Enberg wrote:

> SLAB is able to queue lots of large objects but SLUB can't do that because it
> has no queues. In SLUB, each CPU gets a page assigned to it that serves as a
> "queue" but the size of the queue gets smaller as object size approaches page
> size.
>
> We try to offset that with higher order allocations but IIRC we don't increase
> the order linearly with object size and cap it to some reasonable maximum.

You can test to see if larger pages have an influence by passing

slub_max_order=6

or so on the kernel command line.

You can force a large page use in slub by setting

slub_min_order=3

f.e.

Or you can force a mininum number of objecxcts in slub through f.e.

slub_min_objects=50

slub_max_order=6 slub_min_objects=50

should result in pretty large slabs with lots of in page objects that
allow slub to queue better.

2009-10-14 16:15:08

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

Hi Christoph,

On Wed, 14 Oct 2009, Pekka Enberg wrote:
>> SLAB is able to queue lots of large objects but SLUB can't do that because it
>> has no queues. In SLUB, each CPU gets a page assigned to it that serves as a
>> "queue" but the size of the queue gets smaller as object size approaches page
>> size.
>>
>> We try to offset that with higher order allocations but IIRC we don't increase
>> the order linearly with object size and cap it to some reasonable maximum.

On Wed, Oct 14, 2009 at 6:56 PM, Christoph Lameter
<[email protected]> wrote:
> You can test to see if larger pages have an influence by passing
>
> slub_max_order=6
>
> or so on the kernel command line.
>
> You can force a large page use in slub by setting
>
> slub_min_order=3
>
> f.e.
>
> Or you can force a mininum number of objecxcts in slub through f.e.
>
> slub_min_objects=50
>
> slub_max_order=6 slub_min_objects=50
>
> should result in pretty large slabs with lots of in page objects that
> allow slub to queue better.

Yeah, that should help but it's probably not something we can do for
mainline. I'm not sure how we can fix SLUB to support large objects
out-of-the-box as efficiently as SLAB does.

Pekka

2009-10-14 18:27:51

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

On Wed, 14 Oct 2009, Pekka Enberg wrote:

> Yeah, that should help but it's probably not something we can do for
> mainline. I'm not sure how we can fix SLUB to support large objects
> out-of-the-box as efficiently as SLAB does.

We could add a per cpu "queue" through a pointer array in kmem_cache_cpu.
Which is more SLQB than SLUB.

2009-10-15 09:04:37

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

On Wed, 14 Oct 2009, Mel Gorman wrote:

> NETPERF TCP_STREAM
> Packet netperf-tcp tcp-SLUB netperf-tcp tcp-SLAB
> Size SLUB-vanilla this-cpu SLAB-vanilla this-cpu
> 64 1773.00 ( 0.00%) 1731.63 (-2.39%)* 1794.48 ( 1.20%) 2029.46 (12.64%)
> 1.00% 2.43% 1.00% 1.00%
> 128 3181.12 ( 0.00%) 3471.22 ( 8.36%) 3296.37 ( 3.50%) 3251.33 ( 2.16%)
> 256 4794.35 ( 0.00%) 4797.38 ( 0.06%) 4912.99 ( 2.41%) 4846.86 ( 1.08%)
> 1024 9438.10 ( 0.00%) 8681.05 (-8.72%)* 8270.58 (-14.12%) 8268.85 (-14.14%)
> 1.00% 7.31% 1.00% 1.00%
> 2048 9196.06 ( 0.00%) 9375.72 ( 1.92%) 11474.59 (19.86%) 9420.01 ( 2.38%)
> 3312 10338.49 ( 0.00%)* 10021.82 (-3.16%)* 12018.72 (13.98%)* 12069.28 (14.34%)*
> 9.49% 6.36% 1.21% 2.12%
> 4096 9931.20 ( 0.00%)* 10285.38 ( 3.44%)* 12265.59 (19.03%)* 10175.33 ( 2.40%)*
> 1.31% 1.38% 9.97% 8.33%
> 6144 12775.08 ( 0.00%)* 10559.63 (-20.98%) 13139.34 ( 2.77%) 13210.79 ( 3.30%)*
> 1.45% 1.00% 1.00% 2.99%
> 8192 10933.93 ( 0.00%)* 10534.41 (-3.79%)* 10876.42 (-0.53%)* 10738.25 (-1.82%)*
> 14.29% 2.10% 12.50% 9.55%
> 10240 12868.58 ( 0.00%) 12991.65 ( 0.95%) 10892.20 (-18.14%) 13106.01 ( 1.81%)
> 12288 11854.97 ( 0.00%) 12122.34 ( 2.21%)* 12129.79 ( 2.27%)* 12411.84 ( 4.49%)*
> 1.00% 6.61% 5.78% 8.95%
> 14336 12552.48 ( 0.00%)* 12501.71 (-0.41%)* 12274.54 (-2.26%) 12322.63 (-1.87%)*
> 6.05% 2.58% 1.00% 2.23%
> 16384 11733.09 ( 0.00%)* 12735.05 ( 7.87%)* 13195.68 (11.08%)* 14401.62 (18.53%)
> 1.14% 9.79% 10.30% 1.00%
>
> The results for the patches are a bit all over the place for TCP_STREAM
> with big gains and losses depending on the packet size, particularly 6144
> for some reason. SLUB vs SLAB shows SLAB often has really massive advantages
> and this is not always for the larger packet sizes where the page allocator
> might be a suspect.
>

TCP_STREAM stresses a few specific caches:

ALLOC_FASTPATH ALLOC_SLOWPATH FREE_FASTPATH FREE_SLOWPATH
kmalloc-256 3868530 3450592 95628 7223491
kmalloc-1024 2440434 429 2430825 10034
kmalloc-4096 3860625 1036723 85571 4811779

This demonstrates that freeing to full (or partial) slabs causes a lot of
pain since the fastpath normally can't be utilized and that's probably
beyond the scope of this patchset.

It's also different from the cpu slab thrashing issue I identified with
the TCP_RR benchmark and had a patchset to somewhat improve. The
criticism was the addition of an increment to a fastpath counter in struct
kmem_cache_cpu which could probably now be much cheaper with these
optimizations.

2009-10-16 10:50:52

by Mel Gorman

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

On Wed, Oct 14, 2009 at 11:56:29AM -0400, Christoph Lameter wrote:
> On Wed, 14 Oct 2009, Pekka Enberg wrote:
>
> > SLAB is able to queue lots of large objects but SLUB can't do that because it
> > has no queues. In SLUB, each CPU gets a page assigned to it that serves as a
> > "queue" but the size of the queue gets smaller as object size approaches page
> > size.
> >
> > We try to offset that with higher order allocations but IIRC we don't increase
> > the order linearly with object size and cap it to some reasonable maximum.
>
> You can test to see if larger pages have an influence by passing
>
> slub_max_order=6
>
> or so on the kernel command line.
>
> You can force a large page use in slub by setting
>
> slub_min_order=3
>
> f.e.
>
> Or you can force a mininum number of objecxcts in slub through f.e.
>
> slub_min_objects=50
>
>
>
> slub_max_order=6 slub_min_objects=50
>
> should result in pretty large slabs with lots of in page objects that
> allow slub to queue better.
>

Here are the results of that suggestion. They are side-by-side with the
other results so the columns are

SLUB-vanilla No other patches applied, SLUB configured
vanilla-highorder No other patches + slub_max_order=6 slub_min_objects=50
SLUB-this-cpu The patches in this set applied
this-cpu-higher These patches + slub_max_order=6 slub_min_objects=50
SLAB-vanilla No other patches, SLAB configured
SLAB-this-cpu Thes patches, SLAB configured

SLUB-vanilla vanilla-highorder SLUB-this-cpu this-cpu-highorder SLAB-vanilla SLAB-this-cpu
Elapsed min 92.95 ( 0.00%) 92.64 ( 0.33%) 92.62 ( 0.36%) 92.77 ( 0.19%) 92.93 ( 0.02%) 92.62 ( 0.36%)
Elapsed mean 93.11 ( 0.00%) 92.89 ( 0.24%) 92.74 ( 0.40%) 92.82 ( 0.31%) 93.00 ( 0.13%) 92.82 ( 0.32%)
Elapsed stddev 0.10 ( 0.00%) 0.15 (-58.74%) 0.14 (-40.55%) 0.09 ( 7.73%) 0.04 (55.47%) 0.18 (-84.33%)
Elapsed max 93.20 ( 0.00%) 93.04 ( 0.17%) 92.95 ( 0.27%) 92.98 ( 0.24%) 93.05 ( 0.16%) 93.09 ( 0.12%)
User min 323.21 ( 0.00%) 323.38 (-0.05%) 322.60 ( 0.19%) 323.26 (-0.02%) 322.50 ( 0.22%) 323.26 (-0.02%)
User mean 323.81 ( 0.00%) 323.64 ( 0.05%) 323.20 ( 0.19%) 323.56 ( 0.08%) 323.16 ( 0.20%) 323.54 ( 0.08%)
User stddev 0.40 ( 0.00%) 0.38 ( 4.24%) 0.46 (-15.30%) 0.27 (33.20%) 0.48 (-20.92%) 0.29 (26.07%)
User max 324.32 ( 0.00%) 324.30 ( 0.01%) 323.72 ( 0.19%) 323.96 ( 0.11%) 323.86 ( 0.14%) 323.98 ( 0.10%)
System min 35.95 ( 0.00%) 35.33 ( 1.72%) 35.50 ( 1.25%) 35.95 ( 0.00%) 35.35 ( 1.67%) 36.01 (-0.17%)
System mean 36.30 ( 0.00%) 35.99 ( 0.87%) 35.96 ( 0.96%) 36.20 ( 0.28%) 36.17 ( 0.36%) 36.23 ( 0.21%)
System stddev 0.25 ( 0.00%) 0.41 (-59.25%) 0.45 (-75.60%) 0.15 (41.61%) 0.56 (-121.14%) 0.14 (46.14%)
System max 36.65 ( 0.00%) 36.44 ( 0.57%) 36.67 (-0.05%) 36.32 ( 0.90%) 36.94 (-0.79%) 36.39 ( 0.71%)
CPU min 386.00 ( 0.00%) 386.00 ( 0.00%) 386.00 ( 0.00%) 386.00 ( 0.00%) 386.00 ( 0.00%) 386.00 ( 0.00%)
CPU mean 386.25 ( 0.00%) 386.75 (-0.13%) 386.75 (-0.13%) 386.75 (-0.13%) 386.00 ( 0.06%) 387.25 (-0.26%)
CPU stddev 0.43 ( 0.00%) 0.83 (-91.49%) 0.83 (-91.49%) 0.43 ( 0.00%) 0.00 (100.00%) 0.83 (-91.49%)
CPU max 387.00 ( 0.00%) 388.00 (-0.26%) 388.00 (-0.26%) 387.00 ( 0.00%) 386.00 ( 0.26%) 388.00 (-0.26%)

The high-order allocations help here, but not by a massive amount. In some
cases it made things slightly worse. However, the standard deviations are
generally high enough to file most of the results under "noise"

NETPERF UDP
SLUB-vanilla vanilla-highorder SLUB-this-cpu this-cpu-highorder SLAB-vanilla SLAB-this-cpu
64 148.48 ( 0.00%) 146.28 (-1.50%) 152.03 ( 2.34%) 152.20 ( 2.44%) 147.45 (-0.70%) 150.07 ( 1.06%)
128 294.65 ( 0.00%) 286.80 (-2.74%) 299.92 ( 1.76%) 302.55 ( 2.61%) 289.20 (-1.88%) 290.15 (-1.55%)
256 583.63 ( 0.00%) 564.84 (-3.33%) 609.14 ( 4.19%) 587.53 ( 0.66%) 590.78 ( 1.21%) 586.42 ( 0.48%)
1024 2217.90 ( 0.00%) 2176.12 (-1.92%) 2261.99 ( 1.95%) 2312.12 ( 4.08%) 2219.64 ( 0.08%) 2207.93 (-0.45%)
2048 4164.27 ( 0.00%) 4154.96 (-0.22%) 4161.47 (-0.07%) 4244.60 ( 1.89%) 4216.46 ( 1.24%) 4155.11 (-0.22%)
3312 6284.17 ( 0.00%) 6121.32 (-2.66%) 6383.24 ( 1.55%) 6356.61 ( 1.14%) 6231.88 (-0.84%) 6243.82 (-0.65%)
4096 7399.42 ( 0.00%) 7327.40 (-0.98%)* 7686.38 ( 3.73%) 7633.64 ( 3.07%) 7394.89 (-0.06%) 7487.91 ( 1.18%)
1.00% 1.07% 1.00% 1.00% 1.00% 1.00%
6144 10014.35 ( 0.00%) 10061.59 ( 0.47%) 10199.48 ( 1.82%) 10223.16 ( 2.04%) 9927.92 (-0.87%)* 10067.40 ( 0.53%)
1.00% 1.00% 1.00% 1.00% 1.08% 1.00%
8192 11232.50 ( 0.00%)* 11222.92 (-0.09%)* 11368.13 ( 1.19%)* 11403.82 ( 1.50%)* 12280.88 ( 8.54%)* 12244.23 ( 8.26%)
1.65% 1.37% 1.64% 1.16% 1.32% 1.00%
10240 12961.87 ( 0.00%) 12746.40 (-1.69%)* 13099.82 ( 1.05%)* 12767.02 (-1.53%)* 13816.33 ( 6.18%)* 13927.18 ( 6.93%)
1.00% 2.34% 1.03% 1.26% 1.21% 1.00%
12288 14403.74 ( 0.00%)* 14136.36 (-1.89%)* 14276.89 (-0.89%)* 14246.18 (-1.11%)* 15173.09 ( 5.07%)* 15464.05 ( 6.86%)*
1.31% 1.60% 1.63% 1.60% 1.93% 1.55%
14336 15229.98 ( 0.00%)* 14962.61 (-1.79%)* 15218.52 (-0.08%)* 15243.51 ( 0.09%) 16412.94 ( 7.21%) 16252.98 ( 6.29%)
1.37% 1.66% 2.76% 1.00% 1.00% 1.00%
16384 15367.60 ( 0.00%)* 15543.13 ( 1.13%)* 16038.71 ( 4.18%) 15870.54 ( 3.17%)* 16635.91 ( 7.62%) 17128.87 (10.28%)*
1.29% 1.34% 1.00% 2.18% 1.00% 6.36%

Configuring use of high-order pages actually hurt SLUB mostly on the unpatched
kernel. The results are mixed with the patches applied. Hard to draw anything
very conclusive to be honest. Based on these results, I wouldn't push the
high-order allocations aggressively.

NETPERF TCP
SLUB-vanilla vanilla-highorder SLUB-this-cpu this-cpu-highorder SLAB-vanilla SLAB-this-cpu
64 1773.00 ( 0.00%) 1812.07 ( 2.16%)* 1731.63 (-2.39%)* 1717.99 (-3.20%)* 1794.48 ( 1.20%) 2029.46 (12.64%)
1.00% 5.88% 2.43% 2.83% 1.00% 1.00%
128 3181.12 ( 0.00%) 3193.06 ( 0.37%)* 3471.22 ( 8.36%) 3154.79 (-0.83%) 3296.37 ( 3.50%) 3251.33 ( 2.16%)
1.00% 1.70% 1.00% 1.00% 1.00% 1.00%
256 4794.35 ( 0.00%) 4813.37 ( 0.40%) 4797.38 ( 0.06%) 4819.16 ( 0.51%) 4912.99 ( 2.41%) 4846.86 ( 1.08%)
1024 9438.10 ( 0.00%) 8144.02 (-15.89%) 8681.05 (-8.72%)* 8204.11 (-15.04%) 8270.58 (-14.12%) 8268.85 (-14.14%)
1.00% 1.00% 7.31% 1.00% 1.00% 1.00%
2048 9196.06 ( 0.00%) 11233.72 (18.14%) 9375.72 ( 1.92%) 10487.89 (12.32%)* 11474.59 (19.86%) 9420.01 ( 2.38%)
1.00% 1.00% 1.00% 9.43% 1.00% 1.00%
3312 10338.49 ( 0.00%)* 9730.79 (-6.25%)* 10021.82 (-3.16%)* 10089.90 (-2.46%)* 12018.72 (13.98%)* 12069.28 (14.34%)*
9.49% 2.51% 6.36% 5.96% 1.21% 2.12%
4096 9931.20 ( 0.00%)* 12447.88 (20.22%) 10285.38 ( 3.44%)* 10548.56 ( 5.85%)* 12265.59 (19.03%)* 10175.33 ( 2.40%)*
1.31% 1.00% 1.38% 8.22% 9.97% 8.33%
6144 12775.08 ( 0.00%)* 10489.24 (-21.79%)* 10559.63 (-20.98%) 11033.15 (-15.79%)* 13139.34 ( 2.77%) 13210.79 ( 3.30%)*
1.45% 8.46% 1.00% 12.65% 1.00% 2.99%
8192 10933.93 ( 0.00%)* 10340.42 (-5.74%)* 10534.41 (-3.79%)* 10845.36 (-0.82%)* 10876.42 (-0.53%)* 10738.25 (-1.82%)*
14.29% 2.38% 2.10% 1.83% 12.50% 9.55%
10240 12868.58 ( 0.00%) 11211.60 (-14.78%)* 12991.65 ( 0.95%) 11330.97 (-13.57%)* 10892.20 (-18.14%) 13106.01 ( 1.81%)
1.00% 11.36% 1.00% 6.64% 1.00% 1.00%
12288 11854.97 ( 0.00%) 11854.51 (-0.00%) 12122.34 ( 2.21%)* 12258.61 ( 3.29%)* 12129.79 ( 2.27%)* 12411.84 ( 4.49%)*
1.00% 1.00% 6.61% 5.69% 5.78% 8.95%
14336 12552.48 ( 0.00%)* 12309.15 (-1.98%) 12501.71 (-0.41%)* 13683.57 ( 8.27%)* 12274.54 (-2.26%) 12322.63 (-1.87%)*
6.05% 1.00% 2.58% 2.46% 1.00% 2.23%
16384 11733.09 ( 0.00%)* 11856.66 ( 1.04%)* 12735.05 ( 7.87%)* 13482.61 (12.98%)* 13195.68 (11.08%)* 14401.62 (18.53%)
1.14% 1.05% 9.79% 11.52% 10.30% 1.00%

Configuring high-rder helper in a few cases here and in one or two
cases close the gap with SLAB, particularly for large packet sizes.
However, it still suffered for the small packet sizes.

SYSBENCH
SLUB-vanilla vanilla-highorder SLUB-this-cpu this-cpu-highorder SLAB-vanilla SLAB-this-cpu
1 26950.79 ( 0.00%) 26723.98 (-0.85%) 26822.05 (-0.48%) 26877.71 (-0.27%) 26919.89 (-0.11%) 26746.18 (-0.77%)
2 51555.51 ( 0.00%) 51231.41 (-0.63%) 51928.02 ( 0.72%) 51794.47 ( 0.46%) 51370.02 (-0.36%) 51129.82 (-0.83%)
3 76204.23 ( 0.00%) 76060.77 (-0.19%) 76333.58 ( 0.17%) 76270.53 ( 0.09%) 76483.99 ( 0.37%) 75954.52 (-0.33%)
4 100599.12 ( 0.00%) 100825.16 ( 0.22%) 101757.98 ( 1.14%) 100273.02 (-0.33%) 100499.65 (-0.10%) 101605.61 ( 0.99%)
5 100211.45 ( 0.00%) 100096.77 (-0.11%) 100435.33 ( 0.22%) 101132.16 ( 0.91%) 100150.98 (-0.06%) 99398.11 (-0.82%)
6 99390.81 ( 0.00%) 99305.36 (-0.09%) 99840.85 ( 0.45%) 99200.53 (-0.19%) 99234.38 (-0.16%) 99244.42 (-0.15%)
7 98740.56 ( 0.00%) 98625.23 (-0.12%) 98727.61 (-0.01%) 98470.75 (-0.27%) 98305.88 (-0.44%) 98123.56 (-0.63%)
8 98075.89 ( 0.00%) 97609.30 (-0.48%) 98048.62 (-0.03%) 97092.44 (-1.01%) 98183.99 ( 0.11%) 97587.82 (-0.50%)
9 96502.22 ( 0.00%) 96685.39 ( 0.19%) 97276.80 ( 0.80%) 96800.23 ( 0.31%) 96819.88 ( 0.33%) 97320.51 ( 0.84%)
10 96598.70 ( 0.00%) 96272.05 (-0.34%) 96545.37 (-0.06%) 95936.97 (-0.69%) 96222.51 (-0.39%) 96221.69 (-0.39%)
11 95500.66 ( 0.00%) 95141.00 (-0.38%) 95671.11 ( 0.18%) 96057.84 ( 0.58%) 95003.21 (-0.52%) 95246.81 (-0.27%)
12 94572.87 ( 0.00%) 94811.46 ( 0.25%) 95266.70 ( 0.73%) 93767.06 (-0.86%) 93807.60 (-0.82%) 94859.82 ( 0.30%)
13 93811.85 ( 0.00%) 93597.39 (-0.23%) 94309.18 ( 0.53%) 93323.96 (-0.52%) 93219.81 (-0.64%) 93051.63 (-0.82%)
14 92972.16 ( 0.00%) 92936.53 (-0.04%) 93849.87 ( 0.94%) 92545.83 (-0.46%) 92641.50 (-0.36%) 92916.70 (-0.06%)
15 92276.06 ( 0.00%) 91559.63 (-0.78%) 92454.94 ( 0.19%) 91748.29 (-0.58%) 91094.04 (-1.30%) 91972.79 (-0.33%)
16 90265.35 ( 0.00%) 89707.32 (-0.62%) 90416.26 ( 0.17%) 89253.93 (-1.13%) 89309.26 (-1.07%) 90103.89 (-0.18%)

High-order didn't really help here either.

Overall, it would appear that high-order allocations occasionally help
but the margins are pretty small.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-10-16 16:53:13

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

On Thu, 15 Oct 2009, David Rientjes wrote:

> TCP_STREAM stresses a few specific caches:
>
> ALLOC_FASTPATH ALLOC_SLOWPATH FREE_FASTPATH FREE_SLOWPATH
> kmalloc-256 3868530 3450592 95628 7223491
> kmalloc-1024 2440434 429 2430825 10034
> kmalloc-4096 3860625 1036723 85571 4811779
>
> This demonstrates that freeing to full (or partial) slabs causes a lot of
> pain since the fastpath normally can't be utilized and that's probably
> beyond the scope of this patchset.
>
> It's also different from the cpu slab thrashing issue I identified with
> the TCP_RR benchmark and had a patchset to somewhat improve. The
> criticism was the addition of an increment to a fastpath counter in struct
> kmem_cache_cpu which could probably now be much cheaper with these
> optimizations.

Can you redo the patch?

2009-10-16 18:40:51

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

On Fri, 16 Oct 2009, Mel Gorman wrote:

> NETPERF TCP
> SLUB-vanilla vanilla-highorder SLUB-this-cpu this-cpu-highorder SLAB-vanilla SLAB-this-cpu
> 64 1773.00 ( 0.00%) 1812.07 ( 2.16%)* 1731.63 (-2.39%)* 1717.99 (-3.20%)* 1794.48 ( 1.20%) 2029.46 (12.64%)
> 1.00% 5.88% 2.43% 2.83% 1.00% 1.00%
> 128 3181.12 ( 0.00%) 3193.06 ( 0.37%)* 3471.22 ( 8.36%) 3154.79 (-0.83%) 3296.37 ( 3.50%) 3251.33 ( 2.16%)
> 1.00% 1.70% 1.00% 1.00% 1.00% 1.00%
> 256 4794.35 ( 0.00%) 4813.37 ( 0.40%) 4797.38 ( 0.06%) 4819.16 ( 0.51%) 4912.99 ( 2.41%) 4846.86 ( 1.08%)
> 1024 9438.10 ( 0.00%) 8144.02 (-15.89%) 8681.05 (-8.72%)* 8204.11 (-15.04%) 8270.58 (-14.12%) 8268.85 (-14.14%)
> 1.00% 1.00% 7.31% 1.00% 1.00% 1.00%
> 2048 9196.06 ( 0.00%) 11233.72 (18.14%) 9375.72 ( 1.92%) 10487.89 (12.32%)* 11474.59 (19.86%) 9420.01 ( 2.38%)
> 1.00% 1.00% 1.00% 9.43% 1.00% 1.00%
> 3312 10338.49 ( 0.00%)* 9730.79 (-6.25%)* 10021.82 (-3.16%)* 10089.90 (-2.46%)* 12018.72 (13.98%)* 12069.28 (14.34%)*
> 9.49% 2.51% 6.36% 5.96% 1.21% 2.12%
> 4096 9931.20 ( 0.00%)* 12447.88 (20.22%) 10285.38 ( 3.44%)* 10548.56 ( 5.85%)* 12265.59 (19.03%)* 10175.33 ( 2.40%)*
> 1.31% 1.00% 1.38% 8.22% 9.97% 8.33%
> 6144 12775.08 ( 0.00%)* 10489.24 (-21.79%)* 10559.63 (-20.98%) 11033.15 (-15.79%)* 13139.34 ( 2.77%) 13210.79 ( 3.30%)*
> 1.45% 8.46% 1.00% 12.65% 1.00% 2.99%
> 8192 10933.93 ( 0.00%)* 10340.42 (-5.74%)* 10534.41 (-3.79%)* 10845.36 (-0.82%)* 10876.42 (-0.53%)* 10738.25 (-1.82%)*
> 14.29% 2.38% 2.10% 1.83% 12.50% 9.55%
> 10240 12868.58 ( 0.00%) 11211.60 (-14.78%)* 12991.65 ( 0.95%) 11330.97 (-13.57%)* 10892.20 (-18.14%) 13106.01 ( 1.81%)
> 1.00% 11.36% 1.00% 6.64% 1.00% 1.00%
> 12288 11854.97 ( 0.00%) 11854.51 (-0.00%) 12122.34 ( 2.21%)* 12258.61 ( 3.29%)* 12129.79 ( 2.27%)* 12411.84 ( 4.49%)*
> 1.00% 1.00% 6.61% 5.69% 5.78% 8.95%
> 14336 12552.48 ( 0.00%)* 12309.15 (-1.98%) 12501.71 (-0.41%)* 13683.57 ( 8.27%)* 12274.54 (-2.26%) 12322.63 (-1.87%)*
> 6.05% 1.00% 2.58% 2.46% 1.00% 2.23%
> 16384 11733.09 ( 0.00%)* 11856.66 ( 1.04%)* 12735.05 ( 7.87%)* 13482.61 (12.98%)* 13195.68 (11.08%)* 14401.62 (18.53%)
> 1.14% 1.05% 9.79% 11.52% 10.30% 1.00%
>
> Configuring high-rder helper in a few cases here and in one or two
> cases close the gap with SLAB, particularly for large packet sizes.
> However, it still suffered for the small packet sizes.
>

This is understandable considering the statistics that I posted for this
workload on my machine, higher order cpu slabs will naturally get freed to
more often from the fastpath, which also causes it to utilize the
allocation fastpath more often (and we can see the optimization of this
patchset), in addition to avoiding partial list handling.

The pain with the smaller packet sizes is probably the overhead from the
page allocator more than slub, a characteristic that also caused the
TCP_RR benchmark to suffer. It can be mitigated somewhat with slab
preallocation or a higher min_partial setting, but that's probably not an
optimal solution.

2009-10-16 18:43:27

[permalink] [raw]

Subject: Re: [this_cpu_xx V6 7/7] this_cpu: slub aggressive use of this_cpu operations in the hotpaths

On Fri, 16 Oct 2009, Christoph Lameter wrote:

> > TCP_STREAM stresses a few specific caches:
> >
> > ALLOC_FASTPATH ALLOC_SLOWPATH FREE_FASTPATH FREE_SLOWPATH
> > kmalloc-256 3868530 3450592 95628 7223491
> > kmalloc-1024 2440434 429 2430825 10034
> > kmalloc-4096 3860625 1036723 85571 4811779
> >
> > This demonstrates that freeing to full (or partial) slabs causes a lot of
> > pain since the fastpath normally can't be utilized and that's probably
> > beyond the scope of this patchset.
> >
> > It's also different from the cpu slab thrashing issue I identified with
> > the TCP_RR benchmark and had a patchset to somewhat improve. The
> > criticism was the addition of an increment to a fastpath counter in struct
> > kmem_cache_cpu which could probably now be much cheaper with these
> > optimizations.
>
> Can you redo the patch?
>

Sure, but it would be even more inexpensive if we can figure out why the
irqless patch is hanging my netserver machine within the first 60 seconds
on the TCP_RR benchmark. I guess nobody else has reproduced that yet.

2009-10-16 18:58:45