1) Tbench has about 30% regression in kernel 2.6.23-rc4 than 2.6.22.
2.6.23-rc1 has about 10% regression. I investigated 2.6.22 and 2.6.23-rc4.
2) Testing environment: x86_64, qual-core, 2 physical processors, totally
8 cores. 8GB memory. Kernel enables CONFIG_SLUB=y and CONFIG_SLUB_DEBUG=y.
3) Under my environment, I started CPU_NUMBER*2 tbench sub processes and
server processes. So 16 tbench and 16 tbench_srv processes are running
based on 1:1 mapping. Tbench communicates with tbench_srv in an interactive
mode by tcp socket.
4) Collected oprofile data and showed __slab_alloc is about 15% in
2.6.23-rc4 and 3.8% in 2.6.22;
5) Collected slabinfo and found kmalloc-4096 and skbuff_head_cache are
proactive. Other slabs are mostly quiet.
6) Collect data about slab_alloc: data consists of
a) number to call slab_alloc
b) number to get objects from slab per cpu cache;
c) number to get objects from a new slab and a partial slab;
d) number to free objects from non-perCPU cache;
These data showed skbuff_head_cache allocation mostly succeeds in
per-cpu cache, so it won’t cause too much __slab_alloc. kmalloc-4096
is the slab which causes too most __slab_alloc callings.
On 2.6.22, about 58% kmalloc-4096 succeeds in per-cpu slab cache.
On 2.6.23-rc4, about 12.5% kmalloc-4096 succeeds in per-cpu slab cache.
7) By instrumenting kernel, I captured kernel allocates kmalloc-4096 always
at tcp_sendmsg=>sk_stream_alloc_psk and frees it at
tcp_ack=>tcp_clean_rtx_queue=>sk_stream_free_skb. When tbench client
communicates with tbench_srv, the sender will allocate a kmalloc-4096 and the
receiver will free it.
8) kmalloc-4096 order is 1 which means one slab consists of 2 objects. So a
partial slab always consists one free object. In the other hand, slub only
allocates 1 slab for every cpu. If a tbench client process gets a kmalloc-4096
object from a partial page on a cpu and put the slab as the per-cpu cache,
this slab will have no free objects. So late another tbench process also
applies for a kmalloc-4096 on the same cpu, it couldn’t get a free object
from per-CPU cache. It will get an object, mostly from another partial slab
and replace the per-cpu slab cache, although the new slab also hasn’t free objects then.
9) I collected more data about cpu to see if the cpu on which kernel allocates
the object is the cpu which kernel frees the same object. The result showed both
kernel 2.6.22 and 2.6.23-rc3, very mostly, an object will be allocated and freed
on the same cpu. That means tbench client and tbench_srv process who communicate
with each other are mostly running on the same cpu.
10) I ran both kernel with boot parameter maxcpus=1 and found the regression becomes
about 10%.
11) On my machines, averagely, 1 cpu has 2 tbench client process and 2 tbench_srv
processes. So there are a couple of scenario of process scheduling:
a) Client 1 allocates a 4096 object and updates the per-CPU slab with
the new non-free slab. The tbench_srv 1 consumes the data and free the 4096 object
on the same cpu, so the per-cpu slab cache slab has free object now. Then,
tbench_srv 1 replies to client 1 by allocating a new 4096 again, or client 2
allocates a 4096 object from the per-cpu slab cache to communicate with tbench_srv 2.
This good scenario is ideal.
b) Client 1 allocates a 4096 object and updates the per-CPU slab with the
new non-free slab, then sleep to wait tbench_srv 1 to reply. But client 2 allocates
a 4096 object and finds the per-cpu slab cache has no free object, so allocates the
object from a partial slab and updates the per-CPU slab with the new non-free slab.
Then, tbench_srv 1 is scheduled in and frees the kmalloc-4096 object to a partial
slab, because previous slab already isn’t per-cpu cache. Then, tbench_srv 1 tries
to allocate a new kmalloc-4096 to reply to client 1. But because the per-cpu slab
hasn’t free object, so it also needs get the free object from a partial slab and
update per-cpu slab cache. This scenario is very bad.
Under both scenarios, I think schedule wakes up sleeping processes on the same
cpu. In scenario 1, the waken process will be scheduled to run on cpu quickly
(immediately?). In In scenario 2, the waken process will be scheduled later.
I think kernel 2.6.22 creates scenario 1 and 2.6.23-rc4 creates scenario 2.
12) How to resolve the issue? There are 2 directions.
a) Change process scheduler to schedule waken processes firstly.
b) Change SLUB per-cpu slab cache, to cache more slabs instead of only one
slab. This way could use page->lru to creates a list linked in kmem_cache->cpu_slab[]
whose members need to be changed to as list_head. As for how many slabs could be in
a per-cpu slab cache, it might be implemented as a sysfs parameter under /sys/slab/XXX/.
Default could be 1 to satisfy big machines.
--yanmin
On Wed, 5 Sep 2007, Zhang, Yanmin wrote:
> 8) kmalloc-4096 order is 1 which means one slab consists of 2 objects. So a
You can change that by booting with slub_max_order=0. Then we can also use
the per cpu queues to get these order 0 objects which may speed up the
allocations because we do not have to take zone locks on slab allocation.
Note also that Andrew's tree has a page allocator pass through for SLUB
for 4k kmallocs bypassing slab completely. That may also address the
issue.
If you want SLUB to handle more objects in the 4k kmalloc cache
without going to the page allocator then you can boot f.e. with
slub_max_order=3 slub_min_objects=8
which will result in a kmalloc-4096 that caches 8 objects.
> b) Change SLUB per-cpu slab cache, to cache more slabs instead of only one
> slab. This way could use page->lru to creates a list linked in kmem_cache->cpu_slab[]
> whose members need to be changed to as list_head. As for how many slabs could be in
> a per-cpu slab cache, it might be implemented as a sysfs parameter under /sys/slab/XXX/.
> Default could be 1 to satisfy big machines.
Try the ways to address the issue that I mentioned above.
On Tue, 2007-09-04 at 20:59 -0700, Christoph Lameter wrote:
> On Wed, 5 Sep 2007, Zhang, Yanmin wrote:
>
> > 8) kmalloc-4096 order is 1 which means one slab consists of 2 objects. So a
>
> You can change that by booting with slub_max_order=0. Then we can also use
> the per cpu queues to get these order 0 objects which may speed up the
> allocations because we do not have to take zone locks on slab allocation.
>
> Note also that Andrew's tree has a page allocator pass through for SLUB
> for 4k kmallocs bypassing slab completely. That may also address the
> issue.
>
> If you want SLUB to handle more objects in the 4k kmalloc cache
> without going to the page allocator then you can boot f.e. with
>
> slub_max_order=3 slub_min_objects=8
I tried this approach. The testing result showed 2.6.23-rc4 is about
2.5% better than 2.6.22. It really resovles the issue.
However, the approach treats the slabs in the same policy. Could we
implement a per-slab specific approach like direct b)?
>
> which will result in a kmalloc-4096 that caches 8 objects.
>
> > b) Change SLUB per-cpu slab cache, to cache more slabs instead of only one
> > slab. This way could use page->lru to creates a list linked in kmem_cache->cpu_slab[]
> > whose members need to be changed to as list_head. As for how many slabs could be in
> > a per-cpu slab cache, it might be implemented as a sysfs parameter under /sys/slab/XXX/.
> > Default could be 1 to satisfy big machines.
Above direction b) looks more flexible.
In addition, could process scheduler also have an enhancement to schedule waken
processes firstly or do some favor for waken processes? From cache-hot point of view,
this enhancement might help performance, because mostly, waken process and waker share
some data.
> Try the ways to address the issue that I mentioned above.
I really appreciate your kind comments!
-yanmin
On Wed, 5 Sep 2007, Zhang, Yanmin wrote:
> On Tue, 2007-09-04 at 20:59 -0700, Christoph Lameter wrote:
> > On Wed, 5 Sep 2007, Zhang, Yanmin wrote:
> >
> > > 8) kmalloc-4096 order is 1 which means one slab consists of 2 objects. So a
> >
> > You can change that by booting with slub_max_order=0. Then we can also use
> > the per cpu queues to get these order 0 objects which may speed up the
> > allocations because we do not have to take zone locks on slab allocation.
> >
> > Note also that Andrew's tree has a page allocator pass through for SLUB
> > for 4k kmallocs bypassing slab completely. That may also address the
> > issue.
> >
> > If you want SLUB to handle more objects in the 4k kmalloc cache
> > without going to the page allocator then you can boot f.e. with
> >
> > slub_max_order=3 slub_min_objects=8
> I tried this approach. The testing result showed 2.6.23-rc4 is about
> 2.5% better than 2.6.22. It really resovles the issue.
>
> However, the approach treats the slabs in the same policy. Could we
> implement a per-slab specific approach like direct b)?
I am not sure what you mean by same policy. Same configuration for all
slabs?
> > Try the ways to address the issue that I mentioned above.
> I really appreciate your kind comments!
Would it be possible to try the two other approaches that I suggested? I
think both of those may also solve the issue. Try booting with
slab_max_order=0 and see what effect it has. The queues of the page
allocator can be much larger than what slab has for 4k pages. There is
really not much of a point in using a slab allocator for page sized
allocations.
On Wed, 5 Sep 2007, Zhang, Yanmin wrote:
> > slub_max_order=3 slub_min_objects=8
> I tried this approach. The testing result showed 2.6.23-rc4 is about
> 2.5% better than 2.6.22. It really resovles the issue.
Note also that the configuration you tried is the way SLUB is configured
in Andrew's tree.
On Tue, 2007-09-04 at 23:58 -0700, Christoph Lameter wrote:
> On Wed, 5 Sep 2007, Zhang, Yanmin wrote:
>
> > On Tue, 2007-09-04 at 20:59 -0700, Christoph Lameter wrote:
> > > On Wed, 5 Sep 2007, Zhang, Yanmin wrote:
> > >
> > > > 8) kmalloc-4096 order is 1 which means one slab consists of 2 objects. So a
> > >
> > > You can change that by booting with slub_max_order=0. Then we can also use
> > > the per cpu queues to get these order 0 objects which may speed up the
> > > allocations because we do not have to take zone locks on slab allocation.
> > >
> > > Note also that Andrew's tree has a page allocator pass through for SLUB
> > > for 4k kmallocs bypassing slab completely. That may also address the
> > > issue.
> > >
> > > If you want SLUB to handle more objects in the 4k kmalloc cache
> > > without going to the page allocator then you can boot f.e. with
> > >
> > > slub_max_order=3 slub_min_objects=8
> > I tried this approach. The testing result showed 2.6.23-rc4 is about
> > 2.5% better than 2.6.22. It really resovles the issue.
> >
> > However, the approach treats the slabs in the same policy. Could we
> > implement a per-slab specific approach like direct b)?
>
> I am not sure what you mean by same policy. Same configuration for all
> slabs?
Yes.
>
> > > Try the ways to address the issue that I mentioned above.
> > I really appreciate your kind comments!
>
> Would it be possible to try the two other approaches that I suggested? I
> think both of those may also solve the issue. Try booting with
> slab_max_order=0
1) I tried slab_max_order=0 and the regression becomes 12.5%. It's still not good.
2) I apllied patch slub-direct-pass-through-of-page-size-or-higher-kmalloc.patch
to kernel 2.6.23-rc4. The new testing result is much better, only 1% less than
2.6.22.
So the best solution is booting kernel with "slub_max_order=3 slub_min_objects=8".
> and see what effect it has. The queues of the page
> allocator can be much larger than what slab has for 4k pages. There is
> really not much of a point in using a slab allocator for page sized
> allocations.
On Wed, 5 Sep 2007, Zhang, Yanmin wrote:
> > > However, the approach treats the slabs in the same policy. Could we
> > > implement a per-slab specific approach like direct b)?
> >
> > I am not sure what you mean by same policy. Same configuration for all
> > slabs?
> Yes.
Ok. I could add the ability to specify parameters for some slabs.
> > Would it be possible to try the two other approaches that I suggested? I
> > think both of those may also solve the issue. Try booting with
> > slab_max_order=0
> 1) I tried slab_max_order=0 and the regression becomes 12.5%. It's still
> not good.
>
> 2) I apllied patch
> slub-direct-pass-through-of-page-size-or-higher-kmalloc.patch to kernel
> 2.6.23-rc4. The new testing result is much better, only 1% less than
> 2.6.22.
Ok. That seems to indicate that we should improve the alloc path in the
page allocator. The page allocator performance needs to be competitive on
page sized allocations. The problem will be largely going away when we
merge the pass through patch in 2.6.24.
On Wed, 2007-09-05 at 03:45 -0700, Christoph Lameter wrote:
> On Wed, 5 Sep 2007, Zhang, Yanmin wrote:
>
> > > > However, the approach treats the slabs in the same policy. Could we
> > > > implement a per-slab specific approach like direct b)?
> > >
> > > I am not sure what you mean by same policy. Same configuration for all
> > > slabs?
> > Yes.
>
> Ok. I could add the ability to specify parameters for some slabs.
Thanks. That will be more flexible.
>
> > > Would it be possible to try the two other approaches that I suggested? I
> > > think both of those may also solve the issue. Try booting with
> > > slab_max_order=0
> > 1) I tried slab_max_order=0 and the regression becomes 12.5%. It's still
> > not good.
> >
> > 2) I apllied patch
> > slub-direct-pass-through-of-page-size-or-higher-kmalloc.patch to kernel
> > 2.6.23-rc4. The new testing result is much better, only 1% less than
> > 2.6.22.
I retested 2.6.22 and booted kernel with "slub_max_order=3 slub_min_objects=8".
The result is about 8.7% better than without booting parameters.
So all with booting parameter "slub_max_order=3 slub_min_objects=8", 2.6.22 is
about 5.8% better than 2.6.23-rc4. I suspect process scheduler is responsible
for the 5.8% regressions.
>
> Ok. That seems to indicate that we should improve the alloc path in the
> page allocator. The page allocator performance needs to be competitive on
> page sized allocations. The problem will be largely going away when we
> merge the pass through patch in 2.6.24.
On Wednesday 05 September 2007 17:07, Christoph Lameter wrote:
> On Wed, 5 Sep 2007, Zhang, Yanmin wrote:
> > > slub_max_order=3 slub_min_objects=8
> >
> > I tried this approach. The testing result showed 2.6.23-rc4 is about
> > 2.5% better than 2.6.22. It really resovles the issue.
>
> Note also that the configuration you tried is the way SLUB is configured
> in Andrew's tree.
It still doesn't sound like it is competitive with SLAB at the same sizes.
What's the problem?
On Sat, 2007-09-08 at 18:08 +1000, Nick Piggin wrote:
> On Wednesday 05 September 2007 17:07, Christoph Lameter wrote:
> > On Wed, 5 Sep 2007, Zhang, Yanmin wrote:
> > > > slub_max_order=3 slub_min_objects=8
> > >
> > > I tried this approach. The testing result showed 2.6.23-rc4 is about
> > > 2.5% better than 2.6.22. It really resovles the issue.
> >
> > Note also that the configuration you tried is the way SLUB is configured
> > in Andrew's tree.
>
> It still doesn't sound like it is competitive with SLAB at the same sizes.
> What's the problem?
Process scheduler and small SLUB per-cpu cache work together to create the tebnch regression.
Pls. see the starting of the thread.
-yanmin
On Monday 10 September 2007 10:56, Zhang, Yanmin wrote:
> On Sat, 2007-09-08 at 18:08 +1000, Nick Piggin wrote:
> > On Wednesday 05 September 2007 17:07, Christoph Lameter wrote:
> > > On Wed, 5 Sep 2007, Zhang, Yanmin wrote:
> > > > > slub_max_order=3 slub_min_objects=8
> > > >
> > > > I tried this approach. The testing result showed 2.6.23-rc4 is about
> > > > 2.5% better than 2.6.22. It really resovles the issue.
> > >
> > > Note also that the configuration you tried is the way SLUB is
> > > configured in Andrew's tree.
> >
> > It still doesn't sound like it is competitive with SLAB at the same
> > sizes. What's the problem?
>
> Process scheduler and small SLUB per-cpu cache work together to create the
> tebnch regression.
OK, so after isolating the scheduler, then SLUB should be as fast as SLAB
at the same allocation size. That's basically what we need to do before we
can replace SLAB with it, I think?
On Mon, 10 Sep 2007, Nick Piggin wrote:
> OK, so after isolating the scheduler, then SLUB should be as fast as SLAB
> at the same allocation size. That's basically what we need to do before we
> can replace SLAB with it, I think?
The regression is due to the limited number of objects in the per cpu
"queue" in SLUB for 4k objects. With the .23 code this is one or two
(order 1 slab). So we have to call into the page allocator frequently and
do it for order 1 pages which requires the zone locks. Urgh.
I think the regression is best addressed by the page allocator pass
through patch in mm which makes the page allocator handle these objects.
They are single pages so the pcp lists are in use which provide much
larger queues than SLUB/SLAB.
IMHO >=4k objects should be handled by the page allocator. From the
numbers I have seen there is then still a 1% regression left. If
that is still the case after we have fixed the scheduler then maybe
we need to slim down the page allocator fast path.
On Tuesday 11 September 2007 05:07, Christoph Lameter wrote:
> On Mon, 10 Sep 2007, Nick Piggin wrote:
> > OK, so after isolating the scheduler, then SLUB should be as fast as SLAB
> > at the same allocation size. That's basically what we need to do before
> > we can replace SLAB with it, I think?
>
> The regression is due to the limited number of objects in the per cpu
> "queue" in SLUB for 4k objects. With the .23 code this is one or two
> (order 1 slab). So we have to call into the page allocator frequently and
> do it for order 1 pages which requires the zone locks. Urgh.
The impression I got at vm meeting was that SLUB was good to go :(
> I think the regression is best addressed by the page allocator pass
> through patch in mm which makes the page allocator handle these objects.
> They are single pages so the pcp lists are in use which provide much
> larger queues than SLUB/SLAB.
>
> IMHO >=4k objects should be handled by the page allocator. From the
> numbers I have seen there is then still a 1% regression left. If
> that is still the case after we have fixed the scheduler then maybe
> we need to slim down the page allocator fast path.
It is trivial to test SLUB vs SLAB independently of the scheduler change.
And actually, a scheduler regression here might just never be fixed,
because it is likely to be a higher level thing where the scheduling just
happens not to interact with tbench so well (and either it would be
impossible to find out why, or no point tuning the scheduler for such a
case).
But slab allocations don't really control the macro behaviour of a
benchmark like that so much. So don't wait until something happens
with the scheduler, fix it now.
Yay, looks like we'll get yet more logic in the VM to polish the proverbial
turd that is higher order allocations :P
On Tue, 11 Sep 2007, Nick Piggin wrote:
> The impression I got at vm meeting was that SLUB was good to go :(
Its not? I have had Intel test this thoroughly and they assured me that it
is up to SLAB. This particular case is an synthetic tests for a PAGE_SIZE
alloc and SLUB was not optimized for that case because PAGE_SIZEd
allocations should be handled by the page allocators. Quicklists were
introduced for the explicit purpose to get these messy page sized cases
out of the slab allocators.
> But slab allocations don't really control the macro behaviour of a
> benchmark like that so much. So don't wait until something happens
> with the scheduler, fix it now.
Ok so you are for pushing in the page allocator pass through patch from mm
into rc6? Isnt it a bit late for such a change? I would think that 2.6.24
is early enough.
On Wednesday 12 September 2007 06:19, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> > The impression I got at vm meeting was that SLUB was good to go :(
>
> Its not? I have had Intel test this thoroughly and they assured me that it
> is up to SLAB. This particular case is an synthetic tests for a PAGE_SIZE
> alloc and SLUB was not optimized for that case because PAGE_SIZEd
> allocations should be handled by the page allocators. Quicklists were
> introduced for the explicit purpose to get these messy page sized cases
> out of the slab allocators.
I heard from one person at KS and one person here that it is not. If they're
simply missing some patch that's in -mm, and there is no longer a SLUB vs
SLAB regression when using equivalent page allocation order, then that's
fine.
On Tue, Sep 11, 2007 at 01:19:30PM -0700, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
>
> > The impression I got at vm meeting was that SLUB was good to go :(
>
> Its not? I have had Intel test this thoroughly and they assured me that it
> is up to SLAB.
Christoph, Not sure if you are referring to me or not here. But our
tests(atleast on with the database workloads) approx 1.5 months or so back
showed that on ia64 slub was on par with slab and on x86_64, slub was 9% down.
And after changing the slub min order and max order, slub perf on x86_64 is
down approx 3.5% or so compared to slab.
While I don't rule out large sized allocations like PAGE_SIZE, I am mostly
certain that the critical allocations in this workload are not PAGE_SIZE
based. Mostly they are in the range less than 300-500 bytes or so.
Any changes in the recent slub which takes the pressure away from the page
allocator especially for smaller page sized architectures? If so, we can
redo some of the experiments. Looking at this thread, it doesn't sound like?
thanks,
suresh
On Wed, 12 Sep 2007, Siddha, Suresh B wrote:
> Christoph, Not sure if you are referring to me or not here. But our
> tests(atleast on with the database workloads) approx 1.5 months or so back
> showed that on ia64 slub was on par with slab and on x86_64, slub was 9% down.
> And after changing the slub min order and max order, slub perf on x86_64 is
> down approx 3.5% or so compared to slab.
No, I was referring to another talk that I had at the OLS with Corey
Gough. I keep getting confusing information from Intel. Last I heard was
that IA64 had a regression and x86_64 was fine (but they were not allowed
to tell me details). Would you please straighten out your story and give
me details?
AFAIK the two of us discussed some issues related to object handover
between processors that cause cache line bouncing and I sent you a
patchset for testing but I did not get any feedback. The patches that were
discussed are now in mm.
> While I don't rule out large sized allocations like PAGE_SIZE, I am mostly
> certain that the critical allocations in this workload are not PAGE_SIZE
> based. Mostly they are in the range less than 300-500 bytes or so.
>
> Any changes in the recent slub which takes the pressure away from the page
> allocator especially for smaller page sized architectures? If so, we can
> redo some of the experiments. Looking at this thread, it doesn't sound like?
Its too late for 2.6.23. But we can certainly do things for .24. Could you
please test the patches queued up in Andrew's tree? In particular the page
allocator pass through and the per cpu structures optimizations?
There is more work out of tree to optimize the fastpath that is mostly
driven by Mathieu Desnoyers. I hope to get that into mm in the next weeks
but I do not think that it is going to be available before .25.
The work of Matheiu also has implications for the page allocator. We may
be able to significantly speed up the fastpath there as well.
Christoph,
On Thu, Sep 13, 2007 at 11:03:53AM -0700, Christoph Lameter wrote:
> On Wed, 12 Sep 2007, Siddha, Suresh B wrote:
>
> > Christoph, Not sure if you are referring to me or not here. But our
> > tests(atleast on with the database workloads) approx 1.5 months or so back
> > showed that on ia64 slub was on par with slab and on x86_64, slub was 9% down.
> > And after changing the slub min order and max order, slub perf on x86_64 is
> > down approx 3.5% or so compared to slab.
>
> No, I was referring to another talk that I had at the OLS with Corey
> Gough. I keep getting confusing information from Intel. Last I heard was
Please don't go with informal talks and discussions. Please demand the numbers
and make decisions, conclusions based on those numbers. AFAIK, we haven't
posted confusing numbers so far.
> that IA64 had a regression and x86_64 was fine (but they were not allowed
> to tell me details). Would you please straighten out your story and give
> me details?
Numbers I posted in the previous e-mail is the only story we have so far.
> AFAIK the two of us discussed some issues related to object handover
> between processors that cause cache line bouncing and I sent you a
> patchset for testing but I did not get any feedback. The patches that were
Sorry, These systems are huge and limited. We are raising the priority
with the performance team to do the latest slub patch testing.
> discussed are now in mm.
>
> > While I don't rule out large sized allocations like PAGE_SIZE, I am mostly
> > certain that the critical allocations in this workload are not PAGE_SIZE
> > based. Mostly they are in the range less than 300-500 bytes or so.
> >
> > Any changes in the recent slub which takes the pressure away from the page
> > allocator especially for smaller page sized architectures? If so, we can
> > redo some of the experiments. Looking at this thread, it doesn't sound like?
>
> Its too late for 2.6.23. But we can certainly do things for .24. Could you
> please test the patches queued up in Andrew's tree? In particular the page
> allocator pass through and the per cpu structures optimizations?
We are trying to get the latest data with 2.6.23-rc4-mm1 with and without
slub. Is this good enough?
>
> There is more work out of tree to optimize the fastpath that is mostly
> driven by Mathieu Desnoyers. I hope to get that into mm in the next weeks
> but I do not think that it is going to be available before .25.
>
> The work of Matheiu also has implications for the page allocator. We may
> be able to significantly speed up the fastpath there as well.
Ok. Atleast till all the regressions addressed and all these patches well
tested, we shouldn't do away with slab from mainline anytime soon.
Other than us, who else are you banking on for analysing slub? Do
you have any numbers that you can share, which show where slub
is good or bad...
thanks,
suresh
On Fri, 14 Sep 2007, Siddha, Suresh B wrote:
> Numbers I posted in the previous e-mail is the only story we have so far.
It would be interesting to know more about how the allocator is used
there.
> Sorry, These systems are huge and limited. We are raising the priority
> with the performance team to do the latest slub patch testing.
Ok. Thanks.
> > Its too late for 2.6.23. But we can certainly do things for .24. Could you
> > please test the patches queued up in Andrew's tree? In particular the page
> > allocator pass through and the per cpu structures optimizations?
>
> We are trying to get the latest data with 2.6.23-rc4-mm1 with and without
> slub. Is this good enough?
Good enough. If you are concerned about the page allocator pass through
then you may want to test the page allocator pass through patchset
separately. The fastpath of the page allocator is currently not
competitive if you always free and allocate a single page. If contiguous
pages are allocated then the pass through is superior.
> > The work of Matheiu also has implications for the page allocator. We may
> > be able to significantly speed up the fastpath there as well.
>
> Ok. Atleast till all the regressions addressed and all these patches well
> tested, we shouldn't do away with slab from mainline anytime soon.
Ok. We will hold off. It was so silent about this issue though and from
the talk with Corey I may have wrongly concluded that this was because the
issues were resolved.
> Other than us, who else are you banking on for analysing slub? Do
> you have any numbers that you can share, which show where slub
> is good or bad...
http://lwn.net/Articles/246927/ contains some cycle measurements for the
per cpu patchset and also for the page allocator pass through.
If there is a problem with certain sizes for page allocator pass through
then we may want to increase the boundary so that the page allocator is
only called for objects larger than page size.
On Fri, Sep 14, 2007 at 12:51:34PM -0700, Christoph Lameter wrote:
> On Fri, 14 Sep 2007, Siddha, Suresh B wrote:
> > We are trying to get the latest data with 2.6.23-rc4-mm1 with and without
> > slub. Is this good enough?
>
> Good enough. If you are concerned about the page allocator pass through
> then you may want to test the page allocator pass through patchset
> separately. The fastpath of the page allocator is currently not
> competitive if you always free and allocate a single page. If contiguous
> pages are allocated then the pass through is superior.
We are having all sorts of stability issues with -mm kernels, let alone
perf testing :(
For now, we are trying to do slab Vs slub comparisons for the mainline kernels.
Let's see how that goes.
Meanwhile, any chance that you can point us at relevant recent patches/fixes
that are in -mm and perhaps that can be applied to mainline kernel?
thanks,
suresh
On Tue, 18 Sep 2007, Siddha, Suresh B wrote:
> For now, we are trying to do slab Vs slub comparisons for the mainline kernels.
> Let's see how that goes.
>
> Meanwhile, any chance that you can point us at relevant recent patches/fixes
> that are in -mm and perhaps that can be applied to mainline kernel?
Those can be found in the performance branch of the slab git tree.
See
http://git.kernel.org/?p=linux/kernel/git/christoph/slab.git;a=log;h=performance