Comparing with 2.6.24, on my 16-core tigerton, hackbench process mode has about
40% regression with 2.6.25-rc1, and more than 20% regression with kernel
2.6.25-rc4, because rc4 includes the reverting patch of scheduler load balance.
Command to start it.
#hackbench 100 process 2000
I ran it for 3 times and sum the values.
I tried to investiagte it by bisect.
Kernel up to tag 0f4dafc0563c6c49e17fe14b3f5f356e4c4b8806 has the 20% regression.
Kernel up to tag 6e90aa972dda8ef86155eefcdbdc8d34165b9f39 hasn't regression.
Any bisect between above 2 tags cause kernel hang. I tried to checkout to a point between
these 2 tags for many times manually and kernel always paniced.
All patches between the 2 tags are on kobject restructure. I guess such restructure
creates more cache miss on the 16-core tigerton.
Any idea?
-yanmin
On Thu, 13 Mar 2008 15:46:57 +0800 "Zhang, Yanmin" <[email protected]> wrote:
> Comparing with 2.6.24, on my 16-core tigerton, hackbench process mode has about
> 40% regression with 2.6.25-rc1, and more than 20% regression with kernel
> 2.6.25-rc4, because rc4 includes the reverting patch of scheduler load balance.
>
> Command to start it.
> #hackbench 100 process 2000
> I ran it for 3 times and sum the values.
>
> I tried to investiagte it by bisect.
> Kernel up to tag 0f4dafc0563c6c49e17fe14b3f5f356e4c4b8806 has the 20% regression.
> Kernel up to tag 6e90aa972dda8ef86155eefcdbdc8d34165b9f39 hasn't regression.
>
> Any bisect between above 2 tags cause kernel hang. I tried to checkout to a point between
> these 2 tags for many times manually and kernel always paniced.
>
> All patches between the 2 tags are on kobject restructure. I guess such restructure
> creates more cache miss on the 16-core tigerton.
>
That's pretty surprising - hackbench spends most of its time in userspace
and zeroing out anonymous pages. It shouldn't be fiddling with kobjects
much at all.
Some kernel profiling might be needed here..
On Thu, 2008-03-13 at 01:48 -0700, Andrew Morton wrote:
> On Thu, 13 Mar 2008 15:46:57 +0800 "Zhang, Yanmin" <[email protected]> wrote:
>
> > Comparing with 2.6.24, on my 16-core tigerton, hackbench process mode has about
> > 40% regression with 2.6.25-rc1, and more than 20% regression with kernel
> > 2.6.25-rc4, because rc4 includes the reverting patch of scheduler load balance.
> >
> > Command to start it.
> > #hackbench 100 process 2000
> > I ran it for 3 times and sum the values.
> >
> > I tried to investiagte it by bisect.
> > Kernel up to tag 0f4dafc0563c6c49e17fe14b3f5f356e4c4b8806 has the 20% regression.
> > Kernel up to tag 6e90aa972dda8ef86155eefcdbdc8d34165b9f39 hasn't regression.
> >
> > Any bisect between above 2 tags cause kernel hang. I tried to checkout to a point between
> > these 2 tags for many times manually and kernel always paniced.
> >
> > All patches between the 2 tags are on kobject restructure. I guess such restructure
> > creates more cache miss on the 16-core tigerton.
> >
>
> That's pretty surprising - hackbench spends most of its time in userspace
> and zeroing out anonymous pages.
No. vmstat showed hackbench spends almost 100% in sys.
> It shouldn't be fiddling with kobjects
> much at all.
>
> Some kernel profiling might be needed here..
Thanks for your kind reminder. I don't know why I forgot it.
2.6.24 oprofile data:
CPU: Core 2, speed 1602 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples % image name app name symbol name
40200494 43.3899 linux-2.6.24 linux-2.6.24 __slab_alloc
35338431 38.1421 linux-2.6.24 linux-2.6.24 add_partial_tail
2993156 3.2306 linux-2.6.24 linux-2.6.24 __slab_free
1365806 1.4742 linux-2.6.24 linux-2.6.24 sock_alloc_send_skb
1253820 1.3533 linux-2.6.24 linux-2.6.24 copy_user_generic_string
1141442 1.2320 linux-2.6.24 linux-2.6.24 unix_stream_recvmsg
846836 0.9140 linux-2.6.24 linux-2.6.24 unix_stream_sendmsg
777561 0.8393 linux-2.6.24 linux-2.6.24 kmem_cache_alloc
587127 0.6337 linux-2.6.24 linux-2.6.24 sock_def_readable
2.6.25-rc4 oprofile data:
CPU: Core 2, speed 1602 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples % image name app name symbol name
46746994 43.3801 linux-2.6.25-rc4 linux-2.6.25-rc4 __slab_alloc
45986635 42.6745 linux-2.6.25-rc4 linux-2.6.25-rc4 add_partial
2577578 2.3919 linux-2.6.25-rc4 linux-2.6.25-rc4 __slab_free
1301644 1.2079 linux-2.6.25-rc4 linux-2.6.25-rc4 sock_alloc_send_skb
1185888 1.1005 linux-2.6.25-rc4 linux-2.6.25-rc4 copy_user_generic_string
969847 0.9000 linux-2.6.25-rc4 linux-2.6.25-rc4 unix_stream_recvmsg
806665 0.7486 linux-2.6.25-rc4 linux-2.6.25-rc4 kmem_cache_alloc
731059 0.6784 linux-2.6.25-rc4 linux-2.6.25-rc4 unix_stream_sendmsg
-yanmin
On Thu, 13 Mar 2008 17:28:58 +0800 "Zhang, Yanmin" <[email protected]> wrote:
> On Thu, 2008-03-13 at 01:48 -0700, Andrew Morton wrote:
> > On Thu, 13 Mar 2008 15:46:57 +0800 "Zhang, Yanmin" <[email protected]> wrote:
> >
> > > Comparing with 2.6.24, on my 16-core tigerton, hackbench process mode has about
> > > 40% regression with 2.6.25-rc1, and more than 20% regression with kernel
> > > 2.6.25-rc4, because rc4 includes the reverting patch of scheduler load balance.
> > >
> > > Command to start it.
> > > #hackbench 100 process 2000
> > > I ran it for 3 times and sum the values.
> > >
> > > I tried to investiagte it by bisect.
> > > Kernel up to tag 0f4dafc0563c6c49e17fe14b3f5f356e4c4b8806 has the 20% regression.
> > > Kernel up to tag 6e90aa972dda8ef86155eefcdbdc8d34165b9f39 hasn't regression.
> > >
> > > Any bisect between above 2 tags cause kernel hang. I tried to checkout to a point between
> > > these 2 tags for many times manually and kernel always paniced.
> > >
> > > All patches between the 2 tags are on kobject restructure. I guess such restructure
> > > creates more cache miss on the 16-core tigerton.
> > >
> >
> > That's pretty surprising - hackbench spends most of its time in userspace
> > and zeroing out anonymous pages.
> No. vmstat showed hackbench spends almost 100% in sys.
ah, I got confused about which test that is.
> > It shouldn't be fiddling with kobjects
> > much at all.
> >
> > Some kernel profiling might be needed here..
> Thanks for your kind reminder. I don't know why I forgot it.
>
> 2.6.24 oprofile data:
> CPU: Core 2, speed 1602 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> samples % image name app name symbol name
> 40200494 43.3899 linux-2.6.24 linux-2.6.24 __slab_alloc
> 35338431 38.1421 linux-2.6.24 linux-2.6.24 add_partial_tail
> 2993156 3.2306 linux-2.6.24 linux-2.6.24 __slab_free
> 1365806 1.4742 linux-2.6.24 linux-2.6.24 sock_alloc_send_skb
> 1253820 1.3533 linux-2.6.24 linux-2.6.24 copy_user_generic_string
> 1141442 1.2320 linux-2.6.24 linux-2.6.24 unix_stream_recvmsg
> 846836 0.9140 linux-2.6.24 linux-2.6.24 unix_stream_sendmsg
> 777561 0.8393 linux-2.6.24 linux-2.6.24 kmem_cache_alloc
> 587127 0.6337 linux-2.6.24 linux-2.6.24 sock_def_readable
>
>
>
>
> 2.6.25-rc4 oprofile data:
> CPU: Core 2, speed 1602 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> samples % image name app name symbol name
> 46746994 43.3801 linux-2.6.25-rc4 linux-2.6.25-rc4 __slab_alloc
> 45986635 42.6745 linux-2.6.25-rc4 linux-2.6.25-rc4 add_partial
> 2577578 2.3919 linux-2.6.25-rc4 linux-2.6.25-rc4 __slab_free
> 1301644 1.2079 linux-2.6.25-rc4 linux-2.6.25-rc4 sock_alloc_send_skb
> 1185888 1.1005 linux-2.6.25-rc4 linux-2.6.25-rc4 copy_user_generic_string
> 969847 0.9000 linux-2.6.25-rc4 linux-2.6.25-rc4 unix_stream_recvmsg
> 806665 0.7486 linux-2.6.25-rc4 linux-2.6.25-rc4 kmem_cache_alloc
> 731059 0.6784 linux-2.6.25-rc4 linux-2.6.25-rc4 unix_stream_sendmsg
>
So slub got a litle slower?
(Is slab any better?)
Still, I don't think there are any kobject operations in these codepaths
are there? Maybe some related to the network device, but I doubt it -
networking tends to go it alone on those things, mainly for performance
reasons.
On Thu, Mar 13, 2008 at 03:46:57PM +0800, Zhang, Yanmin wrote:
> Comparing with 2.6.24, on my 16-core tigerton, hackbench process mode has about
> 40% regression with 2.6.25-rc1, and more than 20% regression with kernel
> 2.6.25-rc4, because rc4 includes the reverting patch of scheduler load balance.
>
> Command to start it.
> #hackbench 100 process 2000
> I ran it for 3 times and sum the values.
>
> I tried to investiagte it by bisect.
> Kernel up to tag 0f4dafc0563c6c49e17fe14b3f5f356e4c4b8806 has the 20% regression.
> Kernel up to tag 6e90aa972dda8ef86155eefcdbdc8d34165b9f39 hasn't regression.
>
> Any bisect between above 2 tags cause kernel hang. I tried to checkout to a point between
> these 2 tags for many times manually and kernel always paniced.
Where is the kernel panicing? The changeset right after the last one
above: bc87d2fe7a1190f1c257af8a91fc490b1ee35954, is a change to efivars,
are you using that in your .config?
> All patches between the 2 tags are on kobject restructure. I guess such restructure
> creates more cache miss on the 16-core tigerton.
Nothing should be creating kobjects on a normal load like this, so a
regression seems very odd. Unless the /sys/kernel/uids/ stuff is
triggering this?
Do you have a link to where I can get hackbench (google seems to find
lots of reports with it, but not the source itself), so I can test to
see if we are accidentally creating kobjects with this load?
thanks,
greg k-h
On Thu, 13 Mar 2008 08:14:13 -0700 Greg KH wrote:
> On Thu, Mar 13, 2008 at 03:46:57PM +0800, Zhang, Yanmin wrote:
> > Comparing with 2.6.24, on my 16-core tigerton, hackbench process mode has about
> > 40% regression with 2.6.25-rc1, and more than 20% regression with kernel
> > 2.6.25-rc4, because rc4 includes the reverting patch of scheduler load balance.
> >
> > Command to start it.
> > #hackbench 100 process 2000
> > I ran it for 3 times and sum the values.
> >
> > I tried to investiagte it by bisect.
> > Kernel up to tag 0f4dafc0563c6c49e17fe14b3f5f356e4c4b8806 has the 20% regression.
> > Kernel up to tag 6e90aa972dda8ef86155eefcdbdc8d34165b9f39 hasn't regression.
> >
> > Any bisect between above 2 tags cause kernel hang. I tried to checkout to a point between
> > these 2 tags for many times manually and kernel always paniced.
>
> Where is the kernel panicing? The changeset right after the last one
> above: bc87d2fe7a1190f1c257af8a91fc490b1ee35954, is a change to efivars,
> are you using that in your .config?
>
> > All patches between the 2 tags are on kobject restructure. I guess such restructure
> > creates more cache miss on the 16-core tigerton.
>
> Nothing should be creating kobjects on a normal load like this, so a
> regression seems very odd. Unless the /sys/kernel/uids/ stuff is
> triggering this?
>
> Do you have a link to where I can get hackbench (google seems to find
> lots of reports with it, but not the source itself), so I can test to
> see if we are accidentally creating kobjects with this load?
The version that I see referenced most often (unscientifically :)
is somewhere under people.redhat.com/mingo/, like so:
http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c
---
~Randy
On Thu, Mar 13, 2008 at 09:19:21AM -0700, Randy Dunlap wrote:
> On Thu, 13 Mar 2008 08:14:13 -0700 Greg KH wrote:
>
> > On Thu, Mar 13, 2008 at 03:46:57PM +0800, Zhang, Yanmin wrote:
> > > Comparing with 2.6.24, on my 16-core tigerton, hackbench process mode has about
> > > 40% regression with 2.6.25-rc1, and more than 20% regression with kernel
> > > 2.6.25-rc4, because rc4 includes the reverting patch of scheduler load balance.
> > >
> > > Command to start it.
> > > #hackbench 100 process 2000
> > > I ran it for 3 times and sum the values.
> > >
> > > I tried to investiagte it by bisect.
> > > Kernel up to tag 0f4dafc0563c6c49e17fe14b3f5f356e4c4b8806 has the 20% regression.
> > > Kernel up to tag 6e90aa972dda8ef86155eefcdbdc8d34165b9f39 hasn't regression.
> > >
> > > Any bisect between above 2 tags cause kernel hang. I tried to checkout to a point between
> > > these 2 tags for many times manually and kernel always paniced.
> >
> > Where is the kernel panicing? The changeset right after the last one
> > above: bc87d2fe7a1190f1c257af8a91fc490b1ee35954, is a change to efivars,
> > are you using that in your .config?
> >
> > > All patches between the 2 tags are on kobject restructure. I guess such restructure
> > > creates more cache miss on the 16-core tigerton.
> >
> > Nothing should be creating kobjects on a normal load like this, so a
> > regression seems very odd. Unless the /sys/kernel/uids/ stuff is
> > triggering this?
> >
> > Do you have a link to where I can get hackbench (google seems to find
> > lots of reports with it, but not the source itself), so I can test to
> > see if we are accidentally creating kobjects with this load?
>
> The version that I see referenced most often (unscientifically :)
> is somewhere under people.redhat.com/mingo/, like so:
> http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c
Great, thanks for the link.
In using that version, I do not see any kobjects being created at all
when running the program. So I don't see how a kobject change could
have caused any slowdown.
Yanmin, is the above link the version you are using?
Hm, running with "hackbench 100 process 2000" seems to lock up my
laptop, maybe I shouldn't run 4000 tasks at once on such a memory
starved machine...
thanks,
greg k-h
Could you recompile the kernel with slub performance statistics and post
the output of
slabinfo -AD
?
On Thu, 2008-03-13 at 10:12 -0700, Greg KH wrote:
> On Thu, Mar 13, 2008 at 09:19:21AM -0700, Randy Dunlap wrote:
> > On Thu, 13 Mar 2008 08:14:13 -0700 Greg KH wrote:
> >
> > > On Thu, Mar 13, 2008 at 03:46:57PM +0800, Zhang, Yanmin wrote:
> > > > Comparing with 2.6.24, on my 16-core tigerton, hackbench process mode has about
> > > > 40% regression with 2.6.25-rc1, and more than 20% regression with kernel
> > > > 2.6.25-rc4, because rc4 includes the reverting patch of scheduler load balance.
> > > >
> > > > Command to start it.
> > > > #hackbench 100 process 2000
> > > > I ran it for 3 times and sum the values.
> > > >
> > > > I tried to investiagte it by bisect.
> > > > Kernel up to tag 0f4dafc0563c6c49e17fe14b3f5f356e4c4b8806 has the 20% regression.
> > > > Kernel up to tag 6e90aa972dda8ef86155eefcdbdc8d34165b9f39 hasn't regression.
> > > >
> > > > Any bisect between above 2 tags cause kernel hang. I tried to checkout to a point between
> > > > these 2 tags for many times manually and kernel always paniced.
> > >
> > > Where is the kernel panicing? The changeset right after the last one
> > > above: bc87d2fe7a1190f1c257af8a91fc490b1ee35954, is a change to efivars,
> > > are you using that in your .config?
> > >
> > > > All patches between the 2 tags are on kobject restructure. I guess such restructure
> > > > creates more cache miss on the 16-core tigerton.
> > >
> > > Nothing should be creating kobjects on a normal load like this, so a
> > > regression seems very odd. Unless the /sys/kernel/uids/ stuff is
> > > triggering this?
> > >
> > > Do you have a link to where I can get hackbench (google seems to find
> > > lots of reports with it, but not the source itself), so I can test to
> > > see if we are accidentally creating kobjects with this load?
> >
> > The version that I see referenced most often (unscientifically :)
> > is somewhere under people.redhat.com/mingo/, like so:
> > http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c
>
> Great, thanks for the link.
>
> In using that version, I do not see any kobjects being created at all
> when running the program. So I don't see how a kobject change could
> have caused any slowdown.
>
> Yanmin, is the above link the version you are using?
Yes.
>
> Hm, running with "hackbench 100 process 2000" seems to lock up my
> laptop, maybe I shouldn't run 4000 tasks at once on such a memory
> starved machine...
The issue doesn't exist on my 8-core stoakley and on tulsa. So I don't think
you could reproduce it on laptop.
>From the oprofile data, perhaps we need dig into SLUB firstly.
-yanmin
On Thu, 2008-03-13 at 17:16 -0700, Christoph Lameter wrote:
> Could you recompile the kernel with slub performance statistics and post
> the output of
>
> slabinfo -AD
Before testing with kernel 2.6.25-rc5:
Name Objects Alloc Free %Fast
vm_area_struct 2795 135185 132587 93 29
:0004096 25 119045 119043 99 98
:0000064 12257 119671 107742 98 50
:0000192 3312 78563 75370 92 21
:0000128 4648 48143 43738 97 53
dentry 15217 46675 31527 95 72
:0000080 12784 33674 21206 99 97
:0000016 4367 25871 23705 99 78
:0000096 3001 22591 20084 99 92
buffer_head 5536 18147 12884 97 42
anon_vma 1729 14948 14130 99 73
After testing:
Name Objects Alloc Free %Fast
:0000192 3428 80093958 80090708 92 8
:0000512 374 80016030 80015715 68 7
vm_area_struct 2875 224524 221868 94 20
:0000064 12408 134273 122227 98 47
:0004096 24 127397 127395 99 98
:0000128 4596 57837 53432 97 48
dentry 15659 51402 35824 95 64
:0000016 4584 29327 27161 99 76
:0000080 12784 33674 21206 99 97
:0000096 2998 26264 23757 99 93
So block 192 and 512's and very active and their fast free percentage is low.
-yanmin
On Fri, 2008-03-14 at 11:04 +0800, Zhang, Yanmin wrote:
> On Thu, 2008-03-13 at 17:16 -0700, Christoph Lameter wrote:
> > Could you recompile the kernel with slub performance statistics and post
> > the output of
> >
> > slabinfo -AD
> Before testing with kernel 2.6.25-rc5:
> Name Objects Alloc Free %Fast
> vm_area_struct 2795 135185 132587 93 29
> :0004096 25 119045 119043 99 98
> :0000064 12257 119671 107742 98 50
> :0000192 3312 78563 75370 92 21
> :0000128 4648 48143 43738 97 53
> dentry 15217 46675 31527 95 72
> :0000080 12784 33674 21206 99 97
> :0000016 4367 25871 23705 99 78
> :0000096 3001 22591 20084 99 92
> buffer_head 5536 18147 12884 97 42
> anon_vma 1729 14948 14130 99 73
>
>
> After testing:
> Name Objects Alloc Free %Fast
> :0000192 3428 80093958 80090708 92 8
> :0000512 374 80016030 80015715 68 7
> vm_area_struct 2875 224524 221868 94 20
> :0000064 12408 134273 122227 98 47
> :0004096 24 127397 127395 99 98
> :0000128 4596 57837 53432 97 48
> dentry 15659 51402 35824 95 64
> :0000016 4584 29327 27161 99 76
> :0000080 12784 33674 21206 99 97
> :0000096 2998 26264 23757 99 93
>
>
> So block 192 and 512's and very active and their fast free percentage is low.
On my 8-core stoakley, there is no such regression. Below data is after testing.
[root@lkp-st02-x8664 ~]# slabinfo -AD
Name Objects Alloc Free %Fast
:0000192 3170 80055388 80052280 92 1
:0000512 316 80012750 80012466 69 1
vm_area_struct 2642 194700 192193 94 16
:0000064 3846 74468 70820 97 53
:0004096 15 69014 69012 98 97
:0000128 1447 32920 31541 91 8
dentry 13485 33060 19652 92 42
:0000080 10639 23377 12953 98 98
:0000096 1662 16496 15036 99 94
:0000832 232 14422 14203 85 10
:0000016 2733 15102 13372 99 14
So the block 192 and 512's fast free percentage is even smaller than the ones on tigerton.
Oprofile data on stoakley:
CPU: Core 2, speed 2660 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples % app name symbol name
2897265 25.7603 linux-2.6.25-rc5 __slab_alloc
2689900 23.9166 linux-2.6.25-rc5 add_partial
629355 5.5957 linux-2.6.25-rc5 copy_user_generic_string
552309 4.9107 linux-2.6.25-rc5 __slab_free
514792 4.5771 linux-2.6.25-rc5 sock_alloc_send_skb
500879 4.4534 linux-2.6.25-rc5 unix_stream_recvmsg
274798 2.4433 linux-2.6.25-rc5 __kmalloc_track_caller
230283 2.0475 linux-2.6.25-rc5 kfree
222286 1.9764 linux-2.6.25-rc5 unix_stream_sendmsg
217413 1.9331 linux-2.6.25-rc5 memset_c
211589 1.8813 linux-2.6.25-rc5 kmem_cache_alloc
151500 1.3470 linux-2.6.25-rc5 system_call
132262 1.1760 linux-2.6.25-rc5 sock_def_readable
123130 1.0948 linux-2.6.25-rc5 kmem_cache_free
109518 0.9738 linux-2.6.25-rc5 sock_wfree
yanmin
On Fri, Mar 14, 2008 at 08:50:19AM +0800, Zhang, Yanmin wrote:
> On Thu, 2008-03-13 at 10:12 -0700, Greg KH wrote:
> > On Thu, Mar 13, 2008 at 09:19:21AM -0700, Randy Dunlap wrote:
> > > On Thu, 13 Mar 2008 08:14:13 -0700 Greg KH wrote:
> > >
> > > > On Thu, Mar 13, 2008 at 03:46:57PM +0800, Zhang, Yanmin wrote:
> > > > > Comparing with 2.6.24, on my 16-core tigerton, hackbench process mode has about
> > > > > 40% regression with 2.6.25-rc1, and more than 20% regression with kernel
> > > > > 2.6.25-rc4, because rc4 includes the reverting patch of scheduler load balance.
> > > > >
> > > > > Command to start it.
> > > > > #hackbench 100 process 2000
> > > > > I ran it for 3 times and sum the values.
> > > > >
> > > > > I tried to investiagte it by bisect.
> > > > > Kernel up to tag 0f4dafc0563c6c49e17fe14b3f5f356e4c4b8806 has the 20% regression.
> > > > > Kernel up to tag 6e90aa972dda8ef86155eefcdbdc8d34165b9f39 hasn't regression.
> > > > >
> > > > > Any bisect between above 2 tags cause kernel hang. I tried to checkout to a point between
> > > > > these 2 tags for many times manually and kernel always paniced.
> > > >
> > > > Where is the kernel panicing? The changeset right after the last one
> > > > above: bc87d2fe7a1190f1c257af8a91fc490b1ee35954, is a change to efivars,
> > > > are you using that in your .config?
> > > >
> > > > > All patches between the 2 tags are on kobject restructure. I guess such restructure
> > > > > creates more cache miss on the 16-core tigerton.
> > > >
> > > > Nothing should be creating kobjects on a normal load like this, so a
> > > > regression seems very odd. Unless the /sys/kernel/uids/ stuff is
> > > > triggering this?
> > > >
> > > > Do you have a link to where I can get hackbench (google seems to find
> > > > lots of reports with it, but not the source itself), so I can test to
> > > > see if we are accidentally creating kobjects with this load?
> > >
> > > The version that I see referenced most often (unscientifically :)
> > > is somewhere under people.redhat.com/mingo/, like so:
> > > http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c
> >
> > Great, thanks for the link.
> >
> > In using that version, I do not see any kobjects being created at all
> > when running the program. So I don't see how a kobject change could
> > have caused any slowdown.
> >
> > Yanmin, is the above link the version you are using?
> Yes.
>
> >
> > Hm, running with "hackbench 100 process 2000" seems to lock up my
> > laptop, maybe I shouldn't run 4000 tasks at once on such a memory
> > starved machine...
> The issue doesn't exist on my 8-core stoakley and on tulsa. So I don't think
> you could reproduce it on laptop.
But I should see any kobjects being created and destroyed as you are
thinking that is the problem here, right?
And I don't see any, so I'm thinking that this is probably something
else.
I'm still interested in why your machine was oopsing when bisecting
through the kobject commits. I thought it all should have worked
without problems, as I spend enough time trying to ensure it was so...
thanks,
greg k-h
On Fri, 2008-03-14 at 11:30 +0800, Zhang, Yanmin wrote:
> On Fri, 2008-03-14 at 11:04 +0800, Zhang, Yanmin wrote:
> > On Thu, 2008-03-13 at 17:16 -0700, Christoph Lameter wrote:
> > > Could you recompile the kernel with slub performance statistics and post
> > > the output of
> > >
> > > slabinfo -AD
> > Before testing with kernel 2.6.25-rc5:
> > Name Objects Alloc Free %Fast
> > vm_area_struct 2795 135185 132587 93 29
> > :0004096 25 119045 119043 99 98
> > :0000064 12257 119671 107742 98 50
> > :0000192 3312 78563 75370 92 21
> > :0000128 4648 48143 43738 97 53
> > dentry 15217 46675 31527 95 72
> > :0000080 12784 33674 21206 99 97
> > :0000016 4367 25871 23705 99 78
> > :0000096 3001 22591 20084 99 92
> > buffer_head 5536 18147 12884 97 42
> > anon_vma 1729 14948 14130 99 73
> >
> >
> > After testing:
> > Name Objects Alloc Free %Fast
> > :0000192 3428 80093958 80090708 92 8
> > :0000512 374 80016030 80015715 68 7
> > vm_area_struct 2875 224524 221868 94 20
> > :0000064 12408 134273 122227 98 47
> > :0004096 24 127397 127395 99 98
> > :0000128 4596 57837 53432 97 48
> > dentry 15659 51402 35824 95 64
> > :0000016 4584 29327 27161 99 76
> > :0000080 12784 33674 21206 99 97
> > :0000096 2998 26264 23757 99 93
> >
> >
> > So block 192 and 512's and very active and their fast free percentage is low.
> On my 8-core stoakley, there is no such regression. Below data is after testing.
>
> [root@lkp-st02-x8664 ~]# slabinfo -AD
> Name Objects Alloc Free %Fast
> :0000192 3170 80055388 80052280 92 1
> :0000512 316 80012750 80012466 69 1
> vm_area_struct 2642 194700 192193 94 16
> :0000064 3846 74468 70820 97 53
> :0004096 15 69014 69012 98 97
> :0000128 1447 32920 31541 91 8
> dentry 13485 33060 19652 92 42
> :0000080 10639 23377 12953 98 98
> :0000096 1662 16496 15036 99 94
> :0000832 232 14422 14203 85 10
> :0000016 2733 15102 13372 99 14
>
> So the block 192 and 512's fast free percentage is even smaller than the ones on tigerton.
>
> Oprofile data on stoakley:
>
> CPU: Core 2, speed 2660 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> samples % app name symbol name
> 2897265 25.7603 linux-2.6.25-rc5 __slab_alloc
> 2689900 23.9166 linux-2.6.25-rc5 add_partial
> 629355 5.5957 linux-2.6.25-rc5 copy_user_generic_string
> 552309 4.9107 linux-2.6.25-rc5 __slab_free
> 514792 4.5771 linux-2.6.25-rc5 sock_alloc_send_skb
> 500879 4.4534 linux-2.6.25-rc5 unix_stream_recvmsg
> 274798 2.4433 linux-2.6.25-rc5 __kmalloc_track_caller
> 230283 2.0475 linux-2.6.25-rc5 kfree
> 222286 1.9764 linux-2.6.25-rc5 unix_stream_sendmsg
> 217413 1.9331 linux-2.6.25-rc5 memset_c
> 211589 1.8813 linux-2.6.25-rc5 kmem_cache_alloc
> 151500 1.3470 linux-2.6.25-rc5 system_call
> 132262 1.1760 linux-2.6.25-rc5 sock_def_readable
> 123130 1.0948 linux-2.6.25-rc5 kmem_cache_free
> 109518 0.9738 linux-2.6.25-rc5 sock_wfree
On tigerton, if I add "slub_max_order=3 slub_min_objects=16" to kernel boot cmdline,
the result is improved significantly and it takes just 1/10 time of the original testing.
Below is the new output of slabino -AD.
Name Objects Alloc Free %Fast
:0000192 3192 80087199 80084141 92 8
kmalloc-512 773 80016203 80015888 97 9
vm_area_struct 2787 223100 220525 94 17
:0004096 68 118322 118320 99 98
:0000064 12215 123575 111669 98 42
:0000128 4616 53826 49422 97 45
dentry 12373 49568 37286 95 65
:0000080 12823 33755 21206 99 97
So kmalloc-512 is the key.
Then, I tested it on stoakley with the same kernel commandline. Improvement is about 50%.
One important thing is without the boot parameter, hackbench on stoakey takes only 1/4 time
of the one on tigerton. With the boot parameter, hackbench on tigerton is faster than the one
on stoakely.
Is it possible to initiate slub_min_objects based on possible cpu number? I mean,
cpu_possible_map(). We could calculate slub_min_objects by a formular.
-yanmin
On Thu, 2008-03-13 at 22:01 -0700, Greg KH wrote:
> On Fri, Mar 14, 2008 at 08:50:19AM +0800, Zhang, Yanmin wrote:
> > On Thu, 2008-03-13 at 10:12 -0700, Greg KH wrote:
> > > On Thu, Mar 13, 2008 at 09:19:21AM -0700, Randy Dunlap wrote:
> > > > On Thu, 13 Mar 2008 08:14:13 -0700 Greg KH wrote:
> > > >
> > > > > On Thu, Mar 13, 2008 at 03:46:57PM +0800, Zhang, Yanmin wrote:
> > > > > > Comparing with 2.6.24, on my 16-core tigerton, hackbench process mode has about
> > > > > > 40% regression with 2.6.25-rc1, and more than 20% regression with kernel
> > > > > > 2.6.25-rc4, because rc4 includes the reverting patch of scheduler load balance.
> > > > > >
> > > > > > Command to start it.
> > > > > > #hackbench 100 process 2000
> > > > > > I ran it for 3 times and sum the values.
> > > > > >
> > > > > > I tried to investiagte it by bisect.
> > > > > > Kernel up to tag 0f4dafc0563c6c49e17fe14b3f5f356e4c4b8806 has the 20% regression.
> > > > > > Kernel up to tag 6e90aa972dda8ef86155eefcdbdc8d34165b9f39 hasn't regression.
> > > > > >
> > > > > > Any bisect between above 2 tags cause kernel hang. I tried to checkout to a point between
> > > > > > these 2 tags for many times manually and kernel always paniced.
> > > > >
> > > > > Where is the kernel panicing? The changeset right after the last one
> > > > > above: bc87d2fe7a1190f1c257af8a91fc490b1ee35954, is a change to efivars,
> > > > > are you using that in your .config?
> > > > >
> > > > > > All patches between the 2 tags are on kobject restructure. I guess such restructure
> > > > > > creates more cache miss on the 16-core tigerton.
> > > > >
> > > > > Nothing should be creating kobjects on a normal load like this, so a
> > > > > regression seems very odd. Unless the /sys/kernel/uids/ stuff is
> > > > > triggering this?
> > > > >
> > > > > Do you have a link to where I can get hackbench (google seems to find
> > > > > lots of reports with it, but not the source itself), so I can test to
> > > > > see if we are accidentally creating kobjects with this load?
> > > >
> > > > The version that I see referenced most often (unscientifically :)
> > > > is somewhere under people.redhat.com/mingo/, like so:
> > > > http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c
> > >
> > > Great, thanks for the link.
> > >
> > > In using that version, I do not see any kobjects being created at all
> > > when running the program. So I don't see how a kobject change could
> > > have caused any slowdown.
> > >
> > > Yanmin, is the above link the version you are using?
> > Yes.
> >
> > >
> > > Hm, running with "hackbench 100 process 2000" seems to lock up my
> > > laptop, maybe I shouldn't run 4000 tasks at once on such a memory
> > > starved machine...
> > The issue doesn't exist on my 8-core stoakley and on tulsa. So I don't think
> > you could reproduce it on laptop.
>
> But I should see any kobjects being created and destroyed as you are
> thinking that is the problem here, right?
Not just thinking. That's based on lots of testing. But as you know, performance
work is complicated often. Now, I think maybe kernel image changes cache line alignment.
>
> And I don't see any, so I'm thinking that this is probably something
> else.
Yes.
>
> I'm still interested in why your machine was oopsing when bisecting
> through the kobject commits. I thought it all should have worked
> without problems, as I spend enough time trying to ensure it was so...
Kernel panic after printing warning in kref_get when executing add_disk
in rd_init.
Thanks,
Yanmin
On Fri, 14 Mar 2008, Zhang, Yanmin wrote:
> After testing:
> Name Objects Alloc Free %Fast
> :0000192 3428 80093958 80090708 92 8
> :0000512 374 80016030 80015715 68 7
Ahhh... Okay those slabs did not change for 2.6.25-rc. Is there
really a difference to 2.6.24?
> So block 192 and 512's and very active and their fast free percentage is low.
Yes but that is to be expected given that hackbench does allocate objects
and then passes them to other processors for freeing.
Could you get me more details on the two critical slabs?
Do slabinfo -a and then pick one alias for each of those sizes.
Then do
slabinfo skbuff_head (whatever alias you want to use to refer to the slab)
for each of them. Should give some more insight as to how slub behaves
with these two slab caches.
On Fri, 14 Mar 2008, Zhang, Yanmin wrote:
> > So block 192 and 512's and very active and their fast free percentage
> > is low.
> On my 8-core stoakley, there is no such regression. Below data is after testing.
Ok get the detailed statistics for this configuration as well. Then we
can see what kind of slub behavior changes between both configurations.
The 16p is really one node? No strange variances in memory latencies?
On Fri, 14 Mar 2008, Zhang, Yanmin wrote:
> On tigerton, if I add "slub_max_order=3 slub_min_objects=16" to kernel
> boot cmdline, the result is improved significantly and it takes just
> 1/10 time of the original testing.
Hmmm... That means the updates to SLUB in mm will fix the regression that
you are seeing because we there can use large orders of slabs and fallback
for all slab caches. But I am still interested to get to the details of
slub behavior on the 16p.
> So kmalloc-512 is the key.
Yeah in 2.6.26-rc kmalloc-512 has 8 objects per slab. The mm version
increases that with a larger allocation size.
> Then, I tested it on stoakley with the same kernel commandline.
> Improvement is about 50%. One important thing is without the boot
> parameter, hackbench on stoakey takes only 1/4 time of the one on
> tigerton. With the boot parameter, hackbench on tigerton is faster than
> the one on stoakely.
>
> Is it possible to initiate slub_min_objects based on possible cpu
> number? I mean, cpu_possible_map(). We could calculate slub_min_objects
> by a formular.
Hmmm... Interesting. Lets first get the details for 2.6.25-rc. Then we can
start toying around with the slub version in mm to configure slub in such
a way that we get best results on both machines.
On Thu, 2008-03-13 at 23:32 -0700, Christoph Lameter wrote:
> On Fri, 14 Mar 2008, Zhang, Yanmin wrote:
>
> > After testing:
> > Name Objects Alloc Free %Fast
> > :0000192 3428 80093958 80090708 92 8
> > :0000512 374 80016030 80015715 68 7
>
> Ahhh... Okay those slabs did not change for 2.6.25-rc. Is there
> really a difference to 2.6.24?
As oprofile shows slub functions spend more than 80% cpu time, I would like
to focus on optimizing SLUB before going back to 2.6.24.
>
> > So block 192 and 512's and very active and their fast free percentage is low.
>
> Yes but that is to be expected given that hackbench does allocate objects
> and then passes them to other processors for freeing.
>
> Could you get me more details on the two critical slabs?
Yes, definitely.
>
> Do slabinfo -a and then pick one alias for each of those sizes.
They are skbuff_head_cache and kmalloc-512.
>
> Then do
>
> slabinfo skbuff_head (whatever alias you want to use to refer to the slab)
Slabcache: skbuff_head_cache Aliases: 7 Order : 0 Objects: 2848
Sizes (bytes) Slabs Debug Memory
------------------------------------------------------------------------
Object : 192 Total : 142 Sanity Checks : Off Total: 581632
SlabObj: 192 Full : 126 Redzoning : Off Used : 546816
SlabSiz: 4096 Partial: 0 Poisoning : Off Loss : 34816
Loss : 0 CpuSlab: 16 Tracking : Off Lalig: 0
Align : 8 Objects: 21 Tracing : Off Lpadd: 9088
skbuff_head_cache has no kmem_cache operations
skbuff_head_cache: Kernel object allocation
-----------------------------------------------------------------------
No Data
skbuff_head_cache: Kernel object freeing
------------------------------------------------------------------------
No Data
skbuff_head_cache: No NUMA information available.
Slab Perf Counter Alloc Free %Al %Fr
--------------------------------------------------
Fastpath 74048234 6259131 92 7
Slowpath 6031994 73818377 7 92
Page Alloc 19746 19603 0 0
Add partial 0 4658709 0 5
Remove partial 4639106 19603 5 0
RemoteObj/SlabFrozen 0 3887872 0 4
Total 80080228 80077508
Refill 6031979
Deactivate Full=4658836(100%) Empty=0(0%) ToHead=0(0%) ToTail=0(0%)
Slabcache: kmalloc-512 Aliases: 1 Order : 0 Objects: 365
Sizes (bytes) Slabs Debug Memory
------------------------------------------------------------------------
Object : 512 Total : 61 Sanity Checks : Off Total: 249856
SlabObj: 512 Full : 36 Redzoning : Off Used : 186880
SlabSiz: 4096 Partial: 9 Poisoning : Off Loss : 62976
Loss : 0 CpuSlab: 16 Tracking : Off Lalig: 0
Align : 8 Objects: 8 Tracing : Off Lpadd: 0
kmalloc-512 has no kmem_cache operations
kmalloc-512: Kernel object allocation
-----------------------------------------------------------------------
No Data
kmalloc-512: Kernel object freeing
------------------------------------------------------------------------
No Data
kmalloc-512: No NUMA information available.
Slab Perf Counter Alloc Free %Al %Fr
--------------------------------------------------
Fastpath 55039159 5006829 68 6
Slowpath 24975754 75007769 31 93
Page Alloc 73840 73779 0 0
Add partial 0 24341085 0 30
Remove partial 24267297 73779 30 0
RemoteObj/SlabFrozen 0 953614 0 1
Total 80014913 80014598
Refill 24975738
Deactivate Full=24341121(100%) Empty=0(0%) ToHead=0(0%) ToTail=0(0%)
>
> for each of them. Should give some more insight as to how slub behaves
> with these two slab caches.
>
On Thu, 2008-03-13 at 23:34 -0700, Christoph Lameter wrote:
> On Fri, 14 Mar 2008, Zhang, Yanmin wrote:
>
> > > So block 192 and 512's and very active and their fast free percentage
> > > is low.
> > On my 8-core stoakley, there is no such regression. Below data is after testing.
>
> Ok get the detailed statistics for this configuration as well. Then we
> can see what kind of slub behavior changes between both configurations.
I did paste such data in a prior email. COpy it below.
On my 8-core stoakley, there is no such regression. Below data is after testing.
[root@lkp-st02-x8664 ~]# slabinfo -AD
Name Objects Alloc Free %Fast
:0000192 3170 80055388 80052280 92 1
:0000512 316 80012750 80012466 69 1
vm_area_struct 2642 194700 192193 94 16
:0000064 3846 74468 70820 97 53
:0004096 15 69014 69012 98 97
:0000128 1447 32920 31541 91 8
dentry 13485 33060 19652 92 42
:0000080 10639 23377 12953 98 98
:0000096 1662 16496 15036 99 94
:0000832 232 14422 14203 85 10
:0000016 2733 15102 13372 99 14
I ran it for many times and got the similiar output from slabinfo.
>
> The 16p is really one node?
Yes. It's a SMP machine.
> No strange variances in memory latencies?
No.
On Thu, 2008-03-13 at 23:39 -0700, Christoph Lameter wrote:
> On Fri, 14 Mar 2008, Zhang, Yanmin wrote:
>
> > On tigerton, if I add "slub_max_order=3 slub_min_objects=16" to kernel
> > boot cmdline, the result is improved significantly and it takes just
> > 1/10 time of the original testing.
>
> Hmmm... That means the updates to SLUB in mm will fix the regression that
> you are seeing because we there can use large orders of slabs and fallback
> for all slab caches. But I am still interested to get to the details of
> slub behavior on the 16p.
>
> > So kmalloc-512 is the key.
>
> Yeah in 2.6.26-rc kmalloc-512 has 8 objects per slab. The mm version
> increases that with a larger allocation size.
Would you like to give me a pointer to the patch? Is it one patch, or many patches?
>
> > Then, I tested it on stoakley with the same kernel commandline.
> > Improvement is about 50%. One important thing is without the boot
> > parameter, hackbench on stoakey takes only 1/4 time of the one on
> > tigerton. With the boot parameter, hackbench on tigerton is faster than
> > the one on stoakely.
> >
> > Is it possible to initiate slub_min_objects based on possible cpu
> > number? I mean, cpu_possible_map(). We could calculate slub_min_objects
> > by a formular.
>
> Hmmm... Interesting. Lets first get the details for 2.6.25-rc. Then we can
> start toying around with the slub version in mm to configure slub in such
> a way that we get best results on both machines.
Boot parameter "slub_max_order=3 slub_min_objects=16" could boost perforamnce
both on stoakley and on tigerton.
So should we keep slub_min_objects scalable based on possible cpu number? When a
machine has more cpu, it means more processes/threads will run on it and it will
take more time when they compete for the same resources. SLAB is such a typical
resource.
-yanmin
On Fri, 14 Mar 2008, Zhang, Yanmin wrote:
> > Yeah in 2.6.26-rc kmalloc-512 has 8 objects per slab. The mm version
> > increases that with a larger allocation size.
> Would you like to give me a pointer to the patch? Is it one patch, or many patches?
If you a git pull on the slab-mm branch off my VM tree on kernel.org then
you got all you need. There will be an update in the next days though
since some of the data you gave me already suggests a couple of ways that
things may be made better.
> > Hmmm... Interesting. Lets first get the details for 2.6.25-rc. Then we can
> > start toying around with the slub version in mm to configure slub in such
> > a way that we get best results on both machines.
> Boot parameter "slub_max_order=3 slub_min_objects=16" could boost perforamnce
> both on stoakley and on tigerton.
Well the current slab-mm tree already does order 4 and min_objects=60
which is probably overkill. Next git push on slab-mm will reduce that
to the values you found to be sufficient.
> So should we keep slub_min_objects scalable based on possible cpu
> number? When a machine has more cpu, it means more processes/threads
> will run on it and it will take more time when they compete for the same
> resources. SLAB is such a typical resource.
We would have to do some experiments to see how cpu counts affect multiple
benchmarks. If we can establish a consistent benefit from varying these
parameters based on processor count then we should do so. There is already
one example in mm/vmstat.c how this could be done.
On Fri, 14 Mar 2008, Zhang, Yanmin wrote:
> On my 8-core stoakley, there is no such regression. Below data is after
> testing.
I was looking for the details on two slab caches. The comparison of the
details statistics is likely very interesting because we will be able to
see how the doubling of processor counts affects the internal behavior of
slub.
On Fri, 14 Mar 2008, Zhang, Yanmin wrote:
> > Ahhh... Okay those slabs did not change for 2.6.25-rc. Is there
> > really a difference to 2.6.24?
> As oprofile shows slub functions spend more than 80% cpu time, I would like
> to focus on optimizing SLUB before going back to 2.6.24.
I thought you wanted to address a regression vs 2.6.24?
> kmalloc-512: No NUMA information available.
>
> Slab Perf Counter Alloc Free %Al %Fr
> --------------------------------------------------
> Fastpath 55039159 5006829 68 6
> Slowpath 24975754 75007769 31 93
> Page Alloc 73840 73779 0 0
> Add partial 0 24341085 0 30
> Remove partial 24267297 73779 30 0
^^^ add partial/remove partial is likely the cause for
trouble here. 30% is unacceptably high. The larger allocs will reduce the
partial handling overhead. That is likely the effect that we see here.
> Refill 24975738
Duh refills at 50%? We could try to just switch to another slab instead of
reusing the existing one. May also affect the add/remove partial
situation.
Here is a patch to just not perform refills but switch slabs instead.
Could check what effect doing so has on the statistics you see on the 16p?
---
mm/slub.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2008-03-14 16:49:36.000000000 -0700
+++ linux-2.6/mm/slub.c 2008-03-14 16:50:04.000000000 -0700
@@ -1474,10 +1474,7 @@ static void *__slab_alloc(struct kmem_ca
goto new_slab;
slab_lock(c->page);
- if (unlikely(!node_match(c, node)))
- goto another_slab;
-
- stat(c, ALLOC_REFILL);
+ goto another_slab;
load_freelist:
object = c->page->freelist;
On Fri, 2008-03-14 at 14:08 -0700, Christoph Lameter wrote:
> On Fri, 14 Mar 2008, Zhang, Yanmin wrote:
>
> > > Ahhh... Okay those slabs did not change for 2.6.25-rc. Is there
> > > really a difference to 2.6.24?
> > As oprofile shows slub functions spend more than 80% cpu time, I would like
> > to focus on optimizing SLUB before going back to 2.6.24.
>
> I thought you wanted to address a regression vs 2.6.24?
Initially I wanted to do so, but oprofile data showed both 2.6.24 and 2.6.25-rc
aren't good with hachbench on tigerton.
The slub_min_objects boot parameter could boost performance largely. So I think
we need optimize it before addressing the regression.
>
> > kmalloc-512: No NUMA information available.
> >
> > Slab Perf Counter Alloc Free %Al %Fr
> > --------------------------------------------------
> > Fastpath 55039159 5006829 68 6
> > Slowpath 24975754 75007769 31 93
> > Page Alloc 73840 73779 0 0
> > Add partial 0 24341085 0 30
> > Remove partial 24267297 73779 30 0
>
> ^^^ add partial/remove partial is likely the cause for
> trouble here. 30% is unacceptably high. The larger allocs will reduce the
> partial handling overhead. That is likely the effect that we see here.
>
> > Refill 24975738
>
> Duh refills at 50%? We could try to just switch to another slab instead of
> reusing the existing one. May also affect the add/remove partial
> situation.
>
>
>
On Fri, 2008-03-14 at 17:15 -0700, Christoph Lameter wrote:
> Here is a patch to just not perform refills but switch slabs instead.
> Could check what effect doing so has on the statistics you see on the 16p?
>
> ---
> mm/slub.c | 5 +----
> 1 file changed, 1 insertion(+), 4 deletions(-)
>
> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c 2008-03-14 16:49:36.000000000 -0700
> +++ linux-2.6/mm/slub.c 2008-03-14 16:50:04.000000000 -0700
> @@ -1474,10 +1474,7 @@ static void *__slab_alloc(struct kmem_ca
> goto new_slab;
>
> slab_lock(c->page);
> - if (unlikely(!node_match(c, node)))
> - goto another_slab;
> -
> - stat(c, ALLOC_REFILL);
> + goto another_slab;
>
> load_freelist:
> object = c->page->freelist;
There is no much help. In 2.6.25-rc5, REFILL means refill from c->page->freelist
and another_slab. It's looks like its definition is confusing. In the case of
hackbench, mostly, c->page->freelist is NULL.
With #hackbench 100 process 2000, 100*20*2 (totoally 4000) processes are started.
vmstat shows about 300~500 processes are at RUNNING state, so every processor runqueue
has more than 20 processes running on 16p tigerton.
Below is the data with kernel 2.6.25-rc5+your_patch.
[ymzhang@lkp-tt01-x8664 ~]$ slabinfo kmalloc-512
Slabcache: kmalloc-512 Aliases: 1 Order : 0 Objects: 352
Sizes (bytes) Slabs Debug Memory
------------------------------------------------------------------------
Object : 512 Total : 56 Sanity Checks : Off Total: 229376
SlabObj: 512 Full : 36 Redzoning : Off Used : 180224
SlabSiz: 4096 Partial: 4 Poisoning : Off Loss : 49152
Loss : 0 CpuSlab: 16 Tracking : Off Lalig: 0
Align : 8 Objects: 8 Tracing : Off Lpadd: 0
kmalloc-512 has no kmem_cache operations
kmalloc-512: Kernel object allocation
-----------------------------------------------------------------------
No Data
kmalloc-512: Kernel object freeing
------------------------------------------------------------------------
No Data
kmalloc-512: No NUMA information available.
Slab Perf Counter Alloc Free %Al %Fr
--------------------------------------------------
Fastpath 55883575 6130576 69 7
Slowpath 24131134 73883818 30 92
Page Alloc 84844 84788 0 0
Add partial 270625 23860257 0 29
Remove partial 24046290 84752 30 0
RemoteObj/SlabFrozen 270825 439015 0 0
Total 80014709 80014394
Deactivate Full=23860293(98%) Empty=200(0%) ToHead=0(0%) ToTail=270625(1%)
On Fri, 2008-03-14 at 14:06 -0700, Christoph Lameter wrote:
> On Fri, 14 Mar 2008, Zhang, Yanmin wrote:
>
> > On my 8-core stoakley, there is no such regression. Below data is after
> > testing.
>
> I was looking for the details on two slab caches. The comparison of the
> details statistics is likely very interesting because we will be able to
> see how the doubling of processor counts affects the internal behavior of
> slub.
I collected more data on 16-p tigerton to try to find the possible relationship
between slub_min_objects and processor number. Kernel is 2.6.25-rc5.
Command\slub_min_objects | slub_min_objects=8 | 16 | 32 | 64
-------------------------------------------------------------------------------------
./hackbench 100 process 2000 | 250second | 23 | 18.6 | 17.5
./hackbench 200 process 2000 | 532 | 44 | 35.6 | 33.5
The first command line will start 4000 processes and the second will start 8000 processes.
As the problemtic slab is kmalloc-512, slub_min_objects=8 is just the default configuration.
Oprofile data shows the ratio of __slab_alloc+__slab_free+add_partial has no difference
between the 2 commandline with the same kernel boot parameters.
slub_min_objects | 8 | 16 | 32 | 64
--------------------------------------------------------------------------------------------
slab(__slab_alloc+__slab_free+add_partial) cpu utilization | 88.00% | 44.00% | 13.00% | 12%
When slub_min_objects=32, we could get a reasonable value. Beyond 32, the improvement
is very small. 32 is just possible_cpu_number*2 on my tigerton.
It's hard to say hackbench simulates real applications closely. But it discloses a possible
performance bottlebeck. Last year, we once captured the kmalloc-2048 issue by tbench. So the
default slub_min_objects need to be revised. In the other hand, slab is allocated by alloc_page
when its size is equal to or more than a half page, so enlarging slub_min_objects won't create
too many slab page buffers.
As for NUMA, perhaps we could define slub_min_objects to 2*max_cpu_number_per_node.
-yanmin
On Mon, 17 Mar 2008, Zhang, Yanmin wrote:
> There is no much help. In 2.6.25-rc5, REFILL means refill from c->page->freelist
> and another_slab. It's looks like its definition is confusing. In the case of
> hackbench, mostly, c->page->freelist is NULL.
REFILL means refilling the per cpu objects from the freelist of the
per cpu slab page. That could be bad because it requires taking the slab
lock on the slab page.
> Slab Perf Counter Alloc Free %Al %Fr
> --------------------------------------------------
> Fastpath 55883575 6130576 69 7
> Slowpath 24131134 73883818 30 92
> Page Alloc 84844 84788 0 0
> Add partial 270625 23860257 0 29
> Remove partial 24046290 84752 30 0
Hmmm... I was hoping that add/remove partial numbers would come down. Ok
lets forget about the patch. Increasing min_objects does the trick.
On Mon, 17 Mar 2008, Zhang, Yanmin wrote:
> slub_min_objects | 8 | 16 | 32 | 64
> --------------------------------------------------------------------------------------------
> slab(__slab_alloc+__slab_free+add_partial) cpu utilization | 88.00% | 44.00% | 13.00% | 12%
>
>
> When slub_min_objects=32, we could get a reasonable value. Beyond 32, the improvement
> is very small. 32 is just possible_cpu_number*2 on my tigerton.
Interesting. What is the optimal configuration for your 8p? Could you
figure out the optimal configuration for an 4p and a 2p configuration?
> It's hard to say hackbench simulates real applications closely. But it discloses a possible
> performance bottlebeck. Last year, we once captured the kmalloc-2048 issue by tbench. So the
> default slub_min_objects need to be revised. In the other hand, slab is allocated by alloc_page
> when its size is equal to or more than a half page, so enlarging slub_min_objects won't create
> too many slab page buffers.
>
> As for NUMA, perhaps we could define slub_min_objects to 2*max_cpu_number_per_node.
Well for a 4k cpu configu this would set min_objects to 8192. So I think
we could implement a form of logarithmic scaling based on cpu
counts comparable to what is done for the statistics update in vmstat.c
fls(num_online_cpus()) = 4
So maybe
slub_min_objects= 8 + (2 + fls(num_online_cpus())) * 4
On Mon, 2008-03-17 at 10:32 -0700, Christoph Lameter wrote:
> On Mon, 17 Mar 2008, Zhang, Yanmin wrote:
>
> > slub_min_objects | 8 | 16 | 32 | 64
> > --------------------------------------------------------------------------------------------
> > slab(__slab_alloc+__slab_free+add_partial) cpu utilization | 88.00% | 44.00% | 13.00% | 12%
> >
> >
> > When slub_min_objects=32, we could get a reasonable value. Beyond 32, the improvement
> > is very small. 32 is just possible_cpu_number*2 on my tigerton.
>
> Interesting. What is the optimal configuration for your 8p? Could you
> figure out the optimal configuration for an 4p and a 2p configuration?
I used 8-core stoakley to do testing, and tried boot kernel with maxcpus=4 and 2.
Just ran ./hackbench 100 process 2000.
processor number\slub_min_objects | slub_min_objects=8 | 16 | 32 | 64
--------------------------------------------------------------------------------------------
8p | 60second | 30 | 28.5 | 26.5
--------------------------------------------------------------------------------------------
4p | 50second | 43 | 42 |
--------------------------------------------------------------------------------------------
2p | 92second | 79 | |
As stoakley is just multi-core machine and hasn't hyper-threading, I also tested it on an old
harwich machine which has 4 physical processors and 8 logical processors with hyperthreading.
processor number\slub_min_objects | slub_min_objects=8 | 16 | 32 | 64
--------------------------------------------------------------------------------------------
8p | 78.7second | 77.5| |
>
> > It's hard to say hackbench simulates real applications closely. But it discloses a possible
> > performance bottlebeck. Last year, we once captured the kmalloc-2048 issue by tbench. So the
> > default slub_min_objects need to be revised. In the other hand, slab is allocated by alloc_page
> > when its size is equal to or more than a half page, so enlarging slub_min_objects won't create
> > too many slab page buffers.
> >
> > As for NUMA, perhaps we could define slub_min_objects to 2*max_cpu_number_per_node.
>
> Well for a 4k cpu configu this would set min_objects to 8192.
> So I think
> we could implement a form of logarithmic scaling based on cpu
> counts comparable to what is done for the statistics update in vmstat.c
>
> fls(num_online_cpus()) = 4
num_online_cpus as the input parameter is ok. A potential issue is how to consider cpu hot-plug.
When num_online_cpus()=16, fls(num_online_cpus())=5.
>
> So maybe
>
> slub_min_objects= 8 + (2 + fls(num_online_cpus())) * 4
So slub_min_objects= 8 + (1 + fls(num_online_cpus())) * 4.
On Tue, 18 Mar 2008, Zhang, Yanmin wrote:
> num_online_cpus as the input parameter is ok. A potential issue is how to consider cpu hot-plug.
Yeah I used nr_cpu_ids instead in the patchset that I cced you on. Maybe
continue discussion on that thread?