2008-07-31 03:23:17

by Yanmin Zhang

[permalink] [raw]
Subject: VolanoMark regression with 2.6.27-rc1

Ingo,

volanoMark has regression with 2.6.27-rc1.
1) 70% on 16-core tigerton;
2) 18% on a new multi-core+HT mahcine;

I tried to use git bisect to locate the root cause, but git bisect always went
back to 2.6.26. Then, I used my mechanical bisect script linearly to locate below commit:

commit 82638844d9a8581bbf33201cc209a14876eca167
Merge: 9982fbf... 63cf13b...
Author: Ingo Molnar <[email protected]>
Date: Wed Jul 16 00:29:07 2008 +0200

Merge branch 'linus' into cpus4096

Conflicts:

arch/x86/xen/smp.c
kernel/sched_rt.c
net/iucv/iucv.c

Signed-off-by: Ingo Molnar <[email protected]>



But if I use 'git show 82638844d9a8581bbf33201cc209a14876eca167', it looks like I could only
get a part of the patch. If I use web to acces the commit, I could get a big patch. As it's
a merge, what're commit numbers of the subpatches?


BTW, sysbench+mysql(oltp readonly) has about 15% regression, but git bisect looks crazy again.

-yanmin



2008-07-31 07:33:49

by Yanmin Zhang

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1


On Thu, 2008-07-31 at 11:20 +0800, Zhang, Yanmin wrote:
> Ingo,
>
> volanoMark has regression with 2.6.27-rc1.
> 1) 70% on 16-core tigerton;
> 2) 18% on a new multi-core+HT mahcine;
>
> I tried to use git bisect to locate the root cause, but git bisect always went
> back to 2.6.26. Then, I used my mechanical bisect script linearly to locate below commit:
>
> commit 82638844d9a8581bbf33201cc209a14876eca167
> Merge: 9982fbf... 63cf13b...
> Author: Ingo Molnar <[email protected]>
> Date: Wed Jul 16 00:29:07 2008 +0200
>
> Merge branch 'linus' into cpus4096
>
> Conflicts:
>
> arch/x86/xen/smp.c
> kernel/sched_rt.c
> net/iucv/iucv.c
>
> Signed-off-by: Ingo Molnar <[email protected]>
>
>
>
> But if I use 'git show 82638844d9a8581bbf33201cc209a14876eca167', it looks like I could only
> get a part of the patch. If I use web to acces the commit, I could get a big patch. As it's
> a merge, what're commit numbers of the subpatches?
>
>
> BTW, sysbench+mysql(oltp readonly) has about 15% regression, but git bisect looks crazy again.
Oh, it looks like they are the old issues in 2.6.26-rc1 and the 2 patches were reverted before 2.6.26.
New patches are merged into 2.6.27-rc1, but the issues are still not resolved clearly.
http://www.uwsg.iu.edu/hypermail/linux/kernel/0805.2/1148.html.

yanmin

2008-07-31 07:39:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Thu, 2008-07-31 at 15:31 +0800, Zhang, Yanmin wrote:
> On Thu, 2008-07-31 at 11:20 +0800, Zhang, Yanmin wrote:
> > Ingo,
> >
> > volanoMark has regression with 2.6.27-rc1.
> > 1) 70% on 16-core tigerton;
> > 2) 18% on a new multi-core+HT mahcine;
> >
> > I tried to use git bisect to locate the root cause, but git bisect always went
> > back to 2.6.26. Then, I used my mechanical bisect script linearly to locate below commit:
> >
> > commit 82638844d9a8581bbf33201cc209a14876eca167
> > Merge: 9982fbf... 63cf13b...
> > Author: Ingo Molnar <[email protected]>
> > Date: Wed Jul 16 00:29:07 2008 +0200
> >
> > Merge branch 'linus' into cpus4096
> >
> > Conflicts:
> >
> > arch/x86/xen/smp.c
> > kernel/sched_rt.c
> > net/iucv/iucv.c
> >
> > Signed-off-by: Ingo Molnar <[email protected]>
> >
> >
> >
> > But if I use 'git show 82638844d9a8581bbf33201cc209a14876eca167', it looks like I could only
> > get a part of the patch. If I use web to acces the commit, I could get a big patch. As it's
> > a merge, what're commit numbers of the subpatches?
> >
> >
> > BTW, sysbench+mysql(oltp readonly) has about 15% regression, but git bisect looks crazy again.

> Oh, it looks like they are the old issues in 2.6.26-rc1 and the 2 patches were reverted before 2.6.26.
> New patches are merged into 2.6.27-rc1, but the issues are still not resolved clearly.
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0805.2/1148.html.

The new smp-group stuff doesn't remotely look like what was in .26

Also, on my quad (admittedly smaller than your machines) both volano and
sysbench didn't regress anymore - where they clearly did with the code
reverted from .26.

2008-07-31 07:52:26

by Yanmin Zhang

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1


On Thu, 2008-07-31 at 09:39 +0200, Peter Zijlstra wrote:
> On Thu, 2008-07-31 at 15:31 +0800, Zhang, Yanmin wrote:
> > On Thu, 2008-07-31 at 11:20 +0800, Zhang, Yanmin wrote:
> > > Ingo,
> > >
> > > volanoMark has regression with 2.6.27-rc1.
> > > 1) 70% on 16-core tigerton;
> > > 2) 18% on a new multi-core+HT mahcine;
> > >
> > > I tried to use git bisect to locate the root cause, but git bisect always went
> > > back to 2.6.26. Then, I used my mechanical bisect script linearly to locate below commit:
> > >
> > > commit 82638844d9a8581bbf33201cc209a14876eca167
> > > Merge: 9982fbf... 63cf13b...
> > > Author: Ingo Molnar <[email protected]>
> > > Date: Wed Jul 16 00:29:07 2008 +0200
> > >
> > > Merge branch 'linus' into cpus4096
> > >
> > > Conflicts:
> > >
> > > arch/x86/xen/smp.c
> > > kernel/sched_rt.c
> > > net/iucv/iucv.c
> > >
> > > Signed-off-by: Ingo Molnar <[email protected]>
> > >
> > >
> > >
> > > But if I use 'git show 82638844d9a8581bbf33201cc209a14876eca167', it looks like I could only
> > > get a part of the patch. If I use web to acces the commit, I could get a big patch. As it's
> > > a merge, what're commit numbers of the subpatches?
> > >
> > >
> > > BTW, sysbench+mysql(oltp readonly) has about 15% regression, but git bisect looks crazy again.
>
> > Oh, it looks like they are the old issues in 2.6.26-rc1 and the 2 patches were reverted before 2.6.26.
> > New patches are merged into 2.6.27-rc1, but the issues are still not resolved clearly.
> > http://www.uwsg.iu.edu/hypermail/linux/kernel/0805.2/1148.html.
>
> The new smp-group stuff doesn't remotely look like what was in .26
>
> Also, on my quad (admittedly smaller than your machines) both volano and
> sysbench didn't regress anymore - where they clearly did with the code
> reverted from .26.
The regression I reported exists on:
1) 8-core+HT(totally 16 logical processor) tulsa: 40% regression with volano, 8% with oltp;
2) 8-core+HT Montvale Itanium: 9% regression with volano; 8% with oltp;
3) 16-core tigerton: %70 with volano, %18 with oltp;
4) 8-core stoakley: %15 with oltp, testing failed with volanoMark.

So the issues are popular on different architectures.

yanmin

2008-08-01 00:42:19

by Yanmin Zhang

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1


On Thu, 2008-07-31 at 15:49 +0800, Zhang, Yanmin wrote:
> On Thu, 2008-07-31 at 09:39 +0200, Peter Zijlstra wrote:
> > On Thu, 2008-07-31 at 15:31 +0800, Zhang, Yanmin wrote:
> > > On Thu, 2008-07-31 at 11:20 +0800, Zhang, Yanmin wrote:
> > > > Ingo,
> > > >
> > > > volanoMark has regression with 2.6.27-rc1.
> > > > 1) 70% on 16-core tigerton;
> > > > 2) 18% on a new multi-core+HT mahcine;
> > > >
> > > > I tried to use git bisect to locate the root cause, but git bisect always went
> > > > back to 2.6.26. Then, I used my mechanical bisect script linearly to locate below commit:
> > > >
> > > > commit 82638844d9a8581bbf33201cc209a14876eca167
> > > > Merge: 9982fbf... 63cf13b...
> > > > Author: Ingo Molnar <[email protected]>
> > > > Date: Wed Jul 16 00:29:07 2008 +0200
> > > >
> > > > Merge branch 'linus' into cpus4096
> > > >
> > > > Conflicts:
> > > >
> > > > arch/x86/xen/smp.c
> > > > kernel/sched_rt.c
> > > > net/iucv/iucv.c
> > > >
> > > > Signed-off-by: Ingo Molnar <[email protected]>
> > > >
> > > >
> > > >
> > > > But if I use 'git show 82638844d9a8581bbf33201cc209a14876eca167', it looks like I could only
> > > > get a part of the patch. If I use web to acces the commit, I could get a big patch. As it's
> > > > a merge, what're commit numbers of the subpatches?
> > > >
> > > >
> > > > BTW, sysbench+mysql(oltp readonly) has about 15% regression, but git bisect looks crazy again.
> >
> > > Oh, it looks like they are the old issues in 2.6.26-rc1 and the 2 patches were reverted before 2.6.26.
> > > New patches are merged into 2.6.27-rc1, but the issues are still not resolved clearly.
> > > http://www.uwsg.iu.edu/hypermail/linux/kernel/0805.2/1148.html.
> >
> > The new smp-group stuff doesn't remotely look like what was in .26
> >
> > Also, on my quad (admittedly smaller than your machines) both volano and
> > sysbench didn't regress anymore - where they clearly did with the code
> > reverted from .26.
> The regression I reported exists on:
> 1) 8-core+HT(totally 16 logical processor) tulsa: 40% regression with volano, 8% with oltp;
> 2) 8-core+HT Montvale Itanium: 9% regression with volano; 8% with oltp;
> 3) 16-core tigerton: %70 with volano, %18 with oltp;
> 4) 8-core stoakley: %15 with oltp, testing failed with volanoMark.
>
> So the issues are popular on different architectures.
I know kernel needs the features and it might not be a good idea to reject them over and over again.
I will collect more data on tigerton and try to optimize it.

-yanmin

2008-08-01 02:37:46

by Miao Xie

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

on 2008-8-1 8:39 Zhang, Yanmin wrote:
[snip]
>> The regression I reported exists on:
>> 1) 8-core+HT(totally 16 logical processor) tulsa: 40% regression with volano, 8% with oltp;
>> 2) 8-core+HT Montvale Itanium: 9% regression with volano; 8% with oltp;
>> 3) 16-core tigerton: %70 with volano, %18 with oltp;
>> 4) 8-core stoakley: %15 with oltp, testing failed with volanoMark.
>>
>> So the issues are popular on different architectures.
> I know kernel needs the features and it might not be a good idea to reject them over and over again.
> I will collect more data on tigerton and try to optimize it.

Could you tell me the exact config options? the same with last time?

as follows:

# CONFIG_CGROUPS is not set
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_RT_GROUP_SCHED is not set
CONFIG_USER_SCHED=y
# CONFIG_CGROUP_SCHED is not set


> -yanmin
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
>

2008-08-01 03:11:34

by Yanmin Zhang

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1


On Fri, 2008-08-01 at 10:35 +0800, Miao Xie wrote:
> on 2008-8-1 8:39 Zhang, Yanmin wrote:
> [snip]
> >> The regression I reported exists on:
> >> 1) 8-core+HT(totally 16 logical processor) tulsa: 40% regression with volano, 8% with oltp;
> >> 2) 8-core+HT Montvale Itanium: 9% regression with volano; 8% with oltp;
> >> 3) 16-core tigerton: %70 with volano, %18 with oltp;
> >> 4) 8-core stoakley: %15 with oltp, testing failed with volanoMark.
> >>
> >> So the issues are popular on different architectures.
> > I know kernel needs the features and it might not be a good idea to reject them over and over again.
> > I will collect more data on tigerton and try to optimize it.
>
> Could you tell me the exact config options? the same with last time?
Yes.

>
> as follows:
>
> # CONFIG_CGROUPS is not set
> CONFIG_GROUP_SCHED=y
> CONFIG_FAIR_GROUP_SCHED=y
> # CONFIG_RT_GROUP_SCHED is not set
> CONFIG_USER_SCHED=y
> # CONFIG_CGROUP_SCHED is not set

2008-08-01 05:14:39

by Dhaval Giani

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Fri, Aug 01, 2008 at 08:39:14AM +0800, Zhang, Yanmin wrote:
>
> On Thu, 2008-07-31 at 15:49 +0800, Zhang, Yanmin wrote:
> > On Thu, 2008-07-31 at 09:39 +0200, Peter Zijlstra wrote:
> > > On Thu, 2008-07-31 at 15:31 +0800, Zhang, Yanmin wrote:
> > > > On Thu, 2008-07-31 at 11:20 +0800, Zhang, Yanmin wrote:
> > > > > Ingo,
> > > > >
> > > > Oh, it looks like they are the old issues in 2.6.26-rc1 and the 2 patches were reverted before 2.6.26.
> > > > New patches are merged into 2.6.27-rc1, but the issues are still not resolved clearly.
> > > > http://www.uwsg.iu.edu/hypermail/linux/kernel/0805.2/1148.html.
> > >
> > > The new smp-group stuff doesn't remotely look like what was in .26
> > >
> > > Also, on my quad (admittedly smaller than your machines) both volano and
> > > sysbench didn't regress anymore - where they clearly did with the code
> > > reverted from .26.
> > The regression I reported exists on:
> > 1) 8-core+HT(totally 16 logical processor) tulsa: 40% regression with volano, 8% with oltp;
> > 2) 8-core+HT Montvale Itanium: 9% regression with volano; 8% with oltp;
> > 3) 16-core tigerton: %70 with volano, %18 with oltp;
> > 4) 8-core stoakley: %15 with oltp, testing failed with volanoMark.
> >
> > So the issues are popular on different architectures.
> I know kernel needs the features and it might not be a good idea to reject them over and over again.
> I will collect more data on tigerton and try to optimize it.

Hi Yanmin,

Would it be possible for you to switch of the group scheduling feature
and see if the regression still exists. In all our testing, we did not
see a regression. I would like to eliminate it from your testing as
well.

The option to switch off would be CONFIG_GROUP_SCHED, that should disable
all the group scheduling features.

Thanks,
--
regards,
Dhaval

2008-08-01 12:26:16

by Hugh Dickins

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Thu, 31 Jul 2008, Zhang, Yanmin wrote:
>
> I tried to use git bisect to locate the root cause, but git bisect
> always went back to 2.6.26. Then, I used my mechanical bisect script...
....
> BTW, sysbench+mysql(oltp readonly) has about 15% regression,
> but git bisect looks crazy again.

I'm no git expert, but didn't see anyone else comment on this:
you need to trust git more, it's like the Tour de France,
occasionally venturing into other countries for a little while.

Work which got merged into Linus's 2.6.26-git for 2.6.27-rc1 may
well have been developed on a 2.6.26-rcN base in someone else's
tree, and so the bisection may take you back there.

I think this is getting commoner now, since Linus spoke out
against rebasing: bisecting a net issue took me back to rc6-git
and rc4-git, but did end up at the right commit.

It can be nuisance if you don't notice at "make install" time,
and reboot another kernel than the one you just built to test.

Hugh

2008-08-04 00:57:40

by Yanmin Zhang

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1


On Fri, 2008-08-01 at 13:25 +0100, Hugh Dickins wrote:
> On Thu, 31 Jul 2008, Zhang, Yanmin wrote:
> >
> > I tried to use git bisect to locate the root cause, but git bisect
> > always went back to 2.6.26. Then, I used my mechanical bisect script...
> ....
> > BTW, sysbench+mysql(oltp readonly) has about 15% regression,
> > but git bisect looks crazy again.
>
> I'm no git expert, but didn't see anyone else comment on this:
> you need to trust git more, it's like the Tour de France,
> occasionally venturing into other countries for a little while.
>
> Work which got merged into Linus's 2.6.26-git for 2.6.27-rc1 may
> well have been developed on a 2.6.26-rcN base in someone else's
> tree, and so the bisection may take you back there.
>
> I think this is getting commoner now, since Linus spoke out
> against rebasing: bisecting a net issue took me back to rc6-git
> and rc4-git, but did end up at the right commit.
Sometimes, git bisect could locate the culprit, but didn't this time.
I'm used to keep quiet when it's good but complain to make noisy when
something is wrong. :)

Thanks,
Yanmin

2008-08-04 05:08:18

by Yanmin Zhang

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1


On Fri, 2008-08-01 at 10:44 +0530, Dhaval Giani wrote:
> On Fri, Aug 01, 2008 at 08:39:14AM +0800, Zhang, Yanmin wrote:
> >
> > On Thu, 2008-07-31 at 15:49 +0800, Zhang, Yanmin wrote:
> > > On Thu, 2008-07-31 at 09:39 +0200, Peter Zijlstra wrote:
> > > > On Thu, 2008-07-31 at 15:31 +0800, Zhang, Yanmin wrote:
> > > > > On Thu, 2008-07-31 at 11:20 +0800, Zhang, Yanmin wrote:
> > > > > > Ingo,
> > > > > >
> > > > > Oh, it looks like they are the old issues in 2.6.26-rc1 and the 2 patches were reverted before 2.6.26.
> > > > > New patches are merged into 2.6.27-rc1, but the issues are still not resolved clearly.
> > > > > http://www.uwsg.iu.edu/hypermail/linux/kernel/0805.2/1148.html.
> > > >
> > > > The new smp-group stuff doesn't remotely look like what was in .26
> > > >
> > > > Also, on my quad (admittedly smaller than your machines) both volano and
> > > > sysbench didn't regress anymore - where they clearly did with the code
> > > > reverted from .26.
> > > The regression I reported exists on:
> > > 1) 8-core+HT(totally 16 logical processor) tulsa: 40% regression with volano, 8% with oltp;
> > > 2) 8-core+HT Montvale Itanium: 9% regression with volano; 8% with oltp;
> > > 3) 16-core tigerton: %70 with volano, %18 with oltp;
> > > 4) 8-core stoakley: %15 with oltp, testing failed with volanoMark.
> > >
> > > So the issues are popular on different architectures.
> > I know kernel needs the features and it might not be a good idea to reject them over and over again.
> > I will collect more data on tigerton and try to optimize it.
>
> Hi Yanmin,
>
> Would it be possible for you to switch of the group scheduling feature
> and see if the regression still exists. In all our testing, we did not
> see a regression. I would like to eliminate it from your testing as
> well.
I tested with CONFIG_GROUP_SCHED=n. To test faster, I simplified the benchmark parameter.

volanoMark:
kernel | result
----------------------------------------------------------
2.6.27-rc1_group | 205901
----------------------------------------------------------
2.6.27-rc1_nogroup | 303377
----------------------------------------------------------
2.6.26_group | 529388


sysbench+mysql(readonly oltp):
kernel | result
-----------------------------------------------------------
2.6.27-rc1_group | 560636
-----------------------------------------------------------
2.6.27-rc1_nogroup | 604937
-----------------------------------------------------------
2.6.26_group | 627384



>
> The option to switch off would be CONFIG_GROUP_SCHED, that should disable
> all the group scheduling features.
>
> Thanks,

2008-08-04 05:23:25

by Dhaval Giani

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Mon, Aug 04, 2008 at 01:04:38PM +0800, Zhang, Yanmin wrote:
>
> On Fri, 2008-08-01 at 10:44 +0530, Dhaval Giani wrote:
> > On Fri, Aug 01, 2008 at 08:39:14AM +0800, Zhang, Yanmin wrote:
> > >
> > > On Thu, 2008-07-31 at 15:49 +0800, Zhang, Yanmin wrote:
> > > > On Thu, 2008-07-31 at 09:39 +0200, Peter Zijlstra wrote:
> > > > > On Thu, 2008-07-31 at 15:31 +0800, Zhang, Yanmin wrote:
> > > > > > On Thu, 2008-07-31 at 11:20 +0800, Zhang, Yanmin wrote:
> > > > > > > Ingo,
> > > > > > >
> > > > > > Oh, it looks like they are the old issues in 2.6.26-rc1 and the 2 patches were reverted before 2.6.26.
> > > > > > New patches are merged into 2.6.27-rc1, but the issues are still not resolved clearly.
> > > > > > http://www.uwsg.iu.edu/hypermail/linux/kernel/0805.2/1148.html.
> > > > >
> > > > > The new smp-group stuff doesn't remotely look like what was in .26
> > > > >
> > > > > Also, on my quad (admittedly smaller than your machines) both volano and
> > > > > sysbench didn't regress anymore - where they clearly did with the code
> > > > > reverted from .26.
> > > > The regression I reported exists on:
> > > > 1) 8-core+HT(totally 16 logical processor) tulsa: 40% regression with volano, 8% with oltp;
> > > > 2) 8-core+HT Montvale Itanium: 9% regression with volano; 8% with oltp;
> > > > 3) 16-core tigerton: %70 with volano, %18 with oltp;
> > > > 4) 8-core stoakley: %15 with oltp, testing failed with volanoMark.
> > > >
> > > > So the issues are popular on different architectures.
> > > I know kernel needs the features and it might not be a good idea to reject them over and over again.
> > > I will collect more data on tigerton and try to optimize it.
> >
> > Hi Yanmin,
> >
> > Would it be possible for you to switch of the group scheduling feature
> > and see if the regression still exists. In all our testing, we did not
> > see a regression. I would like to eliminate it from your testing as
> > well.
> I tested with CONFIG_GROUP_SCHED=n. To test faster, I simplified the benchmark parameter.
>
> volanoMark:
> kernel | result
> ----------------------------------------------------------
> 2.6.27-rc1_group | 205901
> ----------------------------------------------------------
> 2.6.27-rc1_nogroup | 303377
> ----------------------------------------------------------
> 2.6.26_group | 529388
>

There seem to be two different regressions here. One in the user group
scheduling (which I do remember did have problems) and something totally
unrelated to group scheduling. In some of the runs I tried here, I got
similar results for 2.6.27-rc1_nogroup and 2.6.27-rc1_cgroup but had bad
results for user. Anyway, we will need to fix both the regressions.
Would it be possible for you to see what causes the regression between
2.6.26 and 2.6.27-rc1 for the non group scheduling case?

thanks,
--
regards,
Dhaval

2008-08-04 05:41:40

by Yanmin Zhang

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1


On Mon, 2008-08-04 at 10:52 +0530, Dhaval Giani wrote:
> On Mon, Aug 04, 2008 at 01:04:38PM +0800, Zhang, Yanmin wrote:
> >
> > On Fri, 2008-08-01 at 10:44 +0530, Dhaval Giani wrote:
> > > On Fri, Aug 01, 2008 at 08:39:14AM +0800, Zhang, Yanmin wrote:
> > > >
> > > > On Thu, 2008-07-31 at 15:49 +0800, Zhang, Yanmin wrote:
> > > > > On Thu, 2008-07-31 at 09:39 +0200, Peter Zijlstra wrote:
> > > > > > On Thu, 2008-07-31 at 15:31 +0800, Zhang, Yanmin wrote:
> > > > > > > On Thu, 2008-07-31 at 11:20 +0800, Zhang, Yanmin wrote:
> > > > > > > > Ingo,
> > > > > > > >
> > > > > > > Oh, it looks like they are the old issues in 2.6.26-rc1 and the 2 patches were reverted before 2.6.26.
> > > > > > > New patches are merged into 2.6.27-rc1, but the issues are still not resolved clearly.
> > > > > > > http://www.uwsg.iu.edu/hypermail/linux/kernel/0805.2/1148.html.
> > > > > >
> > > > > > The new smp-group stuff doesn't remotely look like what was in .26
> > > > > >
> > > > > > Also, on my quad (admittedly smaller than your machines) both volano and
> > > > > > sysbench didn't regress anymore - where they clearly did with the code
> > > > > > reverted from .26.
> > > > > The regression I reported exists on:
> > > > > 1) 8-core+HT(totally 16 logical processor) tulsa: 40% regression with volano, 8% with oltp;
> > > > > 2) 8-core+HT Montvale Itanium: 9% regression with volano; 8% with oltp;
> > > > > 3) 16-core tigerton: %70 with volano, %18 with oltp;
> > > > > 4) 8-core stoakley: %15 with oltp, testing failed with volanoMark.
> > > > >
> > > > > So the issues are popular on different architectures.
> > > > I know kernel needs the features and it might not be a good idea to reject them over and over again.
> > > > I will collect more data on tigerton and try to optimize it.
> > >
> > > Hi Yanmin,
> > >
> > > Would it be possible for you to switch of the group scheduling feature
> > > and see if the regression still exists. In all our testing, we did not
> > > see a regression. I would like to eliminate it from your testing as
> > > well.
> > I tested with CONFIG_GROUP_SCHED=n. To test faster, I simplified the benchmark parameter.
> >
> > volanoMark:
> > kernel | result
> > ----------------------------------------------------------
> > 2.6.27-rc1_group | 205901
> > ----------------------------------------------------------
> > 2.6.27-rc1_nogroup | 303377
> > ----------------------------------------------------------
> > 2.6.26_group | 529388
> >
>
> There seem to be two different regressions here. One in the user group
> scheduling (which I do remember did have problems) and something totally
> unrelated to group scheduling. In some of the runs I tried here, I got
> similar results for 2.6.27-rc1_nogroup and 2.6.27-rc1_cgroup
Does cgroup here mean CONFIG_CGROUPS? Or just a typo?

I never enable CONFIG_CGROUP.

> but had bad
> results for user. Anyway, we will need to fix both the regressions.
That's great.

> Would it be possible for you to see what causes the regression between
> 2.6.26 and 2.6.27-rc1 for the non group scheduling case?
I will check it. But git bisect doesn't work on this issue. Mostly, it's still
caused by scheduler. If checking the old emails about 2.6.26-rc1, we can find the
major issues about scheduler are related to 2 patches, although I'm not sure
current regression is still caused by them.

yanmin

2008-08-04 05:58:26

by Dhaval Giani

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Mon, Aug 04, 2008 at 01:37:58PM +0800, Zhang, Yanmin wrote:
>
> On Mon, 2008-08-04 at 10:52 +0530, Dhaval Giani wrote:
> > On Mon, Aug 04, 2008 at 01:04:38PM +0800, Zhang, Yanmin wrote:
> > >
> > > On Fri, 2008-08-01 at 10:44 +0530, Dhaval Giani wrote:
> > > > On Fri, Aug 01, 2008 at 08:39:14AM +0800, Zhang, Yanmin wrote:
> > > > >
> > > > > On Thu, 2008-07-31 at 15:49 +0800, Zhang, Yanmin wrote:
> > > > > > On Thu, 2008-07-31 at 09:39 +0200, Peter Zijlstra wrote:
> > > > > > > On Thu, 2008-07-31 at 15:31 +0800, Zhang, Yanmin wrote:
> > > > > > > > On Thu, 2008-07-31 at 11:20 +0800, Zhang, Yanmin wrote:
> > > > > > > > > Ingo,
> > > > > > > > >
> > > > > > > > Oh, it looks like they are the old issues in 2.6.26-rc1 and the 2 patches were reverted before 2.6.26.
> > > > > > > > New patches are merged into 2.6.27-rc1, but the issues are still not resolved clearly.
> > > > > > > > http://www.uwsg.iu.edu/hypermail/linux/kernel/0805.2/1148.html.
> > > > > > >
> > > > > > > The new smp-group stuff doesn't remotely look like what was in .26
> > > > > > >
> > > > > > > Also, on my quad (admittedly smaller than your machines) both volano and
> > > > > > > sysbench didn't regress anymore - where they clearly did with the code
> > > > > > > reverted from .26.
> > > > > > The regression I reported exists on:
> > > > > > 1) 8-core+HT(totally 16 logical processor) tulsa: 40% regression with volano, 8% with oltp;
> > > > > > 2) 8-core+HT Montvale Itanium: 9% regression with volano; 8% with oltp;
> > > > > > 3) 16-core tigerton: %70 with volano, %18 with oltp;
> > > > > > 4) 8-core stoakley: %15 with oltp, testing failed with volanoMark.
> > > > > >
> > > > > > So the issues are popular on different architectures.
> > > > > I know kernel needs the features and it might not be a good idea to reject them over and over again.
> > > > > I will collect more data on tigerton and try to optimize it.
> > > >
> > > > Hi Yanmin,
> > > >
> > > > Would it be possible for you to switch of the group scheduling feature
> > > > and see if the regression still exists. In all our testing, we did not
> > > > see a regression. I would like to eliminate it from your testing as
> > > > well.
> > > I tested with CONFIG_GROUP_SCHED=n. To test faster, I simplified the benchmark parameter.
> > >
> > > volanoMark:
> > > kernel | result
> > > ----------------------------------------------------------
> > > 2.6.27-rc1_group | 205901
> > > ----------------------------------------------------------
> > > 2.6.27-rc1_nogroup | 303377
> > > ----------------------------------------------------------
> > > 2.6.26_group | 529388
> > >
> >
> > There seem to be two different regressions here. One in the user group
> > scheduling (which I do remember did have problems) and something totally
> > unrelated to group scheduling. In some of the runs I tried here, I got
> > similar results for 2.6.27-rc1_nogroup and 2.6.27-rc1_cgroup
> Does cgroup here mean CONFIG_CGROUPS? Or just a typo?
>

It means CONFIG_CGROUP_SCHED.

> I never enable CONFIG_CGROUP.
>
> > but had bad
> > results for user. Anyway, we will need to fix both the regressions.
> That's great.
>
> > Would it be possible for you to see what causes the regression between
> > 2.6.26 and 2.6.27-rc1 for the non group scheduling case?
> I will check it. But git bisect doesn't work on this issue. Mostly, it's still
> caused by scheduler. If checking the old emails about 2.6.26-rc1, we can find the
> major issues about scheduler are related to 2 patches, although I'm not sure
> current regression is still caused by them.
>

The current set of patches affect group scheduling. From your results,
there is a big performance regression between the 2.6.26 group
scheduling and 2.6.27-rc1 non group scheduling case (where normally non
group scheduling case should have performed better). (I don't recall any
major changes to the scheduler which would explain this regression).
Peter, vatsa, any ideas?

Thanks,
--
regards,
Dhaval

2008-08-04 06:26:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Mon, 2008-08-04 at 11:23 +0530, Dhaval Giani wrote:

> Peter, vatsa, any ideas?

---

Patches in tip/sched/clock

---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5270d44..ea436bc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1572,28 +1572,13 @@ static inline void sched_clock_idle_sleep_event(void)
static inline void sched_clock_idle_wakeup_event(u64 delta_ns)
{
}
-
-#ifdef CONFIG_NO_HZ
-static inline void sched_clock_tick_stop(int cpu)
-{
-}
-
-static inline void sched_clock_tick_start(int cpu)
-{
-}
-#endif
-
-#else /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */
+#else
extern void sched_clock_init(void);
extern u64 sched_clock_cpu(int cpu);
extern void sched_clock_tick(void);
extern void sched_clock_idle_sleep_event(void);
extern void sched_clock_idle_wakeup_event(u64 delta_ns);
-#ifdef CONFIG_NO_HZ
-extern void sched_clock_tick_stop(int cpu);
-extern void sched_clock_tick_start(int cpu);
#endif
-#endif /* CONFIG_HAVE_UNSTABLE_SCHED_CLOCK */

/*
* For kernel-internal use: high-speed (but slightly incorrect) per-cpu
diff --git a/kernel/Kconfig.hz b/kernel/Kconfig.hz
index 382dd5a..94fabd5 100644
--- a/kernel/Kconfig.hz
+++ b/kernel/Kconfig.hz
@@ -55,4 +55,4 @@ config HZ
default 1000 if HZ_1000

config SCHED_HRTICK
- def_bool HIGH_RES_TIMERS && USE_GENERIC_SMP_HELPERS
+ def_bool HIGH_RES_TIMERS && (!SMP || USE_GENERIC_SMP_HELPERS)
diff --git a/kernel/sched.c b/kernel/sched.c
index 21f7da9..9a76e92 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -834,7 +834,7 @@ static inline u64 global_rt_period(void)

static inline u64 global_rt_runtime(void)
{
- if (sysctl_sched_rt_period < 0)
+ if (sysctl_sched_rt_runtime < 0)
return RUNTIME_INF;

return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c
index 22ed55d..074edc9 100644
--- a/kernel/sched_clock.c
+++ b/kernel/sched_clock.c
@@ -32,14 +32,18 @@
#include <linux/ktime.h>
#include <linux/module.h>

+/*
+ * Scheduler clock - returns current time in nanosec units.
+ * This is default implementation.
+ * Architectures and sub-architectures can override this.
+ */
+unsigned long long __attribute__((weak)) sched_clock(void)
+{
+ return (unsigned long long)jiffies * (NSEC_PER_SEC / HZ);
+}

#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK

-#define MULTI_SHIFT 15
-/* Max is double, Min is 1/2 */
-#define MAX_MULTI (2LL << MULTI_SHIFT)
-#define MIN_MULTI (1LL << (MULTI_SHIFT-1))
-
struct sched_clock_data {
/*
* Raw spinlock - this is a special case: this might be called
@@ -49,14 +53,9 @@ struct sched_clock_data {
raw_spinlock_t lock;

unsigned long tick_jiffies;
- u64 prev_raw;
u64 tick_raw;
u64 tick_gtod;
u64 clock;
- s64 multi;
-#ifdef CONFIG_NO_HZ
- int check_max;
-#endif
};

static DEFINE_PER_CPU_SHARED_ALIGNED(struct sched_clock_data, sched_clock_data);
@@ -84,90 +83,39 @@ void sched_clock_init(void)

scd->lock = (raw_spinlock_t)__RAW_SPIN_LOCK_UNLOCKED;
scd->tick_jiffies = now_jiffies;
- scd->prev_raw = 0;
scd->tick_raw = 0;
scd->tick_gtod = ktime_now;
scd->clock = ktime_now;
- scd->multi = 1 << MULTI_SHIFT;
-#ifdef CONFIG_NO_HZ
- scd->check_max = 1;
-#endif
}

sched_clock_running = 1;
}

-#ifdef CONFIG_NO_HZ
-/*
- * The dynamic ticks makes the delta jiffies inaccurate. This
- * prevents us from checking the maximum time update.
- * Disable the maximum check during stopped ticks.
- */
-void sched_clock_tick_stop(int cpu)
-{
- struct sched_clock_data *scd = cpu_sdc(cpu);
-
- scd->check_max = 0;
-}
-
-void sched_clock_tick_start(int cpu)
-{
- struct sched_clock_data *scd = cpu_sdc(cpu);
-
- scd->check_max = 1;
-}
-
-static int check_max(struct sched_clock_data *scd)
-{
- return scd->check_max;
-}
-#else
-static int check_max(struct sched_clock_data *scd)
-{
- return 1;
-}
-#endif /* CONFIG_NO_HZ */
-
/*
* update the percpu scd from the raw @now value
*
* - filter out backward motion
* - use jiffies to generate a min,max window to clip the raw values
*/
-static void __update_sched_clock(struct sched_clock_data *scd, u64 now, u64 *time)
+static u64 __update_sched_clock(struct sched_clock_data *scd, u64 now)
{
unsigned long now_jiffies = jiffies;
long delta_jiffies = now_jiffies - scd->tick_jiffies;
u64 clock = scd->clock;
u64 min_clock, max_clock;
- s64 delta = now - scd->prev_raw;
+ s64 delta = now - scd->tick_raw;

WARN_ON_ONCE(!irqs_disabled());
-
- /*
- * At schedule tick the clock can be just under the gtod. We don't
- * want to push it too prematurely.
- */
- min_clock = scd->tick_gtod + (delta_jiffies * TICK_NSEC);
- if (min_clock > TICK_NSEC)
- min_clock -= TICK_NSEC / 2;
+ min_clock = scd->tick_gtod + delta_jiffies * TICK_NSEC;

if (unlikely(delta < 0)) {
clock++;
goto out;
}

- /*
- * The clock must stay within a jiffie of the gtod.
- * But since we may be at the start of a jiffy or the end of one
- * we add another jiffy buffer.
- */
- max_clock = scd->tick_gtod + (2 + delta_jiffies) * TICK_NSEC;
-
- delta *= scd->multi;
- delta >>= MULTI_SHIFT;
+ max_clock = min_clock + TICK_NSEC;

- if (unlikely(clock + delta > max_clock) && check_max(scd)) {
+ if (unlikely(clock + delta > max_clock)) {
if (clock < max_clock)
clock = max_clock;
else
@@ -180,12 +128,10 @@ static void __update_sched_clock(struct sched_clock_data *scd, u64 now, u64 *tim
if (unlikely(clock < min_clock))
clock = min_clock;

- if (time)
- *time = clock;
- else {
- scd->prev_raw = now;
- scd->clock = clock;
- }
+ scd->tick_jiffies = now_jiffies;
+ scd->clock = clock;
+
+ return clock;
}

static void lock_double_clock(struct sched_clock_data *data1,
@@ -203,7 +149,7 @@ static void lock_double_clock(struct sched_clock_data *data1,
u64 sched_clock_cpu(int cpu)
{
struct sched_clock_data *scd = cpu_sdc(cpu);
- u64 now, clock;
+ u64 now, clock, this_clock, remote_clock;

if (unlikely(!sched_clock_running))
return 0ull;
@@ -212,43 +158,44 @@ u64 sched_clock_cpu(int cpu)
now = sched_clock();

if (cpu != raw_smp_processor_id()) {
- /*
- * in order to update a remote cpu's clock based on our
- * unstable raw time rebase it against:
- * tick_raw (offset between raw counters)
- * tick_gotd (tick offset between cpus)
- */
struct sched_clock_data *my_scd = this_scd();

lock_double_clock(scd, my_scd);

- now -= my_scd->tick_raw;
- now += scd->tick_raw;
+ this_clock = __update_sched_clock(my_scd, now);
+ remote_clock = scd->clock;

- now += my_scd->tick_gtod;
- now -= scd->tick_gtod;
+ /*
+ * Use the opportunity that we have both locks
+ * taken to couple the two clocks: we take the
+ * larger time as the latest time for both
+ * runqueues. (this creates monotonic movement)
+ */
+ if (likely(remote_clock < this_clock)) {
+ clock = this_clock;
+ scd->clock = clock;
+ } else {
+ /*
+ * Should be rare, but possible:
+ */
+ clock = remote_clock;
+ my_scd->clock = remote_clock;
+ }

__raw_spin_unlock(&my_scd->lock);
-
- __update_sched_clock(scd, now, &clock);
-
- __raw_spin_unlock(&scd->lock);
-
} else {
__raw_spin_lock(&scd->lock);
- __update_sched_clock(scd, now, NULL);
- clock = scd->clock;
- __raw_spin_unlock(&scd->lock);
+ clock = __update_sched_clock(scd, now);
}

+ __raw_spin_unlock(&scd->lock);
+
return clock;
}

void sched_clock_tick(void)
{
struct sched_clock_data *scd = this_scd();
- unsigned long now_jiffies = jiffies;
- s64 mult, delta_gtod, delta_raw;
u64 now, now_gtod;

if (unlikely(!sched_clock_running))
@@ -260,29 +207,14 @@ void sched_clock_tick(void)
now = sched_clock();

__raw_spin_lock(&scd->lock);
- __update_sched_clock(scd, now, NULL);
+ __update_sched_clock(scd, now);
/*
* update tick_gtod after __update_sched_clock() because that will
* already observe 1 new jiffy; adding a new tick_gtod to that would
* increase the clock 2 jiffies.
*/
- delta_gtod = now_gtod - scd->tick_gtod;
- delta_raw = now - scd->tick_raw;
-
- if ((long)delta_raw > 0) {
- mult = delta_gtod << MULTI_SHIFT;
- do_div(mult, delta_raw);
- scd->multi = mult;
- if (scd->multi > MAX_MULTI)
- scd->multi = MAX_MULTI;
- else if (scd->multi < MIN_MULTI)
- scd->multi = MIN_MULTI;
- } else
- scd->multi = 1 << MULTI_SHIFT;
-
scd->tick_raw = now;
scd->tick_gtod = now_gtod;
- scd->tick_jiffies = now_jiffies;
__raw_spin_unlock(&scd->lock);
}

@@ -301,7 +233,6 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_sleep_event);
void sched_clock_idle_wakeup_event(u64 delta_ns)
{
struct sched_clock_data *scd = this_scd();
- u64 now = sched_clock();

/*
* Override the previous timestamp and ignore all
@@ -310,9 +241,7 @@ void sched_clock_idle_wakeup_event(u64 delta_ns)
* rq clock:
*/
__raw_spin_lock(&scd->lock);
- scd->prev_raw = now;
scd->clock += delta_ns;
- scd->multi = 1 << MULTI_SHIFT;
__raw_spin_unlock(&scd->lock);

touch_softlockup_watchdog();
@@ -321,16 +250,6 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);

#endif

-/*
- * Scheduler clock - returns current time in nanosec units.
- * This is default implementation.
- * Architectures and sub-architectures can override this.
- */
-unsigned long long __attribute__((weak)) sched_clock(void)
-{
- return (unsigned long long)jiffies * (NSEC_PER_SEC / HZ);
-}
-
unsigned long long cpu_clock(int cpu)
{
unsigned long long clock;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index cf2cd6c..0fe94ea 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -899,7 +899,7 @@ static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
* doesn't make sense. Rely on vruntime for fairness.
*/
if (rq->curr != p)
- delta = max(10000LL, delta);
+ delta = max_t(s64, 10000LL, delta);

hrtick_start(rq, delta);
}
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 825b4c0..f5da526 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -289,7 +289,6 @@ void tick_nohz_stop_sched_tick(int inidle)
ts->tick_stopped = 1;
ts->idle_jiffies = last_jiffies;
rcu_enter_nohz();
- sched_clock_tick_stop(cpu);
}

/*
@@ -392,7 +391,6 @@ void tick_nohz_restart_sched_tick(void)
select_nohz_load_balancer(0);
now = ktime_get();
tick_do_update_jiffies64(now);
- sched_clock_tick_start(cpu);
cpu_clear(cpu, nohz_cpu_mask);

/*

2008-08-04 06:26:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Mon, 2008-08-04 at 11:23 +0530, Dhaval Giani wrote:

> Peter, vatsa, any ideas?

---

Revert:
a7be37ac8e1565e00880531f4e2aff421a21c803 sched: revert the revert of: weight calculations
c9c294a630e28eec5f2865f028ecfc58d45c0a5a sched: fix calc_delta_asym()
ced8aa16e1db55c33c507174c1b1f9e107445865 sched: fix calc_delta_asym, #2

---
diff --git a/kernel/sched.c b/kernel/sched.c
index 21f7da9..7afb0fc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1244,9 +1244,6 @@ static void resched_task(struct task_struct *p)
*/
#define SRR(x, y) (((x) + (1UL << ((y) - 1))) >> (y))

-/*
- * delta *= weight / lw
- */
static unsigned long
calc_delta_mine(unsigned long delta_exec, unsigned long weight,
struct load_weight *lw)
@@ -1274,6 +1271,12 @@ calc_delta_mine(unsigned long delta_exec, unsigned long weight,
return (unsigned long)min(tmp, (u64)(unsigned long)LONG_MAX);
}

+static inline unsigned long
+calc_delta_fair(unsigned long delta_exec, struct load_weight *lw)
+{
+ return calc_delta_mine(delta_exec, NICE_0_LOAD, lw);
+}
+
static inline void update_load_add(struct load_weight *lw, unsigned long inc)
{
lw->weight += inc;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index cf2cd6c..593af05 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -334,34 +334,6 @@ int sched_nr_latency_handler(struct ctl_table *table, int write,
#endif

/*
- * delta *= w / rw
- */
-static inline unsigned long
-calc_delta_weight(unsigned long delta, struct sched_entity *se)
-{
- for_each_sched_entity(se) {
- delta = calc_delta_mine(delta,
- se->load.weight, &cfs_rq_of(se)->load);
- }
-
- return delta;
-}
-
-/*
- * delta *= rw / w
- */
-static inline unsigned long
-calc_delta_fair(unsigned long delta, struct sched_entity *se)
-{
- for_each_sched_entity(se) {
- delta = calc_delta_mine(delta,
- cfs_rq_of(se)->load.weight, &se->load);
- }
-
- return delta;
-}
-
-/*
* The idea is to set a period in which each task runs once.
*
* When there are too many tasks (sysctl_sched_nr_latency) we have to stretch
@@ -390,80 +362,47 @@ static u64 __sched_period(unsigned long nr_running)
*/
static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- return calc_delta_weight(__sched_period(cfs_rq->nr_running), se);
+ u64 slice = __sched_period(cfs_rq->nr_running);
+
+ for_each_sched_entity(se) {
+ cfs_rq = cfs_rq_of(se);
+
+ slice *= se->load.weight;
+ do_div(slice, cfs_rq->load.weight);
+ }
+
+
+ return slice;
}

/*
* We calculate the vruntime slice of a to be inserted task
*
- * vs = s*rw/w = p
+ * vs = s/w = p/rw
*/
static u64 sched_vslice_add(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
unsigned long nr_running = cfs_rq->nr_running;
+ unsigned long weight;
+ u64 vslice;

if (!se->on_rq)
nr_running++;

- return __sched_period(nr_running);
-}
-
-/*
- * The goal of calc_delta_asym() is to be asymmetrically around NICE_0_LOAD, in
- * that it favours >=0 over <0.
- *
- * -20 |
- * |
- * 0 --------+-------
- * .'
- * 19 .'
- *
- */
-static unsigned long
-calc_delta_asym(unsigned long delta, struct sched_entity *se)
-{
- struct load_weight lw = {
- .weight = NICE_0_LOAD,
- .inv_weight = 1UL << (WMULT_SHIFT-NICE_0_SHIFT)
- };
+ vslice = __sched_period(nr_running);

for_each_sched_entity(se) {
- struct load_weight *se_lw = &se->load;
- unsigned long rw = cfs_rq_of(se)->load.weight;
-
-#ifdef CONFIG_FAIR_SCHED_GROUP
- struct cfs_rq *cfs_rq = se->my_q;
- struct task_group *tg = NULL
-
- if (cfs_rq)
- tg = cfs_rq->tg;
-
- if (tg && tg->shares < NICE_0_LOAD) {
- /*
- * scale shares to what it would have been had
- * tg->weight been NICE_0_LOAD:
- *
- * weight = 1024 * shares / tg->weight
- */
- lw.weight *= se->load.weight;
- lw.weight /= tg->shares;
-
- lw.inv_weight = 0;
-
- se_lw = &lw;
- rw += lw.weight - se->load.weight;
- } else
-#endif
+ cfs_rq = cfs_rq_of(se);

- if (se->load.weight < NICE_0_LOAD) {
- se_lw = &lw;
- rw += NICE_0_LOAD - se->load.weight;
- }
+ weight = cfs_rq->load.weight;
+ if (!se->on_rq)
+ weight += se->load.weight;

- delta = calc_delta_mine(delta, rw, se_lw);
+ vslice *= NICE_0_LOAD;
+ do_div(vslice, weight);
}

- return delta;
+ return vslice;
}

/*
@@ -480,7 +419,11 @@ __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,

curr->sum_exec_runtime += delta_exec;
schedstat_add(cfs_rq, exec_clock, delta_exec);
- delta_exec_weighted = calc_delta_fair(delta_exec, curr);
+ delta_exec_weighted = delta_exec;
+ if (unlikely(curr->load.weight != NICE_0_LOAD)) {
+ delta_exec_weighted = calc_delta_fair(delta_exec_weighted,
+ &curr->load);
+ }
curr->vruntime += delta_exec_weighted;
}

@@ -687,17 +630,8 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)

if (!initial) {
/* sleeps upto a single latency don't count. */
- if (sched_feat(NEW_FAIR_SLEEPERS)) {
- unsigned long thresh = sysctl_sched_latency;
-
- /*
- * convert the sleeper threshold into virtual time
- */
- if (sched_feat(NORMALIZED_SLEEPER))
- thresh = calc_delta_fair(thresh, se);
-
- vruntime -= thresh;
- }
+ if (sched_feat(NEW_FAIR_SLEEPERS))
+ vruntime -= sysctl_sched_latency;

/* ensure we never gain time by being placed backwards. */
vruntime = max_vruntime(se->vruntime, vruntime);
@@ -1277,13 +1211,11 @@ static unsigned long wakeup_gran(struct sched_entity *se)
unsigned long gran = sysctl_sched_wakeup_granularity;

/*
- * More easily preempt - nice tasks, while not making it harder for
- * + nice tasks.
+ * More easily preempt - nice tasks, while not making
+ * it harder for + nice tasks.
*/
- if (sched_feat(ASYM_GRAN))
- gran = calc_delta_asym(sysctl_sched_wakeup_granularity, se);
- else
- gran = calc_delta_fair(sysctl_sched_wakeup_granularity, se);
+ if (unlikely(se->load.weight > NICE_0_LOAD))
+ gran = calc_delta_fair(gran, &se->load);

return gran;
}
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 862b06b..6cd8734 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -1,5 +1,4 @@
SCHED_FEAT(NEW_FAIR_SLEEPERS, 1)
-SCHED_FEAT(NORMALIZED_SLEEPER, 1)
SCHED_FEAT(WAKEUP_PREEMPT, 1)
SCHED_FEAT(START_DEBIT, 1)
SCHED_FEAT(AFFINE_WAKEUPS, 1)
@@ -7,7 +6,6 @@ SCHED_FEAT(CACHE_HOT_BUDDY, 1)
SCHED_FEAT(SYNC_WAKEUPS, 1)
SCHED_FEAT(HRTICK, 1)
SCHED_FEAT(DOUBLE_TICK, 0)
-SCHED_FEAT(ASYM_GRAN, 1)
SCHED_FEAT(LB_BIAS, 0)
SCHED_FEAT(LB_WAKEUP_UPDATE, 1)
SCHED_FEAT(ASYM_EFF_LOAD, 1)

2008-08-04 06:54:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Mon, 2008-08-04 at 11:23 +0530, Dhaval Giani wrote:

> Peter, vatsa, any ideas?

---
Subject: sched: scale sysctl_sched_shares_ratelimit with nr_cpus

David reported that his Niagra spend a little too much time in
tg_shares_up(), which considering he has a large cpu count makes sense.

So scale the ratelimit value with the number of cpus like we do for
other controls as well.

Reported-by: David Miller <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
diff --git a/kernel/sched.c b/kernel/sched.c
index 9a76e92..7eddaea 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -809,9 +809,9 @@ const_debug unsigned int sysctl_sched_nr_migrate = 32;

/*
* ratelimit for updating the group shares.
- * default: 0.5ms
+ * default: 0.25ms
*/
-const_debug unsigned int sysctl_sched_shares_ratelimit = 500000;
+const_debug unsigned int sysctl_sched_shares_ratelimit = 250000;

/*
* period over which we measure -rt task cpu usage in us.
@@ -5732,6 +5732,8 @@ static inline void sched_init_granularity(void)
sysctl_sched_latency = limit;

sysctl_sched_wakeup_granularity *= factor;
+
+ sysctl_sched_shares_ratelimit *= factor;
}

#ifdef CONFIG_SMP

2008-08-04 07:10:45

by Dhaval Giani

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Mon, Aug 04, 2008 at 08:26:11AM +0200, Peter Zijlstra wrote:
> On Mon, 2008-08-04 at 11:23 +0530, Dhaval Giani wrote:
>
> > Peter, vatsa, any ideas?
>
> ---
>
> Revert:
> a7be37ac8e1565e00880531f4e2aff421a21c803 sched: revert the revert of: weight calculations
> c9c294a630e28eec5f2865f028ecfc58d45c0a5a sched: fix calc_delta_asym()
> ced8aa16e1db55c33c507174c1b1f9e107445865 sched: fix calc_delta_asym, #2
>

Did we not fix those? :)
--
regards,
Dhaval

2008-08-04 07:12:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Mon, 2008-08-04 at 12:35 +0530, Dhaval Giani wrote:
> On Mon, Aug 04, 2008 at 08:26:11AM +0200, Peter Zijlstra wrote:
> > On Mon, 2008-08-04 at 11:23 +0530, Dhaval Giani wrote:
> >
> > > Peter, vatsa, any ideas?
> >
> > ---
> >
> > Revert:
> > a7be37ac8e1565e00880531f4e2aff421a21c803 sched: revert the revert of: weight calculations
> > c9c294a630e28eec5f2865f028ecfc58d45c0a5a sched: fix calc_delta_asym()
> > ced8aa16e1db55c33c507174c1b1f9e107445865 sched: fix calc_delta_asym, #2
> >
>
> Did we not fix those? :)

Works for me,.. just guessing here.

2008-08-08 07:30:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Tue, 2030-08-06 at 11:26 +0800, Zhang, Yanmin wrote:
> On Mon, 2008-08-04 at 09:12 +0200, Peter Zijlstra wrote:
> > On Mon, 2008-08-04 at 12:35 +0530, Dhaval Giani wrote:
> > > On Mon, Aug 04, 2008 at 08:26:11AM +0200, Peter Zijlstra wrote:
> > > > On Mon, 2008-08-04 at 11:23 +0530, Dhaval Giani wrote:
> > > >
> > > > > Peter, vatsa, any ideas?
> > > >
> > > > ---
> > > >
> > > > Revert:
> > > > a7be37ac8e1565e00880531f4e2aff421a21c803 sched: revert the revert of: weight calculations
> > > > c9c294a630e28eec5f2865f028ecfc58d45c0a5a sched: fix calc_delta_asym()
> > > > ced8aa16e1db55c33c507174c1b1f9e107445865 sched: fix calc_delta_asym, #2
> > > >
> > >
> > > Did we not fix those? :)
> >
> > Works for me,.. just guessing here.
> I did more investigation on 16-core tigerton.
>
> Firstly, let's focus on CONFIG_GROUP_SCHED=n. With 2.6.26, the result
> has little difference
> between with and without CONFIG_GROUP_SCHED.
>
> 1) I tried different sched_features and found AFFINE_WAKEUPS has big
> impact on volanoMark. Other
> features have little impact.
>
> 2) With kernel 2.6.26, if disabling AFFINE_WAKEUPS, the result is
> 260000; if enabling AFFINE_WAKEUPS,
> the result is 515000, so the improvement caused by AFFINE_WAKEUPS is
> about 100%. With kernel 2.6.27-rc1,
> the improvement is only about 25%.
>
> 3) I turned on CONFIG_SCHETSTATS in kernel and collect
> ttwu_move_affine. Mostly, collect ttwu_move_affine,
> then recollect it after 30 seconds and calculate the difference. With
> 2.6.26, I got below data:

<snip data>

> So with kernel 2.6.27-rc1, the successful wakeup_affine is about
> double of the one of 2.6.27-rc1
> on domain 0, but about 10 times on domain 1. That means more tasks are
> woken up on waker cpus.
>
> Does that mean it doesn't follow cache-hot checking?

I'm a bit puzzled, but you're right - I too noticed that volanomark is
_very_ sensitive to affine wakeups.

I'll try and find what changed in that code for GROUP=n.

2008-08-15 15:37:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1


* Peter Zijlstra <[email protected]> wrote:

> On Mon, 2008-08-04 at 11:23 +0530, Dhaval Giani wrote:
>
> > Peter, vatsa, any ideas?
>
> ---
> Subject: sched: scale sysctl_sched_shares_ratelimit with nr_cpus
>
> David reported that his Niagra spend a little too much time in
> tg_shares_up(), which considering he has a large cpu count makes sense.
>
> So scale the ratelimit value with the number of cpus like we do for
> other controls as well.

i've queued this up in tip/sched/urgent as it makes sense - but i'm also
wondering, does this impact the volano numbers?

Ingo

2008-08-20 07:26:24

by Yanmin Zhang

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1


On Mon, 2008-08-18 at 10:51 +0530, Dhaval Giani wrote:
> > > > > > So with kernel 2.6.27-rc1, the successful wakeup_affine is about
> > > > > > double of the one of 2.6.27-rc1
> > > > > > on domain 0, but about 10 times on domain 1. That means more tasks are
> > > > > > woken up on waker cpus.
> > > > > >
> > > > > > Does that mean it doesn't follow cache-hot checking?
> > > > >
> > > > > I'm a bit puzzled, but you're right - I too noticed that volanomark is
> > > > > _very_ sensitive to affine wakeups.
> > > > >
> > > > > I'll try and find what changed in that code for GROUP=n.
> > > >
> > > > hi Yanmin,
> > > >
> > > > I was wondering if you could send me your config and what sysctls you
> > > > have set. I have not been able to reproduce the 2.6.26 -> 2.6.27-rc1
> > > > GROUP=n regression.
> > > Pls. see the attachment. As for sysctl, I just set /proc/sys/kernel/sched_compat_yield=1.
> > >
> > > I am wondering if the load balance causes the regression when group=n. I manually delete
> > > all GROUP codes and do a diff against 26 and 27-rc1.
> > >
> >
> > You can disable load balancing by being in uniprocessor mode.
> >
>
> Hi,
>
> I can see this regression only with sched_compat_yield=1. Some numbers
> though, I see a 5% regression with max_cpus=1 whereas close to 50% with
> SMP on a 8 way.
After reverting below patch, volanoMark regression becomes less than 2% with CONFIG_GROUP_SCHED=n
on my 8-core stoakely. The improvement on 16-core tigerton is about 44%, but there is still about
20% regression, comparing with 2.6.26_nogroup.


commit 93b75217df39e6d75889cc6f8050343286aff4a5
Author: Peter Zijlstra <[email protected]>
Date: Fri Jun 27 13:41:33 2008 +0200

sched: disable source/target_load bias

The bias given by source/target_load functions can be very large, disable
it by default to get faster convergence.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Srivatsa Vaddagiri <[email protected]>
Cc: Mike Galbraith <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>



This patch adds a new feature LB_BIAS, but uses it with a NOT, so I lost it when I tested
single sched feature one by one. That also explains why wake_affine and load_balance_newidle
have more successful task pulling with kernel 2.6.27-rc, because MC and CPU domain's wake_idx
is 1, so this patch has impact on them.

Dhaval, could you test it on your 8-way machine?

>
> Peter do you have any patches already, which I can try?
>
> Thanks,

2008-08-20 07:41:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Wed, 2008-08-20 at 15:24 +0800, Zhang, Yanmin wrote:
> On Mon, 2008-08-18 at 10:51 +0530, Dhaval Giani wrote:
> > > > > > > So with kernel 2.6.27-rc1, the successful wakeup_affine is about
> > > > > > > double of the one of 2.6.27-rc1
> > > > > > > on domain 0, but about 10 times on domain 1. That means more tasks are
> > > > > > > woken up on waker cpus.
> > > > > > >
> > > > > > > Does that mean it doesn't follow cache-hot checking?
> > > > > >
> > > > > > I'm a bit puzzled, but you're right - I too noticed that volanomark is
> > > > > > _very_ sensitive to affine wakeups.
> > > > > >
> > > > > > I'll try and find what changed in that code for GROUP=n.
> > > > >
> > > > > hi Yanmin,
> > > > >
> > > > > I was wondering if you could send me your config and what sysctls you
> > > > > have set. I have not been able to reproduce the 2.6.26 -> 2.6.27-rc1
> > > > > GROUP=n regression.
> > > > Pls. see the attachment. As for sysctl, I just set /proc/sys/kernel/sched_compat_yield=1.
> > > >
> > > > I am wondering if the load balance causes the regression when group=n. I manually delete
> > > > all GROUP codes and do a diff against 26 and 27-rc1.
> > > >
> > >
> > > You can disable load balancing by being in uniprocessor mode.
> > >
> >
> > Hi,
> >
> > I can see this regression only with sched_compat_yield=1. Some numbers
> > though, I see a 5% regression with max_cpus=1 whereas close to 50% with
> > SMP on a 8 way.
> After reverting below patch, volanoMark regression becomes less than 2% with CONFIG_GROUP_SCHED=n
> on my 8-core stoakely. The improvement on 16-core tigerton is about 44%, but there is still about
> 20% regression, comparing with 2.6.26_nogroup.
>
>
> commit 93b75217df39e6d75889cc6f8050343286aff4a5
> Author: Peter Zijlstra <[email protected]>
> Date: Fri Jun 27 13:41:33 2008 +0200
>
> sched: disable source/target_load bias
>
> The bias given by source/target_load functions can be very large, disable
> it by default to get faster convergence.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Cc: Srivatsa Vaddagiri <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
>
>
>
> This patch adds a new feature LB_BIAS, but uses it with a NOT, so I lost it when I tested
> single sched feature one by one. That also explains why wake_affine and load_balance_newidle
> have more successful task pulling with kernel 2.6.27-rc, because MC and CPU domain's wake_idx
> is 1, so this patch has impact on them.
>
> Dhaval, could you test it on your 8-way machine?

Ah - I assumed you already tried that knob since you mentioned fiddling
with the various feature flags.

And I must admit to having overlooked the effect on wake_affine..

Chris, could you see the effect of this on smp group fairness?

---
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 862b06b..9353ca7 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -8,6 +8,6 @@ SCHED_FEAT(SYNC_WAKEUPS, 1)
SCHED_FEAT(HRTICK, 1)
SCHED_FEAT(DOUBLE_TICK, 0)
SCHED_FEAT(ASYM_GRAN, 1)
-SCHED_FEAT(LB_BIAS, 0)
+SCHED_FEAT(LB_BIAS, 1)
SCHED_FEAT(LB_WAKEUP_UPDATE, 1)
SCHED_FEAT(ASYM_EFF_LOAD, 1)

2008-08-20 10:57:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1


* Peter Zijlstra <[email protected]> wrote:

> And I must admit to having overlooked the effect on wake_affine..
>
> Chris, could you see the effect of this on smp group fairness?

applied your commit below to tip/sched/urgent, thanks.

Ingo

-------------->
>From 939387c3a6141ec6aefc7acd40f8b186781bb098 Mon Sep 17 00:00:00 2001
From: Peter Zijlstra <[email protected]>
Date: Wed, 20 Aug 2008 12:44:55 +0200
Subject: [PATCH] sched: enable LB_BIAS by default

Yanmin reported a significant regression on his 16-core machine due to:

commit 93b75217df39e6d75889cc6f8050343286aff4a5
Author: Peter Zijlstra <[email protected]>
Date: Fri Jun 27 13:41:33 2008 +0200

Flip back to the old behaviour.

Reported-by: "Zhang, Yanmin" <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched_features.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index 862b06b..9353ca7 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -8,6 +8,6 @@ SCHED_FEAT(SYNC_WAKEUPS, 1)
SCHED_FEAT(HRTICK, 1)
SCHED_FEAT(DOUBLE_TICK, 0)
SCHED_FEAT(ASYM_GRAN, 1)
-SCHED_FEAT(LB_BIAS, 0)
+SCHED_FEAT(LB_BIAS, 1)
SCHED_FEAT(LB_WAKEUP_UPDATE, 1)
SCHED_FEAT(ASYM_EFF_LOAD, 1)

2008-08-20 13:32:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Wed, 2008-08-20 at 12:51 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > And I must admit to having overlooked the effect on wake_affine..
> >
> > Chris, could you see the effect of this on smp group fairness?

Just realized my brainfart..

---
Subject: sched: load-balance bias fixes
From: Peter Zijlstra <[email protected]>
Date: Wed Aug 20 15:28:51 CEST 2008

Yanmin spotted a regression with my patch that introduces LB_BIAS:

commit 93b75217df39e6d75889cc6f8050343286aff4a5
Author: Peter Zijlstra <[email protected]>
Date: Fri Jun 27 13:41:33 2008 +0200

And I just spotted the brainfart - I should have replaced min/max with avg
instead of removing it completely.

Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/kernel.h | 6 ++++++
kernel/sched.c | 10 ++++++++--
2 files changed, 14 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/kernel.h
===================================================================
--- linux-2.6.orig/include/linux/kernel.h
+++ linux-2.6/include/linux/kernel.h
@@ -367,6 +367,12 @@ static inline char *pack_hex_byte(char *
(void) (&_max1 == &_max2); \
_max1 > _max2 ? _max1 : _max2; })

+#define avg(x, y) ({ \
+ typeof(x) _avg1 = ((x)+1)/2; \
+ typeof(x) _avg2 = ((y)+1)/2; \
+ (void) (&_avg1 == &_avg2); \
+ _avg1 + _avg2; })
+
/**
* clamp - return a value clamped to a given range with strict typechecking
* @val: current value
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2008,9 +2008,12 @@ static unsigned long source_load(int cpu
struct rq *rq = cpu_rq(cpu);
unsigned long total = weighted_cpuload(cpu);

- if (type == 0 || !sched_feat(LB_BIAS))
+ if (type == 0)
return total;

+ if (!sched_feat(LB_BIAS))
+ return avg(rq->cpu_load[type-1], total);
+
return min(rq->cpu_load[type-1], total);
}

@@ -2023,9 +2026,12 @@ static unsigned long target_load(int cpu
struct rq *rq = cpu_rq(cpu);
unsigned long total = weighted_cpuload(cpu);

- if (type == 0 || !sched_feat(LB_BIAS))
+ if (type == 0)
return total;

+ if (!sched_feat(LB_BIAS))
+ return avg(rq->cpu_load[type-1], total);
+
return max(rq->cpu_load[type-1], total);
}


2008-08-20 13:47:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1


* Peter Zijlstra <[email protected]> wrote:

> Just realized my brainfart..
>
> ---
> Subject: sched: load-balance bias fixes
> From: Peter Zijlstra <[email protected]>
> Date: Wed Aug 20 15:28:51 CEST 2008
>
> Yanmin spotted a regression with my patch that introduces LB_BIAS:

ok, i've applied this one to tip/sched/urgent instead of the
feature-disabling patchlet. Yanmin, could you please check whether this
one does the trick?

Ingo

2008-08-20 14:31:00

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Wed, Aug 20, 2008 at 03:32:17PM +0200, Peter Zijlstra wrote:
> On Wed, 2008-08-20 at 12:51 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > And I must admit to having overlooked the effect on wake_affine..
> > >
> > > Chris, could you see the effect of this on smp group fairness?
>
> Just realized my brainfart..
>
> ---
> Subject: sched: load-balance bias fixes
> From: Peter Zijlstra <[email protected]>
> Date: Wed Aug 20 15:28:51 CEST 2008
>
> Yanmin spotted a regression with my patch that introduces LB_BIAS:
>
> commit 93b75217df39e6d75889cc6f8050343286aff4a5
> Author: Peter Zijlstra <[email protected]>
> Date: Fri Jun 27 13:41:33 2008 +0200
>
> And I just spotted the brainfart - I should have replaced min/max with avg
> instead of removing it completely.

> --- linux-2.6.orig/include/linux/kernel.h
> +++ linux-2.6/include/linux/kernel.h
> @@ -367,6 +367,12 @@ static inline char *pack_hex_byte(char *
> (void) (&_max1 == &_max2); \
> _max1 > _max2 ? _max1 : _max2; })
>
> +#define avg(x, y) ({ \
> + typeof(x) _avg1 = ((x)+1)/2; \
> + typeof(x) _avg2 = ((y)+1)/2; \

ITYM, typeof(y)

> + (void) (&_avg1 == &_avg2); \
> + _avg1 + _avg2; })

2008-08-20 14:33:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Wed, 2008-08-20 at 18:32 +0400, [email protected] wrote:
> On Wed, Aug 20, 2008 at 03:32:17PM +0200, Peter Zijlstra wrote:
> > On Wed, 2008-08-20 at 12:51 +0200, Ingo Molnar wrote:
> > > * Peter Zijlstra <[email protected]> wrote:
> > >
> > > > And I must admit to having overlooked the effect on wake_affine..
> > > >
> > > > Chris, could you see the effect of this on smp group fairness?
> >
> > Just realized my brainfart..
> >
> > ---
> > Subject: sched: load-balance bias fixes
> > From: Peter Zijlstra <[email protected]>
> > Date: Wed Aug 20 15:28:51 CEST 2008
> >
> > Yanmin spotted a regression with my patch that introduces LB_BIAS:
> >
> > commit 93b75217df39e6d75889cc6f8050343286aff4a5
> > Author: Peter Zijlstra <[email protected]>
> > Date: Fri Jun 27 13:41:33 2008 +0200
> >
> > And I just spotted the brainfart - I should have replaced min/max with avg
> > instead of removing it completely.
>
> > --- linux-2.6.orig/include/linux/kernel.h
> > +++ linux-2.6/include/linux/kernel.h
> > @@ -367,6 +367,12 @@ static inline char *pack_hex_byte(char *
> > (void) (&_max1 == &_max2); \
> > _max1 > _max2 ? _max1 : _max2; })
> >
> > +#define avg(x, y) ({ \
> > + typeof(x) _avg1 = ((x)+1)/2; \
> > + typeof(x) _avg2 = ((y)+1)/2; \
>
> ITYM, typeof(y)

you thought right, I did mean that :-)

> > + (void) (&_avg1 == &_avg2); \
> > + _avg1 + _avg2; })
>

2008-08-20 15:10:54

by Nick Piggin

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Thursday 21 August 2008 00:33, Peter Zijlstra wrote:
> On Wed, 2008-08-20 at 18:32 +0400, [email protected] wrote:
> > On Wed, Aug 20, 2008 at 03:32:17PM +0200, Peter Zijlstra wrote:

> > > +#define avg(x, y) ({ \
> > > + typeof(x) _avg1 = ((x)+1)/2; \
> > > + typeof(x) _avg2 = ((y)+1)/2; \
> >
> > ITYM, typeof(y)
>
> you thought right, I did mean that :-)
>
> > > + (void) (&_avg1 == &_avg2); \
> > > + _avg1 + _avg2; })

I don't think this implementation of avg should go in kernel.h?

It gives an average of 1 and 1 to be 2, 3 and 3 is 4, 1 and 3 is
3 etc.

Maybe it is reasonable for very high numbers that would overflow
if added first, but it doesn't seem reasonable for a generic
averaging function.

2008-08-20 15:15:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Thu, 2008-08-21 at 01:10 +1000, Nick Piggin wrote:
> On Thursday 21 August 2008 00:33, Peter Zijlstra wrote:
> > On Wed, 2008-08-20 at 18:32 +0400, [email protected] wrote:
> > > On Wed, Aug 20, 2008 at 03:32:17PM +0200, Peter Zijlstra wrote:
>
> > > > +#define avg(x, y) ({ \
> > > > + typeof(x) _avg1 = ((x)+1)/2; \
> > > > + typeof(x) _avg2 = ((y)+1)/2; \
> > >
> > > ITYM, typeof(y)
> >
> > you thought right, I did mean that :-)
> >
> > > > + (void) (&_avg1 == &_avg2); \
> > > > + _avg1 + _avg2; })
>
> I don't think this implementation of avg should go in kernel.h?
>
> It gives an average of 1 and 1 to be 2, 3 and 3 is 4, 1 and 3 is
> 3 etc.
>
> Maybe it is reasonable for very high numbers that would overflow
> if added first, but it doesn't seem reasonable for a generic
> averaging function.

I had it in sched.c, then moved it to kernel.h and back again, etc.. I'm
fine with wherever..

---
Subject: sched: load-balance bias fixes
From: Peter Zijlstra <[email protected]>
Date: Wed Aug 20 15:28:51 CEST 2008

Yanmin spotted a regression with my patch that introduces LB_BIAS:

commit 93b75217df39e6d75889cc6f8050343286aff4a5
Author: Peter Zijlstra <[email protected]>
Date: Fri Jun 27 13:41:33 2008 +0200

And I just spotted the brainfart - I should have replaced min/max with avg
instead of removing it completely.

Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/sched.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1996,6 +1996,12 @@ void kick_process(struct task_struct *p)
preempt_enable();
}

+#define avg(x, y) ({ \
+ typeof(x) _avg1 = ((x)+1)/2; \
+ typeof(y) _avg2 = ((y)+1)/2; \
+ (void) (&_avg1 == &_avg2); \
+ _avg1 + _avg2; })
+
/*
* Return a low guess at the load of a migration-source cpu weighted
* according to the scheduling class and "nice" value.
@@ -2008,9 +2014,12 @@ static unsigned long source_load(int cpu
struct rq *rq = cpu_rq(cpu);
unsigned long total = weighted_cpuload(cpu);

- if (type == 0 || !sched_feat(LB_BIAS))
+ if (type == 0)
return total;

+ if (!sched_feat(LB_BIAS))
+ return avg(rq->cpu_load[type-1], total);
+
return min(rq->cpu_load[type-1], total);
}

@@ -2023,9 +2032,12 @@ static unsigned long target_load(int cpu
struct rq *rq = cpu_rq(cpu);
unsigned long total = weighted_cpuload(cpu);

- if (type == 0 || !sched_feat(LB_BIAS))
+ if (type == 0)
return total;

+ if (!sched_feat(LB_BIAS))
+ return avg(rq->cpu_load[type-1], total);
+
return max(rq->cpu_load[type-1], total);
}


2008-08-20 16:29:40

by Ray Lee

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Wed, Aug 20, 2008 at 8:10 AM, Nick Piggin <[email protected]> wrote:
> On Thursday 21 August 2008 00:33, Peter Zijlstra wrote:
>> On Wed, 2008-08-20 at 18:32 +0400, [email protected] wrote:
>> > On Wed, Aug 20, 2008 at 03:32:17PM +0200, Peter Zijlstra wrote:
>
>> > > +#define avg(x, y) ({ \
>> > > + typeof(x) _avg1 = ((x)+1)/2; \
>> > > + typeof(x) _avg2 = ((y)+1)/2; \
>> >
>> > ITYM, typeof(y)
>>
>> you thought right, I did mean that :-)
>>
>> > > + (void) (&_avg1 == &_avg2); \
>> > > + _avg1 + _avg2; })
>
> I don't think this implementation of avg should go in kernel.h?
>
> It gives an average of 1 and 1 to be 2, 3 and 3 is 4, 1 and 3 is
> 3 etc.
>
> Maybe it is reasonable for very high numbers that would overflow
> if added first, but it doesn't seem reasonable for a generic
> averaging function.

The usual way of averaging numbers that may be large is

#define avg(x, y) ({ \
typeof(x) _x = (x); \
typeof(x) _y = (y); \
(void) (&_x == &_y); \
_x + (_y - _x)/2; })

...which also works for small and negative numbers.

2008-08-20 16:51:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Wed, 2008-08-20 at 09:29 -0700, Ray Lee wrote:
> On Wed, Aug 20, 2008 at 8:10 AM, Nick Piggin <[email protected]> wrote:
> > On Thursday 21 August 2008 00:33, Peter Zijlstra wrote:
> >> On Wed, 2008-08-20 at 18:32 +0400, [email protected] wrote:
> >> > On Wed, Aug 20, 2008 at 03:32:17PM +0200, Peter Zijlstra wrote:
> >
> >> > > +#define avg(x, y) ({ \
> >> > > + typeof(x) _avg1 = ((x)+1)/2; \
> >> > > + typeof(x) _avg2 = ((y)+1)/2; \
> >> >
> >> > ITYM, typeof(y)
> >>
> >> you thought right, I did mean that :-)
> >>
> >> > > + (void) (&_avg1 == &_avg2); \
> >> > > + _avg1 + _avg2; })
> >
> > I don't think this implementation of avg should go in kernel.h?
> >
> > It gives an average of 1 and 1 to be 2, 3 and 3 is 4, 1 and 3 is
> > 3 etc.
> >
> > Maybe it is reasonable for very high numbers that would overflow
> > if added first, but it doesn't seem reasonable for a generic
> > averaging function.
>
> The usual way of averaging numbers that may be large is
>
> #define avg(x, y) ({ \
> typeof(x) _x = (x); \
> typeof(x) _y = (y); \
> (void) (&_x == &_y); \
> _x + (_y - _x)/2; })
>
> ....which also works for small and negative numbers.

D'oh, why didn't I think of that..

2008-08-20 17:23:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

Ok, so one last time (I hope!)..

Everybody happy with this?

---
Subject: sched: load-balance bias fixes
From: Peter Zijlstra <[email protected]>
Date: Wed Aug 20 15:28:51 CEST 2008

Yanmin spotted a regression with my patch that introduces LB_BIAS:

commit 93b75217df39e6d75889cc6f8050343286aff4a5
Author: Peter Zijlstra <[email protected]>
Date: Fri Jun 27 13:41:33 2008 +0200

And I just spotted the brainfart - I should have replaced min/max with avg
instead of removing it completely.

[[email protected]: better avg implementation]
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/kernel.h | 6 ++++++
kernel/sched.c | 10 ++++++++--
2 files changed, 14 insertions(+), 2 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2008,9 +2008,12 @@ static unsigned long source_load(int cpu
struct rq *rq = cpu_rq(cpu);
unsigned long total = weighted_cpuload(cpu);

- if (type == 0 || !sched_feat(LB_BIAS))
+ if (type == 0)
return total;

+ if (!sched_feat(LB_BIAS))
+ return avg(rq->cpu_load[type-1], total);
+
return min(rq->cpu_load[type-1], total);
}

@@ -2023,9 +2026,12 @@ static unsigned long target_load(int cpu
struct rq *rq = cpu_rq(cpu);
unsigned long total = weighted_cpuload(cpu);

- if (type == 0 || !sched_feat(LB_BIAS))
+ if (type == 0)
return total;

+ if (!sched_feat(LB_BIAS))
+ return avg(rq->cpu_load[type-1], total);
+
return max(rq->cpu_load[type-1], total);
}

Index: linux-2.6/include/linux/kernel.h
===================================================================
--- linux-2.6.orig/include/linux/kernel.h
+++ linux-2.6/include/linux/kernel.h
@@ -367,6 +367,12 @@ static inline char *pack_hex_byte(char *
(void) (&_max1 == &_max2); \
_max1 > _max2 ? _max1 : _max2; })

+#define avg(x, y) ({ \
+ typeof(x) _avg1 = (x); \
+ typeof(y) _avg2 = (y); \
+ (void) (&_avg1 == &_avg2); \
+ _avg1 + (_avg2 - _avg1)/2; })
+
/**
* clamp - return a value clamped to a given range with strict typechecking
* @val: current value

2008-08-20 17:56:33

by Nick Piggin

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Thursday 21 August 2008 03:21, Peter Zijlstra wrote:
> Ok, so one last time (I hope!)..
>
> Everybody happy with this?


> Index: linux-2.6/include/linux/kernel.h
> ===================================================================
> --- linux-2.6.orig/include/linux/kernel.h
> +++ linux-2.6/include/linux/kernel.h
> @@ -367,6 +367,12 @@ static inline char *pack_hex_byte(char *
> (void) (&_max1 == &_max2); \
> _max1 > _max2 ? _max1 : _max2; })
>
> +#define avg(x, y) ({ \
> + typeof(x) _avg1 = (x); \
> + typeof(y) _avg2 = (y); \
> + (void) (&_avg1 == &_avg2); \
> + _avg1 + (_avg2 - _avg1)/2; })

That's not going to work with unsigned types.

2008-08-20 18:16:10

by Ray Lee

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Wed, Aug 20, 2008 at 10:55 AM, Nick Piggin <[email protected]> wrote:
> On Thursday 21 August 2008 03:21, Peter Zijlstra wrote:
>> Ok, so one last time (I hope!)..
>>
>> Everybody happy with this?
>
>
>> Index: linux-2.6/include/linux/kernel.h
>> ===================================================================
>> --- linux-2.6.orig/include/linux/kernel.h
>> +++ linux-2.6/include/linux/kernel.h
>> @@ -367,6 +367,12 @@ static inline char *pack_hex_byte(char *
>> (void) (&_max1 == &_max2); \
>> _max1 > _max2 ? _max1 : _max2; })
>>
>> +#define avg(x, y) ({ \
>> + typeof(x) _avg1 = (x); \
>> + typeof(y) _avg2 = (y); \
>> + (void) (&_avg1 == &_avg2); \
>> + _avg1 + (_avg2 - _avg1)/2; })
>
> That's not going to work with unsigned types.

Uhm, I think it works fine, even with unsigned, even where _avg2 is
smaller than _avg1. Underflow is a good thing here. And I mocked up a
little test harness and it gives the correct answers for a half dozen
sets of values I tossed at it

But maybe I'm forgetting an obscure unsigned or signed int type
widening rule, so, care to elaborate?

2008-08-20 20:30:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Wed, 2008-08-20 at 11:15 -0700, Ray Lee wrote:
> On Wed, Aug 20, 2008 at 10:55 AM, Nick Piggin <[email protected]> wrote:
> > On Thursday 21 August 2008 03:21, Peter Zijlstra wrote:
> >> Ok, so one last time (I hope!)..
> >>
> >> Everybody happy with this?
> >
> >
> >> Index: linux-2.6/include/linux/kernel.h
> >> ===================================================================
> >> --- linux-2.6.orig/include/linux/kernel.h
> >> +++ linux-2.6/include/linux/kernel.h
> >> @@ -367,6 +367,12 @@ static inline char *pack_hex_byte(char *
> >> (void) (&_max1 == &_max2); \
> >> _max1 > _max2 ? _max1 : _max2; })
> >>
> >> +#define avg(x, y) ({ \
> >> + typeof(x) _avg1 = (x); \
> >> + typeof(y) _avg2 = (y); \
> >> + (void) (&_avg1 == &_avg2); \
> >> + _avg1 + (_avg2 - _avg1)/2; })
> >
> > That's not going to work with unsigned types.
>
> Uhm, I think it works fine, even with unsigned, even where _avg2 is
> smaller than _avg1. Underflow is a good thing here. And I mocked up a
> little test harness and it gives the correct answers for a half dozen
> sets of values I tossed at it
>
> But maybe I'm forgetting an obscure unsigned or signed int type
> widening rule, so, care to elaborate?

Nick is right, try:

int main(int argc, char **argv)
{
unsigned int x = 7, y = 5;
printf("%d\n", avg(x,y));
return 0;
}

It fails because 5-7 = -2, which needs a signed division or sign
extending right shift.

we'd need something like:

#define avg(x, y) ({ \
typeof(x) _avg1 = (x); \
typeof(y) _avg2 = (y); \
(void) (&_avg1 == &_avg2); \
_avg1 + (signed typeof(x))(_avg2 - _avg1)/2; })

except that typeof() doesn't work that way.

#define avg(x, y) ({ \
typeof(x) _avg1 = (x); \
typeof(y) _avg2 = (y); \
(void) (&_avg1 == &_avg2); \
_avg1 + (long)(_avg2 - _avg1)/2; })

works for the above example, but when I make it long long, so as to
match the longest supported type, it goes boom again - for as of yet
unknown reasons.

2008-08-20 20:57:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Wed, 2008-08-20 at 22:30 +0200, Peter Zijlstra wrote:
> On Wed, 2008-08-20 at 11:15 -0700, Ray Lee wrote:
> > On Wed, Aug 20, 2008 at 10:55 AM, Nick Piggin <[email protected]> wrote:
> > > On Thursday 21 August 2008 03:21, Peter Zijlstra wrote:
> > >> Ok, so one last time (I hope!)..
> > >>
> > >> Everybody happy with this?
> > >
> > >
> > >> Index: linux-2.6/include/linux/kernel.h
> > >> ===================================================================
> > >> --- linux-2.6.orig/include/linux/kernel.h
> > >> +++ linux-2.6/include/linux/kernel.h
> > >> @@ -367,6 +367,12 @@ static inline char *pack_hex_byte(char *
> > >> (void) (&_max1 == &_max2); \
> > >> _max1 > _max2 ? _max1 : _max2; })
> > >>
> > >> +#define avg(x, y) ({ \
> > >> + typeof(x) _avg1 = (x); \
> > >> + typeof(y) _avg2 = (y); \
> > >> + (void) (&_avg1 == &_avg2); \
> > >> + _avg1 + (_avg2 - _avg1)/2; })
> > >
> > > That's not going to work with unsigned types.
> >
> > Uhm, I think it works fine, even with unsigned, even where _avg2 is
> > smaller than _avg1. Underflow is a good thing here. And I mocked up a
> > little test harness and it gives the correct answers for a half dozen
> > sets of values I tossed at it
> >
> > But maybe I'm forgetting an obscure unsigned or signed int type
> > widening rule, so, care to elaborate?
>
> Nick is right, try:
>
> int main(int argc, char **argv)
> {
> unsigned int x = 7, y = 5;
> printf("%d\n", avg(x,y));
> return 0;
> }
>
> It fails because 5-7 = -2, which needs a signed division or sign
> extending right shift.
>
> we'd need something like:
>
> #define avg(x, y) ({ \
> typeof(x) _avg1 = (x); \
> typeof(y) _avg2 = (y); \
> (void) (&_avg1 == &_avg2); \
> _avg1 + (signed typeof(x))(_avg2 - _avg1)/2; })
>
> except that typeof() doesn't work that way.
>
> #define avg(x, y) ({ \
> typeof(x) _avg1 = (x); \
> typeof(y) _avg2 = (y); \
> (void) (&_avg1 == &_avg2); \
> _avg1 + (long)(_avg2 - _avg1)/2; })
>
> works for the above example, but when I make it long long, so as to
> match the longest supported type, it goes boom again - for as of yet
> unknown reasons.

Ok, people pointed out I got my promotion rules mixed up, I casted the
result of the division to signed, instead of ending up with a signed
division.

#define avg(x, y) ({ \
typeof(x) _avg1 = (x); \
typeof(y) _avg2 = (y); \
(void) (&_avg1 == &_avg2); \
(typeof(x))(_avg1 + ((long long)_avg2 - _avg1)/2); })

seems to work.

2008-08-20 20:58:19

by Ray Lee

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Wed, Aug 20, 2008 at 1:30 PM, Peter Zijlstra <[email protected]> wrote:
> Nick is right, try:
>
> int main(int argc, char **argv)
> {
> unsigned int x = 7, y = 5;
> printf("%d\n", avg(x,y));
> return 0;
> }
>
> It fails because 5-7 = -2, which needs a signed division or sign
> extending right shift.
>
> we'd need something like:
>
> #define avg(x, y) ({ \
> typeof(x) _avg1 = (x); \
> typeof(y) _avg2 = (y); \
> (void) (&_avg1 == &_avg2); \
> _avg1 + (signed typeof(x))(_avg2 - _avg1)/2; })
>
> except that typeof() doesn't work that way.
>
> #define avg(x, y) ({ \
> typeof(x) _avg1 = (x); \
> typeof(y) _avg2 = (y); \
> (void) (&_avg1 == &_avg2); \
> _avg1 + (long)(_avg2 - _avg1)/2; })
>
> works for the above example, but when I make it long long, so as to
> match the longest supported type, it goes boom again - for as of yet
> unknown reasons.

I think you'd want to cast it with a (signed) instead? as in:

#include <stdio.h>

#define avg(x, y) ({ \
typeof(x) _x = (x); \
typeof(y) _y = (y); \
(void) (&_x == &_y); \
_x + (signed)(_y - _x)/2; })

int main (void) {
unsigned long long a=7,b=5;

printf("%d %d\n", avg(a,b), avg(b,a));
}

...which works here, for me, but hey, I managed to goof up my other
test case, so take it for a spin.

Ray

2008-08-20 21:04:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Wed, 2008-08-20 at 13:58 -0700, Ray Lee wrote:
> On Wed, Aug 20, 2008 at 1:30 PM, Peter Zijlstra <[email protected]> wrote:
> > Nick is right, try:
> >
> > int main(int argc, char **argv)
> > {
> > unsigned int x = 7, y = 5;
> > printf("%d\n", avg(x,y));
> > return 0;
> > }
> >
> > It fails because 5-7 = -2, which needs a signed division or sign
> > extending right shift.
> >
> > we'd need something like:
> >
> > #define avg(x, y) ({ \
> > typeof(x) _avg1 = (x); \
> > typeof(y) _avg2 = (y); \
> > (void) (&_avg1 == &_avg2); \
> > _avg1 + (signed typeof(x))(_avg2 - _avg1)/2; })
> >
> > except that typeof() doesn't work that way.
> >
> > #define avg(x, y) ({ \
> > typeof(x) _avg1 = (x); \
> > typeof(y) _avg2 = (y); \
> > (void) (&_avg1 == &_avg2); \
> > _avg1 + (long)(_avg2 - _avg1)/2; })
> >
> > works for the above example, but when I make it long long, so as to
> > match the longest supported type, it goes boom again - for as of yet
> > unknown reasons.
>
> I think you'd want to cast it with a (signed) instead? as in:
>
> #include <stdio.h>
>
> #define avg(x, y) ({ \
> typeof(x) _x = (x); \
> typeof(y) _y = (y); \
> (void) (&_x == &_y); \
> _x + (signed)(_y - _x)/2; })

signed is short for signed int, which is too short for say long or long
long input.

Anyway, see my previous mail in which I explained that I got the cast
order wrong.

2008-08-21 02:26:45

by Yanmin Zhang

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1


On Wed, 2008-08-20 at 15:47 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > Just realized my brainfart..
> >
> > ---
> > Subject: sched: load-balance bias fixes
> > From: Peter Zijlstra <[email protected]>
> > Date: Wed Aug 20 15:28:51 CEST 2008
> >
> > Yanmin spotted a regression with my patch that introduces LB_BIAS:
>
> ok, i've applied this one to tip/sched/urgent instead of the
> feature-disabling patchlet. Yanmin, could you please check whether this
> one does the trick?
This new patch almost doesn't help volanoMark. Pls. use the patch
which sets LB_BIAS=1 by default.

-yanmin

2008-08-21 06:11:37

by Nick Piggin

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Thursday 21 August 2008 06:56, Peter Zijlstra wrote:
> On Wed, 2008-08-20 at 22:30 +0200, Peter Zijlstra wrote:

> > works for the above example, but when I make it long long, so as to
> > match the longest supported type, it goes boom again - for as of yet
> > unknown reasons.
>
> Ok, people pointed out I got my promotion rules mixed up, I casted the
> result of the division to signed, instead of ending up with a signed
> division.
>
> #define avg(x, y) ({ \
> typeof(x) _avg1 = (x); \
> typeof(y) _avg2 = (y); \
> (void) (&_avg1 == &_avg2); \
> (typeof(x))(_avg1 + ((long long)_avg2 - _avg1)/2); })
>
> seems to work.

Right, I guess that will work, but unfortunately the code gen on 32-bit
is a monstrosity. If you're going to cast to 64-bit anyway, we might as
well then just do the normal add rather than playing the game to avoid
overflow.

Secondly, this is operating on the fixed point scaled load numbers, so in
the case of the scheduler I wouldn't worry too much about rounding... also
in most integer operations, rounding down is less surprising than rounding
up like the last code did.

I still don't know whether it is appropriate to put it into kernel.h
(because of rounding, and variability when it comes to what type size will
hold the sum of parameters), but for the scheduler, I would use this:

((unsigned long long)a + b) / 2;

Which gives this on 32-bit:
div:
movl %edx, %ecx
xorl %edx, %edx
pushl %ebx
xorl %ebx, %ebx
addl %ecx, %eax
adcl %ebx, %edx
popl %ebx
shrdl $1, %edx, %eax
shrl %edx
ret

Rather than this:
div:
subl $8, %esp
xorl %ecx, %ecx
movl %ebx, (%esp)
movl %edx, %ebx
movl %esi, 4(%esp)
xorl %esi, %esi
subl %eax, %ebx
sbbl %ecx, %esi
movl %esi, %ecx
movl %esi, %edx
sarl $31, %ecx
movl %ecx, %edx
xorl %ecx, %ecx
shrl $31, %edx
addl %ebx, %edx
movl (%esp), %ebx
adcl %esi, %ecx
movl 4(%esp), %esi
addl $8, %esp
shrdl $1, %ecx, %edx
addl %edx, %eax
sarl %ecx
ret

And it's also slightly better on 64-bit.

2008-08-21 06:13:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1


* Peter Zijlstra <[email protected]> wrote:

> Ok, so one last time (I hope!)..
>
> Everybody happy with this?

looks good to me. What is missing is Yanmin's confirmation that the
patch indeed solves/improves the regression :-)

Ingo

2008-08-21 06:16:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1


* Zhang, Yanmin <[email protected]> wrote:

> > ok, i've applied this one to tip/sched/urgent instead of the
> > feature-disabling patchlet. Yanmin, could you please check whether this
> > one does the trick?
>
> This new patch almost doesn't help volanoMark. Pls. use the patch
> which sets LB_BIAS=1 by default.

ok. That also removes the kernel.h complications ;-)

Ingo

2008-08-21 06:20:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1


* Peter Zijlstra <[email protected]> wrote:

> Ok, people pointed out I got my promotion rules mixed up, I casted the
> result of the division to signed, instead of ending up with a signed
> division.
>
> #define avg(x, y) ({ \
> typeof(x) _avg1 = (x); \
> typeof(y) _avg2 = (y); \
> (void) (&_avg1 == &_avg2); \
> (typeof(x))(_avg1 + ((long long)_avg2 - _avg1)/2); })

ok, could you please just send a patch that is local to sched.c and then
we can let this kernel.h change play out independently? There's too many
iterations of this and it's better to decouple the two.

Ingo

2008-08-21 06:50:21

by Yanmin Zhang

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1


On Thu, 2008-08-21 at 08:16 +0200, Ingo Molnar wrote:
> * Zhang, Yanmin <[email protected]> wrote:
>
> > > ok, i've applied this one to tip/sched/urgent instead of the
> > > feature-disabling patchlet. Yanmin, could you please check whether this
> > > one does the trick?
> >
> > This new patch almost doesn't help volanoMark. Pls. use the patch
> > which sets LB_BIAS=1 by default.
>
> ok. That also removes the kernel.h complications ;-)
Sorry, I have new update.
Originally, I worked on 2.6.27-rc1. I just move to 2.6.27-rc3 and found
something defferent when CONFIG_GROUP_SCHED=n.

With 2.6.27-rc3, on my 8-core stoakley, all volanoMark regression disappears,
no matter if I enable LB_BIAS. On 16-core tigerton, the regression is still
there if I don't enable LB_BIAS and regression becomes 11% from 65%.

8-core stoakley:
2.6.26_nogroup 340669
2.6.27-rc1_nogroup 267237
2.6.27-rc1_nogroup+LB_BIAS=1 330693
2.6.27-rc3_nogroup 352193
2.6.27-rc3_nogroup+LB_BIAS=1 355872

16-core tigerton:
2.6.26_nogroup 539644
2.6.27-rc1_nogroup 309334
2.6.27-rc1_nogroup+LB_BIAS=1 478360
2.6.27-rc3_nogroup 348426
2.6.27-rc3_nogroup+LB_BIAS=1 483150

-yanmin

2008-08-21 08:18:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1

On Thu, 2008-08-21 at 16:11 +1000, Nick Piggin wrote:
> On Thursday 21 August 2008 06:56, Peter Zijlstra wrote:
> > On Wed, 2008-08-20 at 22:30 +0200, Peter Zijlstra wrote:
>
> > > works for the above example, but when I make it long long, so as to
> > > match the longest supported type, it goes boom again - for as of yet
> > > unknown reasons.
> >
> > Ok, people pointed out I got my promotion rules mixed up, I casted the
> > result of the division to signed, instead of ending up with a signed
> > division.
> >
> > #define avg(x, y) ({ \
> > typeof(x) _avg1 = (x); \
> > typeof(y) _avg2 = (y); \
> > (void) (&_avg1 == &_avg2); \
> > (typeof(x))(_avg1 + ((long long)_avg2 - _avg1)/2); })
> >
> > seems to work.
>
> Right, I guess that will work, but unfortunately the code gen on 32-bit
> is a monstrosity. If you're going to cast to 64-bit anyway, we might as
> well then just do the normal add rather than playing the game to avoid
> overflow.
>
> Secondly, this is operating on the fixed point scaled load numbers, so in
> the case of the scheduler I wouldn't worry too much about rounding... also
> in most integer operations, rounding down is less surprising than rounding
> up like the last code did.
>
> I still don't know whether it is appropriate to put it into kernel.h
> (because of rounding, and variability when it comes to what type size will
> hold the sum of parameters), but for the scheduler, I would use this:
>
> ((unsigned long long)a + b) / 2;

Right - anyway the point is moot - as Yanmin says it still sucks rocks.

But since I couldn't let it rest :-)

---
Index: linux-2.6/include/linux/kernel.h
===================================================================
--- linux-2.6.orig/include/linux/kernel.h
+++ linux-2.6/include/linux/kernel.h
@@ -367,6 +367,45 @@ static inline char *pack_hex_byte(char *
(void) (&_max1 == &_max2); \
_max1 > _max2 ? _max1 : _max2; })

+#define __avg_t(type, x, y) ({ \
+ typeof(x) __avg1 = (x); \
+ typeof(y) __avg2 = (y); \
+ __avg1 + ((type)(__avg2 - __avg1))/2; })
+
+extern void avg_unknown_size(void);
+
+#define __avg(x, y) ({ \
+ typeof(x) ret; \
+ switch (sizeof(ret)) { \
+ case 1: \
+ ret = __avg_t(s8, x, y); \
+ break; \
+ case 2: \
+ ret = __avg_t(s16, x, y); \
+ break; \
+ case 4: \
+ ret = __avg_t(s32, x, y); \
+ break; \
+ case 8: \
+ ret = __avg_t(s64, x, y); \
+ break; \
+ default: \
+ avg_unknown_size(); \
+ break; \
+ } \
+ ret; })
+
+#define avg(x, y) ({ \
+ typeof(x) _avg1 = (x); \
+ typeof(y) _avg2 = (y); \
+ (void) (&_avg1 == &_avg2); \
+ __avg(_avg1, _avg2); })
+
+#define avg_t(type, x, y) ({ \
+ type _avg1 = (x); \
+ type _avg2 = (y); \
+ __avg(_avg1, _avg2); })
+
/**
* clamp - return a value clamped to a given range with strict typechecking
* @val: current value

2008-08-29 03:34:51

by Yanmin Zhang

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1


On Thu, 2008-08-21 at 14:48 +0800, Zhang, Yanmin wrote:
> On Thu, 2008-08-21 at 08:16 +0200, Ingo Molnar wrote:
> > * Zhang, Yanmin <[email protected]> wrote:
> >
> > > > ok, i've applied this one to tip/sched/urgent instead of the
> > > > feature-disabling patchlet. Yanmin, could you please check whether this
> > > > one does the trick?
> > >
> > > This new patch almost doesn't help volanoMark. Pls. use the patch
> > > which sets LB_BIAS=1 by default.
> >
> > ok. That also removes the kernel.h complications ;-)
> Sorry, I have new update.
> Originally, I worked on 2.6.27-rc1. I just move to 2.6.27-rc3 and found
> something defferent when CONFIG_GROUP_SCHED=n.
>
> With 2.6.27-rc3, on my 8-core stoakley, all volanoMark regression disappears,
> no matter if I enable LB_BIAS. On 16-core tigerton, the regression is still
> there if I don't enable LB_BIAS and regression becomes 11% from 65%.
I have new updates on this regression. I checked volanoMark web page and
found the client command line has option rooms and users. rooms means how many
chat room will be started. users means how many users are in 1 room. The default
rooms is 10 and users is 20, so every room has about 800 threads. As all threads of a
room just communicate within this room, so the rooms number is important.

All my previous volanoMark testing uses default rooms 10 and users 20. With wake_offine
in kernel, waker/sleeper will be moved to the same cpu gradually. However, if the
rooms is not multiple of cpu number, due to load balance, kernel will move threads from
one cpu to another cpu continually. If there are too many threads to weaken the cache-hot
effect, load balance is more important. But if there are not too many threads running,
cache-hot is more important than load balance. Should we prefer to wake_affine more?

Below is some data I collected with numerous testing on 3 machines.


On 2-quadcore processor stoakley (8-core):
kernel\rooms | 8 | 10 | 16 | 32
-------------------------------------------------------------------------------------------
2.6.26_nogroup | 385617 | 351247 | 323324 | 231934
-------------------------------------------------------------------------------------------
2.6.27-rc4_nogroup | 359124 | 336984 | 335180 | 235258
-------------------------------------------------------------------------------------------
2.6.26group | 381425 | 343636 | 312280 | 179673
-------------------------------------------------------------------------------------------
2.6.27-rc4group | 212112 | 270000 | 300188 | 228465
-------------------------------------------------------------------------------------------


On 2-quadcore+HT processor new x86_64 (8-core+HT, total 16 threads):
kernel\rooms | 10 | 16 | 24 | 32 | 64
-------------------------------------------------------------------------
2.6.26_nogroup | 667668 | 671860 | 671662 | 621900 | 509482
-------------------------------------------------------------------------
2.6.27-rc4_nogroup | 732346 | 800290 | 709272 | 648561 | 497243
-------------------------------------------------------------------------
2.6.26group | 705579 | 759464 | 693697 | 636019 | 500744
-------------------------------------------------------------------------
2.6.27-rc4group | 572426 | 674977 | 627410 | 590984 | 445651
-------------------------------------------------------------------------


On 4-quadcore tigerton processor(16-core)(32 rooms testing isn't stable on the machine, so no 32):
kernel\rooms | 8 | 10 | 16
------------------------------------------------------------------
2.6.26_nogroup | 346410 | 382938 | 349405
------------------------------------------------------------------
2.6.27-rc4_nogroup | 359124 | 336984 | 335180
------------------------------------------------------------------
2.6.26group | 504802 | 376513 | 319020
------------------------------------------------------------------
2.6.27-rc4group | 247652 | 284784 | 355132
------------------------------------------------------------------

I also tried different users with rooms 8 and found the results of users 20/40/60 are very close.

With group scheduing, mostly, 2.6.26 is better than 2.6.27-rc4.
Without group scheduling, the result depends on specific machine.

I also rerun hackbench with group 10/16/32, and found the result difference between 2 kernels
varies among group 10/16/32.

What's the most reasonable group/rooms we should use to test?

In the other hand, tbench(start CPU_NUM*2 clients) has about 4~5% regression with 2.6.27-rc kernels.
With 30second schedstat data during the testing, I found there is almost no wake remote and wake
affine with 2.6.26, but there are many either wake_affine or wake remote with 2.6.27-rc.

-yanmin

2008-08-29 03:38:05

by Yanmin Zhang

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1


On Fri, 2008-08-29 at 11:35 +0800, Zhang, Yanmin wrote:
> On Thu, 2008-08-21 at 14:48 +0800, Zhang, Yanmin wrote:
> > On Thu, 2008-08-21 at 08:16 +0200, Ingo Molnar wrote:
> > > * Zhang, Yanmin <[email protected]> wrote:
> > >
> > > > > ok, i've applied this one to tip/sched/urgent instead of the
> > > > > feature-disabling patchlet. Yanmin, could you please check whether this
> > > > > one does the trick?
> > > >
> > > > This new patch almost doesn't help volanoMark. Pls. use the patch
> > > > which sets LB_BIAS=1 by default.
> > >
> > > ok. That also removes the kernel.h complications ;-)
> > Sorry, I have new update.
> > Originally, I worked on 2.6.27-rc1. I just move to 2.6.27-rc3 and found
> > something defferent when CONFIG_GROUP_SCHED=n.
> >
> > With 2.6.27-rc3, on my 8-core stoakley, all volanoMark regression disappears,
> > no matter if I enable LB_BIAS. On 16-core tigerton, the regression is still
> > there if I don't enable LB_BIAS and regression becomes 11% from 65%.
> I have new updates on this regression. I checked volanoMark web page and
> found the client command line has option rooms and users. rooms means how many
> chat room will be started. users means how many users are in 1 room. The default
> rooms is 10 and users is 20, so every room has about 800 threads.
Sorry. every room has 80 threads.

> As all threads of a
> room just communicate within this room, so the rooms number is important.
>
> All my previous volanoMark testing uses default rooms 10 and users 20. With wake_offine
> in kernel, waker/sleeper will be moved to the same cpu gradually. However, if the
> rooms is not multiple of cpu number, due to load balance, kernel will move threads from
> one cpu to another cpu continually. If there are too many threads to weaken the cache-hot
> effect, load balance is more important. But if there are not too many threads running,
> cache-hot is more important than load balance. Should we prefer to wake_affine more?
>
> Below is some data I collected with numerous testing on 3 machines.
>
>
> On 2-quadcore processor stoakley (8-core):
> kernel\rooms | 8 | 10 | 16 | 32
> -------------------------------------------------------------------------------------------
> 2.6.26_nogroup | 385617 | 351247 | 323324 | 231934
> -------------------------------------------------------------------------------------------
> 2.6.27-rc4_nogroup | 359124 | 336984 | 335180 | 235258
> -------------------------------------------------------------------------------------------
> 2.6.26group | 381425 | 343636 | 312280 | 179673
> -------------------------------------------------------------------------------------------
> 2.6.27-rc4group | 212112 | 270000 | 300188 | 228465
> -------------------------------------------------------------------------------------------
>
> 
> On 2-quadcore+HT processor new x86_64 (8-core+HT, total 16 threads):
> kernel\rooms | 10 | 16 | 24 | 32 | 64
> -------------------------------------------------------------------------
> 2.6.26_nogroup | 667668 | 671860 | 671662 | 621900 | 509482
> -------------------------------------------------------------------------
> 2.6.27-rc4_nogroup | 732346 | 800290 | 709272 | 648561 | 497243
> -------------------------------------------------------------------------
> 2.6.26group | 705579 | 759464 | 693697 | 636019 | 500744
> -------------------------------------------------------------------------
> 2.6.27-rc4group | 572426 | 674977 | 627410 | 590984 | 445651
> -------------------------------------------------------------------------
>
> 
> On 4-quadcore tigerton processor(16-core)(32 rooms testing isn't stable on the machine, so no 32):
> kernel\rooms | 8 | 10 | 16
> ------------------------------------------------------------------
> 2.6.26_nogroup | 346410 | 382938 | 349405
> ------------------------------------------------------------------
> 2.6.27-rc4_nogroup | 359124 | 336984 | 335180
> ------------------------------------------------------------------
> 2.6.26group | 504802 | 376513 | 319020
> ------------------------------------------------------------------
> 2.6.27-rc4group | 247652 | 284784 | 355132
> ------------------------------------------------------------------
>
> I also tried different users with rooms 8 and found the results of users 20/40/60 are very close.
>
> With group scheduing, mostly, 2.6.26 is better than 2.6.27-rc4.
> Without group scheduling, the result depends on specific machine.
>
> I also rerun hackbench with group 10/16/32, and found the result difference between 2 kernels
> varies among group 10/16/32.
>
> What's the most reasonable group/rooms we should use to test?
>
> In the other hand, tbench(start CPU_NUM*2 clients) has about 4~5% regression with 2.6.27-rc kernels.
> With 30second schedstat data during the testing, I found there is almost no wake remote and wake
> affine with 2.6.26, but there are many either wake_affine or wake remote with 2.6.27-rc.

2008-08-06 03:26:37

by Yanmin Zhang

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1


On Mon, 2008-08-04 at 09:12 +0200, Peter Zijlstra wrote:
> On Mon, 2008-08-04 at 12:35 +0530, Dhaval Giani wrote:
> > On Mon, Aug 04, 2008 at 08:26:11AM +0200, Peter Zijlstra wrote:
> > > On Mon, 2008-08-04 at 11:23 +0530, Dhaval Giani wrote:
> > >
> > > > Peter, vatsa, any ideas?
> > >
> > > ---
> > >
> > > Revert:
> > > a7be37ac8e1565e00880531f4e2aff421a21c803 sched: revert the revert of: weight calculations
> > > c9c294a630e28eec5f2865f028ecfc58d45c0a5a sched: fix calc_delta_asym()
> > > ced8aa16e1db55c33c507174c1b1f9e107445865 sched: fix calc_delta_asym, #2
> > >
> >
> > Did we not fix those? :)
>
> Works for me,.. just guessing here.
I did more investigation on 16-core tigerton.

Firstly, let's focus on CONFIG_GROUP_SCHED=n. With 2.6.26, the result has little difference
between with and without CONFIG_GROUP_SCHED.

1) I tried different sched_features and found AFFINE_WAKEUPS has big impact on volanoMark. Other
features have little impact.

2) With kernel 2.6.26, if disabling AFFINE_WAKEUPS, the result is 260000; if enabling AFFINE_WAKEUPS,
the result is 515000, so the improvement caused by AFFINE_WAKEUPS is about 100%. With kernel 2.6.27-rc1,
the improvement is only about 25%.

3) I turned on CONFIG_SCHETSTATS in kernel and collect ttwu_move_affine. Mostly, collect ttwu_move_affine,
then recollect it after 30 seconds and calculate the difference. With 2.6.26, I got below data:
domain0 279521 142332 0
domain1 184589 22823 0
domain0 289170 142168 0
domain1 185491 23778 0
domain0 291842 139687 0
domain1 187807 23174 0
domain0 292426 144879 0
domain1 179721 22122 0
domain0 287669 137756 0
domain1 201236 25156 0
domain0 268374 139532 0
domain1 210145 25268 0
domain0 292002 144530 0
domain1 196146 24669 0
domain0 298406 145023 0
domain1 178381 22743 0
domain0 275685 141086 0
domain1 203797 25686 0
domain0 285818 140260 0
domain1 180506 23002 0
domain0 290562 139757 0
domain1 186669 23086 0
domain0 296466 142084 0
domain1 186346 24161 0
domain0 283394 137930 0
domain1 195596 23895 0
domain0 269296 142978 0
domain1 210648 25682 0
domain0 281672 144002 0
domain1 189959 23685 0
domain0 301834 145922 0
domain1 172737 22351 0


The 3rd column is ttwu_move_affine difference.

With 2.6.27-rc1:
domain0 39054 302678 0
domain1 315384 245684 0
domain0 39142 304117 0
domain1 312896 244796 0
domain0 38636 304438 0
domain1 310687 244409 0
domain0 39534 304167 0
domain1 313746 245381 0
domain0 39082 304231 0
domain1 312592 245219 0
domain0 39057 305460 0
domain1 311395 245195 0
domain0 38224 301351 0
domain1 314482 244448 0
domain0 38016 300573 0
domain1 309031 241127 0
domain0 40285 306397 0
domain1 318707 243595 0
domain0 39685 305034 0
domain1 315380 241506 0
domain0 39828 306178 0
domain1 314515 243039 0
domain0 39870 303382 0
domain1 315457 244483 0
domain0 38892 304697 0
domain1 313808 241948 0
domain0 39255 303937 0
domain1 314531 244301 0
domain0 38850 300187 0
domain1 310727 240255 0
domain0 38847 302327 0
domain1 312538 241857 0


So with kernel 2.6.27-rc1, the successful wakeup_affine is about double of the one of 2.6.27-rc1
on domain 0, but about 10 times on domain 1. That means more tasks are woken up on waker cpus.

Does that mean it doesn't follow cache-hot checking?

I will collect more data.

-yanmin

2008-08-13 08:52:49

by Yanmin Zhang

[permalink] [raw]
Subject: Re: VolanoMark regression with 2.6.27-rc1


On Fri, 2008-08-08 at 09:30 +0200, Peter Zijlstra wrote:
> On Tue, 2030-08-06 at 11:26 +0800, Zhang, Yanmin wrote:
> > On Mon, 2008-08-04 at 09:12 +0200, Peter Zijlstra wrote:
> > > On Mon, 2008-08-04 at 12:35 +0530, Dhaval Giani wrote:
> > > > On Mon, Aug 04, 2008 at 08:26:11AM +0200, Peter Zijlstra wrote:
> > > > > On Mon, 2008-08-04 at 11:23 +0530, Dhaval Giani wrote:
> > > > >
> > > > > > Peter, vatsa, any ideas?
> > > > >
> > > > > ---
> > > > >
> > > > > Revert:
> > > > > a7be37ac8e1565e00880531f4e2aff421a21c803 sched: revert the revert of: weight calculations
> > > > > c9c294a630e28eec5f2865f028ecfc58d45c0a5a sched: fix calc_delta_asym()
> > > > > ced8aa16e1db55c33c507174c1b1f9e107445865 sched: fix calc_delta_asym, #2
> > > > >
> > > >
> > > > Did we not fix those? :)
> > >
> > > Works for me,.. just guessing here.
> > I did more investigation on 16-core tigerton.
> >
> > Firstly, let's focus on CONFIG_GROUP_SCHED=n. With 2.6.26, the result
> > has little difference
> > between with and without CONFIG_GROUP_SCHED.
> >
> > 1) I tried different sched_features and found AFFINE_WAKEUPS has big
> > impact on volanoMark. Other
> > features have little impact.
> >
> > 2) With kernel 2.6.26, if disabling AFFINE_WAKEUPS, the result is
> > 260000; if enabling AFFINE_WAKEUPS,
> > the result is 515000, so the improvement caused by AFFINE_WAKEUPS is
> > about 100%. With kernel 2.6.27-rc1,
> > the improvement is only about 25%.
> >
> > 3) I turned on CONFIG_SCHETSTATS in kernel and collect
> > ttwu_move_affine. Mostly, collect ttwu_move_affine,
> > then recollect it after 30 seconds and calculate the difference. With
> > 2.6.26, I got below data:
>
> <snip data>
>
> > So with kernel 2.6.27-rc1, the successful wakeup_affine is about
> > double of the one of 2.6.27-rc1
> > on domain 0, but about 10 times on domain 1. That means more tasks are
> > woken up on waker cpus.
> >
> > Does that mean it doesn't follow cache-hot checking?
>
> I'm a bit puzzled, but you're right - I too noticed that volanomark is
> _very_ sensitive to affine wakeups.
>
> I'll try and find what changed in that code for GROUP=n.
I collect more data and find CPU_NEWLY_IDLE balance schedstat looks abnormal.
Comparing with 2.6.26, 2.6.27-rc1 has more successful move_tasks among cpu runqueue. I
instrument kernel and find that, with 2.6.26, mostly task is hot when kernel tries to
move it to another cpu. But with 2.6.27-rc1, task is often moved successfully.
If I set /proc/sys/kernel/sched_migration_cost=1500000 (default is 500000), volanoMark
result is improved significantly, near to the result of 2.6.26. Above testing set
CONFIG_GROUP_SCHED=n. So perhaps some key data structures are changed with 2.6.27-rc1
to create more cache misses. With 2.6.26, cpu idle is about 6~7%. With 2.6.27-rc1, cpu idle
is about 1%. I compare the 2 kernels and couldn't find what data structure change makes it.

As for CONFIG_GROUP_SCHED=y, oprofile shows tg_shares_up consumes about 8% cpu utilization
on my 16-core tigerton. If I enlarge /proc/sys/kernel/sched_shares_ratelimit, it doesn't help
volanoMark result. I check the group schedule codes and got an idea to improve it. Add
share_percent, a new var in task_group->sched_entity[i] to record the percent this task group
occupies in the parent group. share_percent is updated in walk_tg_tree. In account_entity_enqueue,
if the task entity has parent, we could just use share_percent and se->load.weight to calculate
a new weight and add the new weight to parent entity weight, in the end to runqueue load weight.
So when sched_shares_ratelimit is enlarged, various load balances still could work well. I think
volanoMark could benefit from it.

BTW, with CONFIG_GROUP_SCHED=y, hackbench has about 80% regression on my 8core+multi_thread
Montvale Itanium machine and Tulsa machines. It seems mutli-thread machines has the regression.

-yanmin