2016-04-08 05:20:59

by Mike Galbraith

[permalink] [raw]
Subject: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

Greetings,

While measuring current NO_HZ cost to light tasks jabbering cross core
at high frequency (~7% max), I noticed that master lost an improvement
for same acquired in 4.5, so bisected it.

4.5.0
homer:~ # taskset 0xc pipe-test 1
2.367681 usecs/loop -- avg 2.367681 844.7 KHz
2.372502 usecs/loop -- avg 2.368163 844.5 KHz
2.342506 usecs/loop -- avg 2.365597 845.5 KHz
2.383029 usecs/loop -- avg 2.367341 844.8 KHz
2.321859 usecs/loop -- avg 2.362792 846.5 KHz 1.00

master
homer:~ # taskset 0xc pipe-test 1
2.797656 usecs/loop -- avg 2.797656 714.9 KHz
2.804518 usecs/loop -- avg 2.798342 714.7 KHz
2.804206 usecs/loop -- avg 2.798929 714.6 KHz
2.802887 usecs/loop -- avg 2.799324 714.5 KHz
2.801577 usecs/loop -- avg 2.799550 714.4 KHz 0.84

master 0c313cb20732 reverted
homer:~ # !taskset
homer:~ # taskset 0xc pipe-test 1
2.277494 usecs/loop -- avg 2.277494 878.2 KHz
2.320979 usecs/loop -- avg 2.281843 876.5 KHz
2.272750 usecs/loop -- avg 2.280933 876.8 KHz
2.272209 usecs/loop -- avg 2.280061 877.2 KHz
2.277279 usecs/loop -- avg 2.279783 877.3 KHz 1.03

0c313cb207326f759a58f486214288411b25d4cf is the first bad commit
commit 0c313cb207326f759a58f486214288411b25d4cf
Author: Rafael J. Wysocki <[email protected]>
Date: Sun Mar 20 01:33:35 2016 +0100

cpuidle: menu: Fall back to polling if next timer event is near

Commit a9ceb78bc75c (cpuidle,menu: use interactivity_req to disable
polling) changed the behavior of the fallback state selection part
of menu_select() so it looks at interactivity_req instead of
data->next_timer_us when it makes its decision. That effectively
caused polling to be used more often as fallback idle which led to
significant increases of energy consumption in some cases.

Commit e132b9b3bc7f (cpuidle: menu: use high confidence factors
only when considering polling) changed that logic again to be more
predictable, but that didn't help with the increased energy
consumption problem.

For this reason, go back to making decisions on which state to fall
back to based on data->next_timer_us which is the time we know for
sure something will happen rather than a prediction (which may be
inaccurate and turns out to be so often enough to be problematic).
However, take the target residency of the first proper idle state
(C1) into account, so that state is not used as the fallback one
if its target residency is greater than data->next_timer_us.

Fixes: a9ceb78bc75c (cpuidle,menu: use interactivity_req to disable polling)
Signed-off-by: Rafael J. Wysocki <[email protected]>
Reported-and-tested-by: Doug Smythies <[email protected]>


2016-04-08 06:45:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On Fri, Apr 08, 2016 at 07:20:54AM +0200, Mike Galbraith wrote:
> Greetings,
>
> While measuring current NO_HZ cost to light tasks jabbering cross core
> at high frequency (~7% max), I noticed that master lost an improvement
> for same acquired in 4.5, so bisected it.
>
> 4.5.0
> homer:~ # taskset 0xc pipe-test 1
> 2.367681 usecs/loop -- avg 2.367681 844.7 KHz
> 2.372502 usecs/loop -- avg 2.368163 844.5 KHz
> 2.342506 usecs/loop -- avg 2.365597 845.5 KHz
> 2.383029 usecs/loop -- avg 2.367341 844.8 KHz
> 2.321859 usecs/loop -- avg 2.362792 846.5 KHz 1.00
>
> master
> homer:~ # taskset 0xc pipe-test 1
> 2.797656 usecs/loop -- avg 2.797656 714.9 KHz
> 2.804518 usecs/loop -- avg 2.798342 714.7 KHz
> 2.804206 usecs/loop -- avg 2.798929 714.6 KHz
> 2.802887 usecs/loop -- avg 2.799324 714.5 KHz
> 2.801577 usecs/loop -- avg 2.799550 714.4 KHz 0.84
>
> master 0c313cb20732 reverted
> homer:~ # !taskset
> homer:~ # taskset 0xc pipe-test 1
> 2.277494 usecs/loop -- avg 2.277494 878.2 KHz
> 2.320979 usecs/loop -- avg 2.281843 876.5 KHz
> 2.272750 usecs/loop -- avg 2.280933 876.8 KHz
> 2.272209 usecs/loop -- avg 2.280061 877.2 KHz
> 2.277279 usecs/loop -- avg 2.279783 877.3 KHz 1.03
>
> 0c313cb207326f759a58f486214288411b25d4cf is the first bad commit
> commit 0c313cb207326f759a58f486214288411b25d4cf
> Author: Rafael J. Wysocki <[email protected]>
> Date: Sun Mar 20 01:33:35 2016 +0100
>
> cpuidle: menu: Fall back to polling if next timer event is near
>

Cute, I thought you used governor=performance for your runs?

2016-04-08 06:50:58

by Mike Galbraith

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On Fri, 2016-04-08 at 08:45 +0200, Peter Zijlstra wrote:

> Cute, I thought you used governor=performance for your runs?

I do, and those numbers are with it thus set.

-Mike

2016-04-08 20:57:28

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On Friday, April 08, 2016 08:50:54 AM Mike Galbraith wrote:
> On Fri, 2016-04-08 at 08:45 +0200, Peter Zijlstra wrote:
>
> > Cute, I thought you used governor=performance for your runs?
>
> I do, and those numbers are with it thus set.

Well, this is a trade-off.

4.5 introduced a power regression here so this one goes back to the previous
state of things.

Thanks,
Rafael

2016-04-08 22:19:20

by Doug Smythies

[permalink] [raw]
Subject: RE: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On 2016.04.08 14:00 Rafael J. Wysocki wrote:
> On Friday, April 08, 2016 08:50:54 AM Mike Galbraith wrote:
>> On Fri, 2016-04-08 at 08:45 +0200, Peter Zijlstra wrote:
>>
>>> Cute, I thought you used governor=performance for your runs?
>>
>> I do, and those numbers are with it thus set.

> Well, this is a trade-off.
>
> 4.5 introduced a power regression here so this one goes back to the previous
> state of things.

Mike:

Could you send me, or point me to, the program "pipe-test"?
So far, I have only found one, but it is both old and not
the same program you are running (based on print statements).

I realize I might not be to recreate your problem scenario anyhow,
I just want to try.

... Doug


2016-04-09 06:41:04

by Mike Galbraith

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On Fri, 2016-04-08 at 22:59 +0200, Rafael J. Wysocki wrote:
> On Friday, April 08, 2016 08:50:54 AM Mike Galbraith wrote:
> > On Fri, 2016-04-08 at 08:45 +0200, Peter Zijlstra wrote:
> >
> > > Cute, I thought you used governor=performance for your runs?
> >
> > I do, and those numbers are with it thus set.
>
> Well, this is a trade-off.
>
> 4.5 introduced a power regression here so this one goes back to the previous
> state of things.

That sounds somewhat reasonable. Too bad I don't have a super duper
watt meter handy.. seeing that you really really are saving me money
would perhaps make me less fond of those prettier numbers.

-Mike

2016-04-09 06:44:41

by Mike Galbraith

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On Fri, 2016-04-08 at 15:19 -0700, Doug Smythies wrote:

> Could you send me, or point me to, the program "pipe-test"?
> So far, I have only found one, but it is both old and not
> the same program you are running (based on print statements).

It's the same old pipe-test, just bent up a little to suit my usage.

-Mike

2016-04-09 07:17:57

by Doug Smythies

[permalink] [raw]
Subject: RE: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On 2016.04.08 15:19 Doug Smythies wrote:
> On 2016.04.08 14:00 Rafael J. Wysocki wrote:
>> On Friday, April 08, 2016 08:50:54 AM Mike Galbraith wrote:
>>> On Fri, 2016-04-08 at 08:45 +0200, Peter Zijlstra wrote:
>>>
>>>> Cute, I thought you used governor=performance for your runs?
>>>
>>> I do, and those numbers are with it thus set.

>> Well, this is a trade-off.
>>
>> 4.5 introduced a power regression here so this one goes back to the previous
>> state of things.

> Mike:
>
> Could you send me, or point me to, the program "pipe-test"?
> So far, I have only found one, but it is both old and not
> the same program you are running (based on print statements).
>
> I realize I might not be to recreate your problem scenario anyhow,
> I just want to try.

I still didn't find the exact same program, but I think I found some
earlier version of the correct test.

I get (long term average):
Kernel 4.4.0-17: Powersave 3.93 usecs/loop ; Performance 3.93 usecs/loop 0.89
Kernel 4.5-rc7: Powersave 3.47 usecs/loop ; Performance 3.51 usecs/loop 1.00
Kernel 4.6-rc1: Powersave 3.84 usecs/loop ; Performance 3.88 usecs/loop 0.90

So, similar results (so far, I didn't try reverted yet).

... Doug


2016-04-09 07:27:33

by Mike Galbraith

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On Sat, 2016-04-09 at 00:17 -0700, Doug Smythies wrote:

> I still didn't find the exact same program, but I think I found some
> earlier version of the correct test.
>
> I get (long term average):
> Kernel 4.4.0-17: Powersave 3.93 usecs/loop ; Performance 3.93 usecs/loop 0.89
> Kernel 4.5-rc7: Powersave 3.47 usecs/loop ; Performance 3.51 usecs/loop 1.00
> Kernel 4.6-rc1: Powersave 3.84 usecs/loop ; Performance 3.88 usecs/loop 0.90
>
> So, similar results (so far, I didn't try reverted yet).

I likely see a bit more go missing because I throttle no_hz when idle
is being hammered at high frequency.

-Mike

2016-04-09 11:07:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On Fri, Apr 08, 2016 at 10:59:59PM +0200, Rafael J. Wysocki wrote:
> On Friday, April 08, 2016 08:50:54 AM Mike Galbraith wrote:
> > On Fri, 2016-04-08 at 08:45 +0200, Peter Zijlstra wrote:
> >
> > > Cute, I thought you used governor=performance for your runs?
> >
> > I do, and those numbers are with it thus set.
>
> Well, this is a trade-off.
>
> 4.5 introduced a power regression here so this one goes back to the previous
> state of things.

Just for my elucidation; how can gov=performance have a 'power'
regression?

2016-04-09 11:08:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On Fri, Apr 08, 2016 at 03:19:14PM -0700, Doug Smythies wrote:
> Could you send me, or point me to, the program "pipe-test"?
> So far, I have only found one, but it is both old and not
> the same program you are running (based on print statements).

The latest public one lives as: perf bench sched pipe

2016-04-09 12:31:25

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On Sat, Apr 9, 2016 at 1:07 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Apr 08, 2016 at 10:59:59PM +0200, Rafael J. Wysocki wrote:
>> On Friday, April 08, 2016 08:50:54 AM Mike Galbraith wrote:
>> > On Fri, 2016-04-08 at 08:45 +0200, Peter Zijlstra wrote:
>> >
>> > > Cute, I thought you used governor=performance for your runs?
>> >
>> > I do, and those numbers are with it thus set.
>>
>> Well, this is a trade-off.
>>
>> 4.5 introduced a power regression here so this one goes back to the previous
>> state of things.
>
> Just for my elucidation; how can gov=performance have a 'power'
> regression?

Because of what is used as the "default" idle state most of the time.

C1 was used before 4.5 and that changed to polling in 4.5.

2016-04-09 12:33:42

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On Sat, Apr 9, 2016 at 8:40 AM, Mike Galbraith <[email protected]> wrote:
> On Fri, 2016-04-08 at 22:59 +0200, Rafael J. Wysocki wrote:
>> On Friday, April 08, 2016 08:50:54 AM Mike Galbraith wrote:
>> > On Fri, 2016-04-08 at 08:45 +0200, Peter Zijlstra wrote:
>> >
>> > > Cute, I thought you used governor=performance for your runs?
>> >
>> > I do, and those numbers are with it thus set.
>>
>> Well, this is a trade-off.
>>
>> 4.5 introduced a power regression here so this one goes back to the previous
>> state of things.
>
> That sounds somewhat reasonable. Too bad I don't have a super duper
> watt meter handy.. seeing that you really really are saving me money
> would perhaps make me less fond of those prettier numbers.

You can look at the turbostat Watts numbers ("turbostat --debug" and
the last three columns of the output in turbostat as included in the
kernel source).

That requires an Intel CPU with RAPL.

2016-04-09 15:10:29

by Mike Galbraith

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On Sat, 2016-04-09 at 14:33 +0200, Rafael J. Wysocki wrote:
> On Sat, Apr 9, 2016 at 8:40 AM, Mike Galbraith <[email protected]> wrote:
> > On Fri, 2016-04-08 at 22:59 +0200, Rafael J. Wysocki wrote:
> > > On Friday, April 08, 2016 08:50:54 AM Mike Galbraith wrote:
> > > > On Fri, 2016-04-08 at 08:45 +0200, Peter Zijlstra wrote:
> > > >
> > > > > Cute, I thought you used governor=performance for your runs?
> > > >
> > > > I do, and those numbers are with it thus set.
> > >
> > > Well, this is a trade-off.
> > >
> > > 4.5 introduced a power regression here so this one goes back to the previous
> > > state of things.
> >
> > That sounds somewhat reasonable. Too bad I don't have a super duper
> > watt meter handy.. seeing that you really really are saving me money
> > would perhaps make me less fond of those prettier numbers.
>
> You can look at the turbostat Watts numbers ("turbostat --debug" and
> the last three columns of the output in turbostat as included in the
> kernel source).

Hm. I think I want my prettier numbers back.

714KHz/877KHz = 0.81
25Watt/30Watt = 0.83

-Mike


2016-04-09 16:39:40

by Mike Galbraith

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732


Hm, setting gov=performance, and taking the average of 3 30 second
interval PkgWatt samples as pipe-test runs..

714KHz/28.03Ws = 25.46
877KHz/30.28Ws = 28.96

..for pipe-test, the tradeoff look a bit more like red than green.

-Mike

2016-04-10 03:44:51

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On Sat, Apr 9, 2016 at 6:39 PM, Mike Galbraith <[email protected]> wrote:
>
> Hm, setting gov=performance, and taking the average of 3 30 second
> interval PkgWatt samples as pipe-test runs..
>
> 714KHz/28.03Ws = 25.46
> 877KHz/30.28Ws = 28.96
>
> ..for pipe-test, the tradeoff look a bit more like red than green.

Well, fair enough, but that's just pipe-test, and what about the
people who don't see the performance gain and see the energy loss,
like Doug?

Essentially, this trades performance gains in somewhat special
workloads for increased energy consumption in idle. Those workloads
need not be run by everybody, but idle is.

That said I applied the patch you're complaining about mostly because
the commit that introduced the change in question in 4.5 claimed that
it wouldn't affect idle power on systems with reasonably fast C1, but
that didn't pass the reality test. I'm not totally against restoring
that change, but it would need to be based on very solid evidence.

2016-04-10 07:16:55

by Doug Smythies

[permalink] [raw]
Subject: RE: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On 2106.04.09 20:45 Rafael J. Wysocki wrote:
>On Sat, Apr 9, 2016 at 6:39 PM, Mike Galbraith wrote:
>>
>> Hm, setting gov=performance, and taking the average of 3 30 second
>> interval PkgWatt samples as pipe-test runs..
>>
>> 714KHz/28.03Ws = 25.46
>> 877KHz/30.28Ws = 28.96
>>
>> ..for pipe-test, the tradeoff look a bit more like red than green.
>
> Well, fair enough, but that's just pipe-test, and what about the
> people who don't see the performance gain and see the energy loss,
> like Doug?

Some numbers from my computer:

Pipe-test (100 seconds):

Kernel 4.6-rc2 gov=powersave:
Stock: 3.86 uSecs/loop and 3148.05 Joules
Reverted: 3.34 uSecs/loop and 3567.43 Joules

Reverted is 13% faster at a cost of 13% more energy.

Idle stats (done separately and for 20e6 loops)

State k46rc2-ps (sec) k46rc2-rev-ps(sec)
0.00 0.01 4.09
1.00 38.68 0.00
2.00 0.46 0.27
3.00 0.01 0.00
4.00 464.23 380.23

total 503.38 384.60

Kernel 4.6-rc2 gov=performance:
Stock: 3.89 uSecs/loop and 3154.72 Joules
Reverted: 3.25 uSecs/loop and 3445.90 Joules

Reverted is 16% faster at a cost of 9% more energy.

Idle stats (done separately and for 20e6 loops)

State k46rc2-pf (sec) k46rc2-rev-pf (sec)
0.00 0.00 1.43
1.00 38.89 0.04
2.00 2.08 0.03
3.00 0.01 0.01
4.00 463.05 381.54

total 504.03 383.05

9 incremental kernel compiles, with no changes:
(the reference test from last cycle):
(2000 seconds turbostat package energy sample time):
There is no detectable consistent change in compile times:

Kernel 4.6-rc2 gov=powersave:
Stock: 48557 Joules
Reverted: 65439 Joules

Reverted costs 34% more energy.
(note: this result is unusually high. There are variations test to test)

Kernel 4.6-rc2 gov=performance:
Stock: 49965 Joules
Reverted: 59232 Joules

Reverted costs 19% more energy.
(note: never tested gov=performance before)

Idle stats not re-done (we had several samples last cycle).

> Essentially, this trades performance gains in somewhat special
> workloads for increased energy consumption in idle. Those workloads
> need not be run by everybody, but idle is.
>
> That said I applied the patch you're complaining about mostly because
> the commit that introduced the change in question in 4.5 claimed that
> it wouldn't affect idle power on systems with reasonably fast C1, but
> that didn't pass the reality test. I'm not totally against restoring
> that change, but it would need to be based on very solid evidence.

... Doug


2016-04-10 09:35:19

by Mike Galbraith

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On Sun, 2016-04-10 at 05:44 +0200, Rafael J. Wysocki wrote:
> On Sat, Apr 9, 2016 at 6:39 PM, Mike Galbraith <
> [email protected]> wrote:
> >
> > Hm, setting gov=performance, and taking the average of 3 30 second
> > interval PkgWatt samples as pipe-test runs..
> >
> > 714KHz/28.03Ws = 25.46
> > 877KHz/30.28Ws = 28.96
> >
> > ..for pipe-test, the tradeoff look a bit more like red than green.
>
> Well, fair enough, but that's just pipe-test, and what about the
> people who don't see the performance gain and see the energy loss,
> like Doug?

Perhaps Doug sees increased power because he's not throttling no_hz,
whereas I am, so he burns more power getting _to_ idle? Dunno, maybe
he'll try the attached. If it's a general case energy loser, so be it,
numbers talk, bs walks and all that ;-)

> Essentially, this trades performance gains in somewhat special
> workloads for increased energy consumption in idle. Those workloads
> need not be run by everybody, but idle is.

Cross core scheduling is routine business, we do truckloads of that for
good reason, and lots of stuff does wakeups at high frequency.

> That said I applied the patch you're complaining about mostly because
> the commit that introduced the change in question in 4.5 claimed that
> it wouldn't affect idle power on systems with reasonably fast C1, but
> that didn't pass the reality test. I'm not totally against restoring
> that change, but it would need to be based on very solid evidence.

Understood. My box seems to be saying we can hug the trees hardest by
telling the CPU get work done as quickly as possible, but I don't have
much experience at tree hugging measurement. Performance wise, tasks
talking via localhost is definitely not special.

tbench 1 2 4 8
base 752 1283 2250 3362

select_idle_sibling() off
735 1344 2080 2884
delta .977 1.047 .924 .857

select_idle_sibling() on, 0c313cb20732 reverted
816 1317 2240 3388
delta 1.085 1.026 .995 1.007 vs base
delta 1.110 .979 1.076 1.174 vs off
(^hm)

-Mike


Attachments:
sched-throttle-nohz.patch (1.50 kB)

2016-04-10 14:54:08

by Mike Galbraith

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On Sun, 2016-04-10 at 11:35 +0200, Mike Galbraith wrote:
> On Sun, 2016-04-10 at 05:44 +0200, Rafael J. Wysocki wrote:
> > On Sat, Apr 9, 2016 at 6:39 PM, Mike Galbraith <
> > [email protected]> wrote:
> > >
> > > Hm, setting gov=performance, and taking the average of 3 30 second
> > > interval PkgWatt samples as pipe-test runs..
> > >
> > > 714KHz/28.03Ws = 25.46
> > > 877KHz/30.28Ws = 28.96
> > >
> > > ..for pipe-test, the tradeoff look a bit more like red than green.
> >
> > Well, fair enough, but that's just pipe-test, and what about the
> > people who don't see the performance gain and see the energy loss,
> > like Doug?
>
> Perhaps Doug sees increased power because he's not throttling no_hz,
> whereas I am, so he burns more power getting _to_ idle? Dunno, maybe
> he'll try the attached. If it's a general case energy loser, so be it,
> numbers talk, bs walks and all that ;-)

And here are the rest of my numbers..

> tbench 1 2 4 8
> base 752 1283 2250 3362
>
> select_idle_sibling() off
> 735 1344 2080 2884
> delta .977 1.047 .924 .857
>
> select_idle_sibling() on, 0c313cb20732 reverted
> 816 1317 2240 3388
> delta 1.085 1.026 .995 1.007 vs base
> delta 1.110 .979 1.076 1.174 vs off
> (^hm)

tbench 2 turboboost off
base 1215 1.00 1215/32.24=37.68
revert 1252 1.03 1252/35.82=34.95=loser

tbench 2 throughput hm is apparently a turboboost oddity, and..

tbench (turboboost back on)
power 1 2 4 8
base 23.88 37.41 54.64 62.25
revert 31.25 42.53 55.11 62.66

MB/s/Ws 1 2 4 8
base 31.49 34.29 41.17 54.00
revert 26.11 30.96 40.64 54.06

..while single pipe-test pair said green/green, tbench numbers say
throughput green, but energy efficiency red across the board.

-Mike

2016-04-10 15:40:05

by Mike Galbraith

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On Sat, 2016-04-09 at 14:31 +0200, Rafael J. Wysocki wrote:
> On Sat, Apr 9, 2016 at 1:07 PM, Peter Zijlstra <[email protected]>
> wrote:
> > On Fri, Apr 08, 2016 at 10:59:59PM +0200, Rafael J. Wysocki wrote:
> > > On Friday, April 08, 2016 08:50:54 AM Mike Galbraith wrote:
> > > > On Fri, 2016-04-08 at 08:45 +0200, Peter Zijlstra wrote:
> > > >
> > > > > Cute, I thought you used governor=performance for your runs?
> > > >
> > > > I do, and those numbers are with it thus set.
> > >
> > > Well, this is a trade-off.
> > >
> > > 4.5 introduced a power regression here so this one goes back to
> > > the previous
> > > state of things.
> >
> > Just for my elucidation; how can gov=performance have a 'power'
> > regression?
>
> Because of what is used as the "default" idle state most of the time.
>
> C1 was used before 4.5 and that changed to polling in 4.5.

Should the default idle state not then be governor dependent? When I
set gov=performance, I'm expecting box to go just as fast as it can go
without melting. Does polling risk CPU -> lava conversion?

-Mike

2016-04-10 20:24:59

by Rik van Riel

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On Sun, 2016-04-10 at 17:39 +0200, Mike Galbraith wrote:
> On Sat, 2016-04-09 at 14:31 +0200, Rafael J. Wysocki wrote:
> >
> > On Sat, Apr 9, 2016 at 1:07 PM, Peter Zijlstra <[email protected]
> > g>
> > wrote:
> > >
> > > On Fri, Apr 08, 2016 at 10:59:59PM +0200, Rafael J. Wysocki
> > > wrote:
> > > >
> > > > On Friday, April 08, 2016 08:50:54 AM Mike Galbraith wrote:
> > > > >
> > > > > On Fri, 2016-04-08 at 08:45 +0200, Peter Zijlstra wrote:
> > > > >
> > > > > >
> > > > > > Cute, I thought you used governor=performance for your
> > > > > > runs?
> > > > > I do, and those numbers are with it thus set.
> > > > Well, this is a trade-off.
> > > >
> > > > 4.5 introduced a power regression here so this one goes back to
> > > > the previous
> > > > state of things.
> > > Just for my elucidation; how can gov=performance have a 'power'
> > > regression?
> > Because of what is used as the "default" idle state most of the
> > time.
> >
> > C1 was used before 4.5 and that changed to polling in 4.5.
> Should the default idle state not then be governor dependent?  When I
> set gov=performance, I'm expecting box to go just as fast as it can
> go
> without melting.  Does polling risk CPU -> lava conversion?

Current CPUs can only have some cores run at full speed
(turbo mode) if other cores are idling and/or running at
lower speeds.

It may be time to stop pretending that gov=performance
actually results in better performance on current CPUs,
since it may inhibit entire levels of turbo mode.

--
All Rights Reversed.


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part

2016-04-11 03:05:00

by Mike Galbraith

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On Sun, 2016-04-10 at 16:24 -0400, Rik van Riel wrote:
> On Sun, 2016-04-10 at 17:39 +0200, Mike Galbraith wrote:

> > Should the default idle state not then be governor dependent? When I
> > set gov=performance, I'm expecting box to go just as fast as it can
> > go
> > without melting. Does polling risk CPU -> lava conversion?
>
> Current CPUs can only have some cores run at full speed
> (turbo mode) if other cores are idling and/or running at
> lower speeds.

The real world is very unlikely to miss the prettier numbers I'm
grieving over one tiny bit. Knowing that doesn't make giving them up
any easier though.. byebye cycles (sniff) ;-)

-Mike

2016-04-11 12:38:21

by Rik van Riel

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On Mon, 2016-04-11 at 05:04 +0200, Mike Galbraith wrote:
> On Sun, 2016-04-10 at 16:24 -0400, Rik van Riel wrote:
> >
> > On Sun, 2016-04-10 at 17:39 +0200, Mike Galbraith wrote:
> >
> > >
> > > Should the default idle state not then be governor
> > > dependent?  When I
> > > set gov=performance, I'm expecting box to go just as fast as it
> > > can
> > > go
> > > without melting.  Does polling risk CPU -> lava conversion?
> > Current CPUs can only have some cores run at full speed
> > (turbo mode) if other cores are idling and/or running at
> > lower speeds.
> The real world is very unlikely to miss the prettier numbers I'm
> grieving over one tiny bit.  Knowing that doesn't make giving them up
> any easier though.. byebye cycles (sniff) ;-)

I suspect your pipe benchmark could be very relevant to
network performance numbers, too.

I would like to go into polling a little bit more aggressively
in a future kernel, and I think we can get away with it if we
teach the polling loop to exit after we have spent enough time
there that the menu governor will pick HLT after a few timed
out poll loops.

That way while we run a workload that actually benefits from
polling, we will get polling, but once we run a workload that
actually sleeps longer than the HLT threshold, we will quickly
fall back to HLT.

With 10Gbps network traffic, it could make a real difference
whether or not the CPU can wake up immediately, or takes a
microsecond to wake up...

--
All Rights Reversed.


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part

2016-04-11 13:21:48

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On Mon, Apr 11, 2016 at 2:38 PM, Rik van Riel <[email protected]> wrote:
> On Mon, 2016-04-11 at 05:04 +0200, Mike Galbraith wrote:
>> On Sun, 2016-04-10 at 16:24 -0400, Rik van Riel wrote:
>> >
>> > On Sun, 2016-04-10 at 17:39 +0200, Mike Galbraith wrote:
>> >
>> > >
>> > > Should the default idle state not then be governor
>> > > dependent? When I
>> > > set gov=performance, I'm expecting box to go just as fast as it
>> > > can
>> > > go
>> > > without melting. Does polling risk CPU -> lava conversion?
>> > Current CPUs can only have some cores run at full speed
>> > (turbo mode) if other cores are idling and/or running at
>> > lower speeds.
>> The real world is very unlikely to miss the prettier numbers I'm
>> grieving over one tiny bit. Knowing that doesn't make giving them up
>> any easier though.. byebye cycles (sniff) ;-)
>
> I suspect your pipe benchmark could be very relevant to
> network performance numbers, too.
>
> I would like to go into polling a little bit more aggressively
> in a future kernel,

Agreed, but ->

> and I think we can get away with it if we
> teach the polling loop to exit after we have spent enough time
> there that the menu governor will pick HLT after a few timed
> out poll loops.

-> my concern about this approach is that it would add an artificial
point to the menu governor statistics at whatever the timeout is
chosen to be.

2016-04-11 13:39:05

by Rik van Riel

[permalink] [raw]
Subject: Re: [regression] cross core scheduling frequency drop bisected to 0c313cb20732

On Mon, 2016-04-11 at 15:21 +0200, Rafael J. Wysocki wrote:
> On Mon, Apr 11, 2016 at 2:38 PM, Rik van Riel <[email protected]>
> wrote:
> >
> > On Mon, 2016-04-11 at 05:04 +0200, Mike Galbraith wrote:
> > >
> > > On Sun, 2016-04-10 at 16:24 -0400, Rik van Riel wrote:
> > > >
> > > >
> > > > On Sun, 2016-04-10 at 17:39 +0200, Mike Galbraith wrote:
> > > >
> > > > >
> > > > >
> > > > > Should the default idle state not then be governor
> > > > > dependent?  When I
> > > > > set gov=performance, I'm expecting box to go just as fast as
> > > > > it
> > > > > can
> > > > > go
> > > > > without melting.  Does polling risk CPU -> lava conversion?
> > > > Current CPUs can only have some cores run at full speed
> > > > (turbo mode) if other cores are idling and/or running at
> > > > lower speeds.
> > > The real world is very unlikely to miss the prettier numbers I'm
> > > grieving over one tiny bit.  Knowing that doesn't make giving
> > > them up
> > > any easier though.. byebye cycles (sniff) ;-)
> > I suspect your pipe benchmark could be very relevant to
> > network performance numbers, too.
> >
> > I would like to go into polling a little bit more aggressively
> > in a future kernel,
> Agreed, but ->
>
> >
> > and I think we can get away with it if we
> > teach the polling loop to exit after we have spent enough time
> > there that the menu governor will pick HLT after a few timed
> > out poll loops.
> -> my concern about this approach is that it would add an artificial
> point to the menu governor statistics at whatever the timeout is
> chosen to be.

I would set the threshold to at least the HLT target residency +
exit latency, so 3 poll timeouts in 8 wakeups would cause
us to fall back to HLT.

On the other hand, if the system is legitimately very busy,
and we break out of the HLT loop due to activities happening
before the timeout most of the time, we should automatically
pick polling.

Does that make sense?

Am I overlooking something?

--
All Rights Reversed.


Attachments:
signature.asc (473.00 B)
This is a digitally signed message part