2007-06-01 18:42:41

by djwong

[permalink] [raw]
Subject: Re: Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW?

On Thu, Mar 29, 2007 at 06:06:22PM -0700, Pallipadi, Venkatesh wrote:
> thought of
> making affected CPUs show the dependency in case of hw coord, but
> retaining the percpu
> control. But, it seemed complicated change for something that is
> cosmetic.

Actually, it's not so cosmetic any more. Our newest servers have a
power meter that measures power consumption, and I'm writing a program
to measure the power cost of various cpufreq transitions in order to
enforce a power cap. Due to the under-reporting in affected_cpus, the
app thinks that (taking your example above) CPUs 0 and 2 can be
controlled independently. Thus, a p-state transition of (x, x) ->
(x, x-1) yields no energy saving at all, while (x, x-1) -> (x-1, x-1)
does. My program considers the effects of a single CPU's transition
independently of which CPU it is and without considering what
frequencies the other CPUs are operating at, which means that it will
conclude that the cost of increasing speed (or the reward for decreasing
it) is half of what it is ... sort of. It's mildly broken as a result,
though amusingly enough it still seems to work ok. I suspect that it
might flail around trying to hit a cap a bit more than it would if
affected_cpus were more accurate.

--D


Attachments:
(No filename) (1.23 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2007-06-01 20:40:46

by Andi Kleen

[permalink] [raw]
Subject: Re: Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW?

"Darrick J. Wong" <[email protected]> writes:

> On Thu, Mar 29, 2007 at 06:06:22PM -0700, Pallipadi, Venkatesh wrote:
> > thought of
> > making affected CPUs show the dependency in case of hw coord, but
> > retaining the percpu
> > control. But, it seemed complicated change for something that is
> > cosmetic.
>
> Actually, it's not so cosmetic any more. Our newest servers have a
> power meter that measures power consumption, and I'm writing a program
> to measure the power cost of various cpufreq transitions in order to
> enforce a power cap.

How would that work? You would adjust the power cap dynamically during
runtime based on the power meter feedback? How long would
the adjustment interval be?

> Due to the under-reporting in affected_cpus, the
> app thinks that (taking your example above) CPUs 0 and 2 can be
> controlled independently. Thus, a p-state transition of (x, x) ->
> (x, x-1) yields no energy saving at all, while (x, x-1) -> (x-1, x-1)
> does. My program considers the effects of a single CPU's transition
> independently of which CPU it is and without considering what
> frequencies the other CPUs are operating at, which means that it will
> conclude that the cost of increasing speed (or the reward for decreasing
> it) is half of what it is ... sort of. It's mildly broken as a result,
> though amusingly enough it still seems to work ok. I suspect that it
> might flail around trying to hit a cap a bit more than it would if
> affected_cpus were more accurate.

Not sure affected CPUs is accurate enough for your purposes anyways.
It cannot express "other core can be independent if I'm idle, otherwise not"
which is common on Intel systems.

-Andi

2007-06-01 22:38:32

by djwong

[permalink] [raw]
Subject: Re: Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW?

On Fri, Jun 01, 2007 at 11:37:07PM +0200, Andi Kleen wrote:
> > On Thu, Mar 29, 2007 at 06:06:22PM -0700, Pallipadi, Venkatesh wrote:

> How would that work? You would adjust the power cap dynamically during
> runtime based on the power meter feedback? How long would
> the adjustment interval be?

Yep, I adjust scaling_max_frequency as needed. The adjustment is
currently done once per minute, though I've noticed that the BMC power meter
itself can react in about 10-15 seconds. Incidentally, the ACPI battery
meter seems to react in about 2-5 seconds on my T40. I suspect that I
could lower that adjustment interval even further, though on the AMD box
(x3755) the power meter is slow to read under high loads.

> Not sure affected CPUs is accurate enough for your purposes anyways.
> It cannot express "other core can be independent if I'm idle, otherwise not"
> which is common on Intel systems.

Yep, this is true too. Right now I'm using CPU offlining as a clumsy
mechanism to force a CPU into idle state; even with the incorrect
assumption that affected_cpus applies to forced idleness, it seems to
work ok. We can end up losing more cores than we need to, but so far it
has always been the case that we don't offline cores until we've run out
of lower p-states on all cores. But I imagine with 80-core behemoths
on the way, I ought to fix this particular bug somehow. It will
probably involve adding a transition rule for each of the
non-lowest-numbered CPUs in a cpufreq domain between 0 and whatever
speed the lowest numbered CPU is in that domain is running at.

--D

2007-06-02 01:59:37

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: RE: Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW?



>-----Original Message-----
>From: Darrick J. Wong [mailto:[email protected]]
>Sent: Friday, June 01, 2007 11:44 AM
>To: Pallipadi, Venkatesh
>Cc: [email protected]
>Subject: Re: Dependent CPU core speed reporting not updated
>with CPUFREQ_SHARED_TYPE_HW?
>
>On Thu, Mar 29, 2007 at 06:06:22PM -0700, Pallipadi, Venkatesh wrote:
>> thought of
>> making affected CPUs show the dependency in case of hw coord, but
>> retaining the percpu
>> control. But, it seemed complicated change for something that is
>> cosmetic.
>
>Actually, it's not so cosmetic any more. Our newest servers have a
>power meter that measures power consumption, and I'm writing a program
>to measure the power cost of various cpufreq transitions in order to
>enforce a power cap. Due to the under-reporting in affected_cpus, the
>app thinks that (taking your example above) CPUs 0 and 2 can be
>controlled independently. Thus, a p-state transition of (x, x) ->
>(x, x-1) yields no energy saving at all, while (x, x-1) -> (x-1, x-1)
>does. My program considers the effects of a single CPU's transition
>independently of which CPU it is and without considering what
>frequencies the other CPUs are operating at, which means that it will
>conclude that the cost of increasing speed (or the reward for
>decreasing
>it) is half of what it is ... sort of. It's mildly broken as a result,
>though amusingly enough it still seems to work ok. I suspect that it
>might flail around trying to hit a cap a bit more than it would if
>affected_cpus were more accurate.

Hmmm. How about having a new cpufreq_sysfs entry to say
these CPUs are frequency dependent in hardware.

affected_cpus today has a single cpufreq directory for all affected_cpus
and we coordinate all CPUs in software. To change freq, we will have to
move among all affected_cpus and write an MSR.

Hardware coordination basically tells us that kernel can control
frequency
percpu, but underneath hardware will pick highest requested freq among a
group of CPUs. Instaed of handling this case as the existing software
coordination case above, we can add a new entry in cpufreq /sysfs
denoting
hardware coordinated CPU group.

Though it will be confusing with too many interfaces, I feel this is the
right way to go about here.

Comments? Thoughts?

Thanks,
Venki

2007-06-02 06:43:55

by Dave Jones

[permalink] [raw]
Subject: Re: Dependent CPU core speed reporting not updated with CPUFREQ_SHARED_TYPE_HW?

On Fri, Jun 01, 2007 at 06:59:25PM -0700, Venki Pallipadi wrote:

> Hmmm. How about having a new cpufreq_sysfs entry to say
> these CPUs are frequency dependent in hardware.

Wait, wasn't this the entire purpose of affected_cpus in the first
place? So we could see which CPUs would be affected by a frequency
change? What went wrong here?

> affected_cpus today has a single cpufreq directory for all affected_cpus
> and we coordinate all CPUs in software. To change freq, we will have to
> move among all affected_cpus and write an MSR.

This I think is where the problem started. That these remained
independant. Changing one should also affect the others that it
'affects'. Is that not the case?

> Hardware coordination basically tells us that kernel can control
> frequency
> percpu, but underneath hardware will pick highest requested freq among a
> group of CPUs. Instaed of handling this case as the existing software
> coordination case above, we can add a new entry in cpufreq /sysfs
> denoting
> hardware coordinated CPU group.
>
> Though it will be confusing with too many interfaces, I feel this is the
> right way to go about here.

If 'affected_cpus' doesn't do the right thing, I'd vote for making it
do so over adding more interfaces.

Dave

--
http://www.codemonkey.org.uk

2007-06-02 14:19:17

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: RE: Dependent CPU core speed reporting not updated withCPUFREQ_SHARED_TYPE_HW?



>-----Original Message-----
>From: Dave Jones [mailto:[email protected]]
>Sent: Friday, June 01, 2007 11:43 PM
>To: Pallipadi, Venkatesh
>Cc: Darrick J. Wong; [email protected]
>Subject: Re: Dependent CPU core speed reporting not updated
>withCPUFREQ_SHARED_TYPE_HW?
>
>On Fri, Jun 01, 2007 at 06:59:25PM -0700, Venki Pallipadi wrote:
>
> > Hmmm. How about having a new cpufreq_sysfs entry to say
> > these CPUs are frequency dependent in hardware.
>
>Wait, wasn't this the entire purpose of affected_cpus in the first
>place? So we could see which CPUs would be affected by a frequency
>change? What went wrong here?
>
> > affected_cpus today has a single cpufreq directory for all
>affected_cpus
> > and we coordinate all CPUs in software. To change freq, we
>will have to
> > move among all affected_cpus and write an MSR.
>
>This I think is where the problem started. That these remained
>independant. Changing one should also affect the others that it
>'affects'. Is that not the case?
>

Yes. Current affected_cpus they are dependent from user perspective.
Single set of /sysfs files linked in to multiple cpus. But kernel
Kernel knows that these are multiple dependent cpus and while
changing freq, driver typically goes from one CPU to other using
set_cpus_allowed to write MSR on all CPUs.

> > Hardware coordination basically tells us that kernel can control
> > frequency
> > percpu, but underneath hardware will pick highest requested
>freq among a
> > group of CPUs. Instaed of handling this case as the
>existing software
> > coordination case above, we can add a new entry in cpufreq /sysfs
> > denoting
> > hardware coordinated CPU group.
> >
> > Though it will be confusing with too many interfaces, I
>feel this is the
> > right way to go about here.
>
>If 'affected_cpus' doesn't do the right thing, I'd vote for making it
>do so over adding more interfaces.
>

The problem here is that with hardware coordination, kernel need not
do what we do for affected_cpus today. Kernel can manage each CPU
independently in terms of setting freq as underlying hardware
guarantees to do the coordination (picking up the highest freq
among a group of dependent cpus). So ideally we can just manage cpu
frequencies as we do today without affected_cpus. But, in this case
there is a fyi from hardware which says even though OS is thinking that
CPUs are independent, hardware is doing the coordination across these
CPUs.

We cannot directly use affected_cpus for this. We can probably change
to use affected_cpus in a way that we enforce software coordination on
top of hardware coordination. But, maintaining freq as in current
affected_cpus may not be as optimal as doing a percpu policy and
decision.

Thanks,
Venki

2007-06-04 17:06:20

by djwong

[permalink] [raw]
Subject: Re: Dependent CPU core speed reporting not updated withCPUFREQ_SHARED_TYPE_HW?

On Sat, Jun 02, 2007 at 07:19:03AM -0700, Pallipadi, Venkatesh wrote:

> The problem here is that with hardware coordination, kernel need not
> do what we do for affected_cpus today. Kernel can manage each CPU
> independently in terms of setting freq as underlying hardware
> guarantees to do the coordination (picking up the highest freq
> among a group of dependent cpus). So ideally we can just manage cpu
> frequencies as we do today without affected_cpus. But, in this case
> there is a fyi from hardware which says even though OS is thinking that
> CPUs are independent, hardware is doing the coordination across these
> CPUs.

Yes, and ... it appears that (at least on Intel CPUs), writing
IA32_PERF_CTL on any core in the package causes the speed of all CPUs to
be set to the max of all CPU cores. Here's what I see when
reading/writing the performance control MSRs:

This is with all CPUs set to 2.6GHz. 0x198 = IA32_PERF_STATUS, 0x199 =
IA32_PERF_CTL.

root@elm3a188:/mnt/src/msr/msr-20060208# ./msr -n 0x198 0x199
CPU 0:
0x00000198 = 0x0828082806000828
0x00000199 = 0x0000000000000828
CPU 1:
0x00000198 = 0x0828082806000828
0x00000199 = 0x0000000000000828

Now, we set scaling_max_freq of CPU 0 to 2GHz:
root@elm3a188:/mnt/src/msr/msr-20060208# ./msr -n 0x198 0x199
CPU 0:
0x00000198 = 0x0828082806000828
0x00000199 = 0x000000000000061a
CPU 1:
0x00000198 = 0x0828082806000828
0x00000199 = 0x0000000000000828

And now likewise for CPU 1:
root@elm3a188:/mnt/src/msr/msr-20060208# ./msr -n 0x198 0x199
CPU 0:
0x00000198 = 0x082808280600061a
0x00000199 = 0x000000000000061a
CPU 1:
0x00000198 = 0x082808280600061a
0x00000199 = 0x000000000000061a

Notice that we've written the slower speed to the control register but
the status register says that we're still running at the higher speed.
This seems to corroborate my finding that the power use does not drop
until _both_ cores are lowered. Unfortunately, in that middle step we
are reporting an incorrect frequency for CPU 0--sysfs says 2GHz but the
hardware itself says 2.6. This is clearly a bad thing, because I just
set scaling_max_cpufreq on CPU0 to 2GHz, yet it is running 667MHz faster
than it ought to be because CPU1 wants to go faster.

How about this: When hardware coordination is specified in _PSD,
scaling_{min,max}_freq between CPUs in a cpufreq domain are tied
together so that we can be sure that our caps are being followed.
Requests to change speed can be done as they always have, but
afterwards the value of scaling_cur_freq for all CPUs in the cpufreq
domain will be determined by reading the speed value from hardware
since we can't really be sure how the hardware decided to coordinate
things anyway. When it becomes the case that individual cores on a
package can run at different speeds, we can drop the _PSD entries. Does
this scheme sound reasonable?

We might, however, want another sysfs file to tell userspace what kind
of coordination is taking place.

--D


Attachments:
(No filename) (2.92 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments