2002-08-28 11:49:30

by Dominik Brodowski

[permalink] [raw]
Subject: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

Hi Linus, lkml,

The following patches add CPU frequency and volatage scaling
support (Intel SpeedStep, AMD PowerNow, etc.) to kernel 2.5.32


Patch 1/4: cpufreq-core
-----------------------
The cpufreq core offers a common interface to the CPU clock
speed features of ARM, PPC and x86 CPUs.

For communication with user space, sysctl entries are placed in
/proc/sys/cpu/{0,1,...,NR_CPUS-1}/ . Entries provided are:

speed-min (readonly)
speed-max (readonly)
speed-sync (readonly - all CPUs need the same frequency,
changes affect all CPUs)
speed (read/write)

In order for this code to be built, an architecture must define the
CONFIG_CPU_FREQ configuration symbol. The merged ARM code already
has the necessary configuration in place, the i386 code follows in
parts 2 and 3.

Specifically on ARM CPUs, the core is especially important, since
various ARM system on a chip implementations derive peripheral clocks
from the CPU clock (eg, LCD controllers, SDRAM controllers, etc).
The core allows these peripherals to take action either prior and/or
after the actual CPU clock adjustment so we don't go out of tolerance.


Patch 2/4: cpufreq-i386-core
----------------------------
The main part of this patch is a CPUFreq notifier in arch/i386/kernel/time.c.
It updates the i386-specific cpu_khz, cpu_data[].loops_per_jiffy and
fast_gettimeoffset_quotient on each frequency change.

Additionally, this patch allows "cpu_khz" to be exported (it is needed
for some cpufreq drivers) and adds some MSR #defines to asm-i386/msr.h


Patch 3/4: cpufreq-i386-drivers
-------------------------------
Four i386 CPUFreq drivers are ready to be merged this time. These are:
elanfreq.c: The AMD Elan CPU family offers extensive clock scaling
longhaul.c: VIA Longhaul processor clock + voltage scaling
powernow-k6.c: mobile AMD K6-2+ / mobile AMD K6-3+ clock scaling
speedstep.c: clock and voltage scaling on mobile Intel Pentium 3 and 4s,
but (unfortunately) only on ICH2-M or ICH3-M based
chipsets.

Support for mobile AMD K7 processors is still in development.


Patch 4/4: cpufreq-doc
----------------------
an entry to the CREDITS and the MAINTAINERS file, Config.help texts, and
extensive documentation in linux/Documentation/cpufreq


Comments welcome; however please ensure that the cpufreq development
list at cpufreq@http://www.linux.org.uk receives a copy of all comments.

Dominik


Attachments:
(No filename) (2.39 kB)
(No filename) (240.00 B)
Download all attachments

2002-08-28 18:41:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)


On Wed, 28 Aug 2002, Dominik Brodowski wrote:
>
> The following patches add CPU frequency and volatage scaling
> support (Intel SpeedStep, AMD PowerNow, etc.) to kernel 2.5.32

The thing is, this interface appears fundamentally broken with respect to
CPU's that change their frequency on the fly. I happen to know one such
CPU rather well myself.

What is this interface supposed to do about a CPU that can change its
frequency dynamically several hundred times a second? Having the OS
control it simply isn't an option - the overhead of the control is _way_
more than is acceptable at that level.

In short, this interface is too broken to be called generic.

A quote from Peter Anvin:

"What is worse is that the interface is, in my opinion, fundamentally
broken for *ALL* CPUs. It doesn't present a policy interface to the
kernel, instead it presents a frequency-setting interface and expect
the policy to be done in userspace. The kernel is the only part of the
system which has sufficient information (idle times of all CPUs, for
example) to do a decent job managing the CPU frequency efficiently.
On Transmeta CPUs this policy should simply be passed down to CMS, of
course; on other CPUs the kernel needs to manage it."

In other words: there is no valid way that a _user_ can set the policy
right now: the user can set the frequency, but since any sane policy
depends on how busy the CPU is, the user isn't even, the right person to
_do_ that, since the user doesn't _know_.

Also note that policy is not just about how busy the CPU is, but also
about how _hot_ the CPU is. Again, a user-mode application (that maybe
polls the situation every minute or so), simply _cannot_ handle this
situation. You need to have the ability to poll the CPU tens of times a
second to react to heat events, and clearly user mode cannot do that
without impacting performance in a big way.

The interface needs to be improved upon. It is simply _not_ valid to say
"run at this speed" as the primary policy.

Linus

2002-08-28 18:47:56

by Cort Dougan

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

It's even worse for some of the very new P4's that don't have their
heatsink seated properly. They heat up every few minutes and then throttle
themselves due to thermal overload. I think this situation is going to
become more and more common, now. We're at the mercy of every BIOS and
micro-code programmer now-a-days. That situation needs to be improved
upon, as well.

} The interface needs to be improved upon. It is simply _not_ valid to say
} "run at this speed" as the primary policy.

2002-08-28 19:18:37

by Alan

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

On Wed, 2002-08-28 at 19:48, Cort Dougan wrote:
> It's even worse for some of the very new P4's that don't have their
> heatsink seated properly. They heat up every few minutes and then throttle
> themselves due to thermal overload. I think this situation is going to
> become more and more common, now. We're at the mercy of every BIOS and

Systems designers are designing on the basis of thermal slowdowns being
the optimal way to build some systems. Its actually quite reasonable for
many workloads.

> micro-code programmer now-a-days. That situation needs to be improved
> upon, as well.

Cpufreq doesn't handle this case yet in the 2.4 code but it already
includes the needed udelay correction. Not everything CPUfreq has to
deal with need be user initiated


2002-08-28 19:14:56

by Alan

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)


> "What is worse is that the interface is, in my opinion, fundamentally
> broken for *ALL* CPUs. It doesn't present a policy interface to the
> kernel, instead it presents a frequency-setting interface and expect
> the policy to be done in userspace. The kernel is the only part of the
> system which has sufficient information (idle times of all CPUs, for
> example) to do a decent job managing the CPU frequency efficiently.
> On Transmeta CPUs this policy should simply be passed down to CMS, of
> course; on other CPUs the kernel needs to manage it."

You might want to read the paper on the original cpufreq for ARM. It
gives real world cases where the user -needs- to be able to control the
policy. I think you misunderstand what the interface is about. Large
numbers of systems benefit from usermode policy engines.

cpufreq is an interface that allows the user to control the processor
speed. Period. It is not a policy, it is not a management system. Its
business stops at "don't blow up the computer". It enables user space
policies to be handled. It sequences events and notifies device drivers
so they can handle speed changes that affect them. (eg reclocking the
serial ports on the AMD Elan)

If you need a kernel policy engine then that should be a completely
seperate module. The cpufreq interface hs to provide "please change
speed" methods to the kernel. We already have one example of a kernel
policy engine that wants this facility. That is ACPI with native
processor speed control. ACPI is simply one possible policy engine.

Cpufreq is to power management as /dev/hda is to file systems.


Alan

2002-08-28 19:32:09

by Cort Dougan

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

My frustration comes from the fact that my CPU time is being stolen from me
because of bad mechanical and software design. I'm not even notified of
it. If there were some way for the OS to over-ride or even be notified of
these events I'd have less of a problem with it. As it is now, poor
system design is affecting OS design more and more.

} Systems designers are designing on the basis of thermal slowdowns being
} the optimal way to build some systems. Its actually quite reasonable for
} many workloads.
}
} > micro-code programmer now-a-days. That situation needs to be improved
} > upon, as well.
}
} Cpufreq doesn't handle this case yet in the 2.4 code but it already
} includes the needed udelay correction. Not everything CPUfreq has to
} deal with need be user initiated
}

2002-08-28 19:40:57

by Peter Riocreux

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

Alan Cox <[email protected]> writes:

> On Wed, 2002-08-28 at 19:48, Cort Dougan wrote:
> > It's even worse for some of the very new P4's that don't have their
> > heatsink seated properly. They heat up every few minutes and then throttle
> > themselves due to thermal overload. I think this situation is going to
> > become more and more common, now. We're at the mercy of every BIOS and
>
> Systems designers are designing on the basis of thermal slowdowns being
> the optimal way to build some systems. Its actually quite reasonable for
> many workloads.

Don't forget the low end of the scale too...

An interface of this type even has applicability in the absence of a
clock. Research in the Amulet group at Manchester University (home of
the Amulet processors - self-timed ARM cores) and elsewhere is looking
at management of /maximum/ power consumption (instantaneous power
consumption is a function of the work to be done) by constraining the
maximum number of instructions in flight, rather than the clocked
equivalent of capping the clock frequency. This might be done where
the power supply's capability is very limited (solar, wireless
induction, smartcard, wind, etc).

This number can be managed by the processor if you build the right
peripheral into the system, and this would need an equivalent
interface for control - it wouldn't be a clock frequency, but it would
be a number controlling the /maximum/ speed of the processor.

Peter

2002-08-28 19:42:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)


On 28 Aug 2002, Alan Cox wrote:
>
> You might want to read the paper on the original cpufreq for ARM. It
> gives real world cases where the user -needs- to be able to control the
> policy. I think you misunderstand what the interface is about. Large
> numbers of systems benefit from usermode policy engines.

That's not the point.

The point is that the _policy_ (not the end result) needs to be pushed
down to the kernel, so that the kernel can do the right thing with it.

That policy can be updated in "real time" from user space, of course. But
the fact is that you cannot just set a frequency and leave it at that, it
doesn't work.

Linus

2002-08-28 19:52:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)


On 28 Aug 2002, Alan Cox wrote:
>
> Systems designers are designing on the basis of thermal slowdowns being
> the optimal way to build some systems. Its actually quite reasonable for
> many workloads.

Absolutely. Thermal policy is often an overriding thing, where even
non-transmeta CPU's will simply do the decision "on their own", without
input from the OS. That's simply because some designs will literally not
work above certain temperatures and do not have the heat sink capacity to
get out of a tight spot by purely external cooling.

But that's just one part of it. Even aside from thermal concerns, you want
to drop frequency aggressively when the machine is idle, because dropping
the frequency allows you to drop the voltage and effetively gets you a
cubed power reduction (which not only saves your battery, but also cools
the chip down so that when you _do_ start going full speed again you have
more thermal headroom).

So in order to avoid the thermal shutdown, you need to be proactive about
the frequency. Which again means that a user-level "once a second" or
"once in a blue moon" approach is fundamentally flawed.

I don't disagree with _also_ being able to set the frequency statically.
However, I do disagree with an interface that seems to be _purely_
designed for this, and nothing else.

Linus

2002-08-28 20:18:47

by Dominik Brodowski

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

On Wed, Aug 28, 2002 at 11:47:54AM -0700, Linus Torvalds wrote:
>
> On Wed, 28 Aug 2002, Dominik Brodowski wrote:
> >
> > The following patches add CPU frequency and volatage scaling
> > support (Intel SpeedStep, AMD PowerNow, etc.) to kernel 2.5.32
>
> The thing is, this interface appears fundamentally broken with respect to
> CPU's that change their frequency on the fly. I happen to know one such
> CPU rather well myself.
Do these CPUs need kernel support? E.g. do udelay() calls work as
expected? If so, there's no need to make a driver aware of something
which isn't in its scope to control. CPUFreq is basically for those many
systems where frequency switches need to be called from the OS.

> What is this interface supposed to do about a CPU that can change its
> frequency dynamically several hundred times a second? Having the OS
> control it simply isn't an option - the overhead of the control is _way_
> more than is acceptable at that level.
Well, that's probably the idea that's behind the microcode approach of
certain CPUs. However, for many voltage-scaling-able CPUs there is no such
microcode, and so the OS _needs_ to do it.

> In other words: there is no valid way that a _user_ can set the policy
> right now: the user can set the frequency, but since any sane policy
> depends on how busy the CPU is, the user isn't even, the right person to
> _do_ that, since the user doesn't _know_.
This is only true on CPUs where frequency switches can occur "on the fly"
with very low latency. Most voltage-scaling CPUs are currently found on
mobile systems. For those notebook users it is a big step forward to have
this ability of switching between "full speed" and "low speed" depending on
the power source. And on LART systems even dynamic switching _from
userspace_ has proven to be successful (see OLS talk my Erik Mouw).
Please note that even a "kernel-based frequency selector" needs large parts
of the cpufreq core and drivers: udelay() calls need to be accurate,
external limitations (like on ARM systems) need to be integrated, and the
frequency and voltage switching mechanisms need to be implemented --
so is there any reason why this "kernel-based frequency selector" couldn't
just use the existing interface: cpufreq_set()?

> Also note that policy is not just about how busy the CPU is, but also
> about how _hot_ the CPU is. Again, a user-mode application (that maybe
> polls the situation every minute or so), simply _cannot_ handle this
> situation. You need to have the ability to poll the CPU tens of times a
> second to react to heat events, and clearly user mode cannot do that
> without impacting performance in a big way.
IMHO a cpufreq interface shouldn't become too bloated. If it would try to
drop the frequency quite aggressively this should be enough,
critical-temperature shutdown / throttling mechanisms will take
care of emergenices.

> The interface needs to be improved upon. It is simply _not_ valid to say
> "run at this speed" as the primary policy.
This is right. But a sane kernel-based frequency selector doesn't exist yet.
Even if it existed, it would need large parts of the cpufreq patches
submitted today. And these offer some support which isn't the best thing
since sliced bread but still is happily used by users.

Dominik

2002-08-28 20:22:01

by Andrew Grover

[permalink] [raw]
Subject: RE: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

> From: Linus Torvalds [mailto:[email protected]]
> In other words: there is no valid way that a _user_ can set the policy
> right now: the user can set the frequency, but since any sane policy
> depends on how busy the CPU is, the user isn't even, the
> right person to
> _do_ that, since the user doesn't _know_.
>
> Also note that policy is not just about how busy the CPU is, but also
> about how _hot_ the CPU is. Again, a user-mode application
> (that maybe
> polls the situation every minute or so), simply _cannot_ handle this
> situation. You need to have the ability to poll the CPU tens
> of times a
> second to react to heat events, and clearly user mode cannot do that
> without impacting performance in a big way.
>
> The interface needs to be improved upon. It is simply _not_
> valid to say
> "run at this speed" as the primary policy.

Well TMTA CPUs would seem to be easy, because all this is done behind the
OS's back, right?

Let's talk about CPUs in which the OS has to control processor performance.
The way I see it, there are a bunch of inputs that are going to determine
CPU speed & voltage: user preference, workload, and thermals.

Wouldn't you have your initial perf setting determined by the workload, and
then revised down, based upon user preferences (such as "I want to conserve
battery") and the thermal requirements?

Any workload analysis has to be in the kernel. The user interface can be one
that just allows a limit to be placed upon the setting the workload demands.
Then, the thermal control can further drop the setting, if needs be.

Regards -- Andy

2002-08-28 20:19:17

by Alan

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

On Wed, 2002-08-28 at 20:49, Linus Torvalds wrote:
> The point is that the _policy_ (not the end result) needs to be pushed
> down to the kernel, so that the kernel can do the right thing with it.
>
> That policy can be updated in "real time" from user space, of course. But
> the fact is that you cannot just set a frequency and leave it at that, it
> doesn't work.

If you look at the papers on the original ARM cpufreq code you'll see a
case where very long granuality user driven policy is pretty much
essential. The kernel sometimes does not have enough information.

Take a trivial example. My CPU is 99% idle. Should you drop the clock
speed right down. On a generic system yes, on a system which has to meet
very tight real time deadlines quite possibly not. In fact some
processors (eg the MediaGX) actually have hardware assists for speeding
the CPU clock up for realtime interrupt processing paths.

That argument ultimately boils down to "should the /proc interface to
cpufreq" be a seperate module to the core cpu_freq code called by kernel
policy engines like ACPI. The answer is obviously "yes" - /proc is just
one of the policy engines.




2002-08-28 20:22:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)


On 28 Aug 2002, Alan Cox wrote:
>
> If you look at the papers on the original ARM cpufreq code you'll see a
> case where very long granuality user driven policy is pretty much
> essential. The kernel sometimes does not have enough information.

Alan, that is _not_ the point here.

It's ok to tell the kernel these "long-term" policies. But it has to be
told as a POLICY, not as a random number. Because I can show you a hundred
other cases where the user mode code does _not_have_a_clue_.

That's my argument. The kernel should be given a _policy_, not a "this
frequency". Because a frequency is provably not enough, and can be quite
hurtful.

And I do not want to get people used to passing in frequencies, when I can
absolutely _prove_ that it's the wrong thing for 99% of all uses.

Linus

2002-08-28 20:25:10

by Dominik Brodowski

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

On Wed, Aug 28, 2002 at 12:49:31PM -0700, Linus Torvalds wrote:
>
> On 28 Aug 2002, Alan Cox wrote:
> >
> > You might want to read the paper on the original cpufreq for ARM. It
> > gives real world cases where the user -needs- to be able to control the
> > policy. I think you misunderstand what the interface is about. Large
> > numbers of systems benefit from usermode policy engines.
>
> That's not the point.
>
> The point is that the _policy_ (not the end result) needs to be pushed
> down to the kernel, so that the kernel can do the right thing with it.
>
> That policy can be updated in "real time" from user space, of course. But
> the fact is that you cannot just set a frequency and leave it at that, it
> doesn't work.

On the long term, maybe. But cpufreq is only the driver, which lets you have
such a policy engine in the kernel. And as long as this policy engine
doesn't exist, why not offer the user some control over his system?

Dominik

2002-08-28 20:37:31

by Dominik Brodowski

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

On Wed, Aug 28, 2002 at 09:25:53PM +0100, Alan Cox wrote:
> On Wed, 2002-08-28 at 20:49, Linus Torvalds wrote:
> That argument ultimately boils down to "should the /proc interface to
> cpufreq" be a seperate module to the core cpu_freq code called by kernel
> policy engines like ACPI. The answer is obviously "yes" - /proc is just
> one of the policy engines.

So, what do all of you think of the following implementation?

#1 The "policy modules" (/proc-interface, kernel-based frequency selector,
...) determine the target CPU frequency.

#2 This is then passed to the cpufreq core. There it is validated,
loops_per_jiffy and other values are adjusted.

#3 Then the cpufreq driver is called to actually set the CPU frequency.


#3 is absolutely ready, #2 in parts (the "policy module" interface is
missing, and the /proc-interface needs to be removed), and #1 is TBD.

Comments?

Dominik

2002-08-28 20:36:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)


On Wed, 28 Aug 2002, Dominik Brodowski wrote:
>
> Do these CPUs need kernel support? E.g. do udelay() calls work as
> expected?

Crusoe CPU's do not.

But Intel CPU's _do_ need this, for example (since they change the TSC
frequency).

And Intel CPU's do _not_ want to have user mode telling them what to do
several times a second - yet it's entirely reasonable to have a kernel
timer function that estimates processor load at every timer tick, and
reacts to that a few times a second.

Which is why such a CPU needs to be passed in a _policy_. Which is my
whole argument.

Let's put it another way, because I've seen people at Transmeta scramble
when Microsoft thought it was a good idea to have the OS tell what
frequency the CPU should run at, and trust me, they got it wrong. From my
contacts at Intel, I can promise you that they got it wrong wrt Intel's
chips too, so this is not a Transmeta-only issue.

All I'm saying is that instead of a frequency, you should take more of a
"what is the goal of this" approach, and pass in _that_. Then, in user
land, you might have a situation where you know that "the goal is to run
at 300MHz, and nothing else". That may sometimes be the right goal, but
quite often it isn't.

And THAT IS MY POINT. If you have a more policy-oriented interface,
everybody can work with it. If you have a strict "this frequency"
approach, some people literally _cannot_ live with it, and will end up
throttling behind your back.

The goals may be:
- "low power" vs "high performance"
Obvious. "Aggressive power management" vs "Power management with
performance as the primary goal"

- "strive for max 20% idle"

The kernel may slow down the clock if the timer tick shows lots of
idle time. Tell the rest of the system when you do so.

- "RT latency - 300MHz minimum"
The idle loop might drop the frequency, but not past a certain
point.

- "run at exactly 500 MHz"

Notice how only the _last_ goal is expressible in the "frequency" space.
Everything else needs at least one additional piece of information, ie the
policy the kernel/CPU should take wrt the power management.

Linus

2002-08-28 20:39:31

by Linus Torvalds

[permalink] [raw]
Subject: RE: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)


On Wed, 28 Aug 2002, Grover, Andrew wrote:
>
> Well TMTA CPUs would seem to be easy, because all this is done behind the
> OS's back, right?

Yes. However, I certainly wouldn't mind having the same interfaces as
everybody else to set things like "aggressive" vs "powersave". Transmeta
does all the actual _work_ behind the OS's back, but you can still tell
the CPU what policy to take, and what frequency limits to use.

> Let's talk about CPUs in which the OS has to control processor performance.
> The way I see it, there are a bunch of inputs that are going to determine
> CPU speed & voltage: user preference, workload, and thermals.

Absolutely.

> Any workload analysis has to be in the kernel.

...with user mode input (ie user mode can know a lot of high-level stuff
that the kernel _doesn't_ know). So the kernel does potentially need user
input on policy.

Linus

2002-08-28 20:51:23

by Dominik Brodowski

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

On Wed, Aug 28, 2002 at 01:43:08PM -0700, Linus Torvalds wrote:
>
> On Wed, 28 Aug 2002, Dominik Brodowski wrote:
> >
> > Do these CPUs need kernel support? E.g. do udelay() calls work as
> > expected?
>
> Crusoe CPU's do not.
Great.

> But Intel CPU's _do_ need this, for example (since they change the TSC
> frequency).
And that's why there is some need for a cpufreq core (which manages
loops_per_jiffy etc.) and the need for the cpufreq drivers (#2 and #3 in my
previous mail).

> Which is why such a CPU needs to be passed in a _policy_. Which is my
> whole argument.
Which is #1 - the "input" to the cpufreq core. This can be seperated from
the cpufreq core. So basically

"policy input" --> "frequency input" --> cpufreq core --> cpufreq driver
user-space | k e r n e l - s p a c e

instead of

"policy input" --> "frequency input" --> cpufreq core --> cpufreq driver
u s e r - s p a c e | k e r n e l - s p a c e


Linus, would you agree to the /proc interface as one of several
frequency "input"/management options? It's good for testing, for some
workloads (LART), and it's (almost) done (just needs seperating from
the cpufreq core)...

Dominik

2002-08-28 21:01:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)


On Wed, 28 Aug 2002, Dominik Brodowski wrote:
>
> "policy input" --> "frequency input" --> cpufreq core --> cpufreq driver
> user-space | k e r n e l - s p a c e

No.

The "policy input" has to filter down ALL THE WAY. If you turn it into a
frequency-only input at _any_ time, you've lost information that the
lowest levels need.

THAT is the problem with the current #3 - it _assumes_ that the policy
input has already been converted to frequency, and since it assumes that,
it cannot handle the case where the hardware itself wants to know what the
policy was.

Linus

2002-08-28 20:59:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)


On Wed, 28 Aug 2002, Dominik Brodowski wrote:
>
> #3 Then the cpufreq driver is called to actually set the CPU frequency.
>
> #3 is absolutely ready

#3 is _not_ ready, if it doesn't include a "policy" part in addition to
the frequency. That was what I started off talking about: on some CPU's
you absolutely do _not_ want to set a hard frequency, you want to tell the
CPU how to behave (possibly together with a frequency _range_).

Until that is done, no other upper layers can use this low-level
functionality, since all upper layers would be forced to come up with a
hard frequency goal.

THAT is the problem. If you want to build infrastructure for upper layers,
then that infrastructure has to be able to pass down sufficient
information from those upper layers.

Think of this as a driver abstraction layer. Some hardware will do more
for you, some will do less. Some hardware is the equivalent of a dumb
frame buffer (where software has to change frequency and voltage by hand,
and be careful about every single step and the delays in between), while
some other hardware contains internal accelerators where you just tell
them what you want, and the hardware will do it for you asynchronously.

The current abstraction layer _thinks_ that all hardware is stupid, and is
thus not actually usable with smart hardware. See?

Linus

2002-08-28 22:56:37

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

Linus Torvalds wrote:
>
> On Wed, 28 Aug 2002, Dominik Brodowski wrote:
> >
> > "policy input" --> "frequency input" --> cpufreq core --> cpufreq driver
> > user-space | k e r n e l - s p a c e
>
> No.
>
> The "policy input" has to filter down ALL THE WAY. If you turn it into a
> frequency-only input at _any_ time, you've lost information that the
> lowest levels need.
>
> THAT is the problem with the current #3 - it _assumes_ that the policy
> input has already been converted to frequency, and since it assumes that,
> it cannot handle the case where the hardware itself wants to know what the
> policy was.

I wonder about converting it to frequency at most any
level. Why not some abstract such as % of full speed, or %
of full power. I, for one, don't want to have to think
absolute numbers. First thing you know I will have a new
box with different numbers. Then what?
--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-08-28 23:23:18

by Alan

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

On Wed, 2002-08-28 at 22:08, Linus Torvalds wrote:
>
> On Wed, 28 Aug 2002, Dominik Brodowski wrote:
> >
> > "policy input" --> "frequency input" --> cpufreq core --> cpufreq driver
> > user-space | k e r n e l - s p a c e
>
> No.
>
> The "policy input" has to filter down ALL THE WAY. If you turn it into a
> frequency-only input at _any_ time, you've lost information that the
> lowest levels need.

So what you are saying is that you want to be sure that something like
"please run at a low speed to save battery" is translated by smarter
cpus into "please save battery" and on spudstop the CPU would go "umm
duh ok 300MHz"

Ok now I think I understand your gripe.




2002-08-28 23:19:30

by Alan

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

On Wed, 2002-08-28 at 21:29, Linus Torvalds wrote:
> It's ok to tell the kernel these "long-term" policies. But it has to be
> told as a POLICY, not as a random number. Because I can show you a hundred
> other cases where the user mode code does _not_have_a_clue_.

Right and for the one in one hundred that is does I need a policy that
suits it

> That's my argument. The kernel should be given a _policy_, not a "this
> frequency". Because a frequency is provably not enough, and can be quite
> hurtful.

One of the policies I need from the kernel is "run at the frequency I
told you to run". Its a policy, its not the general case policy. The
/proc file is that policy.

> And I do not want to get people used to passing in frequencies, when I can
> absolutely _prove_ that it's the wrong thing for 99% of all uses.

99% of people should be using something like ACPI.

cpufreq is cpu speed control not power management policy. I agree
entirely that most people should not be using echo "500" >/proc/... as a
power management policy.

Likewise /dev/hda is not a file system and peopel should not be using dd
to store there files.

In both cases the ability to do so is sometimes useful and shouldnt be
excluded.




2002-08-28 23:39:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)


On 29 Aug 2002, Alan Cox wrote:
>
> One of the policies I need from the kernel is "run at the frequency I
> told you to run". Its a policy, its not the general case policy. The
> /proc file is that policy.

That's ok, but the current code DOES NOT DO THAT.

The current code has no support at all for the notion of policies, and
gives absolutely _zero_ support for it. It blindly assumes that the CPU
can (and should) run at one frequency, and as long as it does that, I
don't want it in the kernel.

> cpufreq is cpu speed control not power management policy. I agree
> entirely that most people should not be using echo "500" >/proc/... as a
> power management policy.
>
> Likewise /dev/hda is not a file system and peopel should not be using dd
> to store there files.

You've had that argument before, and it was bogus then - and it is bogus
now.

It is possible to put a filesystem on top of /dev/hda - because the block
layer is designed to allow it. It is not possible to build sane policy
upon the current frequency patches, because it is _not_ designed for
passing down the policy.

Exactly because some chips _need_ to have the policy passed down, the
lowest levels need to be able to pass it down.

It is _then_ ok to say that "if you do a 'echo 500 > /proc/cpu/freq', that
will also imply a policy of a fixed frequency". But if the frequency
setting code does not allow for any policy interface AT ALL, then it is
fundamentally broken.

That's my beef with it. We should not have "generic" interfaces that are
known to be fundamentally broken. As it is, the code - as designed - is
useless for a growing class of devices.

Think of it as a layering issue:
- user level policy
- kernel interface (possibly many - for different policies)
- low-level driver

Ok?

Now, what the current patches do is (a) one kernel interface (the
fixed-frequency one) and (b) low-level drivers.

The kernel interface is fine - it doesn't do what I think many people
might want to do, but it's simple and I agree that other policies can be
implemented with other interfaces. Fine.

But the fact that low-level drivers don't even support the notion of a
policy means that they are useless for any other interface. And I'm saying
that it's a clear design bug, and for no good reason.

Linus

2002-08-28 23:57:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)


On 29 Aug 2002, Alan Cox wrote:
>
> So what you are saying is that you want to be sure that something like
> "please run at a low speed to save battery" is translated by smarter
> cpus into "please save battery" and on spudstop the CPU would go "umm
> duh ok 300MHz"

Yup, exactly.

I suspect that this is also what most people actually want to use anyway:
you don't care that your CPU is a speedstep 1GHz/500Mhz or a 700/300 (or
whatever the combinations are), you really want to just say "go to power
save mode" vs "go to performance mode".

Sure, for speedstep, you can obviously trivially _emulate_ this in user
mode with the frequency approach, but for the generic case it isn't.

I don't know how many policies would be needed (too many just adds
complexity for no gain), but I _suspect_ that something like a

{ min-Hz, max-Hz, policy }

triple with "policy" being just a few different values ("performance",
"powersave") is sufficient. Clearly this triple trivially _becomes_ the
"single MHz" by just making min and max be the same if you really want one
particular MHz (at which time "policy" doesn't matter).

With something like the above, you could do something like

{ 0, ~0UL, "performance" } => generic highest performance setting
{ 0, ~0UL, "power-save" } => generic power-save setting
{ 300, 500, "performance" } => give me a performance setting in the specified range
{ 1700, 1700, "performance" } => run at a fixed 1.7GHz

(maybe the "policy" thing actually makes a difference even for the
fixed-frequency case: it can give hints about whether to allow C1-C3
states when idle etc).

Linus

2002-08-29 07:07:02

by Dominik Brodowski

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

On Thu, Aug 29, 2002 at 12:26:18AM +0100, Alan Cox wrote:
> > And I do not want to get people used to passing in frequencies, when I can
> > absolutely _prove_ that it's the wrong thing for 99% of all uses.
>
> 99% of people should be using something like ACPI.

Current ACPI code does not adjust frequencies on its own, it relies on user
input too.

Dominik

2002-08-29 07:07:05

by Dominik Brodowski

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

On Wed, Aug 28, 2002 at 05:08:14PM -0700, Linus Torvalds wrote:
> I don't know how many policies would be needed (too many just adds
> complexity for no gain), but I _suspect_ that something like a
<snip>
> (maybe the "policy" thing actually makes a difference even for the
> fixed-frequency case: it can give hints about whether to allow C1-C3
> states when idle etc).

OK, I see the problems you mention wrt current cpufreq. But let's keep the
next version simple: and whether to allow C1-C3 is really nothing cpufreq
should take care of, as this is pure ACPI policy.

Dominik

2002-08-29 09:48:13

by Padraig Brady

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

Linus Torvalds wrote:
> On 28 Aug 2002, Alan Cox wrote:
>
>>Systems designers are designing on the basis of thermal slowdowns being
>>the optimal way to build some systems. Its actually quite reasonable for
>>many workloads.
>
> Absolutely. Thermal policy is often an overriding thing, where even
> non-transmeta CPU's will simply do the decision "on their own", without
> input from the OS. That's simply because some designs will literally not
> work above certain temperatures and do not have the heat sink capacity to
> get out of a tight spot by purely external cooling.
>
> But that's just one part of it. Even aside from thermal concerns, you want
> to drop frequency aggressively when the machine is idle, because dropping
> the frequency allows you to drop the voltage and effetively gets you a
> cubed power reduction (which not only saves your battery, but also cools
> the chip down so that when you _do_ start going full speed again you have
> more thermal headroom).
>
> So in order to avoid the thermal shutdown, you need to be proactive about
> the frequency. Which again means that a user-level "once a second" or
> "once in a blue moon" approach is fundamentally flawed.

Just a data point from my application. First of all with VIA CPUs
there is a 1ms delay per change when it resyncs things, so it's
not practical to do it very often or for realtime apps etc.
My application was purely a heat reduction exercise where the
timescale from 0?C to 70?C was around 10 minutes at max voltage/frequency,
so it could easily be controlled by a userspace application.
In fact I didn't control it manually and only set the frequency
(for which the appropriate voltage was chosen by the cpufreq code)
at system startup.

> I don't disagree with _also_ being able to set the frequency statically.
> However, I do disagree with an interface that seems to be _purely_
> designed for this, and nothing else.
>
> Linus

P?draig.

2002-08-29 09:59:12

by Padraig Brady

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

Linus Torvalds wrote:
> On 29 Aug 2002, Alan Cox wrote:
>
>>So what you are saying is that you want to be sure that something like
>>"please run at a low speed to save battery" is translated by smarter
>>cpus into "please save battery" and on spudstop the CPU would go "umm
>>duh ok 300MHz"
>
>
> Yup, exactly.
>
> I suspect that this is also what most people actually want to use anyway:
> you don't care that your CPU is a speedstep 1GHz/500Mhz or a 700/300 (or
> whatever the combinations are), you really want to just say "go to power
> save mode" vs "go to performance mode".
>
> Sure, for speedstep, you can obviously trivially _emulate_ this in user
> mode with the frequency approach, but for the generic case it isn't.
>
> I don't know how many policies would be needed (too many just adds
> complexity for no gain), but I _suspect_ that something like a
>
> { min-Hz, max-Hz, policy }
>
> triple with "policy" being just a few different values ("performance",
> "powersave") is sufficient. Clearly this triple trivially _becomes_ the
> "single MHz" by just making min and max be the same if you really want one
> particular MHz (at which time "policy" doesn't matter).
>
> With something like the above, you could do something like
>
> { 0, ~0UL, "performance" } => generic highest performance setting
> { 0, ~0UL, "power-save" } => generic power-save setting
> { 300, 500, "performance" } => give me a performance setting in the specified range
> { 1700, 1700, "performance" } => run at a fixed 1.7GHz

I would go for a quadruple:

{ 0, ~0UL, 0, "performance" } => generic highest performance setting
{ 0, ~0UL, 0, "power-save" } => generic power-save setting
{ 300, 500, 0, "performance" } => give me a performance setting in the {
1700, 1700, 70"performance" } => run at a fixed 1.7GHz
{ 0, ~0UL, 70, "performance" } => performance but don't go above 70?C

would you also want a hysteris value?

P?draig.

> (maybe the "policy" thing actually makes a difference even for the
> fixed-frequency case: it can give hints about whether to allow C1-C3
> states when idle etc).
>
> Linus

2002-08-29 10:05:46

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

On Wed, 28 Aug 2002, Cort Dougan wrote:

> My frustration comes from the fact that my CPU time is being stolen from me
> because of bad mechanical and software design. I'm not even notified of
> it. If there were some way for the OS to over-ride or even be notified of
> these events I'd have less of a problem with it. As it is now, poor
> system design is affecting OS design more and more.

The P4 processor does send a notification (interrupt), there is support
for this interrupt in newer kernels. I'm not sure about other processors.

Zwane

--
function.linuxpower.ca


2002-08-29 10:04:06

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

On Wed, 28 Aug 2002, Cort Dougan wrote:

> It's even worse for some of the very new P4's that don't have their
> heatsink seated properly. They heat up every few minutes and then throttle
> themselves due to thermal overload. I think this situation is going to
> become more and more common, now. We're at the mercy of every BIOS and
> micro-code programmer now-a-days. That situation needs to be improved
> upon, as well.

The P4 clock modulation driver was aware of that condition and in fact the
processor would not allow changing the speed in a thermal throttled state
anyway.

Zwane

--
function.linuxpower.ca

2002-08-29 13:10:44

by Alan

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

> { min-Hz, max-Hz, policy }
>

For a few of the processors "event-hz" or similar would be nice. The
Geode supports hardware assisted bursting to full processor speed when
doing SMM, I/O and IRQ handling.

2002-08-29 13:34:26

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

On Thu, Aug 29, 2002 at 11:53:40AM +0100, Alan Cox wrote:
> > { min-Hz, max-Hz, policy }
> For a few of the processors "event-hz" or similar would be nice. The
> Geode supports hardware assisted bursting to full processor speed when
> doing SMM, I/O and IRQ handling.

If we do implement (for sake of argument) /proc/sys/performance or
whatever, changing the cpufreq interface so it performs 'stacking'
would be a good idea too.
Eg, on a K6-3+ system, we could do cpu scaling and chipset throttling
just by doing a echo "min" > /proc/sys/perf or whatever, and have
the code call 'chains' of performance related features.

Comments?

Dave
--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-08-29 15:04:55

by Pering, Trevor

[permalink] [raw]
Subject: RE: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

And now comes the problem with the policy approach -- what to include in the
policy... event-HZ? Temperature? Mem Bus Freq? The list is endless. But,
here's my thoughts on the matter (not sure this adds anything new, but it
helps clarify it for me, at least).

Taking the graphics card analogy... Graphics subsystems have evolved over
many years. Initial implementations were just direct-access-framebuffers,
and in fact, were initially often indirect-access frame-buffers (e.g., the
Apple II didn't even have a linear memory map for the character display,
IIRC). Over time, individual companies would develop optimized hardware and
write libraries for it, and then eventually abstractions like OpenGL or
DirectX appeared -- *after* people knew what was useful and what was not.

In a "well-formed" world, we could probably start off with a very basic
interface, which is what cpufreq has tried to do, and then build policy on
top of that. However, we are already effectively building on top of
abstraction, so things are a little more complicated -- as Linus points out,
just specifying the exact frequency makes no sense anymore.

If you want to continue the graphics analogy, then we're in the position of
trying to write a driver that both handles bitmapped displays as well as
vector plotters. Just allowing direct bitmap access makes *no* sense in this
situation because they are meaningless for a vector plotter, which always
draws lines.

So, if you need to support two sets of graphics drivers, one that provides
draw_point(x,y), and one that provides draw_line(x1,y1,x2,y2) -- what do you
do? I would say you provide draw_line(x1,y1,x2,y2), and which can be reduced
to draw_point(x,y), if necessary.

So, cpufreq is trying to just support set_freq(x), while some processors
_require_ a call in the form of bound_freq(x1,x2). Exact same situation.

Given that, there are still a couple of open questions:

1) About the policy field -- I think this should be as simple as possible...
because the use is either going to be simple, or way to complex to
effectively capture. So, the enumerated policy is probably best. Start
simple, then add other things if absolutely necessary.

2) To use MHz or something else? The problem is that the number here is
virtually meaningless. It does not translate from machine to machine,
processor to processor, or application to application. So, if you have to
pick a meaningless metric, what do you use? I would actually argue for % of
full capacity instead of MHz, but it doesn't really matter in the end.

3) Thermal overloading -- this, I believe, is a separate issue from the
cpufreq setting for things. I would leave this out of the equation, and let
the lower-level components handle this. I.e., think of "cpufreq" as a
suggestion, and if the suggestion would break something, then it is ignored.
If you really wanted, you could have a policy that is something like
"IgnoreThermal" -- but I think that would be silly.

4) The whole "one number describes processor behavior" is also somewhat
silly -- there is the core frequency, memory bus frequency, internal bus
frequency, etc... multipliers, dividers, PLLs -- everywhere! Still not sure
what to do about this, at the moment -- but, I think this might be a
convenient use for the policy field. I.e., in "Performance" state it does
one thing (fasted mem bus freq available), in "Conservation" state it does
another (slowest available). But... (see point #1)... should this be a
separate field or not? Start simple, then build on later, if necessary.

Another way of looking at this is to break the calls up into component
parts:
freq_set_minmax(x,y)
freq_set_exact(x) (same as freq_set_minmax(x,x))
freq_set_policy(p)
(but then there are synchronization issues...)
freq_synchronize()
etc...

Trevor


-----Original Message-----
From: Alan Cox [mailto:[email protected]]
Sent: Thursday, August 29, 2002 3:54 AM
To: Linus Torvalds
Cc: Dominik Brodowski; cpufreq@http://www.linux.org.uk;
[email protected]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)


> { min-Hz, max-Hz, policy }
>

For a few of the processors "event-hz" or similar would be nice. The
Geode supports hardware assisted bursting to full processor speed when
doing SMM, I/O and IRQ handling.


_______________________________________________
Cpufreq mailing list
Cpufreq@http://www.linux.org.uk
http://www.linux.org.uk/mailman/listinfo/cpufreq

2002-08-29 18:40:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

In article <[email protected]>,
Alan Cox <[email protected]> wrote:
>> { min-Hz, max-Hz, policy }
>>
>
>For a few of the processors "event-hz" or similar would be nice. The
>Geode supports hardware assisted bursting to full processor speed when
>doing SMM, I/O and IRQ handling.

Hmm.. I would assume that you'd just use the high frequency for that?
So, for example, assuming you have a 600/300 Geode, when you do

{ 0, ~0UL, "power-save" }

that would tell the Geode driver to run at 300MHz normally
("power-save"), and at 600Mhz when doing critical events.

In contrast, a

{ 0, ~0UL, "performance" }

mode would mean that it always runs at 600MHz (modulo heat throttling,
of course).

And a

{ 300, 300, "power-save" }

means that you want the chip to always run at 300MHz, even when handling
critical events.

I don't know the exact details of what kinds of frequencies the Geode
supports, but it sounds to me like you don't really need another
frequency value..

Linus

2002-08-29 19:16:46

by Alan

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

On Thu, 2002-08-29 at 19:47, Linus Torvalds wrote:
> Hmm.. I would assume that you'd just use the high frequency for that?

That doesnt work if Im trying to keep the machine running slowly at low
power but still want the I/O to work. I guess its fudgeable however and
the battery policy is enough info.

> I don't know the exact details of what kinds of frequencies the Geode
> supports, but it sounds to me like you don't really need another
> frequency value..

It tops out at about 300Mhz. Life gets excessively interesting because
about half the I/O on the Geode is imaginary and caused by SMM traps.
When power saving you normally set the geode to run slowly but spike
hard to full power when faking hardware. WIthout that some stuff like
the SB emulation doesnt work very well.

Alan

2002-08-29 21:18:55

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

Linus Torvalds wrote:
>
> In article <[email protected]>,
> Alan Cox <[email protected]> wrote:
> >> { min-Hz, max-Hz, policy }
> >>
> >
> >For a few of the processors "event-hz" or similar would be nice. The
> >Geode supports hardware assisted bursting to full processor speed when
> >doing SMM, I/O and IRQ handling.
>
> Hmm.. I would assume that you'd just use the high frequency for that?
> So, for example, assuming you have a 600/300 Geode, when you do
>
> { 0, ~0UL, "power-save" }
>
> that would tell the Geode driver to run at 300MHz normally
> ("power-save"), and at 600Mhz when doing critical events.
>
> In contrast, a
>
> { 0, ~0UL, "performance" }
>
> mode would mean that it always runs at 600MHz (modulo heat throttling,
> of course).
>
> And a
>
> { 300, 300, "power-save" }

How about { 50, 50, "power-save" } where the number refers
to percent of full?
I.e. same meaning IFF full is 600, but suppose it is 800.
>
> means that you want the chip to always run at 300MHz, even when handling
> critical events.
>
> I don't know the exact details of what kinds of frequencies the Geode
> supports, but it sounds to me like you don't really need another
> frequency value..
>
> Linus
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-08-30 00:35:00

by jw schultz

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

On Wed, Aug 28, 2002 at 04:49:45PM -0700, Linus Torvalds wrote:
>
> On 29 Aug 2002, Alan Cox wrote:
> >
> > One of the policies I need from the kernel is "run at the frequency I
> > told you to run". Its a policy, its not the general case policy. The
> > /proc file is that policy.
>
> That's ok, but the current code DOES NOT DO THAT.
>
> The current code has no support at all for the notion of policies, and
> gives absolutely _zero_ support for it. It blindly assumes that the CPU
> can (and should) run at one frequency, and as long as it does that, I
> don't want it in the kernel.
>
> > cpufreq is cpu speed control not power management policy. I agree
> > entirely that most people should not be using echo "500" >/proc/... as a
> > power management policy.
> >
> > Likewise /dev/hda is not a file system and peopel should not be using dd
> > to store there files.
>
> You've had that argument before, and it was bogus then - and it is bogus
> now.
>
> Exactly because some chips _need_ to have the policy passed down, the
> lowest levels need to be able to pass it down.
>
> It is _then_ ok to say that "if you do a 'echo 500 > /proc/cpu/freq', that
> will also imply a policy of a fixed frequency". But if the frequency
> setting code does not allow for any policy interface AT ALL, then it is
> fundamentally broken.
>
> That's my beef with it. We should not have "generic" interfaces that are
> known to be fundamentally broken. As it is, the code - as designed - is
> useless for a growing class of devices.
>
> Think of it as a layering issue:
> - user level policy
> - kernel interface (possibly many - for different policies)
> - low-level driver
>
> Ok?
>
> Now, what the current patches do is (a) one kernel interface (the
> fixed-frequency one) and (b) low-level drivers.
>
> The kernel interface is fine - it doesn't do what I think many people
> might want to do, but it's simple and I agree that other policies can be
> implemented with other interfaces. Fine.
>
> But the fact that low-level drivers don't even support the notion of a
> policy means that they are useless for any other interface. And I'm saying
> that it's a clear design bug, and for no good reason.

As a user (for the sake of argument) i don't want to see an
uneccesary proliferation of interfaces. There is no reason
why the interface cannot be made policy aware from the
beginning even if it starts out only supporting the one
policy of fixed.

Something like 'echo "fixed 500M" >/proc/cpu/freak'
would allow the addition of new policies without having to
add new interfaces. Notice that i give the policy as the
first parameter. Each policy should register itself with
its callbacks (read, write, ?).

Probably should have a something like /proc/filesystems
that reports supported policies since supportable
policies would vary according to hardware + emulation.


--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]

Remember Cernan and Schmitt

2002-08-30 03:24:38

by David Lang

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

be careful here, it's not unreasonable to imagine a power saving mode that
shuts down one CPU of a SMP machine.

don't make assumptions about what parameters make sense for all possible
policies, try to find some way for the parameters to depend on the policy
selected.

David Lang


On 29 Aug 2002, Alan Cox wrote:

> Date: 29 Aug 2002 11:53:40 +0100
> From: Alan Cox <[email protected]>
> To: Linus Torvalds <[email protected]>
> Cc: Dominik Brodowski <[email protected]>, cpufreq@http://www.linux.org.uk,
> [email protected]
> Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)
>
> > { min-Hz, max-Hz, policy }
> >
>
> For a few of the processors "event-hz" or similar would be nice. The
> Geode supports hardware assisted bursting to full processor speed when
> doing SMM, I/O and IRQ handling.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2002-08-30 06:42:21

by David Gibson

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

On Thu, Aug 29, 2002 at 02:22:51PM -0700, george anzinger wrote:
> Linus Torvalds wrote:
> >
> > In article <[email protected]>,
> > Alan Cox <[email protected]> wrote:
> > >> { min-Hz, max-Hz, policy }
> > >>
> > >
> > >For a few of the processors "event-hz" or similar would be nice. The
> > >Geode supports hardware assisted bursting to full processor speed when
> > >doing SMM, I/O and IRQ handling.
> >
> > Hmm.. I would assume that you'd just use the high frequency for that?
> > So, for example, assuming you have a 600/300 Geode, when you do
> >
> > { 0, ~0UL, "power-save" }
> >
> > that would tell the Geode driver to run at 300MHz normally
> > ("power-save"), and at 600Mhz when doing critical events.
> >
> > In contrast, a
> >
> > { 0, ~0UL, "performance" }
> >
> > mode would mean that it always runs at 600MHz (modulo heat throttling,
> > of course).
> >
> > And a
> >
> > { 300, 300, "power-save" }
>
> How about { 50, 50, "power-save" } where the number refers
> to percent of full?
> I.e. same meaning IFF full is 600, but suppose it is 800.

Um... how about not. I can't think of a single situation in which
specifying this as a percentage of full speed is useful. It's even
less useful than raw MHz.

--
David Gibson | For every complex problem there is a
[email protected] | solution which is simple, neat and
| wrong.
http://www.ozlabs.org/people/dgibson

2002-08-30 07:49:49

by Helge Hafting

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

george anzinger wrote:

> How about { 50, 50, "power-save" } where the number refers
> to percent of full?
> I.e. same meaning IFF full is 600, but suppose it is 800.

Percentages don't buy you anything. Sure, a new cpu has a
different max setting, but you may get the same problem with your
percentages:

The "old" cpu ran well with 50% for power saving and 100% for
performance. The "new" might want 30% for power to work
well. So, the numbers change anyway.

Helge Hafting

2002-08-30 07:59:40

by Helge Hafting

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

"Pering, Trevor" wrote:

> 2) To use MHz or something else? The problem is that the number here is
> virtually meaningless. It does not translate from machine to machine,
> processor to processor, or application to application. So, if you have to
> pick a meaningless metric, what do you use? I would actually argue for % of
> full capacity instead of MHz, but it doesn't really matter in the end.

Percentages don't buy you much because they are as meaningless as
MHz numbers, or even more so. Percentages don't translate from
machine to machine either. One machine might find 50% speed
useful for power saving, another might want 33%. A third
one might work fine with 75% to prevent overheating.

An MHz carries more meaning - it is a measurable frequency.
Manufacturers tend to specify numbers in MHz.
Percentage of "full" is more problematic because "full"
isn't that well-defined.

Consider things like overclocking. That isn't merely a
hack - AMD specifies different max speeds for different
temperatures. I.e. they officially support higher
clock speed when using liquid cooling. The speed rating
stored in the cpu is only for the fan-cooling case.

Helge Hafting

2002-08-30 11:49:24

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

On Fri, Aug 30, 2002 at 10:04:20AM +0200, Helge Hafting wrote:
> An MHz carries more meaning - it is a measurable frequency.

It's equally meaningless (in fact, less meaningful).
- By your definition my 900MHz VIA C3 is faster than my 800MHz Athlon.
(Clue: It isn't).
- With trickery like AMD's quantispeed ratings, MHz really is a totally
meaningless number when relating to performance of a CPU.
- A MHz rating is only meaningful across the same vendor/family of CPUs.

Getting cpufreq's policy interface into something CPU agnostic therefore
precludes MHz ratings AFAICS.

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-08-30 12:32:49

by Helge Hafting

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

Dave Jones wrote:
>
> On Fri, Aug 30, 2002 at 10:04:20AM +0200, Helge Hafting wrote:
> > An MHz carries more meaning - it is a measurable frequency.
>
> It's equally meaningless (in fact, less meaningful).
> - By your definition my 900MHz VIA C3 is faster than my 800MHz Athlon.
> (Clue: It isn't).
I never said such a thing!
You are right that MHz is useless for telling which
processor is the fastest. But this discussion wasn't about
comparing performance.

It was about:
Should we tell the kernel to run a cpu at "500MHz", or
"50% of max" in order to (save power|avoid overheating|whatever).

In this case MHz is useful - because that's what the manufacturer
specifies. That's what you program into cpu or
motherboard registers, and MHz is what you can measure with an
oscilloscope in order to verify correct operation of the driver.

Percentages don't buy you anything if you replace
the cpu with a different one. The other cpu may of course
have different MHz ratings for "full speed" and "power save|cold
running"
but the percentages may very well be different too.
Some runs cool at 80%, some at 60%...

finally - "full speed" is ill-defined. Some AMD chips have different
speed ratings for different operating temperatures.

So, I think MHz is the better choice for setting up
speed policies for cooling and power saving.

> - With trickery like AMD's quantispeed ratings, MHz really is a totally
> meaningless number when relating to performance of a CPU.
> - A MHz rating is only meaningful across the same vendor/family of CPUs.

This is all fine for the purpose of comparing cpu's, but this
isn't about such comparisons. I would never compare an
intel and an amd chip based on frequency, I'd look at how
well they perform what I want them to do.

> Getting cpufreq's policy interface into something CPU agnostic therefore
> precludes MHz ratings AFAICS.
Why? It is not as if cpufreq is being used to tell who
has the faster machine...

Helge Hafting

2002-08-30 22:39:13

by George Anzinger

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

Helge Hafting wrote:
>
> Dave Jones wrote:
> >
> > On Fri, Aug 30, 2002 at 10:04:20AM +0200, Helge Hafting wrote:
> > > An MHz carries more meaning - it is a measurable frequency.
> >
> > It's equally meaningless (in fact, less meaningful).
> > - By your definition my 900MHz VIA C3 is faster than my 800MHz Athlon.
> > (Clue: It isn't).
> I never said such a thing!
> You are right that MHz is useless for telling which
> processor is the fastest. But this discussion wasn't about
> comparing performance.
>
> It was about:
> Should we tell the kernel to run a cpu at "500MHz", or
> "50% of max" in order to (save power|avoid overheating|whatever).
>
> In this case MHz is useful - because that's what the manufacturer
> specifies. That's what you program into cpu or
> motherboard registers, and MHz is what you can measure with an
> oscilloscope in order to verify correct operation of the driver.
>
> Percentages don't buy you anything if you replace
> the cpu with a different one. The other cpu may of course
> have different MHz ratings for "full speed" and "power save|cold
> running"
> but the percentages may very well be different too.
> Some runs cool at 80%, some at 60%...
>
> finally - "full speed" is ill-defined. Some AMD chips have different
> speed ratings for different operating temperatures.
>
> So, I think MHz is the better choice for setting up
> speed policies for cooling and power saving.

If cooling and power saving is what is wanted, why not talk
in terms of degrees or watts? Or are you really trying to
say something like "it is ok to run this much slower to
conserve power/temp/battery".

I really think we need to push down the power/temp to Mhz
conversion to the lowest level. At the user level we should
only be expressing what we want and/or are willing to give
up for it. From this point of view, if you are talking
frequency you are already out of the box.

-g
>
> > - With trickery like AMD's quantispeed ratings, MHz really is a totally
> > meaningless number when relating to performance of a CPU.
> > - A MHz rating is only meaningful across the same vendor/family of CPUs.
>
> This is all fine for the purpose of comparing cpu's, but this
> isn't about such comparisons. I would never compare an
> intel and an amd chip based on frequency, I'd look at how
> well they perform what I want them to do.
>
> > Getting cpufreq's policy interface into something CPU agnostic therefore
> > precludes MHz ratings AFAICS.
> Why? It is not as if cpufreq is being used to tell who
> has the faster machine...
>
> Helge Hafting
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-09-06 20:53:19

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH][2.5.32] CPU frequency and voltage scaling (0/4)

Hi!

> > #3 Then the cpufreq driver is called to actually set the CPU frequency.
> >
> > #3 is absolutely ready
>
> #3 is _not_ ready, if it doesn't include a "policy" part in addition to
> the frequency. That was what I started off talking about: on some CPU's
> you absolutely do _not_ want to set a hard frequency, you want to tell the
> CPU how to behave (possibly together with a frequency _range_).
>
> Until that is done, no other upper layers can use this low-level
> functionality, since all upper layers would be forced to come up with a
> hard frequency goal.
>
> THAT is the problem. If you want to build infrastructure for upper layers,
> then that infrastructure has to be able to pass down sufficient
> information from those upper layers.

So... would you take a patch that passed range down to cpufreq "core"?

Dumb cpus would set speed to upper limit while smart cpus would get all
the info...

Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.