Message-ID: <51E5947F.4090109@linux.intel.com>
Date: Tue, 16 Jul 2013 11:44:15 -0700
From: Arjan van de Ven <arjan@linux.intel.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130620 Thunderbird/17.0.7
MIME-Version: 1.0
To: Peter Zijlstra <peterz@infradead.org>
CC: Morten Rasmussen <morten.rasmussen@arm.com>, mingo@kernel.org,
        vincent.guittot@linaro.org, preeti@linux.vnet.ibm.com,
        alex.shi@intel.com, efault@gmx.de, pjt@google.com, len.brown@intel.com,
        corbet@lwn.net, akpm@linux-foundation.org,
        torvalds@linux-foundation.org, tglx@linutronix.de,
        catalin.marinas@arm.com, linux-kernel@vger.kernel.org,
        linaro-kernel@lists.linaro.org
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal
References: <1373385338-12983-1-git-send-email-morten.rasmussen@arm.com> <20130713064909.GW25631@dyad.programming.kicks-ass.net> <51E166C8.3000902@linux.intel.com> <20130715195914.GC23818@dyad.programming.kicks-ass.net> <51E45E8B.705@linux.intel.com> <20130715210650.GF23818@dyad.programming.kicks-ass.net> <20130715211230.GG23818@dyad.programming.kicks-ass.net> <51E47D30.5030203@linux.intel.com> <20130716173848.GA22795@dyad.programming.kicks-ass.net>
In-Reply-To: <20130716173848.GA22795@dyad.programming.kicks-ass.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4282
Lines: 92

On 7/16/2013 10:38 AM, Peter Zijlstra wrote:
> On Mon, Jul 15, 2013 at 03:52:32PM -0700, Arjan van de Ven wrote:
>
>> yeah ondemand does this, but ondemand is actually a pretty bad governor.
>> not because of the sampling, but because of its algorithm.
>
> Is it good for any class of hardware still out there? Or should the thing be
> shot in the head?

for Intel, it's not too bad for anything predating Nehalem.


> You saying AMD patched the thing makes me confused; why would they patch a
> piece of crap?

it's still an improvement over something that's in use ;-)

>
>> HOWEVER, on modern CPUs, even many of the ARM ones, the frequency
>> when you're idle is zero anyway regardless of what you as OS ask for.
>
> Right, entire cores are power gated.
>
> So power wise the voltage you run at is important; so for hardware where lower
> frequencies allow lower voltage, does it still make sense to run the lowest
> possible voltage such that there is still some idle time?
>
> Or is the fact that you're running so much longer negating the power save from
> the lower voltage?

the race-to-idle argument again ;-)

>
>> Every 10 (or 100) milliseconds, ondemand makes a new P state decision.
>> It does this by asking the scheduler the time used, does a delta and
>> ends up at a utilization %age which then goes into a formula.
>> It's not that ondemand samples inbetween decision moments to see if the system
>> is busy or not; the microaccounting that the scheduler does is used instead,
>> and only at decision moments.
>
> OK.. So up to now you've mostly said what you want of the scheduler to make a
> better governor for the new Intel chips.
>
> However a power aware scheduler/balancer needs to interact with the policy as a
> whole; and I got confused by the fact that you never talked about
> raising/lowering speeds. As said there's already a very 'fine' problem where
> the cpufreq interacts with the utilization/runnable accounting we now do.

the interaction is "using the scheduler data using the scheduler provided function".

So I don't just want something that makes sense for todays Intel ;-)
We need something that has an interface that makes sense, where the things
that vary between chip generations/vendors are on the driver side
of the interface, and the things that are generic concepts or generically
enough useful are on the core side of the interface. Hardware has changed,
and hardware will be changing for all vendors for as far as we can even see
into the future, since power matters in the market a lot.
This means we need a level of interface that has some chance of being useful
for at least a while.

What frequency to run at is for me clearly a driver side thing since what
goes into choosing a P state that may translate into a frequency is a hardware
specific choice; the translation from "I need at least this much performance
and be power efficient at that" to a hardware register write is very hardware specific.

Things like "I need more compute capacity" or "This is very performance critical" or
"This is very latency critical" are a generic concepts.
As is "behavior is now changed a lot in <this direction>" as a callback kind of thing.
(just as "I no longer need it" is a generic concept to complement the first one)

The scheduler already has the utilization interfaces that are high enough level
for those who want to use utilization on the driver side to guide their hw decisions
(ondemand does not keep its own utilization, it uses straight scheduler data
for that); the very thin layer that ondemand and co add on top is the
percentage = (usage_at_time_b - usage_at_time_a) / (elapsed time) * 100%
formula so that they can do this over the interval of their choosing.
You can argue that the scheduler can do this; that's for me a small detail that we could
do either way; it's not anything relevant in the big picture.
With intervals being quite variable it might make sense to keep it on the driver side
just because its hard to put this one formula into a nice interface.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/