2006-09-12 03:29:31

by Nick Orlov

[permalink] [raw]
Subject: [PATCH 2.6.18-rc6-mm1 0/2] cpufreq: make it harder for cpu to leave "hot" mode

Andrew,

I was playing with ondemand cpufreq governor (gotta save on electricity
bills one day :) ) and I've noticed that gameplay become somewhat sluggish.
Especially noticeble in something cpu-power demanding, like quake4.
Quick look at stats/trans_table confirmed that CPU goes out of "hot" mode
way too often.

To make long story short - reverting -mm changes for cpufreq_ondemand.c
helps a _LOT_. I'm not sure if it is something powersave_bias related or
(most probably) due to alignment of "next do_dbs_timer() fire time" which
could make "collect stats" window too short and introduce significant errors.
Have not specifically checked ...

After thinking about the issue for a while I come up with the following tweaks:
First of all I made hysteresis little bit wider (20% instead of 10).
Another idea was to increase "sampling period" once cpu is in "hot" mode.

Second part also have benefits of reducing the load on already overloaded cpu.
Plus it's damn trivial. To simplify further testing I have exposed
"sampling_rate_hot" parameter through sysfs. Setting it to sampling_rate * 10
works for me very well. Now I do not have to switch governor to "performance"
during game sessions.

Tested on AMD64x2 (32 bit mode).

Could you please consider the following patch for inclusion in -mm?
Should be applied after reverting -mm cpufreq_ondemand.c changes.

Thank you,
Nick Orlov

P.S. These are the first patches I'm sending to LKML: please, be patient :)

--
With best wishes,
Nick Orlov.


2006-09-12 03:30:36

by Nick Orlov

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc6-mm1 1/2] cpufreq: make it harder for cpu to leave "hot" mode

From: Nick Orlov <[email protected]>

Make hysteresis wider (20% instead of 10).

Signed-off-by: Nick Orlov <[email protected]>

--- linux-2.6.18-rc6/drivers/cpufreq/cpufreq_ondemand.c 2006-09-11 21:22:50.000000000 -0400
+++ linux-2.6.18-rc6-mm1-nick/drivers/cpufreq/cpufreq_ondemand.c 2006-09-11 20:49:10.000000000 -0400
@@ -25,7 +25,7 @@
*/

#define DEF_FREQUENCY_UP_THRESHOLD (80)
-#define MIN_FREQUENCY_UP_THRESHOLD (11)
+#define MIN_FREQUENCY_UP_THRESHOLD (21)
#define MAX_FREQUENCY_UP_THRESHOLD (100)

/*
@@ -290,12 +315,12 @@
/*
* The optimal frequency is the frequency that is the lowest that
* can support the current CPU usage without triggering the up
- * policy. To be safe, we focus 10 points under the threshold.
+ * policy. To be safe, we focus 20 points under the threshold.
*/
- if (load < (dbs_tuners_ins.up_threshold - 10)) {
+ if (load < (dbs_tuners_ins.up_threshold - 20)) {
unsigned int freq_next;
freq_next = (policy->cur * load) /
- (dbs_tuners_ins.up_threshold - 10);
+ (dbs_tuners_ins.up_threshold - 20);

__cpufreq_driver_target(policy, freq_next, CPUFREQ_RELATION_L);
}
_

--
With best wishes,
Nick Orlov.

2006-09-12 03:31:29

by Nick Orlov

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc6-mm1 2/2] cpufreq: make it harder for cpu to leave "hot" mode

From: Nick Orlov <[email protected]>

Increase sampling period if cpu is running in "hot" mode.
Expose corresponding knob through sysfs.

Signed-off-by: Nick Orlov <[email protected]>

--- linux-2.6.18-rc6/drivers/cpufreq/cpufreq_ondemand.c 2006-09-11 21:22:50.000000000 -0400
+++ linux-2.6.18-rc6-mm1-5.swp/drivers/cpufreq/cpufreq_ondemand.c 2006-09-11 20:49:10.000000000 -0400
@@ -39,6 +39,7 @@
* All times here are in uS.
*/
static unsigned int def_sampling_rate;
+static unsigned int def_sampling_rate_hot;
#define MIN_SAMPLING_RATE_RATIO (2)
/* for correct statistics, we need at least 10 ticks between each measure */
#define MIN_STAT_SAMPLING_RATE (MIN_SAMPLING_RATE_RATIO * jiffies_to_usecs(10))
@@ -46,6 +47,7 @@
#define MAX_SAMPLING_RATE (500 * def_sampling_rate)
#define DEF_SAMPLING_RATE_LATENCY_MULTIPLIER (1000)
#define TRANSITION_LATENCY_LIMIT (10 * 1000)
+#define DEF_SAMPLING_RATE_HOT_MULTIPLIER (10)

static void do_dbs_timer(void *data);

@@ -74,6 +76,7 @@

struct dbs_tuners {
unsigned int sampling_rate;
+ unsigned int sampling_rate_hot;
unsigned int up_threshold;
unsigned int ignore_nice;
};
@@ -122,6 +125,7 @@
return sprintf(buf, "%u\n", dbs_tuners_ins.object); \
}
show_one(sampling_rate, sampling_rate);
+show_one(sampling_rate_hot, sampling_rate_hot);
show_one(up_threshold, up_threshold);
show_one(ignore_nice_load, ignore_nice);

@@ -144,6 +148,25 @@
return count;
}

+static ssize_t store_sampling_rate_hot(struct cpufreq_policy *unused,
+ const char *buf, size_t count)
+{
+ unsigned int input;
+ int ret;
+ ret = sscanf(buf, "%u", &input);
+
+ mutex_lock(&dbs_mutex);
+ if (ret != 1 || input > MAX_SAMPLING_RATE || input < MIN_SAMPLING_RATE) {
+ mutex_unlock(&dbs_mutex);
+ return -EINVAL;
+ }
+
+ dbs_tuners_ins.sampling_rate_hot = input;
+ mutex_unlock(&dbs_mutex);
+
+ return count;
+}
+
static ssize_t store_up_threshold(struct cpufreq_policy *unused,
const char *buf, size_t count)
{
@@ -203,6 +226,7 @@
__ATTR(_name, 0644, show_##_name, store_##_name)

define_one_rw(sampling_rate);
+define_one_rw(sampling_rate_hot);
define_one_rw(up_threshold);
define_one_rw(ignore_nice_load);

@@ -210,6 +234,7 @@
&sampling_rate_max.attr,
&sampling_rate_min.attr,
&sampling_rate.attr,
+ &sampling_rate_hot.attr,
&up_threshold.attr,
&ignore_nice_load.attr,
NULL
@@ -305,6 +330,8 @@
{
unsigned int cpu = smp_processor_id();
struct cpu_dbs_info_s *dbs_info = &per_cpu(cpu_dbs_info, cpu);
+ struct cpufreq_policy *policy;
+ unsigned int sampling_rate;

if (!dbs_info->enable)
return;
@@ -312,8 +339,14 @@
lock_cpu_hotplug();
dbs_check_cpu(dbs_info);
unlock_cpu_hotplug();
+
+ policy = dbs_info->cur_policy;
+ sampling_rate = (policy->cur == policy->max)
+ ? dbs_tuners_ins.sampling_rate_hot
+ : dbs_tuners_ins.sampling_rate;
+
queue_delayed_work_on(cpu, kondemand_wq, &dbs_info->work,
- usecs_to_jiffies(dbs_tuners_ins.sampling_rate));
+ usecs_to_jiffies(sampling_rate));
}

static inline void dbs_timer_init(unsigned int cpu)
@@ -394,7 +427,14 @@
if (def_sampling_rate < MIN_STAT_SAMPLING_RATE)
def_sampling_rate = MIN_STAT_SAMPLING_RATE;

+ def_sampling_rate_hot = def_sampling_rate *
+ DEF_SAMPLING_RATE_HOT_MULTIPLIER;
+
+ WARN_ON(def_sampling_rate_hot > MAX_SAMPLING_RATE);
+
dbs_tuners_ins.sampling_rate = def_sampling_rate;
+ dbs_tuners_ins.sampling_rate_hot =
+ def_sampling_rate_hot;
}
dbs_timer_init(policy->cpu);

_
--
With best wishes,
Nick Orlov.

2006-09-12 20:20:38

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc6-mm1 0/2] cpufreq: make it harder for cpu to leave "hot" mode

On Mon, Sep 11, 2006 at 11:29:24PM -0400, Nick Orlov wrote:
> Andrew,
>
> I was playing with ondemand cpufreq governor (gotta save on electricity
> bills one day :) ) and I've noticed that gameplay become somewhat sluggish.
> Especially noticeble in something cpu-power demanding, like quake4.
> Quick look at stats/trans_table confirmed that CPU goes out of "hot" mode
> way too often.
>

ondemand governor adjusts the sampling rate depending on frequency transition
latency of CPU. Say the frequency transition latency for a CPU is 100uS,
it will sample atmost once every 100mS and by that rule performance
impact of the transitions alone should not be more than 0.1 %.

There will be performance impact however, if it chooses a lower frequency and
CPU becomes 100% busy immediately after that. That will be a issue with
any history based speculation. Once can have a workload that will fool the
algorithm to take wrong decision every time.

Having said that, I am curious to see actual numbers that you are seeing in
different cases:
- The transition latency for frequency switching on your CPU.
- Default ondemand sampling rate on your system.
- Different frequencies supported on your system.
- I guess you have a dual core CPU. Whether they are changing frequencies
correctly at the same time or changing frequency of one CPU is wrongly
affecting the other CPU.
- trans table for the load with and without mm changes for ondemand.

> To make long story short - reverting -mm changes for cpufreq_ondemand.c
> helps a _LOT_. I'm not sure if it is something powersave_bias related or
> (most probably) due to alignment of "next do_dbs_timer() fire time" which
> could make "collect stats" window too short and introduce significant errors.
> Have not specifically checked ...

The change in mm should not change anything related to sampling interval. If
it is indeed doing something like that, then it will be a bug. Can you please
check sampling_rate (under sys cpufreq/ondemand) with and without mm changes.
That will tell how frequently kernel is checking the stats.

>
> After thinking about the issue for a while I come up with the following tweaks:
> First of all I made hysteresis little bit wider (20% instead of 10).

This is a heuristic and this change will make ondemand more conservative
in some cases and ondemand will not be able to reduce frequency and
hence end up consuming more power.

If this is really needed, then it sould be a tunable to ondemand rather than a
new absolute value. As this is only changing the next frequency to be one that
keeps CPU 60% busy than 70% busy, and these two frequencies must be close
to each other anyway, I dont think this can cause performance degradation
due to wrong/low freq.

> Another idea was to increase "sampling period" once cpu is in "hot" mode.
>
> Second part also have benefits of reducing the load on already overloaded cpu.
> Plus it's damn trivial. To simplify further testing I have exposed
> "sampling_rate_hot" parameter through sysfs. Setting it to sampling_rate * 10
> works for me very well. Now I do not have to switch governor to "performance"
> during game sessions.
>

Again, I dont think frequent checking in ondemand is a bad thing as
it allows ondemand to be aggressive and save as much power and also
have quick response time for increased load. If we really have a issue
with sampling rate, it is possibly due to wrong transition latency
advertised by the driver and we are wasting more time doing
transitions than ondemand thinks it is spending.

If the sampling rate is indeed high for your system/workload, you should be
able to get the same result by just increasing the sampling_rate in
sysfs cpufreq/ondemand than having a new tunable.

Thanks,
Venki

2006-09-14 04:48:23

by Nick Orlov

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc6-mm1 0/2] cpufreq: make it harder for cpu to leave "hot" mode

On Tue, Sep 12, 2006 at 4:20:57PM EST, Venkatesh Pallipadi wrote:
>
>On Mon, Sep 11, 2006 at 11:29:24PM -0400, Nick Orlov wrote:
>>
>> I was playing with ondemand cpufreq governor (gotta save on
>> electricity
>> bills one day :) ) and I've noticed that gameplay become somewhat
>> sluggish.
>> Especially noticeable in something cpu-power demanding, like quake4.
>> Quick look at stats/trans_table confirmed that CPU goes out of "hot"
>> mode
>> way too often.
>>
>
> ondemand governor adjusts the sampling rate depending on frequency
> transition
> latency of CPU. Say the frequency transition latency for a CPU is 100uS,
> it will sample atmost once every 100mS and by that rule performance
> impact of the transitions alone should not be more than 0.1 %.
>
> There will be performance impact however, if it chooses a lower
> frequency and
> CPU becomes 100% busy immediately after that.

I think I've found the real bug. When we are slowing down after determining
the desired frequency we are asking for the closest supported freq _below_.
(CPUFREQ_RELATION_L vs CPUFREQ_RELATION_H). This is definitely going to
cause freq jump back and forward in some cases (under steady load).

Imagine the following: CPU supports 1.0GHz, 1.8GHz and 2.0GHz (my case).
Let's say load corresponds to 50% when CPU is "hot" (running at 2.0GHz)
Let's say we start at 1.0GHz. Load is 100% so we go to 2.0GHz
immediately.

Now load is 50%, so we are trying to slow-down. Desired freq is ~1.4GHz.
And instead of selecting 1.8 and stay there we select 1.0. Load is 100%
and we are at the stage 1 again...

Am I missing something? Is it a desired behavior?


> That will be a issue with
> any history based speculation. Once can have a workload that will fool
> the
> algorithm to take wrong decision every time.
>
> Having said that, I am curious to see actual numbers that you are seeing
> in
> different cases:
> - The transition latency for frequency switching on your CPU.

How exactly can I check it (I mean I can printk it, but is it exported
somewhere already?)

> - Default ondemand sampling rate on your system.

after rmmod/modprobe cpufreq_ondemand it is 1240000
(I'm assuming it means 1.2 secs, which seems to be unexpectedly long)

> - Different frequencies supported on your system.

1.0, 1.8, 2.0 GHz

> - I guess you have a dual core CPU.

Yep, E6 Amd64x2 3800+ (looks like I am the [un]lucky one who've got the CPU born
to be Toledo and sold as Manchester :) )

> Whether they are changing frequencies
> correctly at the same time or changing frequency of one CPU is wrongly
> affecting the other CPU.

Well, I'm assuming it got things right - cpu1/cpufreq is a symlink to
cpu0/cpufreq ... cpu0/cpufreq/affected_cpus contains "0 1"

> - trans table for the load with and without mm changes for ondemand.
>
>> To make long story short - reverting -mm changes for
>> cpufreq_ondemand.c
>> helps a _LOT_. I'm not sure if it is something powersave_bias
>> related or
>> (most probably) due to alignment of "next do_dbs_timer() fire time"
>> which
>> could make "collect stats" window too short and introduce
>> significant errors.
>> Have not specifically checked ...
>
> The change in mm should not change anything related to sampling
> interval.

I agree, from looking at the changes - it should not. I just wanted to
eliminate extra variable from the equation (I'm not using powersave_bias,
so it was "bullet-proof" to just revert the changes).

I'm a speculating here of-course. I did not run comprehensive tests and
did not perform complex/long measurements. I just reverted the changes,
recompiled the kernel and played q4 little bit. It definitely _felt_ less
sluggish (which is purely subjective thing, same as "responsiveness").
I took a look at the trans_table and numbers were lower (not by order
of magnitude, but lower). But since each run is unique it is not
something which "proves" things 100%.

> If it is indeed doing something like that, then it will be a bug. Can you
> please
> check sampling_rate (under sys cpufreq/ondemand) with and without mm
> changes.

I do not see how it can be affected by mm changes (I have reverted
cpufreq_ondemand.c changes only), but if you are still interested I will
reboot and check the value.

> That will tell how frequently kernel is checking the stats.
>
>>
>> After thinking about the issue for a while I come up with the
>> following tweaks:
>> First of all I made hysteresis little bit wider (20% instead of 10).
>
> This is a heuristic and this change will make ondemand more
> conservative
> in some cases and ondemand will not be able to reduce frequency and
> hence end up consuming more power.

Totally agree. My objective was to make it to "deliver power" as quickly as
possible and I was ready to accept increased power consumption as long as
in "idle" CPU eventually goes into the most power-saving mode. (Isn't
the primary difference between conservative and ondemand governors btw?
Warm-up quicker, cool-down somewhat slower...)

>
> If this is really needed, then it should be a tunable to ondemand
> rather than a
> new absolute value. As this is only changing the next frequency to be
> one that
> keeps CPU 60% busy than 70% busy, and these two frequencies must be
> close
> to each other anyway, I don't think this can cause performance
> degradation
> due to wrong/low freq.
>
>> Another idea was to increase "sampling period" once cpu is in "hot"
>> mode.
>>
>> Second part also have benefits of reducing the load on already
>> overloaded cpu.
>> Plus it's damn trivial. To simplify further testing I have exposed
>> "sampling_rate_hot" parameter through sysfs. Setting it to
>> sampling_rate * 10
>> works for me very well. Now I do not have to switch governor to
>> "performance"
>> during game sessions.
>>
>
> Again, I dont think frequent checking in ondemand is a bad thing as
> it allows ondemand to be aggressive and save as much power and also
> have quick response time for increased load.

Point was that we are "hot" already. So the only way out is to slow-down.
Which (IMHO) can wait...

> If we really have a issue
> with sampling rate, it is possibly due to wrong transition latency
> advertised by the driver and we are wasting more time doing
> transitions than ondemand thinks it is spending.
>
> If the sampling rate is indeed high for your system/workload, you
> should be
> able to get the same result by just increasing the sampling_rate in
> sysfs cpufreq/ondemand than having a new tunable.

It is not about "transition rate". It is about the fact that we are
leaving "full-speed" mode (which is somewhat unexpected under q4 load).
I was just hoping that by increasing sampling period I'm making "load
measurements" more reliable... On other side I did not want to affect
"warm up" paths. I do want CPU to speed-up as quickly as possible when
needed.

I definitely not an expert in power management and I'm not an experienced
kernel hacker. I'm merely scratching the surface.
A lot of things are still not clear for me.
For example:

1. what are "real-world" latencies for the execution of things scheduled
through the work-queue? Can these latencies be as high as 1 sec on
~1GHz CPU for example? Under load?

if execution can be delayed that much then the "delay -= jiffies % delay;"
alignment can potentially make measurement period way too short...

2. What if process which produces most of the load migrates to another
core in the middle of the sampling period? We will end up with 50%
load on each core but it does not mean that we can still handle this
load on the frequency twice as low... I'm pretty sure we can find
elegant "next best thing" solution. I just have not found feasible
one yet. Still thinking.

Thank you,
Nick Orlov.

P.S. Could you please CC me (I'm not subscribed to the list)

--
With best wishes,
Nick Orlov.

2006-09-14 21:38:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc6-mm1 0/2] cpufreq: make it harder for cpu to leave "hot" mode

On Mon, 11 Sep 2006 23:29:24 -0400
Nick Orlov <[email protected]> wrote:

> I was playing with ondemand cpufreq governor (gotta save on electricity
> bills one day :) ) and I've noticed that gameplay become somewhat sluggish.
> Especially noticeble in something cpu-power demanding, like quake4.
> Quick look at stats/trans_table confirmed that CPU goes out of "hot" mode
> way too often.

I won't apply any of this in view of your ongoing discussion with
Venkatesh, but thanks for the ongoing support and help - cpufreq is, shall
we say, a rich source of future kernel improvements.

Next time you prepare a patch series please ensure that it uses a different
Subject: for each patch, as per
http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt, section 2.


2006-09-16 00:33:37

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: RE: [PATCH 2.6.18-rc6-mm1 0/2] cpufreq: make it harder for cpu to leave "hot" mode



>-----Original Message-----
>From: Nick Orlov [mailto:[email protected]]
>Sent: Wednesday, September 13, 2006 9:48 PM
>To: linux-kernel
>Cc: Pallipadi, Venkatesh
>Subject: Re: [PATCH 2.6.18-rc6-mm1 0/2] cpufreq: make it
>harder for cpu to leave "hot" mode
>
>On Tue, Sep 12, 2006 at 4:20:57PM EST, Venkatesh Pallipadi wrote:
>>
>> ondemand governor adjusts the sampling rate depending on frequency
>> transition
>> latency of CPU. Say the frequency transition latency for a
>CPU is 100uS,
>> it will sample atmost once every 100mS and by that rule performance
>> impact of the transitions alone should not be more than 0.1 %.
>>
>> There will be performance impact however, if it chooses a lower
>> frequency and
>> CPU becomes 100% busy immediately after that.
>
>I think I've found the real bug. When we are slowing down
>after determining
>the desired frequency we are asking for the closest supported
>freq _below_.
>(CPUFREQ_RELATION_L vs CPUFREQ_RELATION_H). This is definitely going to
>cause freq jump back and forward in some cases (under steady load).
>
>Am I missing something? Is it a desired behavior?
>

RELATION_L and RELATION_H is somewhat confusing.
RELATION_L stands for lowest frequency which is as high as the requested
value. Which will be next highest value greater than equal to the given
value. So, there should not be a issue here.


>> That will be a issue with
>> any history based speculation. Once can have a workload that
>will fool
>> the
>> algorithm to take wrong decision every time.
>>
>> Having said that, I am curious to see actual numbers that
>you are seeing
>> in
>> different cases:
>> - The transition latency for frequency switching on your CPU.
>
>How exactly can I check it (I mean I can printk it, but is it exported
>somewhere already?)
>
>> - Default ondemand sampling rate on your system.
>
>after rmmod/modprobe cpufreq_ondemand it is 1240000
>(I'm assuming it means 1.2 secs, which seems to be unexpectedly long)

It is 1.24 seconds. Sampling rate is calculated from Transition latency
and jiffy rate. Transitions latency on this CPU seems to be 1.24
millisecond.

>> - I guess you have a dual core CPU.
>
>Yep, E6 Amd64x2 3800+ (looks like I am the [un]lucky one
>who've got the CPU born
>to be Toledo and sold as Manchester :) )
>
>> Whether they are changing frequencies
>> correctly at the same time or changing frequency of one
>CPU is wrongly
>> affecting the other CPU.
>
>Well, I'm assuming it got things right - cpu1/cpufreq is a symlink to
>cpu0/cpufreq ... cpu0/cpufreq/affected_cpus contains "0 1"
>

Hmm. You have software coordination enabled. In which case resyncing of
sampling from all CPUs at the same time should not affect at all. With
software coordination there is only one CPU doing the utilization check
on both cores and taking both cores into appropriate frequency.

>> - trans table for the load with and without mm changes for
>ondemand.
>>
>>> To make long story short - reverting -mm changes for
>>> cpufreq_ondemand.c
>>> helps a _LOT_. I'm not sure if it is something powersave_bias
>>> related or
>>> (most probably) due to alignment of "next do_dbs_timer() fire time"
>>> which
>>> could make "collect stats" window too short and introduce
>>> significant errors.
>>> Have not specifically checked ...
>>
>> The change in mm should not change anything related to sampling
>> interval.
>
>I agree, from looking at the changes - it should not. I just wanted to
>eliminate extra variable from the equation (I'm not using
>powersave_bias,
>so it was "bullet-proof" to just revert the changes).
>
>I'm a speculating here of-course. I did not run comprehensive tests and
>did not perform complex/long measurements. I just reverted the changes,
>recompiled the kernel and played q4 little bit. It definitely
>_felt_ less
>sluggish (which is purely subjective thing, same as "responsiveness").
>I took a look at the trans_table and numbers were lower (not by order
>of magnitude, but lower). But since each run is unique it is not
>something which "proves" things 100%.

Yes. It will be nice to have some workload and measure the difference.
As you have narrowed down to this patch, I guess there is something
going wrong with the patch. We just have to figure out why and how.

>> This is a heuristic and this change will make ondemand more
>> conservative
>> in some cases and ondemand will not be able to reduce frequency and
>> hence end up consuming more power.
>
>Totally agree. My objective was to make it to "deliver power"
>as quickly as
>possible and I was ready to accept increased power consumption
>as long as
>in "idle" CPU eventually goes into the most power-saving mode. (Isn't
>the primary difference between conservative and ondemand governors btw?
>Warm-up quicker, cool-down somewhat slower...)
>

First, I don't think it will be a major impact if we sample at 12 sec as
opposed to 1.2 sec. As you mentioned 1.2sec is already a lot of time
from CPU perspective and should not be having any jitters as such. My
guess is, with 1.2sec, CPU changes frequency so rarely, possiblity of
seeing issue going by response time is much less.

But, I agree with your logic of frequency going down slowly. Infact this
is a fullcircle for ondemand. Ondemand when it was first added in 2.6.9
had sampling for frequency decrease happening at 1/10 the rate of
sampling for frequency increase. Infact that option called
sampling_down_factor was in ondemand as late as 2.6.14. But, as we saw
we were loosing oppurtunities to save power with this option. And hence
moved to having same sampling frequency for up and down.

Conservative is slightly different in that it conservatively
increases/decreases the frequency one P-state at a time and not across
multiple P-state.

>> If we really have a issue
>> with sampling rate, it is possibly due to wrong transition latency
>> advertised by the driver and we are wasting more time doing
>> transitions than ondemand thinks it is spending.
>>
>> If the sampling rate is indeed high for your system/workload, you
>> should be
>> able to get the same result by just increasing the sampling_rate in
>> sysfs cpufreq/ondemand than having a new tunable.
>
>It is not about "transition rate". It is about the fact that we are
>leaving "full-speed" mode (which is somewhat unexpected under q4 load).
>I was just hoping that by increasing sampling period I'm making "load
>measurements" more reliable... On other side I did not want to affect
>"warm up" paths. I do want CPU to speed-up as quickly as possible when
>needed.
>

I understand. But, that will make ondemand less responsive in conserving
power in environments where CPU is loaded at steady rate say 60% or so
in say 1 second interval. As I said, we did try this option earlier, and
found issue with how much power we can save that way.

>I definitely not an expert in power management and I'm not an
>experienced
>kernel hacker. I'm merely scratching the surface.

Well.. You will be by the time we resolve this issue ;)

>A lot of things are still not clear for me.
>For example:
>
>1. what are "real-world" latencies for the execution of things
>scheduled
> through the work-queue? Can these latencies be as high as 1 sec on
> ~1GHz CPU for example? Under load?
>
> if execution can be delayed that much then the "delay -=
>jiffies % delay;"
> alignment can potentially make measurement period way too short...
>

Yes. This was what I though was happening in your case. But with 1.24
second sampling I doubt this is happening.

>2. What if process which produces most of the load migrates to another
> core in the middle of the sampling period? We will end up with 50%
> load on each core but it does not mean that we can still handle this
> load on the frequency twice as low... I'm pretty sure we can find
> elegant "next best thing" solution. I just have not found feasible
> one yet. Still thinking.
>

Yes. This is a known issue right now. The only solution for this is to
let the frequency adapt to the load quickly enough or to only change the
frequency when all the cores are totally idle.
Most platforms that ondemand runs on have sampling rate of 100 mS or
less, which is comparable to load balancing when busy and hence we
should not see this issue.

But, if sampling latency is > 1s, we can indeed be seeing this issue. A
simple test is to run kernel make in a single thread with one core and
two cores enabled. You will see a noticable slowdown due to this issue
if frequency is changed so less frequently.

We need someone familiar with these CPUs who can help us understand this
issue. As you can guess I am not an expert on these CPUs :). Copying
Jacob to see whether he has seen any issues with ondemand.

Thanks,
Venki