I noticed on my machine that the "conservative" cpufreq governor wasn't
working properly in v2.6.31.1 or Linus' latest tree, but it worked fine on
v2.6.30.8, so I decided I should figure out where this issue was coming
from. The issue is pretty clear...
Here's the expected:
cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006
Report errors and bugs to [email protected], please.
analyzing CPU 0:
driver: acpi-cpufreq
CPUs which need to switch frequency at the same time: 0
hardware limits: 1000 MHz - 2.33 GHz
available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
available cpufreq governors: ondemand, userspace, powersave, conservative, performance
current policy: frequency should be within 1000 MHz and 2.33 GHz.
The governor "conservative" may decide which speed to use
within this range.
current CPU frequency is 1000 MHz (asserted by call to hardware).
cpufreq stats: 2.33 GHz:0.59%, 2.17 GHz:1.41%, 2.00 GHz:0.88%, 1.83 GHz:1.22%, 1.67 GHz:0.88%, 1.50 GHz:1.41%, 1.33 GHz:10.98%, 1000 MHz:82.63% (33)
analyzing CPU 1:
driver: acpi-cpufreq
CPUs which need to switch frequency at the same time: 1
hardware limits: 1000 MHz - 2.33 GHz
available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
available cpufreq governors: ondemand, userspace, powersave, conservative, performance
current policy: frequency should be within 1000 MHz and 2.33 GHz.
The governor "conservative" may decide which speed to use
within this range.
current CPU frequency is 1000 MHz (asserted by call to hardware).
cpufreq stats: 2.33 GHz:0.40%, 2.17 GHz:0.16%, 2.00 GHz:0.16%, 1.83 GHz:0.35%, 1.67 GHz:0.16%, 1.50 GHz:0.35%, 1.33 GHz:0.16%, 1000 MHz:98.27% (7)
And here is the broken version (note the 'cpufreq stats' line):
cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006
Report errors and bugs to [email protected], please.
analyzing CPU 0:
driver: acpi-cpufreq
CPUs which need to switch frequency at the same time: 0
hardware limits: 1000 MHz - 2.33 GHz
available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
available cpufreq governors: ondemand, userspace, powersave, conservative, performance
current policy: frequency should be within 1000 MHz and 2.33 GHz.
The governor "conservative" may decide which speed to use
within this range.
current CPU frequency is 2.33 GHz (asserted by call to hardware).
cpufreq stats: 2.33 GHz:100.00%, 2.17 GHz:0.00%, 2.00 GHz:0.00%, 1.83 GHz:0.00%, 1.67 GHz:0.00%, 1.50 GHz:0.00%, 1.33 GHz:0.00%, 1000 MHz:0.00%
analyzing CPU 1:
driver: acpi-cpufreq
CPUs which need to switch frequency at the same time: 1
hardware limits: 1000 MHz - 2.33 GHz
available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
available cpufreq governors: ondemand, userspace, powersave, conservative, performance
current policy: frequency should be within 1000 MHz and 2.33 GHz.
The governor "conservative" may decide which speed to use
within this range.
current CPU frequency is 2.33 GHz (asserted by call to hardware).
cpufreq stats: 2.33 GHz:100.00%, 2.17 GHz:0.00%, 2.00 GHz:0.00%, 1.83 GHz:0.00%, 1.67 GHz:0.00%, 1.50 GHz:0.00%, 1.33 GHz:0.00%, 1000 MHz:0.00%
So basically, it just never clocks down from the maximum frequency.
Here's the bisection log:
# bad: [2147b209] Linux 2.6.31.1
# good: [a1c4c06a] Linux 2.6.30.8
# good: [07a2039b] Linux 2.6.30
# good: [452dac45] V4L/DVB (11761): dvb-ttpci: Fixed VIDEO_SLOWMOTION
# bad: [906e8d97] e1000e: delay second read of PHY_STATUS register o
# good: [36e84467] Staging: heci: fix userspace pointer mess
# bad: [df36b439] Merge branch 'for-2.6.31' of git://git.linux-nfs.o
# skip: [12e24f34] Merge branch 'perfcounters-fixes-for-linus' of git
# good: [48c93112] powerpc: Fix invalid construct in our CPU selectio
# bad: [eca41044] n_r3964: fix lock imbalance
# good: [93db6294] Merge branch 'for-linus' of git://git.kernel.org/p
# bad: [1eb51c33] Merge branch 'sched-fixes-for-linus' of git://git.
# good: [1d991001] Merge branch 'x86/mce3' into x86/urgent
# good: [71e308a2] function-graph: add stack frame test
# bad: [38df92b8] Merge branch 'timers-fixes-for-linus' of git://git
# good: [ad5cf46b] Merge git://git.kernel.org/pub/scm/linux/kernel/gi
# good: [7fd5b632] Merge branch 'for-linus' of git://git.monstr.eu/li
# good: [c4c5ab30] Merge branch 'x86-fixes-for-linus' of git://git.ke
# bad: [f2e21c96] NOHZ: Properly feed cpufreq ondemand governor
And finally, the commit that broke "conservative":
commit f2e21c9610991e95621a81407cdbab881226419b
Author: Eero Nurkkala <[email protected]>
Date: Mon May 25 09:57:37 2009 +0300
NOHZ: Properly feed cpufreq ondemand governor
A call from irq_exit() may occasionally pause the timing
info for cpufreq ondemand governor. This results in the
cpufreq ondemand governor to fail to calculate the
system load properly. Thus, relocate the checks for this
particular case to keep the governor always functional.
Signed-off-by: Eero Nurkkala <[email protected]>
Reported-by: Tero Kristo <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
I'd work on fixing it myself and whip up a patch, but I'm going to be gone
all day and I'm not too familiar with cpufreq anyway.
- Steven
On Mon, 2009-10-05 at 18:32 +0200, ext Steven Noonan wrote:
> I noticed on my machine that the "conservative" cpufreq governor wasn't
> working properly in v2.6.31.1 or Linus' latest tree, but it worked fine on
> v2.6.30.8, so I decided I should figure out where this issue was coming
> from. The issue is pretty clear...
>
I had some troubles with cpufreq-info as all values in "cpufreq stats"
were being as 0,00% (I fixed it by replacing unsigned long longs with
unsigned longs, and recompiled)
If this shows still insane values:
cat /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state
I guess your system is indeed broken.
However. I get:
(cat /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state)
(OP1 == highest Frequency)
OP1 7148
OP2 242
OP3 2307
OP4 43145
And another round:
cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83% (72)
Just once more after doing nothing:
OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10% (82)
So I can't agree it's broken. The patch you bisected, actually filtered
out such phenomenon, in which an IRQ made the cpufreq framework
occasionally think we were idling, although we were not. So you got
"bonus" idle time that shouldn't been there in the first place. Now that
the "bonus" idle time is not there, your system load may indeed be so
high that the system never spends 80% or more time in idle? Could that
be the case? Of course, even though I can't agree it's broken, doesn't
mean it isn't somehow broken ;) It'd be nice to get info on other
systems as well...
>
> Here's the expected:
>
> cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006
> Report errors and bugs to [email protected], please.
> analyzing CPU 0:
> driver: acpi-cpufreq
> CPUs which need to switch frequency at the same time: 0
> hardware limits: 1000 MHz - 2.33 GHz
> available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
> available cpufreq governors: ondemand, userspace, powersave, conservative, performance
> current policy: frequency should be within 1000 MHz and 2.33 GHz.
> The governor "conservative" may decide which speed to use
> within this range.
> current CPU frequency is 1000 MHz (asserted by call to hardware).
> cpufreq stats: 2.33 GHz:0.59%, 2.17 GHz:1.41%, 2.00 GHz:0.88%, 1.83 GHz:1.22%, 1.67 GHz:0.88%, 1.50 GHz:1.41%, 1.33 GHz:10.98%, 1000 MHz:82.63% (33)
> analyzing CPU 1:
> driver: acpi-cpufreq
> CPUs which need to switch frequency at the same time: 1
> hardware limits: 1000 MHz - 2.33 GHz
> available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
> available cpufreq governors: ondemand, userspace, powersave, conservative, performance
> current policy: frequency should be within 1000 MHz and 2.33 GHz.
> The governor "conservative" may decide which speed to use
> within this range.
> current CPU frequency is 1000 MHz (asserted by call to hardware).
> cpufreq stats: 2.33 GHz:0.40%, 2.17 GHz:0.16%, 2.00 GHz:0.16%, 1.83 GHz:0.35%, 1.67 GHz:0.16%, 1.50 GHz:0.35%, 1.33 GHz:0.16%, 1000 MHz:98.27% (7)
>
>
>
> And here is the broken version (note the 'cpufreq stats' line):
>
> cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006
> Report errors and bugs to [email protected], please.
> analyzing CPU 0:
> driver: acpi-cpufreq
> CPUs which need to switch frequency at the same time: 0
> hardware limits: 1000 MHz - 2.33 GHz
> available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
> available cpufreq governors: ondemand, userspace, powersave, conservative, performance
> current policy: frequency should be within 1000 MHz and 2.33 GHz.
> The governor "conservative" may decide which speed to use
> within this range.
> current CPU frequency is 2.33 GHz (asserted by call to hardware).
> cpufreq stats: 2.33 GHz:100.00%, 2.17 GHz:0.00%, 2.00 GHz:0.00%, 1.83 GHz:0.00%, 1.67 GHz:0.00%, 1.50 GHz:0.00%, 1.33 GHz:0.00%, 1000 MHz:0.00%
> analyzing CPU 1:
> driver: acpi-cpufreq
> CPUs which need to switch frequency at the same time: 1
> hardware limits: 1000 MHz - 2.33 GHz
> available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
> available cpufreq governors: ondemand, userspace, powersave, conservative, performance
> current policy: frequency should be within 1000 MHz and 2.33 GHz.
> The governor "conservative" may decide which speed to use
> within this range.
> current CPU frequency is 2.33 GHz (asserted by call to hardware).
> cpufreq stats: 2.33 GHz:100.00%, 2.17 GHz:0.00%, 2.00 GHz:0.00%, 1.83 GHz:0.00%, 1.67 GHz:0.00%, 1.50 GHz:0.00%, 1.33 GHz:0.00%, 1000 MHz:0.00%
>
>
> So basically, it just never clocks down from the maximum frequency.
>
>
> Here's the bisection log:
>
> # bad: [2147b209] Linux 2.6.31.1
> # good: [a1c4c06a] Linux 2.6.30.8
> # good: [07a2039b] Linux 2.6.30
> # good: [452dac45] V4L/DVB (11761): dvb-ttpci: Fixed VIDEO_SLOWMOTION
> # bad: [906e8d97] e1000e: delay second read of PHY_STATUS register o
> # good: [36e84467] Staging: heci: fix userspace pointer mess
> # bad: [df36b439] Merge branch 'for-2.6.31' of git://git.linux-nfs.o
> # skip: [12e24f34] Merge branch 'perfcounters-fixes-for-linus' of git
> # good: [48c93112] powerpc: Fix invalid construct in our CPU selectio
> # bad: [eca41044] n_r3964: fix lock imbalance
> # good: [93db6294] Merge branch 'for-linus' of git://git.kernel.org/p
> # bad: [1eb51c33] Merge branch 'sched-fixes-for-linus' of git://git.
> # good: [1d991001] Merge branch 'x86/mce3' into x86/urgent
> # good: [71e308a2] function-graph: add stack frame test
> # bad: [38df92b8] Merge branch 'timers-fixes-for-linus' of git://git
> # good: [ad5cf46b] Merge git://git.kernel.org/pub/scm/linux/kernel/gi
> # good: [7fd5b632] Merge branch 'for-linus' of git://git.monstr.eu/li
> # good: [c4c5ab30] Merge branch 'x86-fixes-for-linus' of git://git.ke
> # bad: [f2e21c96] NOHZ: Properly feed cpufreq ondemand governor
>
>
> And finally, the commit that broke "conservative":
>
> commit f2e21c9610991e95621a81407cdbab881226419b
> Author: Eero Nurkkala <[email protected]>
> Date: Mon May 25 09:57:37 2009 +0300
>
> NOHZ: Properly feed cpufreq ondemand governor
>
> A call from irq_exit() may occasionally pause the timing
> info for cpufreq ondemand governor. This results in the
> cpufreq ondemand governor to fail to calculate the
> system load properly. Thus, relocate the checks for this
> particular case to keep the governor always functional.
>
> Signed-off-by: Eero Nurkkala <[email protected]>
> Reported-by: Tero Kristo <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Acked-by: Venkatesh Pallipadi <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
>
>
> I'd work on fixing it myself and whip up a patch, but I'm going to be gone
> all day and I'm not too familiar with cpufreq anyway.
>
> - Steven
On Tue, Oct 6, 2009 at 12:31 AM, Eero Nurkkala
<[email protected]> wrote:
> On Mon, 2009-10-05 at 18:32 +0200, ext Steven Noonan wrote:
>> I noticed on my machine that the "conservative" cpufreq governor wasn't
>> working properly in v2.6.31.1 or Linus' latest tree, but it worked fine on
>> v2.6.30.8, so I decided I should figure out where this issue was coming
>> from. The issue is pretty clear...
>>
>
> I had some troubles with cpufreq-info as all values in "cpufreq stats"
> were being as 0,00% (I fixed it by replacing unsigned long longs with
> unsigned longs, and recompiled)
That doesn't make much sense for my case. I used the same build of
cpufreq-info through the whole bisection and the problem was very
consistently reproducible throughout. And the bisected commit is
indeed related to cpufreq.
Regardless, cpufreq-info wasn't the only reason that I thought
something was wrong. My machine was unusually warm, fans were going
pretty much constantly, and the Gnome applet for CPU frequency
monitoring constantly showed it was using 2.33GHz and never scaling
down. I checked system load, which stayed at zero. Multiple different
sources (top, perf top, etc) seemed to indicate there was no
justification for the "conservative" governor's behaviour.
> If this shows still insane values:
> cat /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state
> I guess your system is indeed broken.
Here's the output:
2333000 6880
2167000 0
2000000 0
1833000 0
1667000 0
1500000 0
1333000 0
1000000 0
2333000 6880
2167000 0
2000000 0
1833000 0
1667000 0
1500000 0
1333000 0
1000000 0
Pretty broken, alright.
> However. I get:
> (cat /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state)
>
> (OP1 == highest Frequency)
> OP1 7148
> OP2 242
> OP3 2307
> OP4 43145
I would suspect you have to have CONFIG_NO_HZ enabled to be able to
reproduce the issue (considering the title of the bisected commit and
my own config). Do you have it enabled?
> And another round:
>
> cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83% ?(72)
>
> Just once more after doing nothing:
> OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10% ?(82)
>
> So I can't agree it's broken. The patch you bisected, actually filtered
> out such phenomenon, in which an IRQ made the cpufreq framework
> occasionally think we were idling, although we were not. So you got
> "bonus" idle time that shouldn't been there in the first place. Now that
> the "bonus" idle time is not there, your system load may indeed be so
> high that the system never spends 80% or more time in idle? Could that
> be the case? Of course, even though I can't agree it's broken, doesn't
> mean it isn't somehow broken ;) It'd be nice to get info on other
> systems as well...
Interestingly, "ondemand" (the governor fixed by the bisected commit)
works fine. "conservative" is the only broken one.
>>
>> Here's the expected:
>>
>> cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006
>> Report errors and bugs to [email protected], please.
>> analyzing CPU 0:
>> ? driver: acpi-cpufreq
>> ? CPUs which need to switch frequency at the same time: 0
>> ? hardware limits: 1000 MHz - 2.33 GHz
>> ? available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
>> ? available cpufreq governors: ondemand, userspace, powersave, conservative, performance
>> ? current policy: frequency should be within 1000 MHz and 2.33 GHz.
>> ? ? ? ? ? ? ? ? ? The governor "conservative" may decide which speed to use
>> ? ? ? ? ? ? ? ? ? within this range.
>> ? current CPU frequency is 1000 MHz (asserted by call to hardware).
>> ? cpufreq stats: 2.33 GHz:0.59%, 2.17 GHz:1.41%, 2.00 GHz:0.88%, 1.83 GHz:1.22%, 1.67 GHz:0.88%, 1.50 GHz:1.41%, 1.33 GHz:10.98%, 1000 MHz:82.63% ?(33)
>> analyzing CPU 1:
>> ? driver: acpi-cpufreq
>> ? CPUs which need to switch frequency at the same time: 1
>> ? hardware limits: 1000 MHz - 2.33 GHz
>> ? available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
>> ? available cpufreq governors: ondemand, userspace, powersave, conservative, performance
>> ? current policy: frequency should be within 1000 MHz and 2.33 GHz.
>> ? ? ? ? ? ? ? ? ? The governor "conservative" may decide which speed to use
>> ? ? ? ? ? ? ? ? ? within this range.
>> ? current CPU frequency is 1000 MHz (asserted by call to hardware).
>> ? cpufreq stats: 2.33 GHz:0.40%, 2.17 GHz:0.16%, 2.00 GHz:0.16%, 1.83 GHz:0.35%, 1.67 GHz:0.16%, 1.50 GHz:0.35%, 1.33 GHz:0.16%, 1000 MHz:98.27% ?(7)
>>
>>
>>
>> And here is the broken version (note the 'cpufreq stats' line):
>>
>> cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006
>> Report errors and bugs to [email protected], please.
>> analyzing CPU 0:
>> ? driver: acpi-cpufreq
>> ? CPUs which need to switch frequency at the same time: 0
>> ? hardware limits: 1000 MHz - 2.33 GHz
>> ? available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
>> ? available cpufreq governors: ondemand, userspace, powersave, conservative, performance
>> ? current policy: frequency should be within 1000 MHz and 2.33 GHz.
>> ? ? ? ? ? ? ? ? ? The governor "conservative" may decide which speed to use
>> ? ? ? ? ? ? ? ? ? within this range.
>> ? current CPU frequency is 2.33 GHz (asserted by call to hardware).
>> ? cpufreq stats: 2.33 GHz:100.00%, 2.17 GHz:0.00%, 2.00 GHz:0.00%, 1.83 GHz:0.00%, 1.67 GHz:0.00%, 1.50 GHz:0.00%, 1.33 GHz:0.00%, 1000 MHz:0.00%
>> analyzing CPU 1:
>> ? driver: acpi-cpufreq
>> ? CPUs which need to switch frequency at the same time: 1
>> ? hardware limits: 1000 MHz - 2.33 GHz
>> ? available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
>> ? available cpufreq governors: ondemand, userspace, powersave, conservative, performance
>> ? current policy: frequency should be within 1000 MHz and 2.33 GHz.
>> ? ? ? ? ? ? ? ? ? The governor "conservative" may decide which speed to use
>> ? ? ? ? ? ? ? ? ? within this range.
>> ? current CPU frequency is 2.33 GHz (asserted by call to hardware).
>> ? cpufreq stats: 2.33 GHz:100.00%, 2.17 GHz:0.00%, 2.00 GHz:0.00%, 1.83 GHz:0.00%, 1.67 GHz:0.00%, 1.50 GHz:0.00%, 1.33 GHz:0.00%, 1000 MHz:0.00%
>>
>>
>> So basically, it just never clocks down from the maximum frequency.
>>
>>
>> Here's the bisection log:
>>
>> ?# bad: ?[2147b209] Linux 2.6.31.1
>> ?# good: [a1c4c06a] Linux 2.6.30.8
>> ?# good: [07a2039b] Linux 2.6.30
>> ?# good: [452dac45] V4L/DVB (11761): dvb-ttpci: Fixed VIDEO_SLOWMOTION
>> ?# bad: ?[906e8d97] e1000e: delay second read of PHY_STATUS register o
>> ?# good: [36e84467] Staging: heci: fix userspace pointer mess
>> ?# bad: ?[df36b439] Merge branch 'for-2.6.31' of git://git.linux-nfs.o
>> ?# skip: [12e24f34] Merge branch 'perfcounters-fixes-for-linus' of git
>> ?# good: [48c93112] powerpc: Fix invalid construct in our CPU selectio
>> ?# bad: ?[eca41044] n_r3964: fix lock imbalance
>> ?# good: [93db6294] Merge branch 'for-linus' of git://git.kernel.org/p
>> ?# bad: ?[1eb51c33] Merge branch 'sched-fixes-for-linus' of git://git.
>> ?# good: [1d991001] Merge branch 'x86/mce3' into x86/urgent
>> ?# good: [71e308a2] function-graph: add stack frame test
>> ?# bad: ?[38df92b8] Merge branch 'timers-fixes-for-linus' of git://git
>> ?# good: [ad5cf46b] Merge git://git.kernel.org/pub/scm/linux/kernel/gi
>> ?# good: [7fd5b632] Merge branch 'for-linus' of git://git.monstr.eu/li
>> ?# good: [c4c5ab30] Merge branch 'x86-fixes-for-linus' of git://git.ke
>> ?# bad: ?[f2e21c96] NOHZ: Properly feed cpufreq ondemand governor
>>
>>
>> And finally, the commit that broke "conservative":
>>
>> commit f2e21c9610991e95621a81407cdbab881226419b
>> Author: Eero Nurkkala <[email protected]>
>> Date: ? Mon May 25 09:57:37 2009 +0300
>>
>> ? ? NOHZ: Properly feed cpufreq ondemand governor
>>
>> ? ? A call from irq_exit() may occasionally pause the timing
>> ? ? info for cpufreq ondemand governor. This results in the
>> ? ? cpufreq ondemand governor to fail to calculate the
>> ? ? system load properly. Thus, relocate the checks for this
>> ? ? particular case to keep the governor always functional.
>>
>> ? ? Signed-off-by: Eero Nurkkala <[email protected]>
>> ? ? Reported-by: Tero Kristo <[email protected]>
>> ? ? Acked-by: Rik van Riel <[email protected]>
>> ? ? Acked-by: Venkatesh Pallipadi <[email protected]>
>> ? ? Signed-off-by: Thomas Gleixner <[email protected]>
>>
>>
>> I'd work on fixing it myself and whip up a patch, but I'm going to be gone
>> all day and I'm not too familiar with cpufreq anyway.
>>
>> - Steven
>
>
On Tue, 2009-10-06 at 12:22 +0200, ext Steven Noonan wrote:
>
> I would suspect you have to have CONFIG_NO_HZ enabled to be able to
> reproduce the issue (considering the title of the bisected commit and
> my own config). Do you have it enabled?
>
Yes, it's enabled.
> > And another round:
> >
> > cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83% (72)
> >
> > Just once more after doing nothing:
> > OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10% (82)
> >
> > So I can't agree it's broken. The patch you bisected, actually filtered
> > out such phenomenon, in which an IRQ made the cpufreq framework
> > occasionally think we were idling, although we were not. So you got
> > "bonus" idle time that shouldn't been there in the first place. Now that
> > the "bonus" idle time is not there, your system load may indeed be so
> > high that the system never spends 80% or more time in idle? Could that
> > be the case? Of course, even though I can't agree it's broken, doesn't
> > mean it isn't somehow broken ;) It'd be nice to get info on other
> > systems as well...
>
> Interestingly, "ondemand" (the governor fixed by the bisected commit)
> works fine. "conservative" is the only broken one.
>
If you took timestamps in /arch/x86/kernel/process_**.c:
(let's assume process_64.c) in cpu_idle()
around enter_idle(); and __exit_idle(), took the diff,
added the diffs up, and compared it to system uptime, you could see how
much time you spend in idle()? I think it's possible that
even if the cpu load is near 0%, the system may idle only for a bare
moment (that translates to a buggy pm_idle()), and time is spent
elsewhere (less than 80% in idle).
If I had similar findings like you, I'd be worried; but traces here and
there will point out where things go wrong ;)
- Eero
On Tue, Oct 6, 2009 at 3:43 AM, Eero Nurkkala
<[email protected]> wrote:
> On Tue, 2009-10-06 at 12:22 +0200, ext Steven Noonan wrote:
>>
>> I would suspect you have to have CONFIG_NO_HZ enabled to be able to
>> reproduce the issue (considering the title of the bisected commit and
>> my own config). Do you have it enabled?
>>
>
> Yes, it's enabled.
>
>> > And another round:
>> >
>> > cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83% ?(72)
>> >
>> > Just once more after doing nothing:
>> > OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10% ?(82)
>> >
>> > So I can't agree it's broken. The patch you bisected, actually filtered
>> > out such phenomenon, in which an IRQ made the cpufreq framework
>> > occasionally think we were idling, although we were not. So you got
>> > "bonus" idle time that shouldn't been there in the first place. Now that
>> > the "bonus" idle time is not there, your system load may indeed be so
>> > high that the system never spends 80% or more time in idle? Could that
>> > be the case? Of course, even though I can't agree it's broken, doesn't
>> > mean it isn't somehow broken ;) It'd be nice to get info on other
>> > systems as well...
>>
>> Interestingly, "ondemand" (the governor fixed by the bisected commit)
>> works fine. "conservative" is the only broken one.
>>
>
> If you took timestamps in /arch/x86/kernel/process_**.c:
> (let's assume process_64.c) in cpu_idle()
> around enter_idle(); and __exit_idle(), took the diff,
> added the diffs up, and compared it to system uptime, you could see how
> much time you spend in idle()? I think it's possible that
> even if the cpu load is near 0%, the system may idle only for a bare
> moment (that translates to a buggy pm_idle()), and time is spent
> elsewhere (less than 80% in idle).
This makes logical sense, but how should I test this? Is there a way
to do this with existing tracers?
By the way, a revert of the bisected commit changes the cpufreq stats favorably:
2333000 31
2167000 13
2000000 6
1833000 79
1667000 239
1500000 193
1333000 959
1000000 10260
2333000 11
2167000 13
2000000 13
1833000 26
1667000 33
1500000 333
1333000 59
1000000 11293
- Steven
On Tue, 2009-10-06 at 13:22 +0200, ext Steven Noonan wrote:
> On Tue, Oct 6, 2009 at 3:43 AM, Eero Nurkkala
> <[email protected]> wrote:
> > On Tue, 2009-10-06 at 12:22 +0200, ext Steven Noonan wrote:
> >>
> >> I would suspect you have to have CONFIG_NO_HZ enabled to be able to
> >> reproduce the issue (considering the title of the bisected commit and
> >> my own config). Do you have it enabled?
> >>
> >
> > Yes, it's enabled.
> >
> >> > And another round:
> >> >
> >> > cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83% (72)
> >> >
> >> > Just once more after doing nothing:
> >> > OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10% (82)
> >> >
> >> > So I can't agree it's broken. The patch you bisected, actually filtered
> >> > out such phenomenon, in which an IRQ made the cpufreq framework
> >> > occasionally think we were idling, although we were not. So you got
> >> > "bonus" idle time that shouldn't been there in the first place. Now that
> >> > the "bonus" idle time is not there, your system load may indeed be so
> >> > high that the system never spends 80% or more time in idle? Could that
> >> > be the case? Of course, even though I can't agree it's broken, doesn't
> >> > mean it isn't somehow broken ;) It'd be nice to get info on other
> >> > systems as well...
> >>
> >> Interestingly, "ondemand" (the governor fixed by the bisected commit)
> >> works fine. "conservative" is the only broken one.
> >>
> >
> > If you took timestamps in /arch/x86/kernel/process_**.c:
> > (let's assume process_64.c) in cpu_idle()
> > around enter_idle(); and __exit_idle(), took the diff,
> > added the diffs up, and compared it to system uptime, you could see how
> > much time you spend in idle()? I think it's possible that
> > even if the cpu load is near 0%, the system may idle only for a bare
> > moment (that translates to a buggy pm_idle()), and time is spent
> > elsewhere (less than 80% in idle).
>
> This makes logical sense, but how should I test this? Is there a way
> to do this with existing tracers?
Tracers may by themselves add some load into the system.
If I were you, I'd add something like: (I have only one CPU BTW)
static ktime_t time_prior_idle;
static int64_t idle_total;
time_prior_idle = ktime_get();
<idle stuff>
idle_total += ktime_to_ns(ktime_sub(ktime_get(), time_prior_idle));
and have a sysfs hook (something already present, so you can just cat
it) with a trace:
printk("Times: %lld, %lld \n", idle_total, ktime_to_ns(ktime_get()));
Sample output:
374758812519, 386768249832
So I have 386768249832 / 386768249832 = 96.9 % time spent in idle in
this case.
(Right, this should provide somewhat descent info, hopefully ;) )
- Eero
On Tue, Oct 6, 2009 at 5:36 AM, Eero Nurkkala
<[email protected]> wrote:
> On Tue, 2009-10-06 at 13:22 +0200, ext Steven Noonan wrote:
>> On Tue, Oct 6, 2009 at 3:43 AM, Eero Nurkkala
>> <[email protected]> wrote:
>> > On Tue, 2009-10-06 at 12:22 +0200, ext Steven Noonan wrote:
>> >>
>> >> I would suspect you have to have CONFIG_NO_HZ enabled to be able to
>> >> reproduce the issue (considering the title of the bisected commit and
>> >> my own config). Do you have it enabled?
>> >>
>> >
>> > Yes, it's enabled.
>> >
>> >> > And another round:
>> >> >
>> >> > cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83% ?(72)
>> >> >
>> >> > Just once more after doing nothing:
>> >> > OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10% ?(82)
>> >> >
>> >> > So I can't agree it's broken. The patch you bisected, actually filtered
>> >> > out such phenomenon, in which an IRQ made the cpufreq framework
>> >> > occasionally think we were idling, although we were not. So you got
>> >> > "bonus" idle time that shouldn't been there in the first place. Now that
>> >> > the "bonus" idle time is not there, your system load may indeed be so
>> >> > high that the system never spends 80% or more time in idle? Could that
>> >> > be the case? Of course, even though I can't agree it's broken, doesn't
>> >> > mean it isn't somehow broken ;) It'd be nice to get info on other
>> >> > systems as well...
>> >>
>> >> Interestingly, "ondemand" (the governor fixed by the bisected commit)
>> >> works fine. "conservative" is the only broken one.
>> >>
>> >
>> > If you took timestamps in /arch/x86/kernel/process_**.c:
>> > (let's assume process_64.c) in cpu_idle()
>> > around enter_idle(); and __exit_idle(), took the diff,
>> > added the diffs up, and compared it to system uptime, you could see how
>> > much time you spend in idle()? I think it's possible that
>> > even if the cpu load is near 0%, the system may idle only for a bare
>> > moment (that translates to a buggy pm_idle()), and time is spent
>> > elsewhere (less than 80% in idle).
>>
>> This makes logical sense, but how should I test this? Is there a way
>> to do this with existing tracers?
>
> Tracers may by themselves add some load into the system.
>
> If I were you, I'd add something like: (I have only one CPU BTW)
>
> static ktime_t time_prior_idle;
> static int64_t idle_total;
>
> time_prior_idle = ktime_get();
> <idle stuff>
> idle_total += ktime_to_ns(ktime_sub(ktime_get(), time_prior_idle));
>
> and have a sysfs hook (something already present, so you can just cat
> it) with a trace:
>
> printk("Times: %lld, %lld \n", idle_total, ktime_to_ns(ktime_get()));
>
> Sample output:
> 374758812519, 386768249832
>
> So I have 386768249832 / 386768249832 = 96.9 % time spent in idle in
> this case.
>
> (Right, this should provide somewhat descent info, hopefully ;) )
>
Well, I tried adding the code to cpu_idle() as suggested, but it never
printed anything. Apparently cpu_idle() isn't ever being called here.
Even added a 'BUG();' at the beginning of the function and it never
hit it. Of course, I'm probably missing something obvious. Is there a
separate cpu_idle()-esque function for SMP?
- Steven
On Tue, Oct 6, 2009 at 5:54 PM, Steven Noonan <[email protected]> wrote:
> On Tue, Oct 6, 2009 at 5:36 AM, Eero Nurkkala
> <[email protected]> wrote:
>> On Tue, 2009-10-06 at 13:22 +0200, ext Steven Noonan wrote:
>>> On Tue, Oct 6, 2009 at 3:43 AM, Eero Nurkkala
>>> <[email protected]> wrote:
>>> > On Tue, 2009-10-06 at 12:22 +0200, ext Steven Noonan wrote:
>>> >>
>>> >> I would suspect you have to have CONFIG_NO_HZ enabled to be able to
>>> >> reproduce the issue (considering the title of the bisected commit and
>>> >> my own config). Do you have it enabled?
>>> >>
>>> >
>>> > Yes, it's enabled.
>>> >
>>> >> > And another round:
>>> >> >
>>> >> > cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83% ?(72)
>>> >> >
>>> >> > Just once more after doing nothing:
>>> >> > OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10% ?(82)
>>> >> >
>>> >> > So I can't agree it's broken. The patch you bisected, actually filtered
>>> >> > out such phenomenon, in which an IRQ made the cpufreq framework
>>> >> > occasionally think we were idling, although we were not. So you got
>>> >> > "bonus" idle time that shouldn't been there in the first place. Now that
>>> >> > the "bonus" idle time is not there, your system load may indeed be so
>>> >> > high that the system never spends 80% or more time in idle? Could that
>>> >> > be the case? Of course, even though I can't agree it's broken, doesn't
>>> >> > mean it isn't somehow broken ;) It'd be nice to get info on other
>>> >> > systems as well...
>>> >>
>>> >> Interestingly, "ondemand" (the governor fixed by the bisected commit)
>>> >> works fine. "conservative" is the only broken one.
>>> >>
>>> >
>>> > If you took timestamps in /arch/x86/kernel/process_**.c:
>>> > (let's assume process_64.c) in cpu_idle()
>>> > around enter_idle(); and __exit_idle(), took the diff,
>>> > added the diffs up, and compared it to system uptime, you could see how
>>> > much time you spend in idle()? I think it's possible that
>>> > even if the cpu load is near 0%, the system may idle only for a bare
>>> > moment (that translates to a buggy pm_idle()), and time is spent
>>> > elsewhere (less than 80% in idle).
>>>
>>> This makes logical sense, but how should I test this? Is there a way
>>> to do this with existing tracers?
>>
>> Tracers may by themselves add some load into the system.
>>
>> If I were you, I'd add something like: (I have only one CPU BTW)
>>
>> static ktime_t time_prior_idle;
>> static int64_t idle_total;
>>
>> time_prior_idle = ktime_get();
>> <idle stuff>
>> idle_total += ktime_to_ns(ktime_sub(ktime_get(), time_prior_idle));
>>
>> and have a sysfs hook (something already present, so you can just cat
>> it) with a trace:
>>
>> printk("Times: %lld, %lld \n", idle_total, ktime_to_ns(ktime_get()));
>>
>> Sample output:
>> 374758812519, 386768249832
>>
>> So I have 386768249832 / 386768249832 = 96.9 % time spent in idle in
>> this case.
>>
>> (Right, this should provide somewhat descent info, hopefully ;) )
>>
>
> Well, I tried adding the code to cpu_idle() as suggested, but it never
> printed anything. Apparently cpu_idle() isn't ever being called here.
> Even added a 'BUG();' at the beginning of the function and it never
> hit it. Of course, I'm probably missing something obvious. Is there a
> separate cpu_idle()-esque function for SMP?
>
Oh crap. Perhaps it's more insidious. I reverted the bisected commit
and it _DID_ hit the line I added. So cpu_idle is never entered with
the bisected commit. Bizarre.
- Steven
On Wed, 2009-10-07 at 03:22 +0200, ext Steven Noonan wrote:
> >>
> >> Tracers may by themselves add some load into the system.
> >>
> >> If I were you, I'd add something like: (I have only one CPU BTW)
> >>
> >> static ktime_t time_prior_idle;
> >> static int64_t idle_total;
> >>
> >> time_prior_idle = ktime_get();
> >> <idle stuff>
> >> idle_total += ktime_to_ns(ktime_sub(ktime_get(), time_prior_idle));
> >>
> >> and have a sysfs hook (something already present, so you can just cat
> >> it) with a trace:
> >>
> >> printk("Times: %lld, %lld \n", idle_total, ktime_to_ns(ktime_get()));
> >>
> >> Sample output:
> >> 374758812519, 386768249832
> >>
> >> So I have 386768249832 / 386768249832 = 96.9 % time spent in idle in
> >> this case.
> >>
> >> (Right, this should provide somewhat descent info, hopefully ;) )
> >>
> >
> > Well, I tried adding the code to cpu_idle() as suggested, but it never
> > printed anything. Apparently cpu_idle() isn't ever being called here.
> > Even added a 'BUG();' at the beginning of the function and it never
> > hit it. Of course, I'm probably missing something obvious. Is there a
> > separate cpu_idle()-esque function for SMP?
> >
>
> Oh crap. Perhaps it's more insidious. I reverted the bisected commit
> and it _DID_ hit the line I added. So cpu_idle is never entered with
> the bisected commit. Bizarre.
>
> - Steven
So you're saying that cpu_idle() is not ever called.. The commit in
question - whether you got it or had it reverted - should, and must not
be a trigger for your system to hit cpu_idle(). What if you took a look
at init/main.c, and added traces before the first call to cpu_idle():
start_kernel() ->
rest_init() ->
cpu_idle()
With the patch not reverted?
So I guess, that an irq may actually start your "system up" (patch
reverted) somehow:
irq_exit() ->
tick_nohz_stop_sched_tick() ->
tick_nohz_start_idle() ->
sched_clock_idle_sleep_event() ?
So it looks that maybe the conservative governor isn't broken after
all ;)
(BTW I guess, with SMP, you may wish to have fix things to work in your
setup
static ktime_t time_prior_idle[2];
static int64_t idle_total[2];
and
time_prior_idle[smp_processor_id()] = ktime_get();
etc, and print out times separately).
- Eero
On Tue, Oct 6, 2009 at 10:27 PM, Eero Nurkkala
<[email protected]> wrote:
> On Wed, 2009-10-07 at 03:22 +0200, ext Steven Noonan wrote:
>> >>
>> >> Tracers may by themselves add some load into the system.
>> >>
>> >> If I were you, I'd add something like: (I have only one CPU BTW)
>> >>
>> >> static ktime_t time_prior_idle;
>> >> static int64_t idle_total;
>> >>
>> >> time_prior_idle = ktime_get();
>> >> <idle stuff>
>> >> idle_total += ktime_to_ns(ktime_sub(ktime_get(), time_prior_idle));
>> >>
>> >> and have a sysfs hook (something already present, so you can just cat
>> >> it) with a trace:
>> >>
>> >> printk("Times: %lld, %lld \n", idle_total, ktime_to_ns(ktime_get()));
>> >>
>> >> Sample output:
>> >> 374758812519, 386768249832
>> >>
>> >> So I have 386768249832 / 386768249832 = 96.9 % time spent in idle in
>> >> this case.
>> >>
>> >> (Right, this should provide somewhat descent info, hopefully ;) )
>> >>
>> >
>> > Well, I tried adding the code to cpu_idle() as suggested, but it never
>> > printed anything. Apparently cpu_idle() isn't ever being called here.
>> > Even added a 'BUG();' at the beginning of the function and it never
>> > hit it. Of course, I'm probably missing something obvious. Is there a
>> > separate cpu_idle()-esque function for SMP?
>> >
>>
>> Oh crap. Perhaps it's more insidious. I reverted the bisected commit
>> and it _DID_ hit the line I added. So cpu_idle is never entered with
>> the bisected commit. Bizarre.
>>
>> - Steven
>
> So you're saying that cpu_idle() is not ever called.. The commit in
> question - whether you got it or had it reverted - should, and must not
> be a trigger for your system to hit cpu_idle(). What if you took a look
> at init/main.c, and added traces before the first call to cpu_idle():
>
> start_kernel() ->
> rest_init() ->
> cpu_idle()
Okay, wow, I'm a moron. I misread what cpu_idle() was intended to be
for. I thought that cpu_idle() was a function that was periodically
called whenever the CPU had nothing to do, but now I see that it's
actually the main loop. I should really read the code next time.
I've moved the statistics printout code to the _inside_ of that
infinite loop and retested. I had it print every several hundred
iterations. Here's the results (note the machine was idle the whole
time, except for about the first 10-20 seconds while the machine
booted):
[ 3.627716] timings[0]: 2250511125 / 3627716116
[ 6.946216] timings[0]: 4780901366 / 6946213531
[ 13.355182] timings[0]: 9385417604 / 13355183525
[ 18.551304] timings[1]: 16300853077 / 18551301189
[ 21.589039] timings[0]: 15984495433 / 21589037480
[ 47.152733] timings[1]: 44386121538 / 47152731476
[ 51.682630] timings[0]: 45713834076 / 51682628295
[ 79.587359] timings[0]: 73524821916 / 79587356820
[ 88.630110] timings[1]: 85324277596 / 88630109605
[ 96.082386] timings[0]: 89691306072 / 96082384539
>
> With the patch not reverted?
>
> So I guess, that an irq may actually start your "system up" (patch
> reverted) somehow:
>
> irq_exit() ->
> tick_nohz_stop_sched_tick() ->
> tick_nohz_start_idle() ->
> sched_clock_idle_sleep_event() ?
>
> So it looks that maybe the conservative governor isn't broken after
> all ;)
Symptomatically, it is broken. But it's likely just that: a symptom of
a larger issue.
>
>
> (BTW I guess, with SMP, you may wish to have fix things to work in your
> setup
> static ktime_t time_prior_idle[2];
> static int64_t idle_total[2];
> and
> time_prior_idle[smp_processor_id()] = ktime_get();
> etc, and print out times separately).
>
> - Eero
>
>
On Wed, 2009-10-07 at 09:30 +0200, ext Steven Noonan wrote:
>
> Okay, wow, I'm a moron. I misread what cpu_idle() was intended to be
> for. I thought that cpu_idle() was a function that was periodically
> called whenever the CPU had nothing to do, but now I see that it's
> actually the main loop. I should really read the code next time.
>
> I've moved the statistics printout code to the _inside_ of that
> infinite loop and retested. I had it print every several hundred
> iterations. Here's the results (note the machine was idle the whole
> time, except for about the first 10-20 seconds while the machine
> booted):
>
> [ 3.627716] timings[0]: 2250511125 / 3627716116
> [ 6.946216] timings[0]: 4780901366 / 6946213531
> [ 13.355182] timings[0]: 9385417604 / 13355183525
> [ 18.551304] timings[1]: 16300853077 / 18551301189
> [ 21.589039] timings[0]: 15984495433 / 21589037480
> [ 47.152733] timings[1]: 44386121538 / 47152731476
> [ 51.682630] timings[0]: 45713834076 / 51682628295
> [ 79.587359] timings[0]: 73524821916 / 79587356820
> [ 88.630110] timings[1]: 85324277596 / 88630109605
> [ 96.082386] timings[0]: 89691306072 / 96082384539
>
Those look good.
Well, might as well then go for:
/drivers/cpufreq/cpufreq_conservative.c
dbs_check_cpu() ->
load = 100 * (wall_time - idle_time) / wall_time; <- What is your load?
Let assume load is sane, look for (in dbs_check_cpu())
if (load < (dbs_tuners_ins.down_threshold - 10)) {
whether it is taken ever...if not, what is your
(dbs_tuners_ins.down_threshold - 10) ?
- Eero
On Wed, Oct 7, 2009 at 12:49 AM, Eero Nurkkala
<[email protected]> wrote:
> On Wed, 2009-10-07 at 09:30 +0200, ext Steven Noonan wrote:
>>
>> Okay, wow, I'm a moron. I misread what cpu_idle() was intended to be
>> for. I thought that cpu_idle() was a function that was periodically
>> called whenever the CPU had nothing to do, but now I see that it's
>> actually the main loop. I should really read the code next time.
>>
>> I've moved the statistics printout code to the _inside_ of that
>> infinite loop and retested. I had it print every several hundred
>> iterations. Here's the results (note the machine was idle the whole
>> time, except for about the first 10-20 seconds while the machine
>> booted):
>>
>> [ ? ?3.627716] timings[0]: 2250511125 / 3627716116
>> [ ? ?6.946216] timings[0]: 4780901366 / 6946213531
>> [ ? 13.355182] timings[0]: 9385417604 / 13355183525
>> [ ? 18.551304] timings[1]: 16300853077 / 18551301189
>> [ ? 21.589039] timings[0]: 15984495433 / 21589037480
>> [ ? 47.152733] timings[1]: 44386121538 / 47152731476
>> [ ? 51.682630] timings[0]: 45713834076 / 51682628295
>> [ ? 79.587359] timings[0]: 73524821916 / 79587356820
>> [ ? 88.630110] timings[1]: 85324277596 / 88630109605
>> [ ? 96.082386] timings[0]: 89691306072 / 96082384539
>>
>
> Those look good.
>
> Well, might as well then go for:
> /drivers/cpufreq/cpufreq_conservative.c
> dbs_check_cpu() ->
> load = 100 * (wall_time - idle_time) / wall_time; <- What is your load?
That's probably the problem...
[ 40.632277] cpufreq load = 100 * (66667 - 3310) / 66667 = 95
[ 40.698947] cpufreq load = 100 * (66661 - 3238) / 66661 = 95
[ 73.965425] cpufreq load = 100 * (66667 - 12820) / 66667 = 80
[ 74.032095] cpufreq load = 100 * (66661 - 1124) / 66661 = 98
[ 107.298571] cpufreq load = 100 * (66666 - 13092) / 66666 = 80
[ 107.365301] cpufreq load = 100 * (66722 - 3317) / 66722 = 95
[ 140.631717] cpufreq load = 100 * (66666 - 3311) / 66666 = 95
[ 140.698387] cpufreq load = 100 * (66662 - 3237) / 66662 = 95
idle_time is wrong.
> Let assume load is sane, look for (in dbs_check_cpu())
> ? ? ? ?if (load < (dbs_tuners_ins.down_threshold - 10)) {
>
> whether it is taken ever...if not, what is your
> (dbs_tuners_ins.down_threshold - 10) ?
>
> - Eero
>
>
On Wed, Oct 7, 2009 at 1:05 AM, Steven Noonan <[email protected]> wrote:
> On Wed, Oct 7, 2009 at 12:49 AM, Eero Nurkkala
> <[email protected]> wrote:
>> On Wed, 2009-10-07 at 09:30 +0200, ext Steven Noonan wrote:
>>>
>>> Okay, wow, I'm a moron. I misread what cpu_idle() was intended to be
>>> for. I thought that cpu_idle() was a function that was periodically
>>> called whenever the CPU had nothing to do, but now I see that it's
>>> actually the main loop. I should really read the code next time.
>>>
>>> I've moved the statistics printout code to the _inside_ of that
>>> infinite loop and retested. I had it print every several hundred
>>> iterations. Here's the results (note the machine was idle the whole
>>> time, except for about the first 10-20 seconds while the machine
>>> booted):
>>>
>>> [ ? ?3.627716] timings[0]: 2250511125 / 3627716116
>>> [ ? ?6.946216] timings[0]: 4780901366 / 6946213531
>>> [ ? 13.355182] timings[0]: 9385417604 / 13355183525
>>> [ ? 18.551304] timings[1]: 16300853077 / 18551301189
>>> [ ? 21.589039] timings[0]: 15984495433 / 21589037480
>>> [ ? 47.152733] timings[1]: 44386121538 / 47152731476
>>> [ ? 51.682630] timings[0]: 45713834076 / 51682628295
>>> [ ? 79.587359] timings[0]: 73524821916 / 79587356820
>>> [ ? 88.630110] timings[1]: 85324277596 / 88630109605
>>> [ ? 96.082386] timings[0]: 89691306072 / 96082384539
>>>
>>
>> Those look good.
>>
>> Well, might as well then go for:
>> /drivers/cpufreq/cpufreq_conservative.c
>> dbs_check_cpu() ->
>> load = 100 * (wall_time - idle_time) / wall_time; <- What is your load?
>
> That's probably the problem...
>
> [ ? 40.632277] cpufreq load = 100 * (66667 - 3310) / 66667 = 95
> [ ? 40.698947] cpufreq load = 100 * (66661 - 3238) / 66661 = 95
> [ ? 73.965425] cpufreq load = 100 * (66667 - 12820) / 66667 = 80
> [ ? 74.032095] cpufreq load = 100 * (66661 - 1124) / 66661 = 98
> [ ?107.298571] cpufreq load = 100 * (66666 - 13092) / 66666 = 80
> [ ?107.365301] cpufreq load = 100 * (66722 - 3317) / 66722 = 95
> [ ?140.631717] cpufreq load = 100 * (66666 - 3311) / 66666 = 95
> [ ?140.698387] cpufreq load = 100 * (66662 - 3237) / 66662 = 95
>
> idle_time is wrong.
Actually, it's more likely that the idle_time there is correct and
there's something else that's going runaway. My system's fans are
running at about 4000 RPM, when they'd normally be running at 2000 RPM
for this load average. I suspect there's something actually going
wild.
>
>> Let assume load is sane, look for (in dbs_check_cpu())
>> ? ? ? ?if (load < (dbs_tuners_ins.down_threshold - 10)) {
>>
>> whether it is taken ever...if not, what is your
>> (dbs_tuners_ins.down_threshold - 10) ?
>>
>> - Eero
>>
>>
>
On Wed, 2009-10-07 at 10:05 +0200, ext Steven Noonan wrote:
> > Those look good.
> >
> > Well, might as well then go for:
> > /drivers/cpufreq/cpufreq_conservative.c
> > dbs_check_cpu() ->
> > load = 100 * (wall_time - idle_time) / wall_time; <- What is your load?
>
> That's probably the problem...
>
> [ 40.632277] cpufreq load = 100 * (66667 - 3310) / 66667 = 95
> [ 40.698947] cpufreq load = 100 * (66661 - 3238) / 66661 = 95
> [ 73.965425] cpufreq load = 100 * (66667 - 12820) / 66667 = 80
> [ 74.032095] cpufreq load = 100 * (66661 - 1124) / 66661 = 98
> [ 107.298571] cpufreq load = 100 * (66666 - 13092) / 66666 = 80
> [ 107.365301] cpufreq load = 100 * (66722 - 3317) / 66722 = 95
> [ 140.631717] cpufreq load = 100 * (66666 - 3311) / 66666 = 95
> [ 140.698387] cpufreq load = 100 * (66662 - 3237) / 66662 = 95
>
> idle_time is wrong.
>
> > Let assume load is sane, look for (in dbs_check_cpu())
> > if (load < (dbs_tuners_ins.down_threshold - 10)) {
> >
> > whether it is taken ever...if not, what is your
> > (dbs_tuners_ins.down_threshold - 10) ?
> >
> > - Eero
> >
> >
<------------Sidenote------------------------------>
Vankatesh, I think the following is not smp friendly:
cpufreq_conservative.c:
/* Get Absolute Load */
for_each_cpu(j, policy->cpus) {
...
load = 100 * (wall_time - idle_time) / wall_time;
now the load is only considered from the last cpu in line?
in ondemand, load goes to max_load_freq...so max is taken
from all cpus out there. (well, doesn't seem like the case with
conservative). Of course I may be wrong, and things are like
supposed.
</------------Sidenote------------------------------>
Steven, how do the cpu loads look like without the patch?
- Eero
On Wed, Oct 7, 2009 at 1:14 AM, Eero Nurkkala
<[email protected]> wrote:
> On Wed, 2009-10-07 at 10:05 +0200, ext Steven Noonan wrote:
>> > Those look good.
>> >
>> > Well, might as well then go for:
>> > /drivers/cpufreq/cpufreq_conservative.c
>> > dbs_check_cpu() ->
>> > load = 100 * (wall_time - idle_time) / wall_time; <- What is your load?
>>
>> That's probably the problem...
>>
>> [ ? 40.632277] cpufreq load = 100 * (66667 - 3310) / 66667 = 95
>> [ ? 40.698947] cpufreq load = 100 * (66661 - 3238) / 66661 = 95
>> [ ? 73.965425] cpufreq load = 100 * (66667 - 12820) / 66667 = 80
>> [ ? 74.032095] cpufreq load = 100 * (66661 - 1124) / 66661 = 98
>> [ ?107.298571] cpufreq load = 100 * (66666 - 13092) / 66666 = 80
>> [ ?107.365301] cpufreq load = 100 * (66722 - 3317) / 66722 = 95
>> [ ?140.631717] cpufreq load = 100 * (66666 - 3311) / 66666 = 95
>> [ ?140.698387] cpufreq load = 100 * (66662 - 3237) / 66662 = 95
>>
>> idle_time is wrong.
>>
>> > Let assume load is sane, look for (in dbs_check_cpu())
>> > ? ? ? ?if (load < (dbs_tuners_ins.down_threshold - 10)) {
>> >
>> > whether it is taken ever...if not, what is your
>> > (dbs_tuners_ins.down_threshold - 10) ?
>> >
>> > - Eero
>> >
>> >
>
> <------------Sidenote------------------------------>
> Vankatesh, I think the following is not smp friendly:
> ? ? ? ?cpufreq_conservative.c:
> ? ? ? ?/* Get Absolute Load */
> ? ? ? ?for_each_cpu(j, policy->cpus) {
> ? ? ? ?...
> ? ? ? ?load = 100 * (wall_time - idle_time) / wall_time;
>
> ? ? ? ?now the load is only considered from the last cpu in line?
> ? ? ? ?in ondemand, load goes to max_load_freq...so max is taken
> ? ? ? ?from all cpus out there. (well, doesn't seem like the case with
> ? ? ? ?conservative). Of course I may be wrong, and things are like
> ? ? ? ?supposed.
> </------------Sidenote------------------------------>
>
> Steven, how do the cpu loads look like without the patch?
They're sane:
[ 40.019381] cpufreq load = 100 * (66666 - 66337) / 66666 = 0
[ 40.019396] cpufreq load = 100 * (66666 - 66299) / 66666 = 0
[ 73.352580] cpufreq load = 100 * (66717 - 66349) / 66717 = 0
[ 73.352595] cpufreq load = 100 * (66634 - 63848) / 66634 = 4
On Wed, 2009-10-07 at 10:24 +0200, ext Steven Noonan wrote:
> >
> > Steven, how do the cpu loads look like without the patch?
>
> They're sane:
>
> [ 40.019381] cpufreq load = 100 * (66666 - 66337) / 66666 = 0
> [ 40.019396] cpufreq load = 100 * (66666 - 66299) / 66666 = 0
> [ 73.352580] cpufreq load = 100 * (66717 - 66349) / 66717 = 0
> [ 73.352595] cpufreq load = 100 * (66634 - 63848) / 66634 = 4
Thank you. Could you please try the following:
Now, if ts->nohz_mode == NOHZ_MODE_INACTIVE, ts->inidle is not set
and all subsequent calls from irq_exit() think we weren't idling,
which is not true.
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -233,6 +233,8 @@ void tick_nohz_stop_sched_tick(int inidle)
now = tick_nohz_start_idle(ts);
+ ts->inidle = 1;
+
/*
* If this cpu is offline and it is the one which updates
* jiffies, then give up the assignment and let it be taken by
@@ -248,8 +250,6 @@ void tick_nohz_stop_sched_tick(int inidle)
if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
goto end;
- ts->inidle = 1;
-
if (need_resched())
goto end;
On Wed, Oct 7, 2009 at 1:43 AM, Eero Nurkkala
<[email protected]> wrote:
> On Wed, 2009-10-07 at 10:24 +0200, ext Steven Noonan wrote:
>> >
>> > Steven, how do the cpu loads look like without the patch?
>>
>> They're sane:
>>
>> [ ? 40.019381] cpufreq load = 100 * (66666 - 66337) / 66666 = 0
>> [ ? 40.019396] cpufreq load = 100 * (66666 - 66299) / 66666 = 0
>> [ ? 73.352580] cpufreq load = 100 * (66717 - 66349) / 66717 = 0
>> [ ? 73.352595] cpufreq load = 100 * (66634 - 63848) / 66634 = 4
>
> Thank you. Could you please try the following:
> Now, if ts->nohz_mode == NOHZ_MODE_INACTIVE, ts->inidle is not set
> and all subsequent calls from irq_exit() think we weren't idling,
> which is not true.
>
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -233,6 +233,8 @@ void tick_nohz_stop_sched_tick(int inidle)
>
> ? ? ? ?now = tick_nohz_start_idle(ts);
>
> + ? ? ? ts->inidle = 1;
> +
> ? ? ? ?/*
> ? ? ? ? * If this cpu is offline and it is the one which updates
> ? ? ? ? * jiffies, then give up the assignment and let it be taken by
> @@ -248,8 +250,6 @@ void tick_nohz_stop_sched_tick(int inidle)
> ? ? ? ?if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
> ? ? ? ? ? ? ? ?goto end;
>
> - ? ? ? ts->inidle = 1;
> -
> ? ? ? ?if (need_resched())
> ? ? ? ? ? ? ? ?goto end;
>
Yeah, that fixed it. The load is now sane, my system isn't heating up,
and 'conservative' is now clocking my CPU down to the minimum as is
appropriate.
Can you explain why this only affected 'conservative', why it caused
my machine to heat up, etc?
Also, this fix should probably be passed on to Greg K. H. (so it goes
in 2.6.31-stable) as well as Ingo Molnar (so it goes into -tip, and
hopefully to Linus for 2.6.32). Both CC'd.
- Steven
On Wed, 2009-10-07 at 10:52 +0200, ext Steven Noonan wrote:
> On Wed, Oct 7, 2009 at 1:43 AM, Eero Nurkkala
> <[email protected]> wrote:
> > On Wed, 2009-10-07 at 10:24 +0200, ext Steven Noonan wrote:
> >> >
> >> > Steven, how do the cpu loads look like without the patch?
> >>
> >> They're sane:
> >>
> >> [ 40.019381] cpufreq load = 100 * (66666 - 66337) / 66666 = 0
> >> [ 40.019396] cpufreq load = 100 * (66666 - 66299) / 66666 = 0
> >> [ 73.352580] cpufreq load = 100 * (66717 - 66349) / 66717 = 0
> >> [ 73.352595] cpufreq load = 100 * (66634 - 63848) / 66634 = 4
> >
> > Thank you. Could you please try the following:
> > Now, if ts->nohz_mode == NOHZ_MODE_INACTIVE, ts->inidle is not set
> > and all subsequent calls from irq_exit() think we weren't idling,
> > which is not true.
> >
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -233,6 +233,8 @@ void tick_nohz_stop_sched_tick(int inidle)
> >
> > now = tick_nohz_start_idle(ts);
> >
> > + ts->inidle = 1;
> > +
> > /*
> > * If this cpu is offline and it is the one which updates
> > * jiffies, then give up the assignment and let it be taken by
> > @@ -248,8 +250,6 @@ void tick_nohz_stop_sched_tick(int inidle)
> > if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
> > goto end;
> >
> > - ts->inidle = 1;
> > -
> > if (need_resched())
> > goto end;
> >
>
> Yeah, that fixed it. The load is now sane, my system isn't heating up,
> and 'conservative' is now clocking my CPU down to the minimum as is
> appropriate.
>
For some reason, this path is taken:
if (ts->nohz_mode == NOHZ_MODE_INACTIVE)
goto end;
> Can you explain why this only affected 'conservative', why it caused
> my machine to heat up, etc?
>
Maybe the loads dont look good with ondemand either? if you trace them
out. Below is a patch that has explanation on what's happening in
your system.
> Also, this fix should probably be passed on to Greg K. H. (so it goes
> in 2.6.31-stable) as well as Ingo Molnar (so it goes into -tip, and
> hopefully to Linus for 2.6.32). Both CC'd.
>
> - Steven
From: Eero Nurkkala <[email protected]>
Date: Wed, 7 Oct 2009 11:54:26 +0300
Subject: [PATCH] NOHZ: update idle state properly
Commit f2e21c9610991e95621a81407cdbab881226419b
had unfortunate side effects with cpufreq governors on
some systems.
If NOHZ_MODE_INACTIVE was set, ts->inidle was not being
set. Then, all subsequent calls from irq_exit() bypassed
calls to tick_nohz_start_idle() which resulted in wrong
information passed to cpufreq governors. Fix this by
updating the state of ts->inidle where it fits the best.
Reported-by: Steven Noonan <[email protected]>
Signed-off-by: Eero Nurkkala <[email protected]>
---
kernel/time/tick-sched.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index e0f59a2..4c1b0ac 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -233,6 +233,8 @@ void tick_nohz_stop_sched_tick(int inidle)
now = tick_nohz_start_idle(ts);
+ ts->inidle = 1;
+
/*
* If this cpu is offline and it is the one which updates
* jiffies, then give up the assignment and let it be taken by
@@ -248,8 +250,6 @@ void tick_nohz_stop_sched_tick(int inidle)
if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
goto end;
- ts->inidle = 1;
-
if (need_resched())
goto end;
--
1.6.0.4
On Wed, 7 Oct 2009, Eero Nurkkala wrote:
> For some reason, this path is taken:
> if (ts->nohz_mode == NOHZ_MODE_INACTIVE)
> goto end;
Well, probably because the machine never switched to NOHZ :)
> > Also, this fix should probably be passed on to Greg K. H. (so it goes
> > in 2.6.31-stable) as well as Ingo Molnar (so it goes into -tip, and
> > hopefully to Linus for 2.6.32). Both CC'd.
Don't worry, it'll go the right way. No need to cc the world and some
more :)
> > - Steven
>
> From: Eero Nurkkala <[email protected]>
> Date: Wed, 7 Oct 2009 11:54:26 +0300
> Subject: [PATCH] NOHZ: update idle state properly
>
> Commit f2e21c9610991e95621a81407cdbab881226419b
> had unfortunate side effects with cpufreq governors on
> some systems.
>
> If NOHZ_MODE_INACTIVE was set, ts->inidle was not being
> set. Then, all subsequent calls from irq_exit() bypassed
> calls to tick_nohz_start_idle() which resulted in wrong
> information passed to cpufreq governors. Fix this by
> updating the state of ts->inidle where it fits the best.
Hmm, nasty that code silently relies on NOHZ functionality even when
NOHZ is not active. That needs a big fat comment at least.
Nice detective work though !
Thanks,
tglx
On Wed, Oct 7, 2009 at 3:42 AM, Thomas Gleixner <[email protected]> wrote:
> On Wed, 7 Oct 2009, Eero Nurkkala wrote:
>> For some reason, this path is taken:
>> if (ts->nohz_mode == NOHZ_MODE_INACTIVE)
>> ? ? ? ? goto end;
>
> Well, probably because the machine never switched to NOHZ :)
>
>> > Also, this fix should probably be passed on to Greg K. H. (so it goes
>> > in 2.6.31-stable) as well as Ingo Molnar (so it goes into -tip, and
>> > hopefully to Linus for 2.6.32). Both CC'd.
>
> Don't worry, it'll go the right way. No need to cc the world and some
> more :)
>
Sorry if it's not kosher. I was told that when dealing with a
bisection you should CC:
1. those involved with a bisected patch (the Signed-off-by people),
2. those responsible for the affected subsystem (doing an author check
via script[1]),
3. and those to whom a fix should be passed on to.
Any tips so I don't bug the wrong people (or too many people) in the future?
[1] http://www.uplinklabs.net/~tycho/linux/scripts/git-authors-email.sh
(credits to Ingo for making this one. It's handy)
>>
>> From: Eero Nurkkala <[email protected]>
>> Date: Wed, 7 Oct 2009 11:54:26 +0300
>> Subject: [PATCH] NOHZ: update idle state properly
>>
>> Commit f2e21c9610991e95621a81407cdbab881226419b
>> had unfortunate side effects with cpufreq governors on
>> some systems.
>>
>> If NOHZ_MODE_INACTIVE was set, ts->inidle was not being
>> set. Then, all subsequent calls from irq_exit() bypassed
>> calls to tick_nohz_start_idle() which resulted in wrong
>> information passed to cpufreq governors. Fix this by
>> updating the state of ts->inidle where it fits the best.
>
> Hmm, nasty that code silently relies on NOHZ functionality even when
> NOHZ is not active. That needs a big fat comment at least.
>
> Nice detective work though !
>
> Thanks,
>
> ? ? ? ?tglx
>
Commit-ID: fdc6f192e7e1ae80565af23cc33dc88e3dcdf184
Gitweb: http://git.kernel.org/tip/fdc6f192e7e1ae80565af23cc33dc88e3dcdf184
Author: Eero Nurkkala <[email protected]>
AuthorDate: Wed, 7 Oct 2009 11:54:26 +0300
Committer: Thomas Gleixner <[email protected]>
CommitDate: Wed, 7 Oct 2009 13:05:05 +0200
NOHZ: update idle state also when NOHZ is inactive
Commit f2e21c9610991e95621a81407cdbab881226419b had unfortunate side
effects with cpufreq governors on some systems.
If the system did not switch into NOHZ mode ts->inidle is not set when
tick_nohz_stop_sched_tick() is called from the idle routine. Therefor
all subsequent calls from irq_exit() to tick_nohz_stop_sched_tick()
fail to call tick_nohz_start_idle(). This results in bogus idle
accounting information which is passed to cpufreq governors.
Set the inidle flag unconditionally of the NOHZ active state to keep
the idle time accounting correct in any case.
[ tglx: Added comment and tweaked the changelog ]
Reported-by: Steven Noonan <[email protected]>
Signed-off-by: Eero Nurkkala <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Venkatesh Pallipadi <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Steven Noonan <[email protected]>
Cc: [email protected]
LKML-Reference: <1254907901.30157.93.camel@eenurkka-desktop>
Signed-off-by: Thomas Gleixner <[email protected]>
---
kernel/time/tick-sched.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index e0f59a2..89aed59 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -231,6 +231,13 @@ void tick_nohz_stop_sched_tick(int inidle)
if (!inidle && !ts->inidle)
goto end;
+ /*
+ * Set ts->inidle unconditionally. Even if the system did not
+ * switch to NOHZ mode the cpu frequency governers rely on the
+ * update of the idle time accounting in tick_nohz_start_idle().
+ */
+ ts->inidle = 1;
+
now = tick_nohz_start_idle(ts);
/*
@@ -248,8 +255,6 @@ void tick_nohz_stop_sched_tick(int inidle)
if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
goto end;
- ts->inidle = 1;
-
if (need_resched())
goto end;
On Wed, 7 Oct 2009, Steven Noonan wrote:
> > Well, probably because the machine never switched to NOHZ :)
> >
> >> > Also, this fix should probably be passed on to Greg K. H. (so it goes
> >> > in 2.6.31-stable) as well as Ingo Molnar (so it goes into -tip, and
> >> > hopefully to Linus for 2.6.32). Both CC'd.
> >
> > Don't worry, it'll go the right way. No need to cc the world and some
> > more :)
> >
>
> Sorry if it's not kosher. I was told that when dealing with a
> bisection you should CC:
> 1. those involved with a bisected patch (the Signed-off-by people),
> 2. those responsible for the affected subsystem (doing an author check
> via script[1]),
> 3. and those to whom a fix should be passed on to.
>
> Any tips so I don't bug the wrong people (or too many people) in the future?
No worry. It's just Linus and Greg who got some extra mail into their
inbox. The flow is that the patch is picked up by the maintainer and
pushed into his git tree which is then offered to Linus for pulling it
into mainline. We add a Cc: [email protected] tag to the commit, which
triggers a notifier to the stable kernel team when the patch hits
Linus tree.
Thanks,
tglx
tip-bot for Eero Nurkkala wrote:
> Commit-ID: fdc6f192e7e1ae80565af23cc33dc88e3dcdf184
> Gitweb: http://git.kernel.org/tip/fdc6f192e7e1ae80565af23cc33dc88e3dcdf184
> Author: Eero Nurkkala <[email protected]>
> AuthorDate: Wed, 7 Oct 2009 11:54:26 +0300
> Committer: Thomas Gleixner <[email protected]>
> CommitDate: Wed, 7 Oct 2009 13:05:05 +0200
>
> NOHZ: update idle state also when NOHZ is inactive
>
> Commit f2e21c9610991e95621a81407cdbab881226419b had unfortunate side
> effects with cpufreq governors on some systems.
>
> If the system did not switch into NOHZ mode ts->inidle is not set when
> tick_nohz_stop_sched_tick() is called from the idle routine. Therefor
> all subsequent calls from irq_exit() to tick_nohz_stop_sched_tick()
> fail to call tick_nohz_start_idle(). This results in bogus idle
> accounting information which is passed to cpufreq governors.
>
> Set the inidle flag unconditionally of the NOHZ active state to keep
> the idle time accounting correct in any case.
Acked-by: Rik van Riel <[email protected]>
--
All rights reversed.