2010-08-25 00:48:55

by Cesar Eduardo Barros

[permalink] [raw]
Subject: intel ips: CPU TDP doesn't match expected value

I get the following messages on v2.6.36-rc2-203-g502adf5:

intel ips 0000:00:1f.6: Warning: CPU TDP doesn't match expected value
(found 25, expected 35)
intel ips 0000:00:1f.6: PCI INT C -> GSI 18 (level, low) -> IRQ 18
intel ips 0000:00:1f.6: IPS driver initialized, MCP temp limit 65535
intel ips 0000:00:1f.6: MCP power or thermal limit exceeded
intel ips 0000:00:1f.6: CPU power or thermal limit exceeded

(The last two messages repeat very often, one of them every five
seconds; if I understand it correctly, that is normal and expected.)

The code implies the "CPU TDP doesn't match expected value" is a sanity
check failing. Should that happen, or is there something wrong?


/proc/cpuinfo for the first core:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 37
model name : Intel(R) Core(TM) i3 CPU M 330 @ 2.13GHz
stepping : 2
cpu MHz : 933.000
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16
xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm arat tpr_shadow vnmi flexpriority
ept vpid
bogomips : 4256.29
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

/sys/devices/virtual/dmi/id/:
bios_date: 06/21/2010
bios_vendor: Dell Inc.
bios_version: A05
board_asset_tag:
board_name: 07NTDG
board_vendor: Dell Inc.
board_version: A05
chassis_asset_tag:
chassis_type: 8
chassis_vendor: Dell Inc.
chassis_version: A05
product_name: Inspiron N4010
product_version: A05
sys_vendor: Dell Inc.

--
Cesar Eduardo Barros
[email protected]
[email protected]


2010-08-25 01:37:13

by Jesse Barnes

[permalink] [raw]
Subject: Re: intel ips: CPU TDP doesn't match expected value

On Tue, 24 Aug 2010 21:48:45 -0300
Cesar Eduardo Barros <[email protected]> wrote:

> I get the following messages on v2.6.36-rc2-203-g502adf5:
>
> intel ips 0000:00:1f.6: Warning: CPU TDP doesn't match expected value
> (found 25, expected 35)
> intel ips 0000:00:1f.6: PCI INT C -> GSI 18 (level, low) -> IRQ 18
> intel ips 0000:00:1f.6: IPS driver initialized, MCP temp limit 65535
> intel ips 0000:00:1f.6: MCP power or thermal limit exceeded
> intel ips 0000:00:1f.6: CPU power or thermal limit exceeded
>
> (The last two messages repeat very often, one of them every five
> seconds; if I understand it correctly, that is normal and expected.)
>
> The code implies the "CPU TDP doesn't match expected value" is a sanity
> check failing. Should that happen, or is there something wrong?

Yeah, it's normal, and should probably be downgraded to an
informational message. It just means your BIOS doesn't want you to
push the CPU all the way up to its maximum TDP, probably because the
platform or chassis wasn't designed for that much heat or power
consumption.

--
Jesse Barnes, Intel Open Source Technology Center

2010-08-26 23:29:38

by Cesar Eduardo Barros

[permalink] [raw]
Subject: [PATCH] intel_ips: quieten "power or thermal limit exceeded" messages

On my Dell Inspiron N4010, one of these messages is printed every five
seconds. Change both to dev_dbg to quieten them even more.

Signed-off-by: Cesar Eduardo Barros <[email protected]>
---
drivers/platform/x86/intel_ips.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/platform/x86/intel_ips.c b/drivers/platform/x86/intel_ips.c
index 9024480..7833adc 100644
--- a/drivers/platform/x86/intel_ips.c
+++ b/drivers/platform/x86/intel_ips.c
@@ -607,7 +607,7 @@ static bool mcp_exceeded(struct ips_driver *ips)
spin_unlock_irqrestore(&ips->turbo_status_lock, flags);

if (ret)
- dev_info(&ips->dev->dev,
+ dev_dbg(&ips->dev->dev,
"MCP power or thermal limit exceeded\n");

return ret;
@@ -635,7 +635,7 @@ static bool cpu_exceeded(struct ips_driver *ips, int cpu)
spin_unlock_irqrestore(&ips->turbo_status_lock, flags);

if (ret)
- dev_info(&ips->dev->dev,
+ dev_dbg(&ips->dev->dev,
"CPU power or thermal limit exceeded\n");

return ret;
--
1.7.2.2

2010-08-26 23:33:48

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH] intel_ips: quieten "power or thermal limit exceeded" messages

On Thu, 26 Aug 2010, Cesar Eduardo Barros wrote:

> On my Dell Inspiron N4010, one of these messages is printed every five
> seconds. Change both to dev_dbg to quieten them even more.

I think you should instead fix your hardware or maybe change
your thermal throttling settings.

> @@ -607,7 +607,7 @@ static bool mcp_exceeded(struct ips_driver *ips)
> - dev_info(&ips->dev->dev,
> + dev_dbg(&ips->dev->dev,
> "MCP power or thermal limit exceeded\n");

2010-08-27 00:01:50

by Alan

[permalink] [raw]
Subject: Re: [PATCH] intel_ips: quieten "power or thermal limit exceeded" messages

On Thu, 26 Aug 2010 20:29:01 -0300
Cesar Eduardo Barros <[email protected]> wrote:

> On my Dell Inspiron N4010, one of these messages is printed every five
> seconds. Change both to dev_dbg to quieten them even more.

It's not the kind of thing you want to quieten since in most cases it is
stuff you want to know about.

>
> Signed-off-by: Cesar Eduardo Barros <[email protected]>
> ---
> drivers/platform/x86/intel_ips.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/platform/x86/intel_ips.c b/drivers/platform/x86/intel_ips.c
> index 9024480..7833adc 100644
> --- a/drivers/platform/x86/intel_ips.c
> +++ b/drivers/platform/x86/intel_ips.c
> @@ -607,7 +607,7 @@ static bool mcp_exceeded(struct ips_driver *ips)
> spin_unlock_irqrestore(&ips->turbo_status_lock, flags);
>
> if (ret)
> - dev_info(&ips->dev->dev,
> + dev_dbg(&ips->dev->dev,
> "MCP power or thermal limit exceeded\n");

Probably better to ratelimit it. Being told it occurs is important, being
told every five seconds is indeed excessive.

Alan

2010-08-27 00:11:13

by Cesar Eduardo Barros

[permalink] [raw]
Subject: Re: [PATCH] intel_ips: quieten "power or thermal limit exceeded" messages

Em 26-08-2010 20:33, Joe Perches escreveu:
> On Thu, 26 Aug 2010, Cesar Eduardo Barros wrote:
>
>> On my Dell Inspiron N4010, one of these messages is printed every five
>> seconds. Change both to dev_dbg to quieten them even more.
>
> I think you should instead fix your hardware or maybe change
> your thermal throttling settings.

How can I know if this laptop is broken or not? According to Jesse's
reply, the BIOS lowered the limit, which could explain why it hits the
limit more often:

intel ips 0000:00:1f.6: Warning: CPU TDP doesn't match expected value
(found 25, expected 35)

The thermal throttling seems to be also managed by the BIOS (and there
are no settings about it on the BIOS setup that I remember):

CPU: Physical Processor ID: 0
CPU: Processor Core ID: 0
mce: CPU supports 9 MCE banks
CPU0: Thermal monitoring handled by SMI
using mwait in idle threads.
[...]
CPU0: Intel(R) Core(TM) i3 CPU M 330 @ 2.13GHz stepping 02
Booting Node 0, Processors #1
CPU1: Thermal monitoring handled by SMI
#2
CPU2: Thermal monitoring handled by SMI
#3 Ok.
CPU3: Thermal monitoring handled by SMI
Brought up 4 CPUs

And ACPI is of no help, it shows two thermal zones, one always showing
27 C, the other always showing 0 C (and even though that is below the
reported thresholds of 55 C and 71 C for the fans, the fan still speeds
up and down on its own, showing that it is just the reporting that is
broken).

The fan does slow down to almost nothing (or even off) when idle and
spins up to a strong hot breeze when compiling the kernel (make -j8), so
the thermal monitoring on the BIOS seems to be working fine. I would
expect broken hardware to run the fan on high speed all the time (not
cooling enough) or low speed all the time (thermal sensor broken).

The coretemp module (which is not autoloaded) seems to be more helpful;
it shows between 43000 and 45000 for temp1_input on both cores when
idle, going up to 50000-56000 when lightly loaded (I have not looked at
it yet while doing a heavy compile).

Is there a way to know if all this is just an oddness of this model, or
if there is something which is not working quite right?

(All the output above is from 2.6.35.3; I am not running 2.3.36-rc2+
right now because it hangs on resume, and I have not yet had the time to
look at it.)

>
>> @@ -607,7 +607,7 @@ static bool mcp_exceeded(struct ips_driver *ips)
>> - dev_info(&ips->dev->dev,
>> + dev_dbg(&ips->dev->dev,
>> "MCP power or thermal limit exceeded\n");
>
>


--
Cesar Eduardo Barros
[email protected]
[email protected]

2010-08-27 00:22:48

by Cesar Eduardo Barros

[permalink] [raw]
Subject: Re: [PATCH] intel_ips: quieten "power or thermal limit exceeded" messages

Em 26-08-2010 21:18, Alan Cox escreveu:
> On Thu, 26 Aug 2010 20:29:01 -0300
> Cesar Eduardo Barros<[email protected]> wrote:
>
>> On my Dell Inspiron N4010, one of these messages is printed every five
>> seconds. Change both to dev_dbg to quieten them even more.
>
> It's not the kind of thing you want to quieten since in most cases it is
> stuff you want to know about.

Isn't it a normal thing, however? If I understand it correctly, this
driver allows you to use a bit more power but throttles back if you are
using too much or it gets too hot. Thus, there should be nothing wrong
or even unusual with the situations where these messages appear.

>> if (ret)
>> - dev_info(&ips->dev->dev,
>> + dev_dbg(&ips->dev->dev,
>> "MCP power or thermal limit exceeded\n");
>
> Probably better to ratelimit it. Being told it occurs is important, being
> told every five seconds is indeed excessive.

Agreed. Even if it does not happen when idle, it should happen a lot
while playing a CPU-heavy and GPU-heavy 3D game.

--
Cesar Eduardo Barros
[email protected]
[email protected]

2010-08-27 00:41:07

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH] intel_ips: quieten "power or thermal limit exceeded" messages

On Thu, 2010-08-26 at 21:11 -0300, Cesar Eduardo Barros wrote:
> Em 26-08-2010 20:33, Joe Perches escreveu:
> > On Thu, 26 Aug 2010, Cesar Eduardo Barros wrote:
> >> On my Dell Inspiron N4010, one of these messages is printed every five
> >> seconds. Change both to dev_dbg to quieten them even more.
> > I think you should instead fix your hardware or maybe change
> > your thermal throttling settings.

I was probably a bit hasty in writing that.

> Is there a way to know if all this is just an oddness of this model, or
> if there is something which is not working quite right?
>
> (All the output above is from 2.6.35.3; I am not running 2.3.36-rc2+
> right now because it hangs on resume, and I have not yet had the time to
> look at it.)

Perhaps you might try this patch and get a bit more information.

It seems a sensible patch and perhaps should be applied anyway.

Signed-off-by: Joe Perches <[email protected]>
---
drivers/platform/x86/intel_ips.c | 58 +++++++++++++++++++++++++++++--------
1 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/drivers/platform/x86/intel_ips.c b/drivers/platform/x86/intel_ips.c
index 9024480..73f9ad1 100644
--- a/drivers/platform/x86/intel_ips.c
+++ b/drivers/platform/x86/intel_ips.c
@@ -598,17 +598,36 @@ static bool mcp_exceeded(struct ips_driver *ips)
{
unsigned long flags;
bool ret = false;
+ u16 mcp_avg_temp;
+ u16 mcp_temp_limit;
+ u16 mcp_power_limit;
+ u32 cpu_avg_power;
+ u32 mch_avg_power;

spin_lock_irqsave(&ips->turbo_status_lock, flags);
- if (ips->mcp_avg_temp > (ips->mcp_temp_limit * 100))
- ret = true;
- if (ips->cpu_avg_power + ips->mch_avg_power > ips->mcp_power_limit)
- ret = true;
+
+ mcp_avg_temp = ips->mcp_avg_temp;
+ mcp_temp_limit = ips->mcp_temp_limit;
+ mcp_power_limit = ips->mcp_power_limit;
+ cpu_avg_power = ips->cpu_avg_power;
+ mch_avg_power = ips->mch_avg_power;
+
spin_unlock_irqrestore(&ips->turbo_status_lock, flags);

- if (ret)
+ if ((cpu_avg_power + mch_avg_power) > mcp_power_limit) {
dev_info(&ips->dev->dev,
- "MCP power or thermal limit exceeded\n");
+ "MCP power limit %d exceeded: %d\n",
+ mcp_power_limit,
+ cpu_avg_power + mch_avg_power);
+ ret = true;
+ }
+ if (mcp_avg_temp > (mcp_temp_limit * 100)) {
+ dev_info(&ips->dev->dev,
+ "MCP thermal limit %d exceeded: %d\n",
+ mcp_temp_limit * 100,
+ mcp_avg_temp);
+ ret = true;
+ }

return ret;
}
@@ -623,20 +642,33 @@ static bool mcp_exceeded(struct ips_driver *ips)
static bool cpu_exceeded(struct ips_driver *ips, int cpu)
{
unsigned long flags;
- int avg;
bool ret = false;
+ int avg;
+ int core_temp_limit;
+ u16 core_power_limit;
+ u32 cpu_avg_power;

spin_lock_irqsave(&ips->turbo_status_lock, flags);
+
avg = cpu ? ips->ctv2_avg_temp : ips->ctv1_avg_temp;
- if (avg > (ips->limits->core_temp_limit * 100))
- ret = true;
- if (ips->cpu_avg_power > ips->core_power_limit * 100)
- ret = true;
+ core_temp_limit = ips->limits->core_temp_limit;
+ core_power_limit = ips->core_power_limit;
+ cpu_avg_power = ips->cpu_avg_power;
+
spin_unlock_irqrestore(&ips->turbo_status_lock, flags);

- if (ret)
+ if (cpu_avg_power > (core_power_limit * 100)) {
+ dev_info(&ips->dev->dev,
+ "CPU power limit %d exceeded: %d\n",
+ cpu_avg_power, core_power_limit * 100);
+ ret = true;
+ }
+ if (avg > (core_temp_limit * 100)) {
dev_info(&ips->dev->dev,
- "CPU power or thermal limit exceeded\n");
+ "CPU thermal limit %d exceeded: %d\n",
+ core_temp_limit * 100, avg);
+ ret = true;
+ }

return ret;
}

2010-08-27 01:38:45

by Cesar Eduardo Barros

[permalink] [raw]
Subject: Re: [PATCH] intel_ips: quieten "power or thermal limit exceeded" messages

Em 26-08-2010 21:41, Joe Perches escreveu:
> On Thu, 2010-08-26 at 21:11 -0300, Cesar Eduardo Barros wrote:
>> Em 26-08-2010 20:33, Joe Perches escreveu:
>>> On Thu, 26 Aug 2010, Cesar Eduardo Barros wrote:
>>>> On my Dell Inspiron N4010, one of these messages is printed every five
>>>> seconds. Change both to dev_dbg to quieten them even more.
>>> I think you should instead fix your hardware or maybe change
>>> your thermal throttling settings.
>
> I was probably a bit hasty in writing that.
>
>> Is there a way to know if all this is just an oddness of this model, or
>> if there is something which is not working quite right?
>>
>> (All the output above is from 2.6.35.3; I am not running 2.3.36-rc2+
>> right now because it hangs on resume, and I have not yet had the time to
>> look at it.)
>
> Perhaps you might try this patch and get a bit more information.

Running with it right now. Unless I missed one, the messages do happen
exactly every five seconds. Here are the first few lines of the dmesg
(grepping for 'intel ips'), and it does seem a bit strange:

$ dmesg | fgrep intel\ ips
intel ips 0000:00:1f.6: Warning: CPU TDP doesn't match expected value
(found 25, expected 35)
intel ips 0000:00:1f.6: PCI INT C -> GSI 18 (level, low) -> IRQ 18
intel ips 0000:00:1f.6: IPS driver initialized, MCP temp limit 65535
intel ips 0000:00:1f.6: MCP power limit 65535 exceeded: 25615533
intel ips 0000:00:1f.6: CPU power limit 8183 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 6276 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 7952 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 7155 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 5928 exceeded: 0
intel ips 0000:00:1f.6: MCP power limit 65535 exceeded: 70848
intel ips 0000:00:1f.6: CPU power limit 6430 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 6474 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 5508 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 6569 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 5250 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 5023 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 6209 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 7276 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 9027 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 7008 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 5478 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 6658 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 5192 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 6347 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 5506 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4447 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4462 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4382 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4862 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 5218 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4865 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4131 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 5331 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 6012 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 5323 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4727 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4949 exceeded: 0
intel ips 0000:00:1f.6: MCP power limit 65535 exceeded: 69539
intel ips 0000:00:1f.6: CPU power limit 5045 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 5070 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 5113 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4262 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 5158 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4865 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4302 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4430 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4841 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4532 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4814 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4803 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 5139 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4050 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 6176 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4746 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 3739 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4863 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4917 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4099 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 4956 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 5018 exceeded: 0
intel ips 0000:00:1f.6: CPU power limit 3863 exceeded: 0

I think you put the parameters in the wrong order on the dev_info() call
for the CPU power limit; it is the limit that is 0.

Two bogus things I can see:

- The first "MCP power limit exceeded" seems very bogus.
- What do you mean, core_power_limit is zero?

I will go back to 2.6.35; not feeling like debugging my (unrelated)
suspend/resume problem right now. If you have another debugging patch, I
will probably test it only tomorrow night.

--
Cesar Eduardo Barros
[email protected]
[email protected]

2010-08-27 01:43:09

by Matthew Garrett

[permalink] [raw]
Subject: Re: [PATCH] intel_ips: quieten "power or thermal limit exceeded" messages

I agree that something like this is needed, but I'm happy to defer to
Jesse in terms of the implementation (he has the specs, I don't...). I'm
also about to head offline for a week, so feel free to push the patch
once you've got a solution and if I disagree with it I'll write whiny
emails once I get back.

--
Matthew Garrett | [email protected]

2010-08-27 07:39:17

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH] intel_ips: quieten "power or thermal limit exceeded" messages

On Thu, 2010-08-26 at 22:38 -0300, Cesar Eduardo Barros wrote:
> intel ips 0000:00:1f.6: CPU power limit 3863 exceeded: 0
>
> I think you put the parameters in the wrong order on the dev_info() call
> for the CPU power limit; it is the limit that is 0.

Yes. Fixed.

> Two bogus things I can see:
>
> - The first "MCP power limit exceeded" seems very bogus.
> - What do you mean, core_power_limit is zero?

I added a logging message whenever the turbo limits change
and logging messages for power/temp on MCH for completeness.

Maybe this will show something useful like when/how
CPU power limit gets set to 0.

drivers/platform/x86/intel_ips.c | 105 +++++++++++++++++++++++++++++++-------
1 files changed, 87 insertions(+), 18 deletions(-)

diff --git a/drivers/platform/x86/intel_ips.c b/drivers/platform/x86/intel_ips.c
index 9024480..9c65e66 100644
--- a/drivers/platform/x86/intel_ips.c
+++ b/drivers/platform/x86/intel_ips.c
@@ -598,17 +598,36 @@ static bool mcp_exceeded(struct ips_driver *ips)
{
unsigned long flags;
bool ret = false;
+ u16 mcp_avg_temp;
+ u16 mcp_temp_limit;
+ u16 mcp_power_limit;
+ u32 cpu_avg_power;
+ u32 mch_avg_power;

spin_lock_irqsave(&ips->turbo_status_lock, flags);
- if (ips->mcp_avg_temp > (ips->mcp_temp_limit * 100))
- ret = true;
- if (ips->cpu_avg_power + ips->mch_avg_power > ips->mcp_power_limit)
- ret = true;
+
+ mcp_avg_temp = ips->mcp_avg_temp;
+ mcp_temp_limit = ips->mcp_temp_limit;
+ mcp_power_limit = ips->mcp_power_limit;
+ cpu_avg_power = ips->cpu_avg_power;
+ mch_avg_power = ips->mch_avg_power;
+
spin_unlock_irqrestore(&ips->turbo_status_lock, flags);

- if (ret)
+ if ((cpu_avg_power + mch_avg_power) > mcp_power_limit) {
+ dev_info(&ips->dev->dev,
+ "MCP power limit %u exceeded: cpu:%u + mch:%u\n",
+ mcp_power_limit,
+ cpu_avg_power, mch_avg_power);
+ ret = true;
+ }
+ if (mcp_avg_temp > (mcp_temp_limit * 100)) {
dev_info(&ips->dev->dev,
- "MCP power or thermal limit exceeded\n");
+ "MCP thermal limit %d exceeded: %d\n",
+ mcp_temp_limit * 100,
+ mcp_avg_temp);
+ ret = true;
+ }

return ret;
}
@@ -623,20 +642,33 @@ static bool mcp_exceeded(struct ips_driver *ips)
static bool cpu_exceeded(struct ips_driver *ips, int cpu)
{
unsigned long flags;
- int avg;
bool ret = false;
+ int avg;
+ int core_temp_limit;
+ u16 core_power_limit;
+ u32 cpu_avg_power;

spin_lock_irqsave(&ips->turbo_status_lock, flags);
+
avg = cpu ? ips->ctv2_avg_temp : ips->ctv1_avg_temp;
- if (avg > (ips->limits->core_temp_limit * 100))
- ret = true;
- if (ips->cpu_avg_power > ips->core_power_limit * 100)
- ret = true;
+ core_temp_limit = ips->limits->core_temp_limit;
+ core_power_limit = ips->core_power_limit;
+ cpu_avg_power = ips->cpu_avg_power;
+
spin_unlock_irqrestore(&ips->turbo_status_lock, flags);

- if (ret)
+ if (cpu_avg_power > (core_power_limit * 100)) {
+ dev_info(&ips->dev->dev,
+ "CPU power limit %d exceeded: %d\n",
+ core_power_limit * 100, cpu_avg_power);
+ ret = true;
+ }
+ if (avg > (core_temp_limit * 100)) {
dev_info(&ips->dev->dev,
- "CPU power or thermal limit exceeded\n");
+ "CPU thermal limit %d exceeded: %d\n",
+ core_temp_limit * 100, avg);
+ ret = true;
+ }

return ret;
}
@@ -652,13 +684,32 @@ static bool mch_exceeded(struct ips_driver *ips)
unsigned long flags;
bool ret = false;

+ u16 mch_avg_temp;
+ int mch_temp_limit;
+ u32 mch_avg_power;
+ u16 mch_power_limit;
+
spin_lock_irqsave(&ips->turbo_status_lock, flags);
- if (ips->mch_avg_temp > (ips->limits->mch_temp_limit * 100))
- ret = true;
- if (ips->mch_avg_power > ips->mch_power_limit)
- ret = true;
+
+ mch_avg_temp = ips->mch_avg_temp;
+ mch_temp_limit = ips->limits->mch_temp_limit;
+ mch_avg_power = ips->mch_avg_power;
+ mch_power_limit = ips->mch_power_limit;
+
spin_unlock_irqrestore(&ips->turbo_status_lock, flags);

+ if (mch_avg_power > mch_power_limit) {
+ dev_info(&ips->dev->dev,
+ "MCH power limit %d exceeded: %d\n",
+ mch_power_limit, mch_avg_power);
+ ret = true;
+ }
+ if (mch_avg_temp > (mch_temp_limit * 100)) {
+ dev_info(&ips->dev->dev,
+ "MCH thermal limit %d exceeded: %d\n",
+ mch_temp_limit * 100, mch_avg_temp);
+ ret = true;
+ }
return ret;
}

@@ -689,6 +740,16 @@ static void update_turbo_limits(struct ips_driver *ips)
/* Ignore BIOS CPU vs GPU pref */
}

+static void show_turbo_limits(const struct ips_driver *ips, const char *caller)
+{
+ dev_info(&ips->dev->dev,
+ "%s:%s cte:%u gte:%u cpt:%u mpl:%u mtl:%u mpl:%u\n",
+ __func__, caller,
+ ips->cpu_turbo_enabled, ips->gpu_turbo_enabled,
+ ips->core_power_limit, ips->mch_power_limit,
+ ips->mcp_temp_limit, ips->mcp_power_limit);
+}
+
/**
* ips_adjust - adjust power clamp based on thermal state
* @data: ips driver structure
@@ -734,12 +795,18 @@ static int ips_adjust(void *data)
do {
bool cpu_busy = ips_cpu_busy(ips);
bool gpu_busy = ips_gpu_busy(ips);
+ bool updated = false;

spin_lock_irqsave(&ips->turbo_status_lock, flags);
- if (ips->poll_turbo_status)
+ if (ips->poll_turbo_status) {
update_turbo_limits(ips);
+ updated = true;
+ }
spin_unlock_irqrestore(&ips->turbo_status_lock, flags);

+ if (updated)
+ show_turbo_limits(ips, __func__);
+
/* Update turbo status if necessary */
if (ips->cpu_turbo_enabled)
ips_enable_cpu_turbo(ips);
@@ -1158,6 +1225,8 @@ static irqreturn_t ips_irq_handler(int irq, void *arg)
spin_unlock(&ips->turbo_status_lock);

thm_writeb(THM_SEC, SEC_ACK);
+
+ show_turbo_limits(ips, __func__);
}
thm_writeb(THM_TES, tes);
}

2010-08-27 23:12:22

by Cesar Eduardo Barros

[permalink] [raw]
Subject: Re: [PATCH] intel_ips: quieten "power or thermal limit exceeded" messages

Em 27-08-2010 04:39, Joe Perches escreveu:
> On Thu, 2010-08-26 at 22:38 -0300, Cesar Eduardo Barros wrote:
>> intel ips 0000:00:1f.6: CPU power limit 3863 exceeded: 0
>>
>> I think you put the parameters in the wrong order on the dev_info() call
>> for the CPU power limit; it is the limit that is 0.
>
> Yes. Fixed.
>
>> Two bogus things I can see:
>>
>> - The first "MCP power limit exceeded" seems very bogus.
>> - What do you mean, core_power_limit is zero?
>
> I added a logging message whenever the turbo limits change
> and logging messages for power/temp on MCH for completeness.
>
> Maybe this will show something useful like when/how
> CPU power limit gets set to 0.

Running with it right now, did not help much:

$ dmesg | fgrep 'intel ips'
intel ips 0000:00:1f.6: Warning: CPU TDP doesn't match expected value
(found 25, expected 35)
intel ips 0000:00:1f.6: PCI INT C -> GSI 18 (level, low) -> IRQ 18
intel ips 0000:00:1f.6: IPS driver initialized, MCP temp limit 65535
intel ips 0000:00:1f.6: MCP power limit 65535 exceeded: cpu:8058 +
mch:23392829
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5675
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 6369
intel ips 0000:00:1f.6: MCP power limit 65535 exceeded: cpu:5095 + mch:65379
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 7387
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 8326
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5943
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 6428
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5775
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 7061
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5153
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5098
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5208
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 7500
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 9144
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 6722
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 7156
intel ips 0000:00:1f.6: MCP power limit 65535 exceeded: cpu:5693 + mch:64598
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5856
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4209
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4726
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5259
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5212
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4862
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5281
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4235
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4897
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5257
intel ips 0000:00:1f.6: MCP power limit 65535 exceeded: cpu:5111 + mch:64134
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4843
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4909
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5904
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 6059
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5579
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5970
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5213
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 6388
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4444
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 6545
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4439
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4682
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4337
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4499

Just for fun, here is what debugfs has to say:

$ ls /sys/kernel/debug/ips/
cpu_clamp cpu_power cpu_temp mch_power mch_temp
$ cat /sys/kernel/debug/ips/*
23.0W 21.0A
4415mW
46.99
59670mW
0.00

All numbers except the last vary a lot. A few more examples:

21.0W 21.0A
4379mW
51.09
6493mW
0.00

23.0W 21.0A
4681mW
45.13
34062mW
0.00

--
Cesar Eduardo Barros
[email protected]
[email protected]

2010-08-28 02:21:47

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH] intel_ips: quieten "power or thermal limit exceeded" messages

On Fri, 2010-08-27 at 20:12 -0300, Cesar Eduardo Barros wrote:
> Em 27-08-2010 04:39, Joe Perches escreveu:
> > On Thu, 2010-08-26 at 22:38 -0300, Cesar Eduardo Barros wrote:
> >> - The first "MCP power limit exceeded" seems very bogus.
> >> - What do you mean, core_power_limit is zero?
> > I added a logging message whenever the turbo limits change
> > and logging messages for power/temp on MCH for completeness.
> > Maybe this will show something useful like when/how
> > CPU power limit gets set to 0.

> Running with it right now, did not help much:
>
> $ dmesg | fgrep 'intel ips'
> intel ips 0000:00:1f.6: Warning: CPU TDP doesn't match expected value
> (found 25, expected 35)
> intel ips 0000:00:1f.6: PCI INT C -> GSI 18 (level, low) -> IRQ 18
> intel ips 0000:00:1f.6: IPS driver initialized, MCP temp limit 65535
> intel ips 0000:00:1f.6: MCP power limit 65535 exceeded: cpu:8058 +
> mch:23392829
> intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5675
> intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 6369

I believe all these limits should always have non-zero values.
So I still think you've hardware problems, but I suppose it
could be the driver not reading the right registers or some
such. It seems odd that the driver never printed a logging
message for either of the polling or irq methods to read the
device cpu and thermal limits.

Jesse or any Intel folk, can you verify or suggest anything
better?

If cpu_power_limit, or any _limit, is not set perhaps changing
the test style to verify limit and adding a printed_once alert
for each 0 value limit. At least that'd shut up the continuous
logging but at least give a notification message.

if (limit) {
if (measured_val > limit)
dev_info(foo)
} else
dev_alert_once()

Maybe something like this:

drivers/platform/x86/intel_ips.c | 160 +++++++++++++++++++++++++++++++++-----
1 files changed, 140 insertions(+), 20 deletions(-)

diff --git a/drivers/platform/x86/intel_ips.c b/drivers/platform/x86/intel_ips.c
index 9024480..10eba04 100644
--- a/drivers/platform/x86/intel_ips.c
+++ b/drivers/platform/x86/intel_ips.c
@@ -598,17 +598,53 @@ static bool mcp_exceeded(struct ips_driver *ips)
{
unsigned long flags;
bool ret = false;
+ static bool printed_zero_mcp_power_limit;
+ static bool printed_zero_mcp_temp_limit;
+ u16 mcp_avg_temp;
+ u16 mcp_temp_limit;
+ u16 mcp_power_limit;
+ u32 cpu_avg_power;
+ u32 mch_avg_power;

spin_lock_irqsave(&ips->turbo_status_lock, flags);
- if (ips->mcp_avg_temp > (ips->mcp_temp_limit * 100))
- ret = true;
- if (ips->cpu_avg_power + ips->mch_avg_power > ips->mcp_power_limit)
- ret = true;
+
+ mcp_avg_temp = ips->mcp_avg_temp;
+ mcp_temp_limit = ips->mcp_temp_limit;
+ mcp_power_limit = ips->mcp_power_limit;
+ cpu_avg_power = ips->cpu_avg_power;
+ mch_avg_power = ips->mch_avg_power;
+
spin_unlock_irqrestore(&ips->turbo_status_lock, flags);

- if (ret)
- dev_info(&ips->dev->dev,
- "MCP power or thermal limit exceeded\n");
+ if (mcp_power_limit) {
+ if ((cpu_avg_power + mch_avg_power) > mcp_power_limit) {
+ dev_info(&ips->dev->dev,
+ "MCP power limit %u exceeded: cpu:%u+mch:%u\n",
+ mcp_power_limit,
+ cpu_avg_power, mch_avg_power);
+ ret = true;
+ }
+ } else {
+ if (!printed_zero_mcp_power_limit) {
+ printed_zero_mcp_power_limit = true;
+ dev_alert(&ips->dev->dev, "MCP power limit is 0!\n");
+ }
+ }
+
+ if (mcp_temp_limit) {
+ if (mcp_avg_temp > (mcp_temp_limit * 100)) {
+ dev_info(&ips->dev->dev,
+ "MCP thermal limit %d exceeded: %d\n",
+ mcp_temp_limit * 100,
+ mcp_avg_temp);
+ ret = true;
+ }
+ } else {
+ if (!printed_zero_mcp_temp_limit) {
+ printed_zero_mcp_temp_limit = true;
+ dev_alert(&ips->dev->dev, "MCP thermal limit is 0!\n");
+ }
+ }

return ret;
}
@@ -623,20 +659,50 @@ static bool mcp_exceeded(struct ips_driver *ips)
static bool cpu_exceeded(struct ips_driver *ips, int cpu)
{
unsigned long flags;
- int avg;
bool ret = false;
+ static bool printed_zero_core_power_limit;
+ static bool printed_zero_core_temp_limit;
+ int avg;
+ int core_temp_limit;
+ u16 core_power_limit;
+ u32 cpu_avg_power;

spin_lock_irqsave(&ips->turbo_status_lock, flags);
+
avg = cpu ? ips->ctv2_avg_temp : ips->ctv1_avg_temp;
- if (avg > (ips->limits->core_temp_limit * 100))
- ret = true;
- if (ips->cpu_avg_power > ips->core_power_limit * 100)
- ret = true;
+ core_temp_limit = ips->limits->core_temp_limit;
+ core_power_limit = ips->core_power_limit;
+ cpu_avg_power = ips->cpu_avg_power;
+
spin_unlock_irqrestore(&ips->turbo_status_lock, flags);

- if (ret)
- dev_info(&ips->dev->dev,
- "CPU power or thermal limit exceeded\n");
+ if (core_power_limit) {
+ if (cpu_avg_power > (core_power_limit * 100)) {
+ dev_info(&ips->dev->dev,
+ "CPU power limit %d exceeded: %d\n",
+ core_power_limit * 100, cpu_avg_power);
+ ret = true;
+ }
+ } else {
+ if (!printed_zero_core_power_limit) {
+ printed_zero_core_power_limit = true;
+ dev_alert(&ips->dev->dev, "CPU power limit is 0!\n");
+ }
+ }
+
+ if (core_temp_limit) {
+ if (avg > (core_temp_limit * 100)) {
+ dev_info(&ips->dev->dev,
+ "CPU thermal limit %d exceeded: %d\n",
+ core_temp_limit * 100, avg);
+ ret = true;
+ }
+ } else {
+ if (!printed_zero_core_temp_limit) {
+ printed_zero_core_temp_limit = true;
+ dev_alert(&ips->dev->dev, "CPU thermal limit is 0!\n");
+ }
+ }

return ret;
}
@@ -651,14 +717,50 @@ static bool mch_exceeded(struct ips_driver *ips)
{
unsigned long flags;
bool ret = false;
+ static bool printed_zero_mch_power_limit;
+ static bool printed_zero_mch_temp_limit;
+ u16 mch_avg_temp;
+ int mch_temp_limit;
+ u32 mch_avg_power;
+ u16 mch_power_limit;

spin_lock_irqsave(&ips->turbo_status_lock, flags);
- if (ips->mch_avg_temp > (ips->limits->mch_temp_limit * 100))
- ret = true;
- if (ips->mch_avg_power > ips->mch_power_limit)
- ret = true;
+
+ mch_avg_temp = ips->mch_avg_temp;
+ mch_temp_limit = ips->limits->mch_temp_limit;
+ mch_avg_power = ips->mch_avg_power;
+ mch_power_limit = ips->mch_power_limit;
+
spin_unlock_irqrestore(&ips->turbo_status_lock, flags);

+ if (mch_power_limit) {
+ if (mch_avg_power > mch_power_limit) {
+ dev_info(&ips->dev->dev,
+ "MCH power limit %d exceeded: %d\n",
+ mch_power_limit, mch_avg_power);
+ ret = true;
+ }
+ } else {
+ if (!printed_zero_mch_power_limit) {
+ printed_zero_mch_power_limit = true;
+ dev_alert(&ips->dev->dev, "MCH power limit is 0!\n");
+ }
+ }
+
+ if (mch_temp_limit) {
+ if (mch_avg_temp > (mch_temp_limit * 100)) {
+ dev_info(&ips->dev->dev,
+ "MCH thermal limit %d exceeded: %d\n",
+ mch_temp_limit * 100, mch_avg_temp);
+ ret = true;
+ }
+ } else {
+ if (!printed_zero_mch_temp_limit) {
+ printed_zero_mch_temp_limit = true;
+ dev_alert(&ips->dev->dev, "MCH thermal limit is 0!\n");
+ }
+ }
+
return ret;
}

@@ -689,6 +791,16 @@ static void update_turbo_limits(struct ips_driver *ips)
/* Ignore BIOS CPU vs GPU pref */
}

+static void show_turbo_limits(const struct ips_driver *ips, const char *caller)
+{
+ dev_info(&ips->dev->dev,
+ "%s:%s cte:%u gte:%u cpt:%u mpl:%u mtl:%u mpl:%u\n",
+ __func__, caller,
+ ips->cpu_turbo_enabled, ips->gpu_turbo_enabled,
+ ips->core_power_limit, ips->mch_power_limit,
+ ips->mcp_temp_limit, ips->mcp_power_limit);
+}
+
/**
* ips_adjust - adjust power clamp based on thermal state
* @data: ips driver structure
@@ -734,12 +846,18 @@ static int ips_adjust(void *data)
do {
bool cpu_busy = ips_cpu_busy(ips);
bool gpu_busy = ips_gpu_busy(ips);
+ bool updated = false;

spin_lock_irqsave(&ips->turbo_status_lock, flags);
- if (ips->poll_turbo_status)
+ if (ips->poll_turbo_status) {
update_turbo_limits(ips);
+ updated = true;
+ }
spin_unlock_irqrestore(&ips->turbo_status_lock, flags);

+ if (updated)
+ show_turbo_limits(ips, __func__);
+
/* Update turbo status if necessary */
if (ips->cpu_turbo_enabled)
ips_enable_cpu_turbo(ips);
@@ -1158,6 +1276,8 @@ static irqreturn_t ips_irq_handler(int irq, void *arg)
spin_unlock(&ips->turbo_status_lock);

thm_writeb(THM_SEC, SEC_ACK);
+
+ show_turbo_limits(ips, __func__);
}
thm_writeb(THM_TES, tes);
}

2010-08-28 10:46:14

by Cesar Eduardo Barros

[permalink] [raw]
Subject: Re: [PATCH] intel_ips: quieten "power or thermal limit exceeded" messages

Em 27-08-2010 23:21, Joe Perches escreveu:
> On Fri, 2010-08-27 at 20:12 -0300, Cesar Eduardo Barros wrote:
>> Em 27-08-2010 04:39, Joe Perches escreveu:
>>> On Thu, 2010-08-26 at 22:38 -0300, Cesar Eduardo Barros wrote:
>>>> - The first "MCP power limit exceeded" seems very bogus.
>>>> - What do you mean, core_power_limit is zero?
>>> I added a logging message whenever the turbo limits change
>>> and logging messages for power/temp on MCH for completeness.
>>> Maybe this will show something useful like when/how
>>> CPU power limit gets set to 0.
>
>> Running with it right now, did not help much:
>>
>> $ dmesg | fgrep 'intel ips'
>> intel ips 0000:00:1f.6: Warning: CPU TDP doesn't match expected value
>> (found 25, expected 35)
>> intel ips 0000:00:1f.6: PCI INT C -> GSI 18 (level, low) -> IRQ 18
>> intel ips 0000:00:1f.6: IPS driver initialized, MCP temp limit 65535
>> intel ips 0000:00:1f.6: MCP power limit 65535 exceeded: cpu:8058 +
>> mch:23392829
>> intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5675
>> intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 6369
>
> I believe all these limits should always have non-zero values.
> So I still think you've hardware problems, but I suppose it
> could be the driver not reading the right registers or some
> such. It seems odd that the driver never printed a logging
> message for either of the polling or irq methods to read the
> device cpu and thermal limits.

Come on, no blaming the BIOS? ;-)

If I read the code with your previous patch correctly, show_turbo_limits
will never be called if poll_turbo_status is false but no interrupt
happens. And we know no interrupt happened (at least not with nonzero
register values), because the interrupt handler does two dev_info()
right at the beginning. So the limits could still be the ones initially
set at ips_probe().

I will try to enable dev_dbg() later and see what it prints.

>
> Jesse or any Intel folk, can you verify or suggest anything
> better?
>
> If cpu_power_limit, or any _limit, is not set perhaps changing
> the test style to verify limit and adding a printed_once alert
> for each 0 value limit. At least that'd shut up the continuous
> logging but at least give a notification message.
>
> if (limit) {
> if (measured_val> limit)
> dev_info(foo)
> } else
> dev_alert_once()

Wouldn't it make more sense to do the alert when the limit is set,
instead of when it is used? Also, it should still treat it as limit
exceeded (better safe than sorry). Something like:

if (measured_val > limit) {
if (limit)
dev_info(...);
ret = true;
}

--
Cesar Eduardo Barros
[email protected]
[email protected]

2010-08-28 12:52:41

by Cesar Eduardo Barros

[permalink] [raw]
Subject: Re: [PATCH] intel_ips: quieten "power or thermal limit exceeded" messages

You seem to have dropped the CC list by accident, adding it back.

Em 28-08-2010 08:15, Joe Perches escreveu:
> On Sat, 2010-08-28 at 07:46 -0300, Cesar Eduardo Barros wrote:
>> If I read the code with your previous patch correctly, show_turbo_limits
>> will never be called if poll_turbo_status is false but no interrupt
>> happens. And we know no interrupt happened (at least not with nonzero
>> register values), because the interrupt handler does two dev_info()
>> right at the beginning. So the limits could still be the ones initially
>> set at ips_probe().
>>
>> I will try to enable dev_dbg() later and see what it prints.
>
> or add:
>
> diff --git a/drivers/platform/x86/intel_ips.c b/drivers/platform/x86/intel_ips.c
> index 9024480..450ea44 100644
> --- a/drivers/platform/x86/intel_ips.c
> +++ b/drivers/platform/x86/intel_ips.c
> @@ -1600,6 +1600,7 @@ static int ips_probe(struct pci_dev *dev, const struct pci_device_id *id)
> ips->second_cpu = true;
>
> update_turbo_limits(ips);
> + show_turbo_limits(ips, __func__);
> dev_dbg(&dev->dev, "max cpu power clamp: %dW\n",
> ips->mcp_power_limit / 10);
> dev_dbg(&dev->dev, "max core power clamp: %dW\n",

Here it is:

intel ips 0000:00:1f.6: Warning: CPU TDP doesn't match expected value
(found 25, expected 35)
intel ips 0000:00:1f.6: PCI INT C -> GSI 18 (level, low) -> IRQ 18
intel ips 0000:00:1f.6: show_turbo_limits:ips_probe cte:1 gte:1 cpt:0
mpl:65535 mtl:65535 mpl:65535
intel ips 0000:00:1f.6: IPS driver initialized, MCP temp limit 65535
intel ips 0000:00:1f.6: MCP power limit 65535 exceeded: cpu:8004 +
mch:25353039
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4841
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5283
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5586
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 6077
intel ips 0000:00:1f.6: MCP power limit 65535 exceeded: cpu:5871 + mch:64538
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5466
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 8589
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5744
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4859
intel ips 0000:00:1f.6: MCP power limit 65535 exceeded: cpu:4834 + mch:62385
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4874
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5356
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 6557
intel ips 0000:00:1f.6: MCP power limit 65535 exceeded: cpu:7589 + mch:59343
intel ips 0000:00:1f.6: MCP power limit 65535 exceeded: cpu:5536 + mch:60020
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 6676
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4401
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5634
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4038
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4700
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5086
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4930
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4697
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5034
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 5381
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4417
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 6839
intel ips 0000:00:1f.6: CPU power limit 0 exceeded: 4787

--
Cesar Eduardo Barros
[email protected]
[email protected]

2010-08-28 13:01:20

by Cesar Eduardo Barros

[permalink] [raw]
Subject: Re: [PATCH] intel_ips: quieten "power or thermal limit exceeded" messages

Em 28-08-2010 09:52, Cesar Eduardo Barros escreveu:
> You seem to have dropped the CC list by accident, adding it back.

It seems I made some mistake when copying and pasting the CC: list, I
fixed it on this message.

--
Cesar Eduardo Barros
[email protected]
[email protected]

2010-08-28 13:29:17

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH] intel_ips: quieten "power or thermal limit exceeded" messages

On Sat, 2010-08-28 at 09:52 -0300, Cesar Eduardo Barros wrote:
> Em 28-08-2010 08:15, Joe Perches escreveu:
> >> I will try to enable dev_dbg() later and see what it prints.
> > or add:
[]
> Here it is:
>
> intel ips 0000:00:1f.6: Warning: CPU TDP doesn't match expected value
> (found 25, expected 35)
> intel ips 0000:00:1f.6: PCI INT C -> GSI 18 (level, low) -> IRQ 18
> intel ips 0000:00:1f.6: show_turbo_limits:ips_probe cte:1 gte:1 cpt:0
> mpl:65535 mtl:65535 mpl:65535

So ips->core_power_limit is still 0.

Is it bad bios or driver?

I expect the bios/hw is faulty but fingers are easy to point.

Your choice on how to minimize your current logging problem.

Changing the message level to dev_dbg probably isn't the
right thing to do overall, but it may suit you right now.

Thanks for testing.

2010-08-28 14:18:41

by Cesar Eduardo Barros

[permalink] [raw]
Subject: Re: [PATCH] intel_ips: quieten "power or thermal limit exceeded" messages

Em 28-08-2010 10:29, Joe Perches escreveu:
> On Sat, 2010-08-28 at 09:52 -0300, Cesar Eduardo Barros wrote:
>> Em 28-08-2010 08:15, Joe Perches escreveu:
>>>> I will try to enable dev_dbg() later and see what it prints.
>>> or add:
> []
>> Here it is:
>>
>> intel ips 0000:00:1f.6: Warning: CPU TDP doesn't match expected value
>> (found 25, expected 35)
>> intel ips 0000:00:1f.6: PCI INT C -> GSI 18 (level, low) -> IRQ 18
>> intel ips 0000:00:1f.6: show_turbo_limits:ips_probe cte:1 gte:1 cpt:0
>> mpl:65535 mtl:65535 mpl:65535
>
> So ips->core_power_limit is still 0.
>
> Is it bad bios or driver?
>
> I expect the bios/hw is faulty but fingers are easy to point.
>
> Your choice on how to minimize your current logging problem.
>
> Changing the message level to dev_dbg probably isn't the
> right thing to do overall, but it may suit you right now.

The solution here probably is not less logging. The best solution IMO
would be to do some sanity checking when loading the module, and if the
values do not make sense, print something to the log and return -ENODEV.

I expect that, when it works as it should, the first read while loading
the module already returns sane values, so a sanity check there should
not have many false positives. OTOH, it is best to not load the module
when you think things are strange.

--
Cesar Eduardo Barros
[email protected]
[email protected]

Subject: Re: [PATCH] intel_ips: quieten "power or thermal limit exceeded" messages

On Sat, 28 Aug 2010, Cesar Eduardo Barros wrote:
> The solution here probably is not less logging. The best solution
> IMO would be to do some sanity checking when loading the module, and
> if the values do not make sense, print something to the log and
> return -ENODEV.

As long as your sanity checking won't make the module fail to load in the
following scenario:

1. environment temperature control fails, room starts to heat up
2. things go south, server reboots due to exceeded temperature limits
3. OS boots in an overheat situation
4. module refuse to load because it expects to never start in a overheating
situation.

If the sanity checks will cause (4), then don't add them. rate-limit the
thermal alarms (issue them only once every T, and only if temperature has
increased more than, say, 5?C from the last alarm).

If a given platform is buggy crap (or just el-cheapo trash that overheats
all the time) to the point that the module is useless, blacklist it by DMI
and inform the user.

> I expect that, when it works as it should, the first read while
> loading the module already returns sane values, so a sanity check

well, as long as "sane" does include server-is-too-hot situations...

> there should not have many false positives. OTOH, it is best to not
> load the module when you think things are strange.

What good is an alarm module that refuses to load when there is an alarm
condition happening already?

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2010-08-28 19:07:14

by Cesar Eduardo Barros

[permalink] [raw]
Subject: Re: [PATCH] intel_ips: quieten "power or thermal limit exceeded" messages

Em 28-08-2010 12:23, Henrique de Moraes Holschuh escreveu:
> On Sat, 28 Aug 2010, Cesar Eduardo Barros wrote:
>> The solution here probably is not less logging. The best solution
>> IMO would be to do some sanity checking when loading the module, and
>> if the values do not make sense, print something to the log and
>> return -ENODEV.
>
> As long as your sanity checking won't make the module fail to load in the
> following scenario:
>
> 1. environment temperature control fails, room starts to heat up
> 2. things go south, server reboots due to exceeded temperature limits
> 3. OS boots in an overheat situation
> 4. module refuse to load because it expects to never start in a overheating
> situation.
>
> If the sanity checks will cause (4), then don't add them. rate-limit the
> thermal alarms (issue them only once every T, and only if temperature has
> increased more than, say, 5°C from the last alarm).

I have not read the datasheet (I do not even know if it is available to
the public; I have not looked), but I would not expect to see a power
limit of 0 even if the CPU is on fire. Of course, you have to be more
cautious when validating the current temperature (and even then, if it
says the CPU is encased in a block of ice, something odd is going on).

> If a given platform is buggy crap (or just el-cheapo trash that overheats
> all the time) to the point that the module is useless, blacklist it by DMI
> and inform the user.
>
>> I expect that, when it works as it should, the first read while
>> loading the module already returns sane values, so a sanity check
>
> well, as long as "sane" does include server-is-too-hot situations...

Of course. (But you most probably will want to s/server/laptop/ here.)

>> there should not have many false positives. OTOH, it is best to not
>> load the module when you think things are strange.
>
> What good is an alarm module that refuses to load when there is an alarm
> condition happening already?

This is not an alarm module; AFAIK it is a module for the feature in
recent Intel CPU/GPU chips which allow you to overclock it a bit as long
as the thermal and power limit has not been exceeded:

config INTEL_IPS
tristate "Intel Intelligent Power Sharing"
depends on ACPI
---help---
Intel Calpella platforms support dynamic power sharing between the
CPU and GPU, maximizing performance in a given TDP. This driver,
along with the CPU frequency and i915 drivers, provides that
functionality. If in doubt, say Y here; it will only load on
supported platforms.

If the module is not loaded, it simply will not be able to go above its
nominal clock, so refusing to load it is not that much of a problem.

--
Cesar Eduardo Barros
[email protected]
[email protected]

2010-08-30 16:30:04

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH] intel_ips: quieten "power or thermal limit exceeded" messages

On Sat, 28 Aug 2010 16:07:08 -0300
Cesar Eduardo Barros <[email protected]> wrote:

> Em 28-08-2010 12:23, Henrique de Moraes Holschuh escreveu:
> > On Sat, 28 Aug 2010, Cesar Eduardo Barros wrote:
> >> The solution here probably is not less logging. The best solution
> >> IMO would be to do some sanity checking when loading the module, and
> >> if the values do not make sense, print something to the log and
> >> return -ENODEV.
> >
> > As long as your sanity checking won't make the module fail to load in the
> > following scenario:
> >
> > 1. environment temperature control fails, room starts to heat up
> > 2. things go south, server reboots due to exceeded temperature limits
> > 3. OS boots in an overheat situation
> > 4. module refuse to load because it expects to never start in a overheating
> > situation.
> >
> > If the sanity checks will cause (4), then don't add them. rate-limit the
> > thermal alarms (issue them only once every T, and only if temperature has
> > increased more than, say, 5°C from the last alarm).
>
> I have not read the datasheet (I do not even know if it is available to
> the public; I have not looked), but I would not expect to see a power
> limit of 0 even if the CPU is on fire. Of course, you have to be more
> cautious when validating the current temperature (and even then, if it
> says the CPU is encased in a block of ice, something odd is going on).
>
> > If a given platform is buggy crap (or just el-cheapo trash that overheats
> > all the time) to the point that the module is useless, blacklist it by DMI
> > and inform the user.
> >
> >> I expect that, when it works as it should, the first read while
> >> loading the module already returns sane values, so a sanity check
> >
> > well, as long as "sane" does include server-is-too-hot situations...
>
> Of course. (But you most probably will want to s/server/laptop/ here.)
>
> >> there should not have many false positives. OTOH, it is best to not
> >> load the module when you think things are strange.
> >
> > What good is an alarm module that refuses to load when there is an alarm
> > condition happening already?
>
> This is not an alarm module; AFAIK it is a module for the feature in
> recent Intel CPU/GPU chips which allow you to overclock it a bit as long
> as the thermal and power limit has not been exceeded:
>
> config INTEL_IPS
> tristate "Intel Intelligent Power Sharing"
> depends on ACPI
> ---help---
> Intel Calpella platforms support dynamic power sharing between the
> CPU and GPU, maximizing performance in a given TDP. This driver,
> along with the CPU frequency and i915 drivers, provides that
> functionality. If in doubt, say Y here; it will only load on
> supported platforms.
>
> If the module is not loaded, it simply will not be able to go above its
> nominal clock, so refusing to load it is not that much of a problem.

It sounds like either your BIOS values are wrong or the driver isn't
fetching them correctly on your platform.

It's possible that the main issue here is bad thermal limits. There's
obviously a relationship between power and thermal output, but the
driver tries to monitor both. However it's up to the BIOS to provide
the driver with accurate thermal limits, as well as accurate power
limits. The power limits sound reasonable at 25W, thus the
informational output about 35W vs 25W (35W is what the MCP can handle,
but some platforms are designed to handle less, so they clamp it down a
bit). But the temp limits look all wrong. I'll see if I can find info
on getting better data into the driver...

Thanks,
--
Jesse Barnes, Intel Open Source Technology Center

2010-08-30 21:43:06

by Cesar Eduardo Barros

[permalink] [raw]
Subject: Re: [PATCH] intel_ips: quieten "power or thermal limit exceeded" messages

Em 30-08-2010 13:29, Jesse Barnes escreveu:
> It's possible that the main issue here is bad thermal limits. There's
> obviously a relationship between power and thermal output, but the
> driver tries to monitor both. However it's up to the BIOS to provide
> the driver with accurate thermal limits, as well as accurate power
> limits. The power limits sound reasonable at 25W, thus the
> informational output about 35W vs 25W (35W is what the MCP can handle,
> but some platforms are designed to handle less, so they clamp it down a
> bit). But the temp limits look all wrong. I'll see if I can find info
> on getting better data into the driver...

If you need more information from this laptop (dmidecode, ACPI AML,
etc), just ask.

--
Cesar Eduardo Barros
[email protected]
[email protected]