2009-11-25 10:28:56

by Christoph Hellwig

[permalink] [raw]
Subject: regression: 2.6.32-rc8 shuts down after reaching critical temperature

I recently upgraded my Thinkpad T500 from Linux 2.6.31 to Linux
2.6.32-rc (first -rc7 but I've also tried with -rc6 and -rc8) and when
putting load on it, e.g. by building a kernel tree. It shuts down soon
with the

Critical temperature reached (%ld C), shutting down.\n"

printk from drivers/thermal/thermal_sys.c, where the temperature is
usually 100C or slightly above. The system is a Lenovo Thinkpad T500
with a Intel Core 2 Dueo T9600 running a 32 bit kernel.

I've done some attempts at bisecting it, but for most of the 2.6.32-rc
series the system crashes during boot in ACPI code with a backtrace
longer than the screen can display.


2009-11-25 21:56:23

by Len Brown

[permalink] [raw]
Subject: Re: regression: 2.6.32-rc8 shuts down after reaching critical temperature

On Wed, 25 Nov 2009, Christoph Hellwig wrote:

> I recently upgraded my Thinkpad T500 from Linux 2.6.31 to Linux
> 2.6.32-rc (first -rc7 but I've also tried with -rc6 and -rc8) and when
> putting load on it, e.g. by building a kernel tree. It shuts down soon
> with the
>
> Critical temperature reached (%ld C), shutting down.\n"
>
> printk from drivers/thermal/thermal_sys.c, where the temperature is
> usually 100C or slightly above. The system is a Lenovo Thinkpad T500
> with a Intel Core 2 Dueo T9600 running a 32 bit kernel.

"thermal.nocrt=1" will disable the actual shutdown -- but you'll
still get the warning -- which might be helpful. "thermal.crt=105"
would override all critical trip points to be 105, for example,
but otherwise not change any behaviour.

> I've done some attempts at bisecting it, but for most of the 2.6.32-rc
> series the system crashes during boot in ACPI code with a backtrace
> longer than the screen can display.

Hmm, I don't have a T500, but I've not seen such crashes during -rc.
Please send along your .config
Do you still get them when disabling the thinkpad-acpi driver?

Probably the most interesting place to bisect is drivers/acpi/ec.c

If you can send along the dmesg from the recent failing kernel,
plus the output from "grep . /proc/acpi/thermal_zone/*/*"
that may be helpful.

thanks,
-Len

2009-11-25 22:14:11

by Christoph Hellwig

[permalink] [raw]
Subject: Re: regression: 2.6.32-rc8 shuts down after reaching critical temperature

On Wed, Nov 25, 2009 at 04:56:19PM -0500, Len Brown wrote:
> "thermal.nocrt=1" will disable the actual shutdown -- but you'll
> still get the warning -- which might be helpful. "thermal.crt=105"
> would override all critical trip points to be 105, for example,
> but otherwise not change any behaviour.

Well, I suspect the warnings are there for a reason, e.g. with 2.6.32-rc
I also hear the fan regularly while I've almost never done before. So I
guess the reason for it is that throtteling might have problems.

> > I've done some attempts at bisecting it, but for most of the 2.6.32-rc
> > series the system crashes during boot in ACPI code with a backtrace
> > longer than the screen can display.
>
> Hmm, I don't have a T500, but I've not seen such crashes during -rc.
> Please send along your .config

Attached.

> Do you still get them when disabling the thinkpad-acpi driver?

I'll try. This is my main work machine (and I'm travelling right now),
so any sort of bisection and testing will take a while..

> Probably the most interesting place to bisect is drivers/acpi/ec.c
>
> If you can send along the dmesg from the recent failing kernel,

That is failing to boot, or the 2.6.32-rc8 kernel? No chance to capture
the dmesg of the one failing to boot unfortunately..

> plus the output from "grep . /proc/acpi/thermal_zone/*/*"
> that may be helpful.

brick:~# grep . /proc/acpi/thermal_zone/*/*
/proc/acpi/thermal_zone/THM0/cooling_mode:<setting not supported>
/proc/acpi/thermal_zone/THM0/polling_frequency:<polling disabled>
/proc/acpi/thermal_zone/THM0/state:state: ok
/proc/acpi/thermal_zone/THM0/temperature:temperature: 46 C
/proc/acpi/thermal_zone/THM0/trip_points:critical (S5): 127 C
/proc/acpi/thermal_zone/THM1/cooling_mode:<setting not supported>
/proc/acpi/thermal_zone/THM1/polling_frequency:<polling disabled>
/proc/acpi/thermal_zone/THM1/state:state: ok
/proc/acpi/thermal_zone/THM1/temperature:temperature: 49 C
/proc/acpi/thermal_zone/THM1/trip_points:critical (S5): 100 C
/proc/acpi/thermal_zone/THM1/trip_points:passive: 96 C:
tc1=5 tc2=4 tsp=600 devices=CPU0 CPU1


Attachments:
(No filename) (2.15 kB)
.config.bz2 (6.61 kB)
Download all attachments

2009-11-25 23:08:04

by Len Brown

[permalink] [raw]
Subject: Re: regression: 2.6.32-rc8 shuts down after reaching critical temperature

On Wed, 25 Nov 2009, Christoph Hellwig wrote:

> On Wed, Nov 25, 2009 at 04:56:19PM -0500, Len Brown wrote:
> > "thermal.nocrt=1" will disable the actual shutdown -- but you'll
> > still get the warning -- which might be helpful. "thermal.crt=105"
> > would override all critical trip points to be 105, for example,
> > but otherwise not change any behaviour.
>
> Well, I suspect the warnings are there for a reason, e.g. with 2.6.32-rc
> I also hear the fan regularly while I've almost never done before. So I
> guess the reason for it is that throtteling might have problems.

Does the fan noise go away when you revert back to an older kernel?
(if no, unclog your fan:-)

> > > I've done some attempts at bisecting it, but for most of the 2.6.32-rc
> > > series the system crashes during boot in ACPI code with a backtrace
> > > longer than the screen can display.
> >
> > Hmm, I don't have a T500, but I've not seen such crashes during -rc.
> > Please send along your .config
>
> Attached.

can you send the .config that you boot on the (Thinkpad?) T500?
you sent a PPC64 config.

> > Do you still get them when disabling the thinkpad-acpi driver?
>
> I'll try. This is my main work machine (and I'm travelling right now),
> so any sort of bisection and testing will take a while..
>
> > Probably the most interesting place to bisect is drivers/acpi/ec.c
> >
> > If you can send along the dmesg from the recent failing kernel,
>
> That is failing to boot, or the 2.6.32-rc8 kernel? No chance to capture
> the dmesg of the one failing to boot unfortunately..

I thought that rc8 boots, but overheats - can you capture
the dmesg before it overheats?

> > plus the output from "grep . /proc/acpi/thermal_zone/*/*"
> > that may be helpful.
>
> brick:~# grep . /proc/acpi/thermal_zone/*/*
> /proc/acpi/thermal_zone/THM0/cooling_mode:<setting not supported>
> /proc/acpi/thermal_zone/THM0/polling_frequency:<polling disabled>
> /proc/acpi/thermal_zone/THM0/state:state: ok
> /proc/acpi/thermal_zone/THM0/temperature:temperature: 46 C
> /proc/acpi/thermal_zone/THM0/trip_points:critical (S5): 127 C
> /proc/acpi/thermal_zone/THM1/cooling_mode:<setting not supported>
> /proc/acpi/thermal_zone/THM1/polling_frequency:<polling disabled>
> /proc/acpi/thermal_zone/THM1/state:state: ok
> /proc/acpi/thermal_zone/THM1/temperature:temperature: 49 C
> /proc/acpi/thermal_zone/THM1/trip_points:critical (S5): 100 C
> /proc/acpi/thermal_zone/THM1/trip_points:passive: 96 C:
> tc1=5 tc2=4 tsp=600 devices=CPU0 CPU1

This part looks normal.

thanks,
-Len

Subject: Re: regression: 2.6.32-rc8 shuts down after reaching critical temperature

On Wed, 25 Nov 2009, Christoph Hellwig wrote:
> I recently upgraded my Thinkpad T500 from Linux 2.6.31 to Linux
> 2.6.32-rc (first -rc7 but I've also tried with -rc6 and -rc8) and when
> putting load on it, e.g. by building a kernel tree. It shuts down soon
> with the
>
> Critical temperature reached (%ld C), shutting down.\n"
>
> printk from drivers/thermal/thermal_sys.c, where the temperature is
> usually 100C or slightly above. The system is a Lenovo Thinkpad T500
> with a Intel Core 2 Dueo T9600 running a 32 bit kernel.

If you go back to the previous kernel, does the problem disappear? Please
test and report (don't go by past experiences).

If it does disappear, what are the highest temperatures your thinkpad hits?

"sensors" from lm-sensors will tell you the current temperatures as long as
the "thermal" and "thinkpad-acpi" modules are loaded (don't run
sensors-detect).

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh

2009-11-26 09:52:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: regression: 2.6.32-rc8 shuts down after reaching critical temperature

On Thu, Nov 26, 2009 at 12:23:11AM -0200, Henrique de Moraes Holschuh wrote:
> If you go back to the previous kernel, does the problem disappear? Please
> test and report (don't go by past experiences).

Yes - in fact I had to revert to 2.6.31 because 2.6.32-rc is totally
unusable for a kernel developers workload (this issue plus the wifi
disconnects reported elsewhere).

> If it does disappear, what are the highest temperatures your thinkpad hits?

I haven't found a good monitoring applet to do constant monitoring, but
the highest I've seen was around 95C so far.

2009-11-26 10:17:56

by Christoph Hellwig

[permalink] [raw]
Subject: Re: regression: 2.6.32-rc8 shuts down after reaching critical temperature

On Wed, Nov 25, 2009 at 06:07:51PM -0500, Len Brown wrote:
> Does the fan noise go away when you revert back to an older kernel?
> (if no, unclog your fan:-)

Yes.

> can you send the .config that you boot on the (Thinkpad?) T500?
> you sent a PPC64 config.

Oops, copied from the wrong build directory. The correct one is
attached.

> I thought that rc8 boots, but overheats - can you capture
> the dmesg before it overheats?

Kinda hard to capture the exact momemt. I'll try to grab one a bit
after starting a kernel build - once the shit hits the fan it's too late
to still grab it as the shutdown happens quite quickly.


Attachments:
(No filename) (627.00 B)
.config.bz2 (15.02 kB)
Download all attachments

2009-11-26 10:36:11

by Christoph Hellwig

[permalink] [raw]
Subject: Re: regression: 2.6.32-rc8 shuts down after reaching critical temperature

On Wed, Nov 25, 2009 at 06:07:51PM -0500, Len Brown wrote:
> I thought that rc8 boots, but overheats - can you capture
> the dmesg before it overheats?

Here's a dmesg from ~ 30 seconds before it overheated. The workload
was:

- boot
- log into X
- start a kernel compile using make -j4
- ssh into another box to read mail in parallel
- take dmesgs semi-regularly in yet another xterm


Attachments:
(No filename) (391.00 B)
dmesg.bz2 (17.65 kB)
Download all attachments

2009-12-02 11:56:18

by Thomas Renninger

[permalink] [raw]
Subject: Re: regression: 2.6.32-rc8 shuts down after reaching critical temperature

On Thursday 26 November 2009 10:52:02 Christoph Hellwig wrote:
> On Thu, Nov 26, 2009 at 12:23:11AM -0200, Henrique de Moraes Holschuh wrote:
> > If you go back to the previous kernel, does the problem disappear? Please
> > test and report (don't go by past experiences).
>
> Yes - in fact I had to revert to 2.6.31 because 2.6.32-rc is totally
> unusable for a kernel developers workload (this issue plus the wifi
> disconnects reported elsewhere).
>
> > If it does disappear, what are the highest temperatures your thinkpad hits?
>
> I haven't found a good monitoring applet to do constant monitoring, but
> the highest I've seen was around 95C so far.
The critical temperature for the shutdown should still be logged in:
/var/log/messages
Can you grab that one out, please.

But from what we know, this sounds like a real overheating and not a
wrongly read value, also thinkpads used to have a switch to let the
temperature jump above the critical trip point under certain conditions.
This seem to be the case on THM0 (I remember 127 C...):
/proc/acpi/thermal_zone/THM0/trip_points:critical (S5): 127 C

Can we be sure it's because of the fans?
2.6.31 works?
Also the latest stable one?
Then this one might be unrelated (11.2, 2.6.31.X based, Acer Aspire 5315):
[Bug 557850] System Fan is being stopped on boot, CPU goes overheated and shut down instantly
https://bugzilla.novell.com/show_bug.cgi?id=557850

Thomas

2009-12-02 13:31:30

by Christoph Hellwig

[permalink] [raw]
Subject: Re: regression: 2.6.32-rc8 shuts down after reaching critical temperature

On Wed, Dec 02, 2009 at 12:56:20PM +0100, Thomas Renninger wrote:
> The critical temperature for the shutdown should still be logged in:
> /var/log/messages
> Can you grab that one out, please.

brick:/home/hch# grep "Critical temperature reached" /var/log/kern.log.1
Nov 23 17:58:27 brick kernel: [ 976.674070] Critical temperature reached (100 C), shutting down.
Nov 23 18:05:59 brick kernel: [ 411.639261] Critical temperature reached (102 C), shutting down.
Nov 23 22:20:43 brick kernel: [15159.677691] Critical temperature reached (102 C), shutting down.
Nov 25 11:06:33 brick kernel: [10947.132102] Critical temperature reached (103 C), shutting down.
Nov 25 11:53:50 brick kernel: [ 2795.261765] Critical temperature reached (102 C), shutting down.
Nov 25 11:53:55 brick kernel: [ 2800.128671] Critical temperature reached (100 C), shutting down.
Nov 26 11:30:24 brick kernel: [ 395.692667] Critical temperature reached (100 C), shutting down.
Nov 27 13:36:18 brick kernel: [ 390.408366] Critical temperature reached (100 C), shutting down.

>
> But from what we know, this sounds like a real overheating and not a
> wrongly read value,

It sounds like that to me, but something in .32-rc causes it to not
throttle early enough. .31 goes very close to the max all the time, but
something thottles it to not actually go over it.

> Can we be sure it's because of the fans?

No idea, really. The fan rarely every starts when using .31, but comes
in a lot using .32-rc

> 2.6.31 works?

Yes, perfectly. Have been running it for a couple of days now again
after I had all these reproducible .32-rc shutdowns when testiong it.

> Also the latest stable one?

Haven't tried that yet, will do if it helps you.

> Then this one might be unrelated (11.2, 2.6.31.X based, Acer Aspire 5315):
> [Bug 557850] System Fan is being stopped on boot, CPU goes overheated and shut down instantly
> https://bugzilla.novell.com/show_bug.cgi?id=557850

Well, I do hear the fan on .32-rc (not on .31 usually).

2009-12-02 15:07:50

by Thomas Renninger

[permalink] [raw]
Subject: Re: regression: 2.6.32-rc8 shuts down after reaching critical temperature

On Wednesday 02 December 2009 14:30:32 Christoph Hellwig wrote:
> On Wed, Dec 02, 2009 at 12:56:20PM +0100, Thomas Renninger wrote:
...
> > 2.6.31 works?
>
> Yes, perfectly. Have been running it for a couple of days now again
> after I had all these reproducible .32-rc shutdowns when testiong it.
>
> > Also the latest stable one?
>
> Haven't tried that yet, will do if it helps you.
No need. Looks unrelated, the one system seem to overheat because of
no fan activity at all, yours seem to have a "passive cooling does not work
or kicks in too late" (and possibly also fan?) problem(s).

Best would be to open a bug on bugzilla.kernel.org and assign it to the
acpi component (and add Rui, Henrique and myself to CC. I won't be that
active, at least not the next days, just wanted to make sure whether
this isn't a duplicate).
dmesg, acpidump, grep . /proc/acpi/thermal_zone/*/*
and the shutdown messages should be most important info which
should show up there.

Some more hints you may want to try:

- Does cpufreq work at all?
Does this dir exist: /sys/devices/system/cpu/cpu*/cpufreq
If temp of:
watch -n1 cat /proc/acpi/thermal_zone/THM1/temperature
goes beyond 96 C
an ACPI processor event must get thrown and this:
/sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
will get limited (lower than ../cpufreq/cpuinfo_max_freq).
echo xy >/sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
may be bad workaround.
These boot params: thermal.psv=90 thermal.tzp=10
lowering all passive trip points to 90 and enabling polling
might be a better one (with which you might be able to better
test passive cooling). This really should be a runtime sysfs
per thermal_zone parameter, but this is another story...

- Is the ACPI event thrown at all?:
SUSE has acpi_listen, not sure whether it's part of the acpid
mainline project, I think it is. Do you see an ACPI event when
96 C is past?
If not this might workaround your issue:
echo 10 >/proc/acpi/thermal_zone/THM1/polling_frequency (or similar)

- T500 sounds pretty new. Still, make sure your fans are clean.
E.g. the air must be really hot coming out at some point of time.

- Also listen a bit to the fans. with thinkpad-acpi driver you might
be able to monitor (T500 is rather new/untested) the fans:
cat /proc/acpi/ibm/fan # path out of my mind
You might also be able to alter the fan behavior there.

Good luck,

Thomas

2009-12-02 15:15:03

by Christoph Hellwig

[permalink] [raw]
Subject: Re: regression: 2.6.32-rc8 shuts down after reaching critical temperature

On Wed, Nov 25, 2009 at 04:56:19PM -0500, Len Brown wrote:
> > I've done some attempts at bisecting it, but for most of the 2.6.32-rc
> > series the system crashes during boot in ACPI code with a backtrace
> > longer than the screen can display.
>
> Hmm, I don't have a T500, but I've not seen such crashes during -rc.
> Please send along your .config
> Do you still get them when disabling the thinkpad-acpi driver?

I managed to boot 2.6.32-rc1 nd 2.6.32-rc2 with thinkpad-acpi disabled.
Both of them show the shutdown propblem, I'm now looking into bisecting
it further.

2009-12-02 15:35:37

by Christoph Hellwig

[permalink] [raw]
Subject: Re: regression: 2.6.32-rc8 shuts down after reaching critical temperature

On Wed, Dec 02, 2009 at 04:07:52PM +0100, Thomas Renninger wrote:
> Some more hints you may want to try:
>
> - Does cpufreq work at all?
> Does this dir exist: /sys/devices/system/cpu/cpu*/cpufreq

Yes.

> If temp of:
> watch -n1 cat /proc/acpi/thermal_zone/THM1/temperature
> goes beyond 96 C
> an ACPI processor event must get thrown and this:
> /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
> will get limited (lower than ../cpufreq/cpuinfo_max_freq).

The speeds change quite constantly under a kernel compile workload, but
most of the time it's at 2240800 vs cpuinfo_max_freq which is 2801000

> echo xy >/sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
> may be bad workaround.

echo 2240800 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
echo 2240800 > /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq

made it survive a kernel compile for me, with an observed maximum
temperature of 87 C.

More later..

2009-12-11 06:49:19

by Len Brown

[permalink] [raw]
Subject: Re: regression: 2.6.32-rc8 shuts down after reaching critical temperature

On Wed, 2 Dec 2009, Christoph Hellwig wrote:

> On Wed, Dec 02, 2009 at 04:07:52PM +0100, Thomas Renninger wrote:
> > Some more hints you may want to try:
> >
> > - Does cpufreq work at all?
> > Does this dir exist: /sys/devices/system/cpu/cpu*/cpufreq
>
> Yes.
>
> > If temp of:
> > watch -n1 cat /proc/acpi/thermal_zone/THM1/temperature
> > goes beyond 96 C
> > an ACPI processor event must get thrown and this:
> > /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
> > will get limited (lower than ../cpufreq/cpuinfo_max_freq).
>
> The speeds change quite constantly under a kernel compile workload, but
> most of the time it's at 2240800 vs cpuinfo_max_freq which is 2801000
>
> > echo xy >/sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
> > may be bad workaround.
>
> echo 2240800 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
> echo 2240800 > /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq
>
> made it survive a kernel compile for me, with an observed maximum
> temperature of 87 C.

Interesting. I wonder if the T500 is using P-states for thermal control,
or if it is relying on us to hit passive trip and pull the P-states down.
In either case, the mystery remains what is different after 2.6.31.

Perhaps we can simplify...
Say you run two copies of "# cat /dev/zero > /dev/null" on 2.6.31.
Does the frequency go up to max and stay there forever,
or does it come down?

If it stays there, what do you see for the temperature,
and do you hear the fans? Presumably the same test
under 2.6.32 will result in a prompt thermal shutdown.

please send the output from
grep . /sys/devices/system/cpu/cpu*/cpufreq/*

thanks,
-Len Brown, Intel Open Source Technology Center