2007-08-02 09:02:28

by Knut Petersen

[permalink] [raw]
Subject: 2.6.22 regression: thermal trip points

Hi everybody!

Kernel 2.6.22 decreases performance by about 50% on my system.
No, I do not like that. The reason is a broken BIOS, granted, but there
was a perfect workaround in the kernel that has been dropped.

mainboard: AOpen i915GMm-hfs, AWARD BIOS
cpu: Pentium-M 750 (0.8 to 1.86 MHz)
openSuSE 10.2 with kernel 2.6.22.1

The cpu fan can not be controled by linux kernel.
The BIOS will switch on the cpu fan a bit above 50 deg. Celsius.
The active and passive trip points both are set to 50 deg. Celsius.
Temperature of the idle cpu at 800 Mhz: 34 to 42 deg. C.
The BIOS never changes the trip points.
Cpufreq does work perfectly.

Previously there was the possibility to add something like

echo "100:0:65:70:0" > /proc/acpi/thermal_zone/THRM/trip_points
echo 2 > /proc/acpi/thermal_zone/THRM/polling_frequency
echo ondemand > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

to e.g. /etc/init.d/boot.local. With 2.6.22 that solution does not exist
any longer. Now the code in thermal.c slows down the cpu under load
to prevent "overheating". Kernel compile time increases from about 12
to 18 minutes. No, I don?t like that, nobody would.

Possible solutions:

1. Get a better BIOS! --- There is none.

2. Fix DSDT! --- Recompiling gives a number of errors ... I do not know
how to fix it.

3. Don?t include thermal.c! --- That does help, but as this is a 24/7
system, the
cpu fan could break. At that time I do not want to rely on the BIOS to
save my
system (the next trip point is at 100 deg. Celsius).

4. Revert Len Browns commit 11ccc0f249cb01a129f54760b8ff087f242935d4

I would vote for option 4, but I do understand some of the arguments of
Len in
the 2.6.22-rc1-mm1 discussion in May. Yes, communicating trip points to
thermal.c is a hack, it will fail on systems that change trip points
dynamically
and it might be dangerous for the machine if unreasonable trip points
are chosen.
But it does help to keep the machine quiet, and to work around a too low
or too
trip points defined by the BIOS.

If it should be not acceptable to revert the questionable commit without
changes,
would it be acceptable to make rw trip_points a kernel config option?

cu,
Knut Petersen


2007-08-02 08:53:51

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thu, 02 Aug 2007 10:40:44 +0200 Knut Petersen <[email protected]> wrote:

> Hi everybody!
>
> Kernel 2.6.22 decreases performance by about 50% on my system.
> No, I do not like that. The reason is a broken BIOS, granted, but there
> was a perfect workaround in the kernel that has been dropped.
>
> mainboard: AOpen i915GMm-hfs, AWARD BIOS
> cpu: Pentium-M 750 (0.8 to 1.86 MHz)
> openSuSE 10.2 with kernel 2.6.22.1
>
> The cpu fan can not be controled by linux kernel.
> The BIOS will switch on the cpu fan a bit above 50 deg. Celsius.
> The active and passive trip points both are set to 50 deg. Celsius.
> Temperature of the idle cpu at 800 Mhz: 34 to 42 deg. C.
> The BIOS never changes the trip points.
> Cpufreq does work perfectly.
>
> Previously there was the possibility to add something like
>
> echo "100:0:65:70:0" > /proc/acpi/thermal_zone/THRM/trip_points
> echo 2 > /proc/acpi/thermal_zone/THRM/polling_frequency
> echo ondemand > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
>
> to e.g. /etc/init.d/boot.local. With 2.6.22 that solution does not exist
> any longer. Now the code in thermal.c slows down the cpu under load
> to prevent "overheating". Kernel compile time increases from about 12
> to 18 minutes. No, I don?t like that, nobody would.
>
> Possible solutions:
>
> 1. Get a better BIOS! --- There is none.
>
> 2. Fix DSDT! --- Recompiling gives a number of errors ... I do not know
> how to fix it.
>
> 3. Don?t include thermal.c! --- That does help, but as this is a 24/7
> system, the
> cpu fan could break. At that time I do not want to rely on the BIOS to
> save my
> system (the next trip point is at 100 deg. Celsius).
>
> 4. Revert Len Browns commit 11ccc0f249cb01a129f54760b8ff087f242935d4
>
> I would vote for option 4, but I do understand some of the arguments of
> Len in
> the 2.6.22-rc1-mm1 discussion in May. Yes, communicating trip points to
> thermal.c is a hack, it will fail on systems that change trip points
> dynamically
> and it might be dangerous for the machine if unreasonable trip points
> are chosen.
> But it does help to keep the machine quiet, and to work around a too low
> or too
> trip points defined by the BIOS.

I didn't understand the arguments either, actually.

Here we had obviously-useful-to-you functionality which was taken away
without, afaik, providing any alternative.

> If it should be not acceptable to revert the questionable commit without
> changes,
> would it be acceptable to make rw trip_points a kernel config option?

Well we obviously need to do _something_. And reverting that commit until
we get a decent replacement in place sounds like a fine idea to me.

2007-08-02 09:42:41

by Thomas Renninger

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thu, 2007-08-02 at 10:40 +0200, Knut Petersen wrote:
> Hi everybody!
>
> Kernel 2.6.22 decreases performance by about 50% on my system.
> No, I do not like that. The reason is a broken BIOS, granted, but there
> was a perfect workaround in the kernel that has been dropped.
>
> mainboard: AOpen i915GMm-hfs, AWARD BIOS
> cpu: Pentium-M 750 (0.8 to 1.86 MHz)
> openSuSE 10.2 with kernel 2.6.22.1
Is this a DELL laptop that gets throttled by 75% to throttling state 6
if 60 degrees are exceeded?
Adrian has such a machine..., no idea what is going on with that one,
but only workaround to get any use out of this machine is to override at
least the passive trip point.
>
> The cpu fan can not be controled by linux kernel.
> The BIOS will switch on the cpu fan a bit above 50 deg. Celsius.
> The active and passive trip points both are set to 50 deg. Celsius.
> Temperature of the idle cpu at 800 Mhz: 34 to 42 deg. C.
> The BIOS never changes the trip points.
> Cpufreq does work perfectly.
>
> Previously there was the possibility to add something like
>
> echo "100:0:65:70:0" > /proc/acpi/thermal_zone/THRM/trip_points
> echo 2 > /proc/acpi/thermal_zone/THRM/polling_frequency
> echo ondemand > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
>
> to e.g. /etc/init.d/boot.local. With 2.6.22 that solution does not exist
> any longer. Now the code in thermal.c slows down the cpu under load
> to prevent "overheating". Kernel compile time increases from about 12
> to 18 minutes. No, I don´t like that, nobody would.
>
> Possible solutions:
>
> 1. Get a better BIOS! --- There is none.
>
> 2. Fix DSDT! --- Recompiling gives a number of errors ... I do not know
> how to fix it.
>
> 3. Don´t include thermal.c! --- That does help, but as this is a 24/7
> system, the
> cpu fan could break. At that time I do not want to rely on the BIOS to
> save my
> system (the next trip point is at 100 deg. Celsius).
>
> 4. Revert Len Browns commit 11ccc0f249cb01a129f54760b8ff087f242935d4
>
> I would vote for option 4, but I do understand some of the arguments of
> Len in
> the 2.6.22-rc1-mm1 discussion in May. Yes, communicating trip points to
> thermal.c is a hack, it will fail on systems that change trip points
> dynamically
> and it might be dangerous for the machine if unreasonable trip points
> are chosen.
> But it does help to keep the machine quiet, and to work around a too low
> or too
> trip points defined by the BIOS.
>
> If it should be not acceptable to revert the questionable commit without
> changes,
As 2.6.22 was shipped without, I think reverting is not a real option.

> would it be acceptable to make rw trip_points a kernel config option?
IMO something new should be added.
On longterm, maybe it's possible to marriage ACPI thermal control with
hwmon interface, AFAIK there are already efforts to do so, but I don't
know much about it. Still overriding trip points is a problem because
BIOS can change them at runtime... IMO it should just be possible and
machines changing them at runtime either:
- do change the user's overrides
- or trip points are simply fixed after user has overridden them
-> my favorite (Don't care for hysteresis BIOS implementations,
if user changes them, it's his fault, he doesn't need to...)
Sanity checks that trip points only can get lowered (compared to initial
provided ones) needs to be added.
Len, Rui: For short-term can something like that be added at least to
the new sysfs interface (I am willing to help if this is a "would be
nice to have, but no time, maybe later" issue)?

Especially passive trip point modification is IMO a powerful feature.
You can easily build a passive cooled system, running at the performance
level your cooling system allows (CPU frequency simply gets lowered
before fans kick in).
Other architectures than ACPI powered already make use of CPU frequency
scaling. An ACPI independent passive cooling implementation connecting
thermal control (hwmon?) and cpufreq interface should be desired for
future? (could get tricky because ACPI spec has some special needs for
passive cooling)

Thomas

2007-08-02 09:45:28

by Adrian Schröter

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thursday 02 August 2007 11:42:27 wrote Thomas Renninger:
> On Thu, 2007-08-02 at 10:40 +0200, Knut Petersen wrote:
> > Hi everybody!
> >
> > Kernel 2.6.22 decreases performance by about 50% on my system.
> > No, I do not like that. The reason is a broken BIOS, granted, but there
> > was a perfect workaround in the kernel that has been dropped.
> >
> > mainboard: AOpen i915GMm-hfs, AWARD BIOS
> > cpu: Pentium-M 750 (0.8 to 1.86 MHz)
> > openSuSE 10.2 with kernel 2.6.22.1
>
> Is this a DELL laptop that gets throttled by 75% to throttling state 6
> if 60 degrees are exceeded?
> Adrian has such a machine..., no idea what is going on with that one,
> but only workaround to get any use out of this machine is to override at
> least the passive trip point.

JFYI, there are plenty of these systems around, it was one out of four
standard Novell modells. I am mabye just the first one who uses Factory on
it, but expect more bugreports when 10.3 gets released ...

bye
adrian

--

Adrian Schroeter
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
email: [email protected]

2007-08-02 09:54:29

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

Andrew Morton <[email protected]> writes:

> I didn't understand the arguments either, actually.

The issue is that you can actually kill hardware by setting this wrong.
We've had such cases where trip point problems eventually lead
to overheated laptops with hard disks dying etc.

Also it runs the system out of spec and is similar to overclocking
which we also do not support.

> Here we had obviously-useful-to-you functionality which was taken away
> without, afaik, providing any alternative.

I don't think it's that unreasonable to require source code modifications
for anything that can kill hardware. At least that raises the barrier
a bit and hopefully ensures people think twice about it and then really
only blame themselves if anything goes wrong.

-Andi

2007-08-02 09:58:32

by Thomas Renninger

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thu, 2007-08-02 at 11:45 +0200, Adrian Schröter wrote:
> On Thursday 02 August 2007 11:42:27 wrote Thomas Renninger:
> > On Thu, 2007-08-02 at 10:40 +0200, Knut Petersen wrote:
> > > Hi everybody!
> > >
> > > Kernel 2.6.22 decreases performance by about 50% on my system.
> > > No, I do not like that. The reason is a broken BIOS, granted, but there
> > > was a perfect workaround in the kernel that has been dropped.
> > >
> > > mainboard: AOpen i915GMm-hfs, AWARD BIOS
> > > cpu: Pentium-M 750 (0.8 to 1.86 MHz)
> > > openSuSE 10.2 with kernel 2.6.22.1
> >
> > Is this a DELL laptop that gets throttled by 75% to throttling state 6
> > if 60 degrees are exceeded?
> > Adrian has such a machine..., no idea what is going on with that one,
> > but only workaround to get any use out of this machine is to override at
> > least the passive trip point.
>
> JFYI, there are plenty of these systems around, it was one out of four
> standard Novell modells. I am mabye just the first one who uses Factory on
> it, but expect more bugreports when 10.3 gets released ...

Oops. So this is not broken HW/BIOS, but definitely a kernel problem?
Only idea that comes to my mind finding this is to grep through the DSDT
and look out for code that accesses CPU throttling HW ports. Maybe ACPI
subsystem gets something wrong, processing this code and activating
throttling by accident?

Anyway, only solution/workaround to use these machines with current
kernels is to override trip points, maybe the patch should really just
be reverted...

Thomas

2007-08-02 10:54:17

by Alan

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

> Also it runs the system out of spec and is similar to overclocking
> which we also do not support.

We do not systematically prevent overclocking. There are lots of cases
where altering the trip points is helpful, and if you look in vendor
bugzilla databases there are multiple moans from people whose laptops now
run slow, or in many cases are simply unusable as a result of Len's
change.

Given you can achieve some of the same result by not loading the relevant
ACPI code in the first place your argument makes no rational sense at all.

Set a taint flag, print a loud message but don't stop users actually
doing things they intend as root. Or have you forgotten the original Unix
philosophy too ?

> > Here we had obviously-useful-to-you functionality which was taken away
> > without, afaik, providing any alternative.
>
> I don't think it's that unreasonable to require source code modifications
> for anything that can kill hardware.

As root you can erase the bios, lock the hard disk with a random
password, reflash your video card ....

Sorry Andi, you simply do not know better than all end users.

Alan

2007-08-02 10:56:16

by Alan

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

> Anyway, only solution/workaround to use these machines with current
> kernels is to override trip points, maybe the patch should really just
> be reverted...

The question really is whether the vendors will all revert it and carry
it as a patch or whether the main tree will accept reality on this one.

Reverting it and adding a taint marker if you do it is much preferable I
suspect to having every vendor revert this bogus if well meaning
changeset.

2007-08-02 11:13:55

by Matthew Garrett

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thu, Aug 02, 2007 at 12:02:21PM +0100, Alan Cox wrote:
> > Anyway, only solution/workaround to use these machines with current
> > kernels is to override trip points, maybe the patch should really just
> > be reverted...
>
> The question really is whether the vendors will all revert it and carry
> it as a patch or whether the main tree will accept reality on this one.
>
> Reverting it and adding a taint marker if you do it is much preferable I
> suspect to having every vendor revert this bogus if well meaning
> changeset.

I strongly suspect that the vast majority[1] of hardware that "needs"
the trip points changing works perfectly well under Windows, so it's
likely to be papering over bugs in the kernel. It'd be nice if we fixed
those rather than encouraging people to poke stuff into /proc,
especially when doing so is guaranteed to break in really confusing ways
with a lot of hardware. The firmware can reset the trip points at
essentially arbitrary times and is well within its rights to expect the
OS to actually pay attention to them.

[1] Some hardware is simply broken. We don't carry phc just because some
vendors put the wrong voltage values in their tables, either
--
Matthew Garrett | [email protected]

2007-08-02 11:32:29

by Knut Petersen

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

Thomas Renninger wrote:
>> mainboard: AOpen i915GMm-hfs, AWARD BIOS
>> cpu: Pentium-M 750 (0.8 to 1.86 MHz)
>> openSuSE 10.2 with kernel 2.6.22.1
> Is this a DELL laptop that gets throttled by 75% to throttling state 6
> if 60 degrees are exceeded?
No, it is a Pentium M desktop board.:
Chipset i915GM, FSB 533MHz, max 2GB DDR2 RAM, 2 PCI
and 1 16x PCI Express slots, serial, parallel, usb, firewire,
2x Marvel Gigabit Ethernet, Realtek ALC 880 sound, IDE,
Intel SATA and SiI SATA Raid, FDC, DVI and VGA video out etc.
Very low power consumption: ~40W to 65W for the whole system,
except monitor.
> As 2.6.22 was shipped without, I think reverting is not a real option.
Well, it would not be the first time to eliminate a regression by
reverting a
patch after it was accepted previously.
>> Sanity checks that trip points only can get lowered (compared to initial
>> provided ones) needs to be added.
>> Len, Rui: For short-term can some
But I _need_ to raise the unreasonably low passive trip point. We could
decide to
protect the innocent user by allowing write access to trip_points only
after a previous

echo "I know what I am doing" >
/proc/acpi/thermal_zone/THRM/enable_really_dangerous_options

if we believe that this is a good idea ...

Andi Kleen wrote:

> I don't think it's that unreasonable to require source code
modifications
> for anything that can kill hardware. At least that raises the barrier
> a bit and hopefully ensures people think twice about it and then really
> only blame themselves if anything goes wrong.

Andi, would the above be mechanism sufficiently safe for your taste?

cu,
Knut

2007-08-02 11:45:18

by Thomas Renninger

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thu, 2007-08-02 at 12:13 +0100, Matthew Garrett wrote:
> On Thu, Aug 02, 2007 at 12:02:21PM +0100, Alan Cox wrote:
> > > Anyway, only solution/workaround to use these machines with current
> > > kernels is to override trip points, maybe the patch should really just
> > > be reverted...
> >
> > The question really is whether the vendors will all revert it and carry
> > it as a patch or whether the main tree will accept reality on this one.
> >
> > Reverting it and adding a taint marker if you do it is much preferable I
> > suspect to having every vendor revert this bogus if well meaning
> > changeset.
>
> I strongly suspect that the vast majority[1] of hardware that "needs"
> the trip points changing works perfectly well under Windows, so it's
> likely to be papering over bugs in the kernel. It'd be nice if we fixed
> those rather than encouraging people to poke stuff into /proc,
Some arguments against that:
- You cannot tell a customer: Wait for the kernel in half a year.
This is the time it at least needs until a laptop got sold, the
problem is found, a patch is written and checked in and finally
hits the distribution.
- You can also not backport fixes as ACPI patches mostly have the
potential to break other machines/BIOSes
- There also exist the policy to not fix up/workaround totally broken
AML BIOS implementations
- We do not need to and never will be able to copy or do the same
Windows is doing
- ...

> especially when doing so is guaranteed to break in really confusing ways
> with a lot of hardware. The firmware can reset the trip points at
> essentially arbitrary times and is well within its rights to expect the
> OS to actually pay attention to them.
What the hell is so wrong with:

Let the user override the trip points. If he does so, ignore
thermal trip point updates from BIOS. Don't care for hysteresis
BIOS implementations (these are the BIOS trip point updates).
If user changes them, it's his fault, he doesn't need to...
Make sure that trip points can only be lowered, compared to the
initially fetched one from BIOS.

This is neither confusing, nor dangerous in any way (beside the fact
that the critical trip point might get dynamically lowered by BIOS,
which is totally insane).

Thomas

>
> [1] Some hardware is simply broken. We don't carry phc just because some
> vendors put the wrong voltage values in their tables, either


2007-08-02 11:53:44

by Alan

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

> I strongly suspect that the vast majority[1] of hardware that "needs"
> the trip points changing works perfectly well under Windows, so it's

Windows as I understand it has vendor mechanisms to allow the bits
shipped with the OS to override/ignore just about everything trip points
included. Lots of hardware that requires fixups in Linux and just works
in Windows is not Linux bugs but Windows magic .inf files and other
registry gunge done by the machine vendor. We see this in ATA, in power
management and elsewhere.

Alan

2007-08-02 11:56:58

by Matthew Garrett

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thu, Aug 02, 2007 at 01:45:00PM +0200, Thomas Renninger wrote:
> On Thu, 2007-08-02 at 12:13 +0100, Matthew Garrett wrote:
> > I strongly suspect that the vast majority[1] of hardware that "needs"
> > the trip points changing works perfectly well under Windows, so it's
> > likely to be papering over bugs in the kernel. It'd be nice if we fixed
> > those rather than encouraging people to poke stuff into /proc,
> Some arguments against that:
> - You cannot tell a customer: Wait for the kernel in half a year.
> This is the time it at least needs until a laptop got sold, the
> problem is found, a patch is written and checked in and finally
> hits the distribution.

We have to do so frequently. New hardware often exposes bugs in the
kernel.

> - You can also not backport fixes as ACPI patches mostly have the
> potential to break other machines/BIOSes
> - There also exist the policy to not fix up/workaround totally broken
> AML BIOS implementations

The policy has been to attempt to be bug-compatible with Windows
whenever possible for some time now.

> - We do not need to and never will be able to copy or do the same
> Windows is doing

Given that many vendors still only test against Windows, that's exactly
what we need to do.

> > especially when doing so is guaranteed to break in really confusing ways
> > with a lot of hardware. The firmware can reset the trip points at
> > essentially arbitrary times and is well within its rights to expect the
> > OS to actually pay attention to them.
> What the hell is so wrong with:
>
> Let the user override the trip points. If he does so, ignore
> thermal trip point updates from BIOS. Don't care for hysteresis
> BIOS implementations (these are the BIOS trip point updates).

No, that's not the only reason for notifications. Alteration in hardware
state may also force a recalculation of trip point (adding a battery to
a bay rather than a DVD drive may require the platform to be kept at a
lower temperature)

> If user changes them, it's his fault, he doesn't need to...
> Make sure that trip points can only be lowered, compared to the
> initially fetched one from BIOS.

Surely people want this functionality so that they can raise trip
points?

--
Matthew Garrett | [email protected]

2007-08-02 11:58:10

by Matthew Garrett

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thu, Aug 02, 2007 at 12:59:47PM +0100, Alan Cox wrote:
> > I strongly suspect that the vast majority[1] of hardware that "needs"
> > the trip points changing works perfectly well under Windows, so it's
>
> Windows as I understand it has vendor mechanisms to allow the bits
> shipped with the OS to override/ignore just about everything trip points
> included. Lots of hardware that requires fixups in Linux and just works
> in Windows is not Linux bugs but Windows magic .inf files and other
> registry gunge done by the machine vendor. We see this in ATA, in power
> management and elsewhere.

I've seen no evidence that this happens with thermal trip points.
--
Matthew Garrett | [email protected]

2007-08-02 12:05:29

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

> Set a taint flag,

That's hardly any useful if the machine is dead afterwards.

> print a loud message

Neither.

You'll just end up with "Linux destroyed my laptop" headlines all
over the internet and rightfully very annoyed users.

> Or have you forgotten the original Unix
> philosophy too ?

The philosophy didn't include physically destroying hardware
as far as I know.

> > > Here we had obviously-useful-to-you functionality which was taken away
> > > without, afaik, providing any alternative.
> >
> > I don't think it's that unreasonable to require source code modifications
> > for anything that can kill hardware.
>
> As root you can erase the bios,

We don't ship the devbios driver for good reasons.

> lock the hard disk with a random
> password, reflash your video card ....

That all requires significant effort and custom software. It's not that we
have a one liner echo destroy > /sys/.../flash-bios.

-Andi

2007-08-02 12:06:26

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

> Andi Kleen wrote:
>
> > I don't think it's that unreasonable to require source code
> modifications
> > for anything that can kill hardware. At least that raises the barrier
> > a bit and hopefully ensures people think twice about it and then really
> > only blame themselves if anything goes wrong.
>
> Andi, would the above be mechanism sufficiently safe for your taste?

No.

-Andi

2007-08-02 12:06:40

by Thomas Renninger

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thu, 2007-08-02 at 12:57 +0100, Matthew Garrett wrote:
> On Thu, Aug 02, 2007 at 12:59:47PM +0100, Alan Cox wrote:
> > > I strongly suspect that the vast majority[1] of hardware that "needs"
> > > the trip points changing works perfectly well under Windows, so it's
> >
> > Windows as I understand it has vendor mechanisms to allow the bits
> > shipped with the OS to override/ignore just about everything trip points
> > included. Lots of hardware that requires fixups in Linux and just works
> > in Windows is not Linux bugs but Windows magic .inf files and other
> > registry gunge done by the machine vendor. We see this in ATA, in power
> > management and elsewhere.
>
> I've seen no evidence that this happens with thermal trip points.

WMI needed for fan control -- FSC Amilo M3438G
http://bugzilla.kernel.org/show_bug.cgi?id=5670

Thomas

2007-08-02 12:16:22

by Matthew Garrett

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thu, Aug 02, 2007 at 02:06:26PM +0200, Thomas Renninger wrote:
> On Thu, 2007-08-02 at 12:57 +0100, Matthew Garrett wrote:
> > On Thu, Aug 02, 2007 at 12:59:47PM +0100, Alan Cox wrote:
> > > Windows as I understand it has vendor mechanisms to allow the bits
> > > shipped with the OS to override/ignore just about everything trip points
> > > included. Lots of hardware that requires fixups in Linux and just works
> > > in Windows is not Linux bugs but Windows magic .inf files and other
> > > registry gunge done by the machine vendor. We see this in ATA, in power
> > > management and elsewhere.
> >
> > I've seen no evidence that this happens with thermal trip points.
>
> WMI needed for fan control -- FSC Amilo M3438G
> http://bugzilla.kernel.org/show_bug.cgi?id=5670

That machine has no active thermal trip points, so I'm not sure how it's
relevant here. By the sounds of the bug log, I suspect Linux just runs
slightly hotter on the machine than Windows does - especially since the
user isn't running the closed nvidia driver, so there's nothing to carry
out any power management on the GPU.
--
Matthew Garrett | [email protected]

2007-08-02 12:35:30

by Thomas Renninger

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thu, 2007-08-02 at 13:15 +0100, Matthew Garrett wrote:
> On Thu, Aug 02, 2007 at 02:06:26PM +0200, Thomas Renninger wrote:
> > On Thu, 2007-08-02 at 12:57 +0100, Matthew Garrett wrote:
> > > On Thu, Aug 02, 2007 at 12:59:47PM +0100, Alan Cox wrote:
> > > > Windows as I understand it has vendor mechanisms to allow the bits
> > > > shipped with the OS to override/ignore just about everything trip points
> > > > included. Lots of hardware that requires fixups in Linux and just works
> > > > in Windows is not Linux bugs but Windows magic .inf files and other
> > > > registry gunge done by the machine vendor. We see this in ATA, in power
> > > > management and elsewhere.
> > >
> > > I've seen no evidence that this happens with thermal trip points.
> >
> > WMI needed for fan control -- FSC Amilo M3438G
> > http://bugzilla.kernel.org/show_bug.cgi?id=5670
>
> That machine has no active thermal trip points, so I'm not sure how it's
> relevant here.
>From above: "Windows as I understand it has vendor mechanisms to..."
Maybe thermal trip points are not influenced here, it's at least about
thermal management and another prove that we cannot just try to copy
Windows behavior, but need to provide workarounds wherever possible.

Thomas

> By the sounds of the bug log, I suspect Linux just runs
> slightly hotter on the machine than Windows does - especially since the
> user isn't running the closed nvidia driver, so there's nothing to carry
> out any power management on the GPU.

2007-08-02 12:42:35

by Thomas Renninger

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thu, 2007-08-02 at 12:56 +0100, Matthew Garrett wrote:
> On Thu, Aug 02, 2007 at 01:45:00PM +0200, Thomas Renninger wrote:
> > On Thu, 2007-08-02 at 12:13 +0100, Matthew Garrett wrote:
> > > I strongly suspect that the vast majority[1] of hardware that "needs"
> > > the trip points changing works perfectly well under Windows, so it's
> > > likely to be papering over bugs in the kernel. It'd be nice if we fixed
> > > those rather than encouraging people to poke stuff into /proc,
> > Some arguments against that:
> > - You cannot tell a customer: Wait for the kernel in half a year.
> > This is the time it at least needs until a laptop got sold, the
> > problem is found, a patch is written and checked in and finally
> > hits the distribution.
>
> We have to do so frequently. New hardware often exposes bugs in the
> kernel.
And often we can provide a boot param or whatever, that makes it at
least useable.
>
> > - You can also not backport fixes as ACPI patches mostly have the
> > potential to break other machines/BIOSes
> > - There also exist the policy to not fix up/workaround totally broken
> > AML BIOS implementations
>
> The policy has been to attempt to be bug-compatible with Windows
> whenever possible for some time now.
*whenever possible*
>
> > - We do not need to and never will be able to copy or do the same
> > Windows is doing
>
> Given that many vendors still only test against Windows, that's exactly
> what we need to do.
But we cannot (copy all windows (mis-)behavior).
>
> > > especially when doing so is guaranteed to break in really confusing ways
> > > with a lot of hardware. The firmware can reset the trip points at
> > > essentially arbitrary times and is well within its rights to expect the
> > > OS to actually pay attention to them.
> > What the hell is so wrong with:
> >
> > Let the user override the trip points. If he does so, ignore
> > thermal trip point updates from BIOS. Don't care for hysteresis
> > BIOS implementations (these are the BIOS trip point updates).
>
> No, that's not the only reason for notifications. Alteration in hardware
> state may also force a recalculation of trip point (adding a battery to
> a bay rather than a DVD drive may require the platform to be kept at a
> lower temperature)
"I've seen no evidence that this happens...", but I see the point.
> > If user changes them, it's his fault, he doesn't need to...
> > Make sure that trip points can only be lowered, compared to the
> > initially fetched one from BIOS.
>
> Surely people want this functionality so that they can raise trip
> points?
For Adrian it would be enough to be able to lower them.
Also being able to define a passive trip point (even if not provided by
BIOS) could help a lot machines.

What about at least:
- Be able to override passive cooling trip point
- If BIOS does not provide one, let user be able to define it
This should already make a lot people happy.

Thomas


2007-08-02 12:47:53

by Matthew Garrett

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thu, Aug 02, 2007 at 02:35:18PM +0200, Thomas Renninger wrote:
> On Thu, 2007-08-02 at 13:15 +0100, Matthew Garrett wrote:
> > That machine has no active thermal trip points, so I'm not sure how it's
> > relevant here.
> >From above: "Windows as I understand it has vendor mechanisms to..."
> Maybe thermal trip points are not influenced here, it's at least about
> thermal management and another prove that we cannot just try to copy
> Windows behavior, but need to provide workarounds wherever possible.

There's absolutely no evidence in the bug log there that the user's
problems are in any way due to Windows-specific code. The SetSilentMode
stuff is an additional item of functionality that underclocks various
bits of hardware, not one that's actually required for the platform to
function correctly.
--
Matthew Garrett | [email protected]

2007-08-02 12:56:04

by Matthew Garrett

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thu, Aug 02, 2007 at 02:42:19PM +0200, Thomas Renninger wrote:
> On Thu, 2007-08-02 at 12:56 +0100, Matthew Garrett wrote:
> > The policy has been to attempt to be bug-compatible with Windows
> > whenever possible for some time now.
> *whenever possible*

But there's no evidence whatsoever that this is something we can't
handle...

> > No, that's not the only reason for notifications. Alteration in hardware
> > state may also force a recalculation of trip point (adding a battery to
> > a bay rather than a DVD drive may require the platform to be kept at a
> > lower temperature)
> "I've seen no evidence that this happens...", but I see the point.

It's explicitly mentioned as one of the use cases for trip point
alteration in the spec.

> > Surely people want this functionality so that they can raise trip
> > points?
> For Adrian it would be enough to be able to lower them.

Which suggests that we're probably doing something wrong at some more
fundamental level...

> Also being able to define a passive trip point (even if not provided by
> BIOS) could help a lot machines.

I agree that being able to lower trip points is unlikely to result in
hardware damage, but still think that it's likely to be papering over
genuine bugs that we could fix properly.

--
Matthew Garrett | [email protected]

2007-08-02 12:58:33

by Alan

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

> > Set a taint flag,
> That's hardly any useful if the machine is dead afterwards.

It won't be the hardware will do a failsafe shutdown first.

> You'll just end up with "Linux destroyed my laptop" headlines all
> over the internet and rightfully very annoyed users.

You have to systematically sit down and tweak your machine.

> The philosophy didn't include physically destroying hardware
> as far as I know.

It most certainly did. With safety checks you could override.

> > As root you can erase the bios,
> We don't ship the devbios driver for good reasons.

Thats debatably a bad reason (the user space API is wrong thats all), and
one thats totally inconsistent with some of the other drivers we do ship.

> > lock the hard disk with a random
> > password, reflash your video card ....
>
> That all requires significant effort and custom software. It's not that we
> have a one liner echo destroy > /sys/.../flash-bios.

Well you can do the hard disk one in one line of perl, the video card one
in a small bit of C. And this merely makes the argument that raising the
trip points should be harder.

Alan

2007-08-02 13:00:34

by Alan

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

> > Andi, would the above be mechanism sufficiently safe for your taste?
>
> No.

I don't beleve Andi's taste (or lack thereof) is relevant to this
discussion. He's not for example explained why its better to force people
to disable all the APCI power and thermal control on their system rather
than adjust trip points.

2007-08-02 13:16:33

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thu, Aug 02, 2007 at 02:04:42PM +0100, Alan Cox wrote:
> > > Set a taint flag,
> > That's hardly any useful if the machine is dead afterwards.
>
> It won't be the hardware will do a failsafe shutdown first.

Not necessarily. At SUSE we had at least one broken laptop
with wrong trip points. The machine ran very hot for some time
and afterwards the hard disk was dead.

-Andi

2007-08-02 15:56:14

by Pavel Machek

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

Hi!

> > I didn't understand the arguments either, actually.
>
> The issue is that you can actually kill hardware by setting this wrong.
> We've had such cases where trip point problems eventually lead
> to overheated laptops with hard disks dying etc.

Actually, that was my machine. Omnibook xe3; BIOS provided trip points
*did* kill the disk. At least I was able to work around it with
writing to trip points.

Yes, ACPI mandates emergency shutdown when critical+delta point is
reached, *in hardware*. So this only endangers very broken machines,
and it also fixes lot of them.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-08-02 15:58:18

by Pavel Machek

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thu 2007-08-02 15:16:22, Andi Kleen wrote:
> On Thu, Aug 02, 2007 at 02:04:42PM +0100, Alan Cox wrote:
> > > > Set a taint flag,
> > > That's hardly any useful if the machine is dead afterwards.
> >
> > It won't be the hardware will do a failsafe shutdown first.
>
> Not necessarily. At SUSE we had at least one broken laptop
> with wrong trip points. The machine ran very hot for some time
> and afterwards the hard disk was dead.

Yes, but it was original BIOS trip points that were wrong. And yes,
its failsafe shutdown was too late. At least lowering the trip points
would allow me to run it safely.

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-08-02 16:08:06

by Pavel Machek

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

Hi!

> Well, it would not be the first time to eliminate a regression by
> reverting a
> patch after it was accepted previously.
> >> Sanity checks that trip points only can get lowered (compared to initial
> >> provided ones) needs to be added.
> >> Len, Rui: For short-term can some
> But I _need_ to raise the unreasonably low passive trip point. We could
> decide to
> protect the innocent user by allowing write access to trip_points only
> after a previous

Actually, you should lower your active trip point, and keep cpu temp
below 50C.

> echo "I know what I am doing" >
> /proc/acpi/thermal_zone/THRM/enable_really_dangerous_options

No... but patch that only permits lowering could be acceptable.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-08-02 18:38:42

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thu, Aug 02, 2007 at 03:57:54PM +0000, Pavel Machek wrote:
> On Thu 2007-08-02 15:16:22, Andi Kleen wrote:
> > On Thu, Aug 02, 2007 at 02:04:42PM +0100, Alan Cox wrote:
> > > > > Set a taint flag,
> > > > That's hardly any useful if the machine is dead afterwards.
> > >
> > > It won't be the hardware will do a failsafe shutdown first.
> >
> > Not necessarily. At SUSE we had at least one broken laptop
> > with wrong trip points. The machine ran very hot for some time
> > and afterwards the hard disk was dead.
>
> Yes, but it was original BIOS trip points that were wrong. And yes,
> its failsafe shutdown was too late. At least lowering the trip points
> would allow me to run it safely.

I have no problem with lowering them (in fact I proposed this
to Thomas as a possible solution at some point). Just rising
is a bad idea.

-Andi

2007-08-02 18:40:43

by Matthew Garrett

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thu, Aug 02, 2007 at 08:38:30PM +0200, Andi Kleen wrote:
> On Thu, Aug 02, 2007 at 03:57:54PM +0000, Pavel Machek wrote:
> > Yes, but it was original BIOS trip points that were wrong. And yes,
> > its failsafe shutdown was too late. At least lowering the trip points
> > would allow me to run it safely.
>
> I have no problem with lowering them (in fact I proposed this
> to Thomas as a possible solution at some point). Just rising
> is a bad idea.

Though for this to be reliable, you need to ignore any notifications
that would raise the trip points while still paying attention to any
that would lower them.

--
Matthew Garrett | [email protected]

2007-08-02 19:25:46

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

Knut Petersen <[email protected]> writes:

> echo "I know what I am doing" >
> /proc/acpi/thermal_zone/THRM/enable_really_dangerous_options

There is a shorter version:
$ su
Password:
#
--
Krzysztof Halasa

2007-08-02 21:57:26

by Len Brown

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thursday 02 August 2007 04:40, Knut Petersen wrote:

> Kernel 2.6.22 decreases performance by about 50% on my system.
> No, I do not like that. The reason is a broken BIOS, granted, but there
> was a perfect workaround in the kernel that has been dropped.
>
> mainboard: AOpen i915GMm-hfs, AWARD BIOS
> cpu: Pentium-M 750 (0.8 to 1.86 MHz)
> openSuSE 10.2 with kernel 2.6.22.1
>
> The cpu fan can not be controled by linux kernel.
> The BIOS will switch on the cpu fan a bit above 50 deg. Celsius.
> The active and passive trip points both are set to 50 deg. Celsius.
> Temperature of the idle cpu at 800 Mhz: 34 to 42 deg. C.
> The BIOS never changes the trip points.
> Cpufreq does work perfectly.
>
> Previously there was the possibility to add something like
>
> echo "100:0:65:70:0" > /proc/acpi/thermal_zone/THRM/trip_points
> echo 2 > /proc/acpi/thermal_zone/THRM/polling_frequency
> echo ondemand > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
>
> to e.g. /etc/init.d/boot.local. With 2.6.22 that solution does not exist
> any longer. Now the code in thermal.c slows down the cpu under load
> to prevent "overheating". Kernel compile time increases from about 12
> to 18 minutes. No, I don?t like that, nobody would.


Thanks for the sighting, Knut!
This regression is dramatic when put in the terms of 50% performance hit!
I guess the good news is that thermal throttling is doing the job
we are asking it to:-)

The statement above regarding the existence of active trip points
and the kernel not being able to control the fan are inconsistent
with each other.

Please open a sighting for this machine here:

http://bugzilla.kernel.org/enter_bug.cgi?product=ACPI
vs. Power-Thermal
and attach the output from acpidump, cat /proc/acpi/thermal_zone/*/*
and assign it to [email protected]

BTW. does the board boot and run properly with "acpi=off"?

thanks,
-Len

2007-08-02 21:57:40

by Len Brown

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thursday 02 August 2007 05:45, Adrian Schröter wrote:
> On Thursday 02 August 2007 11:42:27 wrote Thomas Renninger:
> > On Thu, 2007-08-02 at 10:40 +0200, Knut Petersen wrote:
> > > Hi everybody!
> > >
> > > Kernel 2.6.22 decreases performance by about 50% on my system.
> > > No, I do not like that. The reason is a broken BIOS, granted, but there
> > > was a perfect workaround in the kernel that has been dropped.
> > >
> > > mainboard: AOpen i915GMm-hfs, AWARD BIOS
> > > cpu: Pentium-M 750 (0.8 to 1.86 MHz)
> > > openSuSE 10.2 with kernel 2.6.22.1
> >
> > Is this a DELL laptop that gets throttled by 75% to throttling state 6
> > if 60 degrees are exceeded?
> > Adrian has such a machine..., no idea what is going on with that one,
> > but only workaround to get any use out of this machine is to override at
> > least the passive trip point.
>
> JFYI, there are plenty of these systems around, it was one out of four
> standard Novell modells. I am mabye just the first one who uses Factory on
> it, but expect more bugreports when 10.3 gets released ...

That's very good news, Adrian. In the past all we had to go on
was the memory of a machine that died several years ago.
But if you've got a live failure, that is really valuable.

Please go here
http://bugzilla.kernel.org/enter_bug.cgi?product=ACPI
and submit a new sighting vs. Power-Thermal
and attach the output from acpidump, cat /proc/acpi/thermal_zone/*/*
and assign it to [email protected]

thanks,
-Len

2007-08-03 11:17:16

by Thomas Renninger

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Thu, 2007-08-02 at 20:38 +0200, Andi Kleen wrote:
> On Thu, Aug 02, 2007 at 03:57:54PM +0000, Pavel Machek wrote:
> > On Thu 2007-08-02 15:16:22, Andi Kleen wrote:
> > > On Thu, Aug 02, 2007 at 02:04:42PM +0100, Alan Cox wrote:
> > > > > > Set a taint flag,
> > > > > That's hardly any useful if the machine is dead afterwards.
> > > >
> > > > It won't be the hardware will do a failsafe shutdown first.
> > >
> > > Not necessarily. At SUSE we had at least one broken laptop
> > > with wrong trip points. The machine ran very hot for some time
> > > and afterwards the hard disk was dead.
> >
> > Yes, but it was original BIOS trip points that were wrong. And yes,
> > its failsafe shutdown was too late. At least lowering the trip points
> > would allow me to run it safely.
>
> I have no problem with lowering them (in fact I proposed this
> to Thomas as a possible solution at some point). Just rising
> is a bad idea.

Ok.
If nobody screams (especially Len who has to accept this in the end, I
don't want to do work for nothing..), I'll try an implementation that:
- Allows lowering trip points
- If BIOS modifies trip points, the overridden ones might also
get lowered if they are even lower
- Allow the definition of a passive trip point (with some default
values for hysteresis), even if the thermal zone does not
provide one

If we have something like this, we could still discuss a config option,
that also allows to increase trip points, marking it with "If you set
this you can destroy your machine, you have been warned...". While this
would not be an option for distributions to compile in, some people may
come around the biggest hammer -> overriding DSDT.

I cannot promise, but I try to get this for 2.6.24.

Thomas

2007-08-03 11:43:57

by Renato S. Yamane

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

Len Brown escreveu:
> On Thursday 02 August 2007 04:40, Knut Petersen wrote:
>> mainboard: AOpen i915GMm-hfs, AWARD BIOS
>> cpu: Pentium-M 750 (0.8 to 1.86 MHz)
>> openSuSE 10.2 with kernel 2.6.22.1
>>
>> The cpu fan can not be controled by linux kernel.
>> The BIOS will switch on the cpu fan a bit above 50 deg. Celsius.
>> The active and passive trip points both are set to 50 deg. Celsius.
>> Temperature of the idle cpu at 800 Mhz: 34 to 42 deg. C.
>> The BIOS never changes the trip points.
>> Cpufreq does work perfectly.

On my Toshiba M45-S355 (Toshiba Bios, Pentium M 750 - 0.8 at 1.86GHz,
Debian Etch) I see the same using Kernel 2.6.21.6

>> Previously there was the possibility to add something like
>>
>> echo "100:0:65:70:0" > /proc/acpi/thermal_zone/THRM/trip_points
>> echo 2 > /proc/acpi/thermal_zone/THRM/polling_frequency
>> echo ondemand > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

I never do that, but see below (Kernel 2.6.21.6):

cat /proc/acpi/thermal_zone/TZCL/trip_points
critical (S5): 105 C

cat /proc/acpi/thermal_zone/TZCL/polling_frequency
polling frequency: 2 seconds

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
ondemand

Regards,
Renato S. Yamane

2007-08-03 12:54:21

by Knut Petersen

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

Len Brown :
>
>
> Thanks for the sighting, Knut!
> This regression is dramatic when put in the terms of 50% performance hit!
> I guess the good news is that thermal throttling is doing the job
> we are asking it to:-)
>
>
>
Thermal management by cpufreq is working really fine ;-)

My problems are definitely not related to a linux bug. All trip_points
are fixed, hardcoded in the system BIOS at address 0x000FF810.

Yes, I could hack and flash a custom BIOS.

After reading a lot I think I even could fix the DSDT.

But all that would only be a solution for my system. The principal
question is, if that hook that allowed to override unreasonable
trip point definitions is too dangerous to be a part of the linux kernel.

You and some others believed it should not be part of the kernel,
and so it was eliminated a while ago. Some people want it back,
either because
- they need it desperately to allow their machines healthy operation,
- they need it to restore performance of their machines, or
- they want a really quiet system.

Root should be allowed to smoke his system - ask him if he really
wants to do so, ask him to echo "Yes, it?s me who is guilty" to
some file prior to allow trip point changes, but do not eliminate
hooks useful for the management of buggy machines from our
kernel.

We do need writable trip points again. And, Thomas, some people
also need to raise the defaults.

cu,
Knut

2007-08-03 18:30:49

by Len Brown

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Friday 03 August 2007 08:53, Knut Petersen wrote:
> Len Brown :
> >
> >
> > Thanks for the sighting, Knut!
> > This regression is dramatic when put in the terms of 50% performance hit!
> > I guess the good news is that thermal throttling is doing the job
> > we are asking it to:-)
> >
> >
> >
> Thermal management by cpufreq is working really fine ;-)

Unfortunately, I a lot of people don't understand that the ";-)"
after this statement and they really think that cpufreq is a
solution for thermal management. It isn't. Systems still
need to be thermally sane when they are fully utilized and
cpufreq helps not.

> My problems are definitely not related to a linux bug. All trip_points
> are fixed, hardcoded in the system BIOS at address 0x000FF810.
>
> Yes, I could hack and flash a custom BIOS.
>
> After reading a lot I think I even could fix the DSDT.

No, you should never have to override your BIOS --
except for debugging.

If Windows works out-of-the-box on this system,
then Linux should too - even if we have to use a DMI-based
workaround for a BIOS bug.

I'm looking forward to seeing the bug report that you are
going to file. Please include the dmidecode output in addition
to the acpidump output.

thanks,
-Len

2007-08-03 18:35:51

by Len Brown

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Friday 03 August 2007 07:43, Renato S. Yamane wrote:
> Len Brown escreveu:
> > On Thursday 02 August 2007 04:40, Knut Petersen wrote:
> >> mainboard: AOpen i915GMm-hfs, AWARD BIOS
> >> cpu: Pentium-M 750 (0.8 to 1.86 MHz)
> >> openSuSE 10.2 with kernel 2.6.22.1
> >>
> >> The cpu fan can not be controled by linux kernel.
> >> The BIOS will switch on the cpu fan a bit above 50 deg. Celsius.
> >> The active and passive trip points both are set to 50 deg. Celsius.
> >> Temperature of the idle cpu at 800 Mhz: 34 to 42 deg. C.
> >> The BIOS never changes the trip points.
> >> Cpufreq does work perfectly.
>
> On my Toshiba M45-S355 (Toshiba Bios, Pentium M 750 - 0.8 at 1.86GHz,
> Debian Etch) I see the same using Kernel 2.6.21.6
>
> >> Previously there was the possibility to add something like
> >>
> >> echo "100:0:65:70:0" > /proc/acpi/thermal_zone/THRM/trip_points
> >> echo 2 > /proc/acpi/thermal_zone/THRM/polling_frequency
> >> echo ondemand > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
>
> I never do that, but see below (Kernel 2.6.21.6):
>
> cat /proc/acpi/thermal_zone/TZCL/trip_points
> critical (S5): 105 C
>
> cat /proc/acpi/thermal_zone/TZCL/polling_frequency
> polling frequency: 2 seconds
>
> cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
> ondemand
>

Renato,
I don't understand how your Toshiba is similar to Knut's Aopen.
You've got a single critical trip point at 105C, but no active or passive
trip points.

Are you reporting some kind of failure?

The only thing wrong with your system is that polling_frequency != 0 --
but that is probably a distro configuration issue rather than
a kernel issue.

thanks,
-Len

2007-08-03 18:59:40

by Len Brown

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Friday 03 August 2007 07:16, Thomas Renninger wrote:
> On Thu, 2007-08-02 at 20:38 +0200, Andi Kleen wrote:
> > On Thu, Aug 02, 2007 at 03:57:54PM +0000, Pavel Machek wrote:
> > > On Thu 2007-08-02 15:16:22, Andi Kleen wrote:
> > > > On Thu, Aug 02, 2007 at 02:04:42PM +0100, Alan Cox wrote:
> > > > > > > Set a taint flag,
> > > > > > That's hardly any useful if the machine is dead afterwards.
> > > > >
> > > > > It won't be the hardware will do a failsafe shutdown first.
> > > >
> > > > Not necessarily. At SUSE we had at least one broken laptop
> > > > with wrong trip points. The machine ran very hot for some time
> > > > and afterwards the hard disk was dead.
> > >
> > > Yes, but it was original BIOS trip points that were wrong. And yes,
> > > its failsafe shutdown was too late. At least lowering the trip points
> > > would allow me to run it safely.
> >
> > I have no problem with lowering them (in fact I proposed this
> > to Thomas as a possible solution at some point). Just rising
> > is a bad idea.
>
> Ok.
> If nobody screams (especially Len who has to accept this in the end, I
> don't want to do work for nothing..), I'll try an implementation that:
> - Allows lowering trip points
> - If BIOS modifies trip points, the overridden ones might also
> get lowered if they are even lower
> - Allow the definition of a passive trip point (with some default
> values for hysteresis), even if the thermal zone does not
> provide one
>
> If we have something like this, we could still discuss a config option,
> that also allows to increase trip points, marking it with "If you set
> this you can destroy your machine, you have been warned...". While this
> would not be an option for distributions to compile in, some people may
> come around the biggest hammer -> overriding DSDT.
>
> I cannot promise, but I try to get this for 2.6.24.

I think if you are enamored with overriding trip points at SuSE,
that you should simply restore the original scheme as the "value add"
for SuSE kernels. Seriously, I'm totally fine with that.

You should be aware, however, that (one of) the fundamental flaws
with that scheme, shared with what you describe above, is that the OS
can not actually change the trip points in the thermal sensor.
The sensor is going to trip at the temperature that _it_ thinks
the trip point is at -- not the trip point that you are letting
the user think it is at. Ie. what is advertised as a trip-point
override actually defeats the entire concept of trip-points,
and it is mandatory that you enable periodic polling of the
current temperature to compare with your new thresholds
to work-around that.

This faking out the user, plus the fact that the BIOS does change
trip-points at run-time, made the original scheme fundamentally
unsound. Further, I've not yet found a single system where use
of this scheme wasn't papering over some other problem. For the
upstream kernel, I think it is more appropriate to expose and fix
the fundamental problems. For distro kernels, I'm less concerned
if you hide bugs instead of fixing them.

We had quite a long discussion when I deleted the trip-point-override
scheme in -mm. Then it rode through the entire 2.6.22 release cycle.
However, I have yet to see a single bug report filed that has shown
that Linux should be doing this, or something like it. I'm hopeful
that Knut's or Adrian's will be the first -- but I'm still waiting.

-Len

2007-08-06 11:07:46

by Pavel Machek

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

Hi!

> > If we have something like this, we could still discuss a config option,
> > that also allows to increase trip points, marking it with "If you set
> > this you can destroy your machine, you have been warned...". While this
> > would not be an option for distributions to compile in, some people may
> > come around the biggest hammer -> overriding DSDT.
> >
> > I cannot promise, but I try to get this for 2.6.24.
>
> I think if you are enamored with overriding trip points at SuSE,
> that you should simply restore the original scheme as the "value add"
> for SuSE kernels. Seriously, I'm totally fine with that.
>
> You should be aware, however, that (one of) the fundamental flaws
> with that scheme, shared with what you describe above, is that the OS
> can not actually change the trip points in the thermal sensor.
> The sensor is going to trip at the temperature that _it_ thinks

Yep, you work around this one by enabling polling.

> This faking out the user, plus the fact that the BIOS does change
> trip-points at run-time, made the original scheme fundamentally
> unsound. Further, I've not yet found a single system where use

Yes, this one is uglier. But maybe "enable polling automatically +
ignore any updates from bios" (+ maybe "only enable lowering") is
better solution than "just remove the knob"? After all, "the knob" is
still useful for debugging at least.

> of this scheme wasn't papering over some other problem. For the
> upstream kernel, I think it is more appropriate to expose and fix
> the fundamental problems. For distro kernels, I'm less concerned
> if you hide bugs instead of fixing them.

This is okay as long as you are willing to work around the fundamental
problems in kernel. You are unable to _fix_ them. They are broken
BIOSes.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-08-07 18:59:50

by Len Brown

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Monday 06 August 2007 05:55, Pavel Machek wrote:
> > For the
> > upstream kernel, I think it is more appropriate to expose and fix
> > the fundamental problems. For distro kernels, I'm less concerned
> > if you hide bugs instead of fixing them.
>
> This is okay as long as you are willing to work around the fundamental
> problems in kernel. You are unable to _fix_ them. They are broken
> BIOSes.

The thing Linux needs to figure out is why Windows doesn't
get confused by what Linux claims to be broken BIOS.

So far I have one live sighting to be addressed by
the upstream kernel (from Knut). I'm certainly looking
forward to the 2nd live sighting...

-Len

2007-08-07 22:08:16

by Pavel Machek

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

On Tue 2007-08-07 14:58:45, Len Brown wrote:
> On Monday 06 August 2007 05:55, Pavel Machek wrote:
> > > For the
> > > upstream kernel, I think it is more appropriate to expose and fix
> > > the fundamental problems. For distro kernels, I'm less concerned
> > > if you hide bugs instead of fixing them.
> >
> > This is okay as long as you are willing to work around the fundamental
> > problems in kernel. You are unable to _fix_ them. They are broken
> > BIOSes.
>
> The thing Linux needs to figure out is why Windows doesn't
> get confused by what Linux claims to be broken BIOS.

Why do you assume that Windows work? Yes, they probably will not have
'machine runs at 50% speed' problem, but I'd be very surprised if
critical shutdown worked properly on more than 90% of notebooks....

> So far I have one live sighting to be addressed by
> the upstream kernel (from Knut). I'm certainly looking
> forward to the 2nd live sighting...

Ok, I guess I should steal that old xe3 I was talking about...

Vojtech, could I have that machine from table football room for a few
experiments? I keep using it as counterexample.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-08-13 18:58:08

by Pavel Machek

[permalink] [raw]
Subject: Re: 2.6.22 regression: thermal trip points

Hi!

> > > > For the
> > > > upstream kernel, I think it is more appropriate to expose and fix
> > > > the fundamental problems. For distro kernels, I'm less concerned
> > > > if you hide bugs instead of fixing them.
> > >
> > > This is okay as long as you are willing to work around the fundamental
> > > problems in kernel. You are unable to _fix_ them. They are broken
> > > BIOSes.
> >
> > The thing Linux needs to figure out is why Windows doesn't
> > get confused by what Linux claims to be broken BIOS.
>
> Why do you assume that Windows work? Yes, they probably will not have
> 'machine runs at 50% speed' problem, but I'd be very surprised if
> critical shutdown worked properly on more than 90% of notebooks....
>
> > So far I have one live sighting to be addressed by
> > the upstream kernel (from Knut). I'm certainly looking
> > forward to the 2nd live sighting...
>
> Ok, I guess I should steal that old xe3 I was talking about...

Done, xe3 was re-built from parts.

/proc/acpi/.../trip_points:
critical (S5): 100 C
passive: 83 C...
active[0]: 100 C...

(hmm, active=critical? Interesting. Fortunately fan seems to be driven
by BIOS).

Temperature is ~63 C in "normal" use. Now lets simulate fan failure...
and lets load the cpu...

temperature slowly rises, 1min00 -- 72C, 1min15 -- 75C, 1min30 --
77C, 1min45 -- 80C, 1min00 -- 82C, 1min15 -- 83C, 1min45 -- sudden
powerdown, presumably because of hardware failsafe.

So we have two bugs here: machine should have attempted to use passive
cooling sooner, so that critical temperature would not be reached, and
machine should have attempted shutdown before hardware failsafe killed
the power. I could do both in 2.6.21, with echo of new trip points and
enable of polling.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html