I found this BIOS bug some days ago.
The positive with this one is, that it nicely shows the
need of some things I lately came up with
(point 1. and 2., 3. and 4. are further suggestions):
1) Do not be transparent to Windows in ACPI OSI parts
-> and do not fake to be Windows as long term goal
2) Document _OSI BIOS developer usage in
Documentation/acpi/known_bios_osi_workarounds
3) Linuxfirmwarekit needs kernel support
4) ACPI AML functionality to report errors to the OS
The problem:
HP extensively makes use of ACPI thermal zones.
It seems they hit a bug in Vista which probably caused their
machines to be shut down through a critical temperature event.
They now workaround that Vista bug by returning zero for _CRT
(which is the critical temperature in Kelvin * 10).
So they return -273 degree Celsius which leads to a critical
temperature shutdown as soon as the ACPI thermal driver is loaded.
This is in short the corresponding ACPI BIOS code:
# BIOS checks which OS is running (most parts cut off)
# Linux is returning true for all but not for "Windows 2006 SP1"
# (Vista SP1) and not for "Linux"
...
If (_OSI ("Windows 2001 SP3"))
{
Store (0x12, OSTB)
Store (0x12, TPOS)
}
If (_OSI ("Windows 2006"))
{
Store (0x40, OSTB)
Store (0x40, TPOS)
}
If (_OSI ("Windows 2006 SP1"))
{
Store (0x41, OSTB)
Store (0x40, TPOS)
}
If (_OSI ("Linux"))
{
Store (One, LINX)
Store (0x80, OSTB)
Store (0x80, TPOS)
}
# Valid critical/hot temperature: 105 (0x69)
Name (TPC, 0x69)
...
Method (_HOT, 0, Serialized)
{
# Match for Vista only, not for Vista SP1 !
!!! If (LEqual (TPOS, 0x40))
{
Return (Add (0x0AAC, Multiply (TPC, 0x0A)))
}
Else
{
Return (Zero)
}
}
Method (_CRT, 0, Serialized)
{
# Returns valid values for all Windows version before Vista
!!! If (LLess (TPOS, 0x40))
{
# This is the valid one: 105 C -> (105 * 10) + 2732 (Kelvin * 10)
Return (Add (0x0AAC, Multiply (TPC, 0x0A)))
}
Else
{
# This is returned on Windows Vista
Return (Zero)
}
}
----------------------
This is the fix for this from Arjan:
ACPI: Reject below-freezing temperatures as invalid critical temperatures
My laptop thinks that it's a good idea to give -73C as the critical
CPU temperature.... which isn't the best thing since it causes a shutdown
right at bootup.
Temperatures below freezing are clearly invalid critical thresholds
so just reject these as such.
commit a39a2d7c72b358c6253a2ec28e17b023b7f6f41c
@@ -364,10 +364,17 @@ static int acpi_thermal_trips_update(struct acpi_thermal *tz, int flag)
if (flag & ACPI_TRIPS_CRITICAL) {
status = acpi_evaluate_integer(tz->device->handle,
"_CRT", NULL, &tz->trips.critical.temperature);
- if (ACPI_FAILURE(status)) {
+ /*
+ * Treat freezing temperatures as invalid as well; some
+ * BIOSes return really low values and cause reboots at startup.
+ * Below zero (Celcius) values clearly aren't right for sure..
+ * ... so lets discard those as invalid.
+ */
+ if (ACPI_FAILURE(status) ||
+ tz->trips.critical.temperature <= 2732) {
tz->trips.critical.flags.valid = 0;
ACPI_EXCEPTION((AE_INFO, status,
- "No critical threshold"));
+ "No or invalid critical threshold"));
return -ENODEV;
} else {
tz->trips.critical.flags.valid = 1;
----------------------
What are the consequences of:
1) The fact that BIOS vendors have to fix Windows bugs/erratas through
ACPI _OSI hooks (this is nearly the only way BIOS vendors do use the
_OSI interface)
2) The current Linux _OSI implementation being transparent to Windows
3) The invalid critical temperature is simply ignored and the trip
point not shown to userspace
1) One must assume that such a Vista or Vista SP1 only bug workaround has to
be spread by HP to all of their BIOSes, thus killing all ACPI aware Linux
kernels to work.
2) Vendors who want to provide Linux and Windows support
have to provide a separate BIOS or patch the Linux kernel so that they
need not to run Windows errata workarounds through _OSI hooks.
3) This Vista bug can be workarounded by checking for zero.
Things could get more complex.
Linux cannot implement all Windows bugs of all Windows versions on
long-term.
4) HP certifies (at least some of) their laptops to work with distributions.
The above patch absorbs the BIOS bug, making it impossible for the current
Linuxfirmwarekit implementation to detect it.
Above BIOS update could have been rejected by certification -> needs
a kernel facility to report BIOS bugs. Or at least the certified
distribution could have been patched along with with this BIOS update/
breakage.
5) It is just a matter of time until Windows version specific ACPI bugs are
workarounded in BIOSes in the server area also.
Therefore some suggestions (from above):
1) As a long term goal Linux should not be transparent to Windows.
Nearly all _OSI conditions where ACPI code is checking which OS is
running, do implement Windows bug workarounds. Vendors are not able
to fix the Windows implementation, therefore they have to do it in
BIOS. While the next Windows generation might have fixed the cause,
Linux tries to implement (be compatible with) all Windows bugs.
2) Document Windows bugs workarounded via _OSI in
Documentation/acpi/known_osi_hooks
3) Document Linux _OSI behavior. No ACPI BIOS developer is aware that
Linux violates the Spec. All latest ACPI BIOSes do check for "Linux"
as running OS, but Linux does not return true for the call.
I have started to document current _OSI behavior on Linux. I then
realized it might be a good idea to extend it a bit and talk about
general ACPI BIOS problems on Linux. It's here: ftp://ftp.suse.com/pub/people/trenn/ACPI_BIOS_on_Linux_guide/acpi_guideline_for_vendors.pdf
Comments for enhancements, additions, etc. are appreciated.
I'll anounce that separately.
4) Provide a facility to tell userspace about BIOS bugs.
The:
FIRMWARE_BUG(severity, "Message");
interface idea I mentioned recently in an unrelated thread.
The idea is something similar to printk, but to be able use it intensively
on each possible bogus value returned from BIOS (also for documentation)
and to be able to compile it out to not waste that much memory on
production kernels.
At the end is a patch that extends Arjan's patch by also checking return
values for hot (is an issue with HP Bioses already), passive and active
trip points, in wrong BIOS value case we want to inform userspace
that something in BIOS is bogus, so that HW vendors who care about Linux
see that something could go wrong.
5) Something ACPI specific, maybe Intel is able to push this into the
ACPI specification on (very) long-term:
ACPI BIOS developers cannot report error conditions.
Therefore you often end up in invalid values as they have to return
a value if a function is provided even
they know it does not make any sense at all.
Ideas:
1) Provide an error object similar to the debug object.
-> Just to have something in the logs
2) Add error values to each or sets of ACPI function
-> cumbersome
3) Introduce return_error statement which can be used instead of
return. If it is used, the kernel must ignore the value
of the function.
-> would help a lot, similar functionality like 2., but easier
Thomas
This patch also fixes hot, passive and active trip points in case
zero is returned as temperature invalidating the trip point.
Hopefully this can be reported as a firmware bug soon.
diff --git a/drivers/acpi/thermal.c b/drivers/acpi/thermal.c
index 84c795f..f6344f6 100644
--- a/drivers/acpi/thermal.c
+++ b/drivers/acpi/thermal.c
@@ -400,7 +400,8 @@ static int acpi_thermal_trips_update(struct acpi_thermal *tz, int flag)
if (flag & ACPI_TRIPS_HOT) {
status = acpi_evaluate_integer(tz->device->handle,
"_HOT", NULL, &tz->trips.hot.temperature);
- if (ACPI_FAILURE(status)) {
+ if (ACPI_FAILURE(status) ||
+ tz->trips.hot.temperature <= 2732) {
tz->trips.hot.flags.valid = 0;
ACPI_DEBUG_PRINT((ACPI_DB_INFO,
"No hot threshold\n"));
@@ -425,7 +426,8 @@ static int acpi_thermal_trips_update(struct acpi_thermal *tz, int flag)
"_PSV", NULL, &tz->trips.passive.temperature);
}
- if (ACPI_FAILURE(status))
+ if (ACPI_FAILURE(status) ||
+ tz->trips.passive.temperature <= 2732)
tz->trips.passive.flags.valid = 0;
else {
tz->trips.passive.flags.valid = 1;
@@ -480,7 +482,8 @@ static int acpi_thermal_trips_update(struct acpi_thermal *tz, int flag)
if (flag & ACPI_TRIPS_ACTIVE) {
status = acpi_evaluate_integer(tz->device->handle,
name, NULL, &tz->trips.active[i].temperature);
- if (ACPI_FAILURE(status)) {
+ if (ACPI_FAILURE(status) ||
+ tz->trips.active[i].temperature <= 2732) {
tz->trips.active[i].flags.valid = 0;
if (i == 0)
break;
Thomas Renninger wrote:
> This is the fix for this from Arjan:
>
> ACPI: Reject below-freezing temperatures as invalid critical temperatures
>
> My laptop thinks that it's a good idea to give -73C as the critical
> CPU temperature.... which isn't the best thing since it causes a shutdown
> right at bootup.
>
> Temperatures below freezing are clearly invalid critical thresholds
> so just reject these as such.
>
btw on my laptop, it wasn't 0 that was returned, but 2007.
This is suspected to be related to how windows finds some random other ACPI state to return,
but it significantly was an AML issue on the bios side.
Just the effect was a trainwreck so I added a check to the kernel (in addition to getting
full info to Robert for the ACPICA side of the issue).
Thomas,
re: OSI(Windows...)
Linux will continue to claim OSI compatibility with Windows
until the day when the majority of Linux systems
have passed a Linux compatibility test rather than
a Windows compatibility test.
Re: OSI(Linux)
I've looked at O(100) DSDT's that look at OSI(Linux),
and all but serveral systems from two vendors do it by mistake.
They simply copied it from the bugged Intel reference code.
OSI(Linux) will _never_ be restored to Linux, ever.
re: the HP BIOS bug at hand.
Linux deletes the entire thermal zone when we see this.
(arguably, we could have just disabled the CRT
and kept the rest of the thermal zone).
If HP cared about testing Linux on this laptop
and had tools such that they could actually
test Linux compatiblity, it would be pretty clear
from user-space that their thermal zone was missing.
thanks,
-Len
Len Brown wrote:
> Thomas,
>
> re: OSI(Windows...)
Thomas,
I discussed this with Len here in Ottawa. First I fully agree with his
reasoning for the current behaviour.
The main problem with OSI(Linux) is that it would be a quickly moving
target so checking for it wouldn't really help the BIOSes.
Still there might be special cases where BIOSes will still check (e.g.
if they intend to work with specific distribution releases which might
have specific bugs). Or to check for specific Linux features.
One way to do the later would be to define new OSI flags for specific
features. Haven't got a good proposal for that currently, but it's a
possibility.
The other thing that could be done is to define OSI flags specific for
special distribution releases so BIOSes could potentially check for bugs
in SLED10 or RHWS5 or something like this, which are hopefully stable
in that behaviour doesn't move as quickly. The way to do that wouldn't
be to change the kernel though, but just specify them on the command
line using acpi_osi=...
-Andi
On Friday 25 July 2008 02:04:32 Len Brown wrote:
> Thomas,
>
> re: OSI(Windows...)
>
> Linux will continue to claim OSI compatibility with Windows
> until the day when the majority of Linux systems
> have passed a Linux compatibility test rather than
> a Windows compatibility test.
And to try that out we need the acpi_osi=windows_false boot
param I sent recently. So will you accept that one?
Also we need this documented.
Will you accept a Documentation/acpi/known_osi_vendor_hooks.txt
file. Like that we get an idea of what kind of features come
in through which Windows version and more important, what kind of
ugly Windows bug workarounds exist (the latter will probably be more).
> Re: OSI(Linux)
>
> I've looked at O(100) DSDT's that look at OSI(Linux),
> and all but serveral systems from two vendors do it by mistake.
> They simply copied it from the bugged Intel reference code.
>
> OSI(Linux) will _never_ be restored to Linux, ever.
But it should not have been removed without announcing it half a
year before. It silently moved distributions and vendors into a
situation where they cannot support Linux and Windows with
the same BIOS anymore.
_OSI is mainly not used for interfaces/features in
reality (as you stated in the other mail), but to workaround very
specific Windows version bugs.
While the mainline kernel stays transparent to _OSI you
advise distributions to exactly not do that and provide e.g.
a "SLE 11" or "RHEL X" _OSI string to be able to
support the system on Linux and Windows, is that correct?
Or do you advise them to provide two separate BIOSes?
The last option, "do not implement Windows version bug
fixes" we cannot influence.
I do not see more options with the current implementation.
> re: the HP BIOS bug at hand.
>
> Linux deletes the entire thermal zone when we see this.
OpenSUSE 11.0 (2.6.25) and SLES10-SP2 (2.6.16) shut down when
the thermal driver is loaded. Probably every kernel in every
distribution out there currently is doing that.
> (arguably, we could have just disabled the CRT
> and kept the rest of the thermal zone).
> If HP cared about testing Linux on this laptop
> and had tools such that they could actually
> test Linux compatiblity, it would be pretty clear
> from user-space that their thermal zone was missing.
Len, this is not about the thermal zone, it is just
a real-world example of something I told you will happen
if Linux stays _OSI transparent with Windows.
This is about that they have to provide a BIOS hot-fix for
VISTA or VISTA SP and thus breaking Linux because there
is no way to distinguish anymore.
Windows 2007 likely will have that fixed and they provide
a sane _CRT trip point again.
This is an example of Windows versions workarounds that could
get much more complex, like initializing HW differently or
whatever.
_OSI is used by vendors as a convenient possibility to
adjust/workaround Windows bugs in their BIOSes, without
the need to pay Millions to Microsoft to fix their things.
Thomas
On Friday, 25 of July 2008, Thomas Renninger wrote:
> On Friday 25 July 2008 02:04:32 Len Brown wrote:
[--snip--]
>
> Len, this is not about the thermal zone, it is just
> a real-world example of something I told you will happen
> if Linux stays _OSI transparent with Windows.
>
> This is about that they have to provide a BIOS hot-fix for
> VISTA or VISTA SP and thus breaking Linux because there
> is no way to distinguish anymore.
> Windows 2007 likely will have that fixed and they provide
> a sane _CRT trip point again.
> This is an example of Windows versions workarounds that could
> get much more complex, like initializing HW differently or
> whatever.
> _OSI is used by vendors as a convenient possibility to
> adjust/workaround Windows bugs in their BIOSes, without
> the need to pay Millions to Microsoft to fix their things.
This is a valid point, IMO.
If vendors use _OSI(Windows) to work around Windows bugs, we get broken
automatically on those systems unless we put in some DMI-based hacks.
Thanks,
Rafael
Len Brown schreef:
> Thomas,
>
> re: OSI(Windows...)
>
> Linux will continue to claim OSI compatibility with Windows
> until the day when the majority of Linux systems
> have passed a Linux compatibility test rather than
> a Windows compatibility test.
>
> Re: OSI(Linux)
>
> I've looked at O(100) DSDT's that look at OSI(Linux),
> and all but serveral systems from two vendors do it by mistake.
> They simply copied it from the bugged Intel reference code.
>
> OSI(Linux) will _never_ be restored to Linux, ever.
>
Just out of curiosity, let's imagine that today HP decides to fix its
BIOS. What would be the way to do it? Of course, without putting
additional problems when Windows is booted.
What they would want is to provide workarounds for each given version of
Windows and provide a completely ACPI-compliant version when Linux is
running. I fail to see how it is possible possible to do that today.
Well... they could detect Linux by checking that several OSI's for
Windows pass, but that would be really a nasty kludge.
So, am I understanding correctly that we are in a desperate need for a
good OSI solution? Until then, we can only bash and complain at the BIOS
developers, but they have no way to fix the problems.
Eric
The goal for ACPICA has gone from being a complete "reference
implementation" of the ACPI specification to being a "Windows
bug-for-bug compatible" ACPI implementation.
So when we report _OS("Microsoft Windows NT") and respond OK to all _OSI
queries with Microsoft strings, we mean it.
Bob
>-----Original Message-----
>From: Eric Piel [mailto:[email protected]]
>Sent: Friday, July 25, 2008 3:11 PM
>To: Len Brown
>Cc: Thomas Renninger; Arjan van de Ven; linux-acpi; Moore, Robert;
Linux
>Kernel Mailing List; Andi Kleen; Christian Kornacker
>Subject: Re: ACPI OSI disaster on latest HP laptops - critical
temperature
>shutdown
>
>Len Brown schreef:
>> Thomas,
>>
>> re: OSI(Windows...)
>>
>> Linux will continue to claim OSI compatibility with Windows
>> until the day when the majority of Linux systems
>> have passed a Linux compatibility test rather than
>> a Windows compatibility test.
>>
>> Re: OSI(Linux)
>>
>> I've looked at O(100) DSDT's that look at OSI(Linux),
>> and all but serveral systems from two vendors do it by mistake.
>> They simply copied it from the bugged Intel reference code.
>>
>> OSI(Linux) will _never_ be restored to Linux, ever.
>>
>Just out of curiosity, let's imagine that today HP decides to fix its
>BIOS. What would be the way to do it? Of course, without putting
>additional problems when Windows is booted.
>
>What they would want is to provide workarounds for each given version
of
>Windows and provide a completely ACPI-compliant version when Linux is
>running. I fail to see how it is possible possible to do that today.
>Well... they could detect Linux by checking that several OSI's for
>Windows pass, but that would be really a nasty kludge.
>
>So, am I understanding correctly that we are in a desperate need for a
>good OSI solution? Until then, we can only bash and complain at the
BIOS
>developers, but they have no way to fix the problems.
>
>Eric
> If vendors use _OSI(Windows) to work around Windows bugs, we get broken
> automatically on those systems unless we put in some DMI-based hacks.
The general goal of ACPICA is to be bug-to-bug compatible with Windows.
So it might be needed for ACPICA to just emulate the respective bugs.
That said for this case I don't think that's needed, Linux just has to
detect the workarounds (which it already does I think)
-Andi
>From [email protected] Sat Jul 26 14:40:36 2008
Date: Sat, 26 Jul 2008 14:40:35 -0400 (EDT)
From: Len Brown <[email protected]>
To: Thomas Renninger <[email protected]>
Cc: Arjan van de Ven <[email protected]>, linux-acpi <[email protected]>, "Moore, Robert" <[email protected]>, Linux Kernel Mailing List <[email protected]>, Andi Kleen <[email protected]>, Christian Kornacker <[email protected]>
Subject: Re: ACPI OSI disaster on latest HP laptops - critical temperature shutdowns
Thomas,
Thank you for debugging and reporting this issue.
I agree with some of your observations and conclusions,
but not with others, so lets review this carefully.
39a2d7c72b358c6253a2ec28e17b023b7f6f41c
(ACPI: Reject below-freezing temperatures as invalid critical temperatures)
was general workaround resulting from a specific HP machine
with a BIOS bug.
The machine functioned properly in 2.6.25, but shutdown
in 2.6.26-rc1. Arjan and I debugged this together.
Unfortunately, we both neglected to put the bug URL
in the commit-it, so here it is:
http://bugzilla.kernel.org/show_bug.cgi?id=10686
The failure in bug 10686 is similar, but not identical
to the one you reported here with CRT returning 0.
Arjan's HP has a _CRT with no return statement at all.
In Linux-2.6.25, this _CRT was rejected with
ACPI Exception (thermal-0365): AE_BAD_DATA, No critical threshold [20070126]
and the entire thermal zone was rejected.
4e3156b183aa087bc19804b3295c7c1a71f64752
(ACPICA: changed order of interpretation of operand objects)
ironically, a MS bug compatibility patch,
had the side effect of causing the implicit return
workaround applied to _CRT to return 2006 rather than bombing out.
This was interpreted as 200.7K, or -73C.
Bob looked into this one, and determined that the latest
ACPICA will return 0 here.
http://bugzilla.kernel.org/show_bug.cgi?id=10686#c9
Bob,
It may be helpful if you can elaborate on "latest ACPICA"
in this comment -- ie what release, or better yet, what patch
will cause Linux behavior to change on this code fragment?
If we suddenly start returning 0 there, we'll still be okay
because Arjan's patch above will still catch it.
Anyway, we had a choice of simple fixes for Arjan's HP.
At the time, the question was whether to reject
the entire thermal zone -- failing like 2.6.25
(a thermal zone w/o a _CRT is invalid per spec)
or to reject just the _CRT (ala thermal.nocrt).
We decided to keep it simple (and similar to 2.6.25)
and reject the entire thermal zone. Thinking about this more,
I think it would be a good idea to instead go
the thermal.nocrt route -- for if this machine
had ACPI fan control (this one doesn't),
the rest of the thermal zone
would be pretty important to normal use....
Rui,
as maintainer of ACPI_THERMAL, perhaps you can look into that,
if Thomas doesn't beat you to it?
In light of Thomas' sighting and Bob's mention that the
latest interpreter will return 0 here...
ALL THIS TELLS US is that Vista doesn't fail certification
when _CRT returns 0.
IT DOES NOT TELL US that Vista has any sort of _CRT bug,
or that Vista mandates _CRT=0.
The T61 I'm typing on has a valid _CRT and a Vista sticker...
The AML Thomas' showed did this:
If (_OSI ("Windows 2006")) {
Store (0x40, TPOS)
}
Method (_CRT, 0, Serialized) {
If (LLess (TPOS, 0x40)) {
Return (...valid...)
}
Else {
Return (Zero)
}
I draw a totally different conclusion than Thomas does.
This does not look like a Vista workaround to me,
it looks like a simple BIOS bug that Vista doesn't catch.
We've seen BIOS bugs like this many times.
They are consistent with this conversation:
Morning:
BIOS Manager: "please quickly update this platform to support Vista"
BIOS writer: "I'm busy today, but have 30 minutes if I work through lunch..."
Afternoon:
BIOS Manager: "did you look at that Vista update yet?"
BIOS Writer: "yes, I think I did it in only 20 minutes"
BIOS Manager: "you're awesome! lets send it through WHQL,
as I've got something else for you to do."
The BIOS passes WHQL and nobody with a brain ever looks
at the source code again...
It would be useful to find out what Vista actually _does_
with _CRT=0. ie. do they throw out the thermal zone,
or just the _CRT. Linux should ideally do the same.
However, the fact that plenty of systems with Vista stickers
are shipping with valid _CRT proves that it isn't Vista that
is mandating _CRT=0.
So I DO NOT BELIEVE that this sighting is proof that we should disable
OSI compatibility with Vista or any other version of Windows.
I feel STRONGLY that it is better to be compatible with the
tested path through the BIOS -- even if that tested
path includes workarounds for BIOS bugs that Windows
doesn't catch. (or workarounds for real Windows bugs --
though I don't believe this thread isn't an example of one)
The alternative would be the FAR GREATER EVIL of trying
to be compatible with an entirely untested path
through the BIOS. We've been there before and it
was horrific.
I think we all agree that the LONG term solution is to have
tools where OEMs can CERTIFY compatibility with Linux
and a large portion of the machines that Linux runs on
having passed that certification. When that happens,
that is the time to re-visit our current strategy of
being bug compatible with Windows. While I believe that
this is a realistic and valuable goal in some markets,
is seems unrealistic in the foreseeable future
in other markets. ie. I think it is valuable and worth pursuing,
but I would not expect universal success in the foreseeable
future.
Andi,
I ACK Thomas' suggestion to check for <= 0C for HOT,
PSV and ACx trip points. While we don't have such a
failure in hand and thus this is not urgent, it can
only make Linux more bomb proof. We might dress it
up a bit, however. I think that with acpi=strict,
we should complain loudly if this workaround is invoked,
if not disable it altogether. Thus an OEM who can
boot with acpi=strict and not get warnings or failures
knows that they're not requiring any of our out-of-spec
workarounds.
Further, Thomas' sighting demonstrates that it is important
to get Arjan's patch back into the .stable releases.
thanks,
-Len
Hi!
> I ACK Thomas' suggestion to check for <= 0C for HOT,
> PSV and ACx trip points. While we don't have such a
Hmm, I don't think that's good idea. 0Celsius is not special value.
While having _CRT below 25Celsius would be quite strange (machine
that can not run on room temperature?), I could imagine it (cryogenic
cooling, cpu is overclocked, needs -10C to work).
Machine with AC0 == -10C is very easy to imagine, OTOH. It is normal
notebook that needs fans during normal temperatures, but does not need
it if it is really cold.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html