2006-03-21 09:11:50

by Luming Yu

[permalink] [raw]
Subject: RE: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]

>With _TMP faked in the kernel and one whole zone ignored, this
>is what I
>get:
>
>Zone to ignore | Result
>---------------------------------------------------------------
>---------
>THM0 OK (10 cycles)
>THM2 "kernel panic! attempted to kill init"

I guess, if you fake DSDT by completely removing THM2
you won't see this.

>THM6 Hangs (4th cycle)
Is it still hang at SMPI?

>THM7 OK (8 cycles)
>
>So THM6 seems healthy, but THM0 and THM7 (and maybe THM2) interact
>badly. If I unload THM2, THM6, and THM7, then it's okay (previous
>experiments with faking _TMP but with only THM0 loaded). But unloading
>THM6 is not enough.

Please try to remove THM2 judge if it is JUST the
problem of THM0 && THM7.

>
>The kernel panic for the don't-load-THM2 kernel is very strange. I had
>another kernel panic while doing another set of tests, which I also
>couldn't explain. The only difference between the no-THM0 and the
>no-THM2 kernels is:

Could you just printk device->pnp? it could be null point (due to
you hack?)

>
>diff -r b7ad6c906aba -r 213308f0ec31 drivers/acpi/thermal.c
>--- a/drivers/acpi/thermal.c Tue Mar 21 02:23:30 2006 -0500
>+++ b/drivers/acpi/thermal.c Tue Mar 21 02:36:42 2006 -0500
>@@ -1324,7 +1324,7 @@ static int acpi_thermal_add(struct acpi_
>
> if (!device)
> return_VALUE(-EINVAL);
>- if (strcmp("THM2", device->pnp.bus_id) == 0) {
>+ if (strcmp("THM0", device->pnp.bus_id) == 0) {
> printk(KERN_INFO PREFIX "thermal_add: ignoring %s\n",
> device->pnp.bus_id);
> return_VALUE(-EINVAL);
>
>


2006-03-21 20:38:36

by Sanjoy Mahajan

[permalink] [raw]
Subject: Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]

The following tests all have acpi_evaluate_integer() hacked to return
_TMP=27C.

>> The kernel panic for the don't-load-THM2 kernel is very strange. I
>> had another kernel panic while doing another set of tests, which I
>> also couldn't explain. The only difference between the no-THM0 and
>> the no-THM2 kernels is:

> Could you just printk device->pnp? it could be null point (due to you
> hack?)

device->pnp is a struct and I couldn't figure out how to printk it, so
I just printk'ed device->pnp.bus_id (most of its other elements aren't
initialized by then anyway):

diff -r ac486e270597 -r 8b088512dd1d drivers/acpi/thermal.c
--- a/drivers/acpi/thermal.c Sat Mar 18 08:35:34 2006 -0500
+++ b/drivers/acpi/thermal.c Tue Mar 21 11:32:31 2006 -0500
@@ -1324,6 +1324,7 @@ static int acpi_thermal_add(struct acpi_

if (!device)
return_VALUE(-EINVAL);
+ printk(KERN_INFO PREFIX "pnp.bus_id=0x%x\n", (u32) device->pnp.bus_id);

tz = kmalloc(sizeof(struct acpi_thermal), GFP_KERNEL);
if (!tz)

It produced nothing surprising:

ACPI: pnp.bus_id=0xe3ed7830
ACPI: pnp.bus_id=0xe3ed7430
ACPI: pnp.bus_id=0xe3ed7030
ACPI: pnp.bus_id=0xe3ed8c30
ACPI: pnp.bus_id=0xe3ed4030

for THM0,2,6,7, and _TZ.

So I still don't know why getting rid of THM2 in the kernel causes the
panic.

But while I had this kernel booted, I tried a few sleep cycles, and it
hung on the second one as expected (it's just the vanilla kernel&DSDT
with acpi_evaluate_integer() hacked to return _TMP=27C).

>> THM6 Hangs (4th cycle)
> Is it still hang at SMPI?

It looked like the usual hang, but I had debug_{layer,level}=0x10. I
increased debug_layer to 0xFFFFFFFF it to see the function traces.
However, the hang didn't occur even after 15 cycles. So I rebooted with
debug_layer=0x10 and still couldn't reproduce the hang even after 12
cycles. But the same kernel hung yesterday after 4 cycles [I save all
the kernels tagged by their revision hash], so I don't know what to
think about THM6.

>> THM2 "kernel panic! attempted to kill init"

> I guess, if you fake DSDT by completely removing THM2 you won't see
> this.

Right, it booted fine when I removed THM2 from the DSDT instead of from
the kernel.

>> So THM6 seems healthy, but THM0 and THM7 (and maybe THM2) interact
>> badly. If I unload THM2, THM6, and THM7, then it's okay (previous
>> experiments with faking _TMP but with only THM0 loaded). But
>> unloading THM6 is not enough.

> Please try to remove THM2 judge if it is JUST the problem of THM0 &&
> THM7.

I tried the kernel with THM2 taken out of the DSDT, and it was fine (so
the total change was that plus _TMP faked in acpi_evaluate_integer()).

-Sanjoy

`A society of sheep must in time beget a government of wolves.'
- Bertrand de Jouvenal

2006-03-21 22:09:49

by Sanjoy Mahajan

[permalink] [raw]
Subject: Re: 2.6.16-rc5: known regressions [TP 600X S3, vanilla DSDT]

Two more experiments:

With a vanilla kernel, I faked EC0.UPDT() to just return 0x00, and the
system hung on the second sleep.

Then, again in the DSDT, I also faked the 4 _TMP methods (one in each
thermal zone), and the system hung on the second sleep.

I think we've raced too far ahead by trying to debug many thermal zones
at once. Perhaps there are two bugs. So let's find them one by one.

One bug is quite repeatable and we know a lot about it. With all zones
except THM0 commented out, the system hung. With the EC0.UPDT line in
THM0._TMP also commented out, the system didn't hang. So there's a
problem related to the EC, even with only THM0. And finding that
problem may giveideas for what else may be wrong.

-Sanjoy

`A society of sheep must in time beget a government of wolves.'
- Bertrand de Jouvenal