2010-06-21 16:03:21

by James Bottomley

[permalink] [raw]
Subject: 2.6.35-rc3 regression: IBM Maia system is unbootable [ACPI related?]

The system in question has a set of root bridges, with the CD rom (which
is seen) being on bus 0 and the SCSI controller, which doesn't show up
being on bus 1 (with several other things on busses > 1).

The system uses ACPI to detect the multiple bridges, so it seems that
this failure in 2.6.35-rc3:

ACPI: Core revision 20100428
ACPI Error: Hardware did not enter ACPI mode (20100428/evxfevnt-103)
ACPI Warning: AcpiEnable failed (20100428/utxface-147)
ACPI: Unable to enable ACPI

Means that we can't detect any PCI bus other than 0 and thus the system
won't boot.

The full dmesgs for a successful (2.6.34) and failed (2.6.35-rc3) boot
are below.

James



Attachments:
dmesg.2.6.34 (118.81 kB)
dmesg.2.6.35-rc3 (69.32 kB)
Download all attachments

2010-06-21 16:14:55

by James Bottomley

[permalink] [raw]
Subject: Re: 2.6.35-rc3 regression: IBM Maia system is unbootable [ACPI related?]

On Mon, 2010-06-21 at 11:03 -0500, James Bottomley wrote:
> The system in question has a set of root bridges, with the CD rom (which
> is seen) being on bus 0 and the SCSI controller, which doesn't show up
> being on bus 1 (with several other things on busses > 1).
>
> The system uses ACPI to detect the multiple bridges, so it seems that
> this failure in 2.6.35-rc3:
>
> ACPI: Core revision 20100428
> ACPI Error: Hardware did not enter ACPI mode (20100428/evxfevnt-103)
> ACPI Warning: AcpiEnable failed (20100428/utxface-147)
> ACPI: Unable to enable ACPI
>
> Means that we can't detect any PCI bus other than 0 and thus the system
> won't boot.
>
> The full dmesgs for a successful (2.6.34) and failed (2.6.35-rc3) boot
> are below.

This didn't make either the PCI or the ACPI lists ... probably because
of the attachments. However, the full message (with attachments) is on
lkml:

http://marc.info/?l=linux-kernel&m=127713622821166

James

2010-06-21 16:18:24

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.35-rc3 regression: IBM Maia system is unbootable [ACPI related?]

James Bottomley <[email protected]> writes:

> The system in question has a set of root bridges, with the CD rom (which
> is seen) being on bus 0 and the SCSI controller, which doesn't show up
> being on bus 1 (with several other things on busses > 1).
>
> The system uses ACPI to detect the multiple bridges, so it seems that
> this failure in 2.6.35-rc3:
>
> ACPI: Core revision 20100428
> ACPI Error: Hardware did not enter ACPI mode (20100428/evxfevnt-103)
> ACPI Warning: AcpiEnable failed (20100428/utxface-147)
> ACPI: Unable to enable ACPI

That probably causes all the other problems. Without ACPI enabled
modern systems generally do not work.

Just guessing, but maybe try to revert

b430acbd7c4b919886fa7fd92eeb7a695f1940d3

That was the last change in this area.

-Andi
--
[email protected] -- Speaking for myself only.

2010-06-21 16:33:58

by James Bottomley

[permalink] [raw]
Subject: Re: 2.6.35-rc3 regression: IBM Maia system is unbootable [ACPI related?]

On Mon, 2010-06-21 at 18:18 +0200, Andi Kleen wrote:
> James Bottomley <[email protected]> writes:
>
> > The system in question has a set of root bridges, with the CD rom (which
> > is seen) being on bus 0 and the SCSI controller, which doesn't show up
> > being on bus 1 (with several other things on busses > 1).
> >
> > The system uses ACPI to detect the multiple bridges, so it seems that
> > this failure in 2.6.35-rc3:
> >
> > ACPI: Core revision 20100428
> > ACPI Error: Hardware did not enter ACPI mode (20100428/evxfevnt-103)
> > ACPI Warning: AcpiEnable failed (20100428/utxface-147)
> > ACPI: Unable to enable ACPI
>
> That probably causes all the other problems. Without ACPI enabled
> modern systems generally do not work.
>
> Just guessing, but maybe try to revert
>
> b430acbd7c4b919886fa7fd92eeb7a695f1940d3
>
> That was the last change in this area.

I can confirm that reverting this over a vanilla 2.6.35-rc3 fixes my
boot problem.

James

2010-06-21 16:54:54

by James Bottomley

[permalink] [raw]
Subject: Re: 2.6.35-rc3 regression: IBM Maia system is unbootable [ACPI related?]

On Mon, 2010-06-21 at 11:33 -0500, James Bottomley wrote:
> On Mon, 2010-06-21 at 18:18 +0200, Andi Kleen wrote:
> > James Bottomley <[email protected]> writes:
> >
> > > The system in question has a set of root bridges, with the CD rom (which
> > > is seen) being on bus 0 and the SCSI controller, which doesn't show up
> > > being on bus 1 (with several other things on busses > 1).
> > >
> > > The system uses ACPI to detect the multiple bridges, so it seems that
> > > this failure in 2.6.35-rc3:
> > >
> > > ACPI: Core revision 20100428
> > > ACPI Error: Hardware did not enter ACPI mode (20100428/evxfevnt-103)
> > > ACPI Warning: AcpiEnable failed (20100428/utxface-147)
> > > ACPI: Unable to enable ACPI
> >
> > That probably causes all the other problems. Without ACPI enabled
> > modern systems generally do not work.
> >
> > Just guessing, but maybe try to revert
> >
> > b430acbd7c4b919886fa7fd92eeb7a695f1940d3
> >
> > That was the last change in this area.
>
> I can confirm that reverting this over a vanilla 2.6.35-rc3 fixes my
> boot problem.

Andi asked for the timings with the commit reverted, so here they are
(from ACPI version print to PCI root busses found):

[ 0.084086] ACPI: Core revision 20100428
[ 0.119779] Setting APIC routing to physical flat
[ 0.124259] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[ 0.128000] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
[ 0.128000] ...trying to set up timer (IRQ0) through the 8259A ...
[ 0.128000] ..... (found apic 0 pin 2) ...
[ 0.128000] ....... failed.
[ 0.128000] ...trying to set up timer as Virtual Wire IRQ...
[ 0.169681] ..... works.
[ 0.172005] CPU0: Intel(R) Xeon(TM) MP CPU 3.16GHz stepping 01
[ 0.188000] lockdep: fixing up alternatives.
[ 0.188133] Booting Node 0, Processors #1 Ok.
[ 0.288089] Brought up 2 CPUs
[ 0.292006] Total of 2 processors activated (12676.86 BogoMIPS).
[ 0.300923] khelper used greatest stack depth: 5864 bytes left
[ 0.305017] khelper used greatest stack depth: 5176 bytes left
[ 0.325510] NET: Registered protocol family 16
[ 0.341322] ACPI: bus type pci registered
[ 0.344474] PCI: Using configuration type 1 for base access
[ 0.359165] bio: create slab <bio-0> at 0
[ 0.478153] ACPI: Interpreter enabled
[ 0.488006] ACPI: (supports S0 S5)
[ 0.497875] ACPI: Using IOAPIC for interrupt routing
[ 0.548145] ACPI: No dock devices found.
[ 0.552176] PCI: Ignoring host bridge windows from ACPI; if necessary, use "pci=use_crs" and report a bug
[ 0.556263] ACPI: PCI Root Bridge [VP00] (domain 0000 [bus 00])
[ 0.568051] ACPI: PCI Root Bridge [VP01] (domain 0000 [bus 01])
[ 0.576336] ACPI: PCI Root Bridge [VP02] (domain 0000 [bus 02-03])
[ 0.596889] ACPI: PCI Root Bridge [VP03] (domain 0000 [bus 04-05])
[ 0.605222] ACPI: PCI Root Bridge [VP04] (domain 0000 [bus 06-07])
[ 0.613530] ACPI: PCI Root Bridge [VP05] (domain 0000 [bus 08-09])
[ 0.617658] ACPI: PCI Root Bridge [VP06] (domain 0000 [bus 0a-0b])
[ 0.621672] ACPI: PCI Root Bridge [VP07] (domain 0000 [bus 0c-0d])
[ 0.625948] HEST: Table is not found!

James

2010-06-21 17:26:12

by James Bottomley

[permalink] [raw]
Subject: Re: 2.6.35-rc3 regression: IBM Maia system is unbootable [ACPI related?]

On Mon, 2010-06-21 at 11:54 -0500, James Bottomley wrote:
> On Mon, 2010-06-21 at 11:33 -0500, James Bottomley wrote:
> > On Mon, 2010-06-21 at 18:18 +0200, Andi Kleen wrote:
> > > James Bottomley <[email protected]> writes:
> > >
> > > > The system in question has a set of root bridges, with the CD rom (which
> > > > is seen) being on bus 0 and the SCSI controller, which doesn't show up
> > > > being on bus 1 (with several other things on busses > 1).
> > > >
> > > > The system uses ACPI to detect the multiple bridges, so it seems that
> > > > this failure in 2.6.35-rc3:
> > > >
> > > > ACPI: Core revision 20100428
> > > > ACPI Error: Hardware did not enter ACPI mode (20100428/evxfevnt-103)
> > > > ACPI Warning: AcpiEnable failed (20100428/utxface-147)
> > > > ACPI: Unable to enable ACPI
> > >
> > > That probably causes all the other problems. Without ACPI enabled
> > > modern systems generally do not work.
> > >
> > > Just guessing, but maybe try to revert
> > >
> > > b430acbd7c4b919886fa7fd92eeb7a695f1940d3
> > >
> > > That was the last change in this area.
> >
> > I can confirm that reverting this over a vanilla 2.6.35-rc3 fixes my
> > boot problem.
>
> Andi asked for the timings with the commit reverted, so here they are
> (from ACPI version print to PCI root busses found):
>
> [ 0.084086] ACPI: Core revision 20100428
> [ 0.119779] Setting APIC routing to physical flat
> [ 0.124259] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> [ 0.128000] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> [ 0.128000] ...trying to set up timer (IRQ0) through the 8259A ...
> [ 0.128000] ..... (found apic 0 pin 2) ...
> [ 0.128000] ....... failed.
> [ 0.128000] ...trying to set up timer as Virtual Wire IRQ...
> [ 0.169681] ..... works.
> [ 0.172005] CPU0: Intel(R) Xeon(TM) MP CPU 3.16GHz stepping 01
> [ 0.188000] lockdep: fixing up alternatives.
> [ 0.188133] Booting Node 0, Processors #1 Ok.
> [ 0.288089] Brought up 2 CPUs
> [ 0.292006] Total of 2 processors activated (12676.86 BogoMIPS).
> [ 0.300923] khelper used greatest stack depth: 5864 bytes left
> [ 0.305017] khelper used greatest stack depth: 5176 bytes left
> [ 0.325510] NET: Registered protocol family 16
> [ 0.341322] ACPI: bus type pci registered
> [ 0.344474] PCI: Using configuration type 1 for base access
> [ 0.359165] bio: create slab <bio-0> at 0
> [ 0.478153] ACPI: Interpreter enabled
> [ 0.488006] ACPI: (supports S0 S5)
> [ 0.497875] ACPI: Using IOAPIC for interrupt routing
> [ 0.548145] ACPI: No dock devices found.
> [ 0.552176] PCI: Ignoring host bridge windows from ACPI; if necessary, use "pci=use_crs" and report a bug
> [ 0.556263] ACPI: PCI Root Bridge [VP00] (domain 0000 [bus 00])
> [ 0.568051] ACPI: PCI Root Bridge [VP01] (domain 0000 [bus 01])
> [ 0.576336] ACPI: PCI Root Bridge [VP02] (domain 0000 [bus 02-03])
> [ 0.596889] ACPI: PCI Root Bridge [VP03] (domain 0000 [bus 04-05])
> [ 0.605222] ACPI: PCI Root Bridge [VP04] (domain 0000 [bus 06-07])
> [ 0.613530] ACPI: PCI Root Bridge [VP05] (domain 0000 [bus 08-09])
> [ 0.617658] ACPI: PCI Root Bridge [VP06] (domain 0000 [bus 0a-0b])
> [ 0.621672] ACPI: PCI Root Bridge [VP07] (domain 0000 [bus 0c-0d])
> [ 0.625948] HEST: Table is not found!

Another update: It's clear the problem is the new expectation that the
hardware sets ACPI mode instantly. I applied the following patch on my
system to see how long it actually takes. I see a single loop:

[ 0.084087] ACPI: Core revision 20100428
[ 0.119720] ACPI Error: Hardware did not enter ACPI mode, retry = 0 (20100428/evxfevnt-104)
[ 0.128240] Setting APIC routing to physical flat

So it needs a delay of about 1ms (or one retry).

James

---
diff --git a/drivers/acpi/acpica/evxfevnt.c b/drivers/acpi/acpica/evxfevnt.c
index d5a5efc..bace473 100644
--- a/drivers/acpi/acpica/evxfevnt.c
+++ b/drivers/acpi/acpica/evxfevnt.c
@@ -70,6 +70,7 @@ acpi_ev_get_gpe_device(struct acpi_gpe_xrupt_info *gpe_xrupt_info,
acpi_status acpi_enable(void)
{
acpi_status status;
+ int retry = 0;

ACPI_FUNCTION_TRACE(acpi_enable);

@@ -98,6 +99,11 @@ acpi_status acpi_enable(void)

/* Sanity check that transition succeeded */

+ while (acpi_hw_get_mode() != ACPI_SYS_MODE_ACPI && retry < 3000) {
+ ACPI_ERROR((AE_INFO,
+ "Hardware did not enter ACPI mode, retry = %d", retry++));
+ acpi_os_stall(1000);
+ }
if (acpi_hw_get_mode() != ACPI_SYS_MODE_ACPI) {
ACPI_ERROR((AE_INFO,
"Hardware did not enter ACPI mode"));

2010-06-22 14:28:23

by Maciej Rutecki

[permalink] [raw]
Subject: Re: 2.6.35-rc3 regression: IBM Maia system is unbootable [ACPI related?]

On poniedziaƂek, 21 czerwca 2010 o 18:03:09 James Bottomley wrote:
> The system in question has a set of root bridges, with the CD rom (which
> is seen) being on bus 0 and the SCSI controller, which doesn't show up
> being on bus 1 (with several other things on busses > 1).
>
> The system uses ACPI to detect the multiple bridges, so it seems that
> this failure in 2.6.35-rc3:
>
> ACPI: Core revision 20100428
> ACPI Error: Hardware did not enter ACPI mode (20100428/evxfevnt-103)
> ACPI Warning: AcpiEnable failed (20100428/utxface-147)
> ACPI: Unable to enable ACPI
>
> Means that we can't detect any PCI bus other than 0 and thus the system
> won't boot.
>
> The full dmesgs for a successful (2.6.34) and failed (2.6.35-rc3) boot
> are below.
>
> James

I created a Bugzilla entry at
https://bugzilla.kernel.org/show_bug.cgi?id=16271
for your bug report, please add your address to the CC list in there, thanks!

--
Maciej Rutecki
http://www.maciek.unixy.pl

2010-06-29 00:15:46

by Len Brown

[permalink] [raw]
Subject: Re: 2.6.35-rc3 regression: IBM Maia system is unbootable [ACPI related?]

James,
Is this a production system with a production BIOS?

BTW, "Instantly" here isn't actually instantly, it is
"before the return from SMM".

thanks,
Len Brown, Intel Open Source Technology Center

2010-06-29 03:50:07

by Len Brown

[permalink] [raw]
Subject: Re: 2.6.35-rc3 regression: IBM Maia system is unbootable [ACPI related?]

From: Len Brown <[email protected]>
Subject: [PATCH] ACPI: handle systems which asynchoronously enable ACPI mode

Folklore suggested that such systems existed
in the pre-history of ACPI.

However, we removed the SCI_EN polling loop from
acpi_hw_set_mode() in b430acbd7c4b919886fa7fd92eeb7a695f1940d3
because it delayed resume by 3 seconds on boxes
that refused to set SCI_EN.

Matthew removed the call to acpi_enable() from
the suspend resume path.

James found a modern system that still needs to be polled
upon boot.

So here we restore the workaround, except that we
put it in acpi_enable() rather than the low level
acpi_hw_set_mode().

https://bugzilla.kernel.org/show_bug.cgi?id=16271

Signed-off-by: Len Brown <[email protected]>
---

James, What does the IBM system see with this patch?

thanks,
-Len

drivers/acpi/acpica/evxfevnt.c | 19 +++++++++++--------
1 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/acpi/acpica/evxfevnt.c b/drivers/acpi/acpica/evxfevnt.c
index d97b8dc..18b3f14 100644
--- a/drivers/acpi/acpica/evxfevnt.c
+++ b/drivers/acpi/acpica/evxfevnt.c
@@ -70,6 +70,7 @@ acpi_ev_get_gpe_device(struct acpi_gpe_xrupt_info *gpe_xrupt_info,
acpi_status acpi_enable(void)
{
acpi_status status;
+ int retry;

ACPI_FUNCTION_TRACE(acpi_enable);

@@ -98,16 +99,18 @@ acpi_status acpi_enable(void)

/* Sanity check that transition succeeded */

- if (acpi_hw_get_mode() != ACPI_SYS_MODE_ACPI) {
- ACPI_ERROR((AE_INFO,
- "Hardware did not enter ACPI mode"));
- return_ACPI_STATUS(AE_NO_HARDWARE_RESPONSE);
+ for (retry = 0; retry < 30000; ++retry) {
+ if (acpi_hw_get_mode() == ACPI_SYS_MODE_ACPI) {
+ if (retry != 0)
+ ACPI_WARNING((AE_INFO,
+ "Platform took > %d00 usec to enter ACPI mode", retry));
+ return_ACPI_STATUS(AE_OK);
+ }
+ acpi_os_stall(100); /* 100 usec */
}

- ACPI_DEBUG_PRINT((ACPI_DB_INIT,
- "Transition to ACPI mode successful\n"));
-
- return_ACPI_STATUS(AE_OK);
+ ACPI_ERROR((AE_INFO, "Hardware did not enter ACPI mode"));
+ return_ACPI_STATUS(AE_NO_HARDWARE_RESPONSE);
}

ACPI_EXPORT_SYMBOL(acpi_enable)
--
1.7.2.rc0

2010-06-29 04:01:52

by James Bottomley

[permalink] [raw]
Subject: Re: 2.6.35-rc3 regression: IBM Maia system is unbootable [ACPI related?]

On Mon, 2010-06-28 at 17:59 -0400, Len Brown wrote:
> Is this a production system with a production BIOS?

It's got a production BIOS, yes ... not sure about the system, it might
be a B model.

> BTW, "Instantly" here isn't actually instantly, it is
> "before the return from SMM".

ENOCONTEXT?

As I explained in the previous email, it takes about 1ms before it
reports as entered ACPI mode.

James

2010-06-29 16:49:13

by James Bottomley

[permalink] [raw]
Subject: Re: 2.6.35-rc3 regression: IBM Maia system is unbootable [ACPI related?]

On Mon, 2010-06-28 at 23:49 -0400, Len Brown wrote:
> From: Len Brown <[email protected]>
> Subject: [PATCH] ACPI: handle systems which asynchoronously enable ACPI mode
>
> Folklore suggested that such systems existed
> in the pre-history of ACPI.
>
> However, we removed the SCI_EN polling loop from
> acpi_hw_set_mode() in b430acbd7c4b919886fa7fd92eeb7a695f1940d3
> because it delayed resume by 3 seconds on boxes
> that refused to set SCI_EN.
>
> Matthew removed the call to acpi_enable() from
> the suspend resume path.
>
> James found a modern system that still needs to be polled
> upon boot.
>
> So here we restore the workaround, except that we
> put it in acpi_enable() rather than the low level
> acpi_hw_set_mode().
>
> https://bugzilla.kernel.org/show_bug.cgi?id=16271
>
> Signed-off-by: Len Brown <[email protected]>
> ---
>
> James, What does the IBM system see with this patch?

The output is this:

[ 0.084088] ACPI: Core revision 20100428
[ 0.119451] ACPI Warning: Platform took > 100 usec to enter ACPI mode (20100428/evxfevnt-106)
[ 0.128231] Setting APIC routing to physical flat

So it's very fast ... actually it's behaving like there's some caching
issue and it needs two reads to return the correct value

James