Hello,
finally I fount the time to track down the reason, why my
Toshiba Satellite 2430-301 laptop did not boot when the
"local APIC support on uniprocessors" option was enabled.
The problem occurs on 2.4.21 as well as on 2.6.0test9. The
following analysis applies to the 2.6.0test9 kernel source
with Debian patches. All the gory details may found at
http://bugs.debian.org/218768 .
The crash happens as follows:
start_kernel(init/main.c):
calls 'setup_arch'
setup_arch(arch/i386/kernel/setup.c):
calls 'get_smp_config'
(this is conditional on CONFIG_X86_LOCAL_APIC)
get_smp_config(arch/i386/kernel/mpparse.c):
obtains struct intel_mp_floating *mpf
the values (obtained via early_printk,
compare memory dumps at the bottom) are
mpf->mpf_signature="_MP_"
mpf->mpf_physptr=0x0009F830
mpf->mpf_length=1
mpf->mpf_specification=4
mpf->mpf_checksum=111
mpf->mpf_feature1=0
mpf->mpf_feature2=0
mpf->mpf_feature3=0
mpf->mpf_feature4=0
mpf->mpf_feature5=0
mpf_physptr points into the second block of
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000dc000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000001ff70000 (usable)
BIOS-e820: 000000001ff70000 - 000000001ff7b000 (ACPI data)
BIOS-e820: 000000001ff7b000 - 000000001ff80000 (ACPI NVS)
BIOS-e820: 000000001ff80000 - 0000000020000000 (reserved)
BIOS-e820: 00000000ff800000 - 00000000ffc00000 (reserved)
BIOS-e820: 00000000fffffc00 - 0000000100000000 (reserved)
smp_read_mpc is called
smp_read_mpc(arch/i386/kernel/mpparse.c):
The argument mpc points to a 'struct mp_config_table',
which is filled with zero bytes (compare memory dump
below).
The 'if (memcmp(mpc->mpc_signature,MPC_SIGNATURE,4))' test
fails because of this and calls 'panic'.
The kernel never returns from the call to 'panic'.
Herbert Xu produced a patch, which converts the crash into an error
message, so the symptoms are cured for me.
Now for my questions: As far as I can see it, the invalid
SMP mptable is a BIOS bug on my machine. Do you think so,
too? Or are there other possibilities?
Do you think it would be helpful to contact Toshiba (my
laptop's vendor) about this? Or would they just ignore
such a report? I have never tried to report something like
this before, and feel a little uncomfortable about doing
so, because I don't know what a "SMP mptable" is for, and I
might write stupid things in my report.
Thank you,
Jochen
PS.: please Cc: me on replies, because I'm not on the list.
PPS.: I append a memory dump (extracted from /dev/kmem) of the
relevant regions. Starting at the "_MP_" you can see the
'struct intel_mp_floating *mpf'
000F6A80: F0 4A F7 1F 00 00 00 00 00 00 00 00 00 00 00 00 .J..............
000F6A90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
000F6AA0: 5F 33 32 5F 80 D7 0F 00 00 01 76 00 00 00 00 00 _32_......v.....
000F6AB0: 5F 4D 50 5F 30 F8 09 00 01 04 6F 00 00 00 00 00 _MP_0.....o.....
000F6AC0: 00 00 00 00 01 10 64 95 3C 3D 6F 00 00 00 00 00 ......d.<=o.....
000F6AD0: 24 50 44 4D 01 0B 2B 47 8D 00 F0 00 00 00 00 00 $PDM..+G........
000F6AE0: 24 50 6E 50 10 21 00 00 A8 00 04 00 00 B1 97 00 $PnP.!..........
000F6AF0: F0 CF 97 00 00 0F 00 00 00 00 00 40 00 00 04 00 ...........@....
This is the place where the 'struct mp_config_table' is
supposed to reside.
0009F800: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0009F810: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0009F820: 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 00 ................
0009F830: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0009F840: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0009F850: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0009F860: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
0009F870: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
I looked for the strings "PCMP" or "_OEM" (the signatures
for 'mp_config_table' and 'mp_config_oemtable'). These
strings do not occur on word-aligned addresses in the
address ranges
BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000dc000 - 0000000000100000 (reserved)
BIOS-e820: 000000001ff70000 - 000000001ff7b000 (ACPI data)
BIOS-e820: 000000001ff7b000 - 000000001ff80000 (ACPI NVS)
BIOS-e820: 000000001ff80000 - 0000000020000000 (reserved)
--
http://seehuhn.de/
On Thu, 13 Nov 2003, Jochen Voss wrote:
>
> smp_read_mpc(arch/i386/kernel/mpparse.c):
> The argument mpc points to a 'struct mp_config_table',
> which is filled with zero bytes (compare memory dump
> below).
> The 'if (memcmp(mpc->mpc_signature,MPC_SIGNATURE,4))' test
> fails because of this and calls 'panic'.
> The kernel never returns from the call to 'panic'.
>
> Herbert Xu produced a patch, which converts the crash into an error
> message, so the symptoms are cured for me.
Ok. That panic is obviously crud from a lazy initial developer, and yes,
it's always silly (and very very wrong) to crash if you can just continue.
Can you send the (tested) patch over?
> Now for my questions: As far as I can see it, the invalid
> SMP mptable is a BIOS bug on my machine. Do you think so,
> too? Or are there other possibilities?
I think it's a Linux bug too, although I'll agree that it was triggered by
some really bad BIOS behaviour. I bet the laptop vendor doesn't care: they
probably depend on ACPI to set the thing up on Windows, and Windows is
likely to just ignore the MP table (properly) when it doesn't need it (or
when it is corrupt).
> Do you think it would be helpful to contact Toshiba (my
> laptop's vendor) about this?
I really think that the Linux behaviour i smore of a bug than the BIOS
behaviour. There's no excuse for panicing just because some signature
for a data block that we don't even strictly need isn't there.
Linus
On Thu, 13 Nov 2003, Jochen Voss wrote:
>
> With the patch the crash goes away, but I get the error message
>
> BIOS bug, MP table errors detected!...
> ... disabling SMP support. (tell your hw vendor)
>
> now. I guess that means no hyperthreading for me :-(
Hmm.. Do you have ACPI enabled? We really shouldn't need the MP table if
the information is elsewhere, but the mptable assumptions might be a bit
entrenched.
Linus
On Thu, Nov 13, 2003 at 08:40:16AM -0800, Linus Torvalds wrote:
>
> On Thu, 13 Nov 2003, Jochen Voss wrote:
> > Herbert Xu produced a patch, which converts the crash into an error
> > message, so the symptoms are cured for me.
>
> Ok. That panic is obviously crud from a lazy initial developer, and yes,
> it's always silly (and very very wrong) to crash if you can just continue.
>
> Can you send the (tested) patch over?
Herbert sent it to the list (Subject: [i386] Remove bogus panic calls
in mpparse.c). You can find the corresponding message under
http://www.ussg.iu.edu/hypermail/linux/kernel/0311.1/0879.html
> I really think that the Linux behaviour i smore of a bug than the BIOS
> behaviour. There's no excuse for panicing just because some signature
> for a data block that we don't even strictly need isn't there.
With the patch the crash goes away, but I get the error message
BIOS bug, MP table errors detected!...
... disabling SMP support. (tell your hw vendor)
now. I guess that means no hyperthreading for me :-(
Jochen
--
http://seehuhn.de/
Hello,
On Thu, Nov 13, 2003 at 10:13:08AM -0800, Linus Torvalds wrote:
> Hmm.. Do you have ACPI enabled? We really shouldn't need the MP table if
> the information is elsewhere, but the mptable assumptions might be a bit
> entrenched.
With SMP and ACPI enabled I get the following kernel
boot messages
Intel MultiProcessor Specification v1.4
Virtual Wire compatibility mode.
SMP mptable: bad signature [0x0]!
BIOS bug, MP table errors detected!...
... disabling SMP support. (tell your hw vendor)
The last message is generated by the following code
in arch/i386/kernel/mpparse.c:
if (!smp_read_mpc((void *)mpf->mpf_physptr)) {
smp_found_config = 0;
printk(KERN_ERR "BIOS bug, MP table errors detected!...\n");
printk(KERN_ERR "... disabling SMP support. (tell your hw vendor)\n");
return;
}
I don't know what effects are caused by the 'smp_found_config = 0',
but later-on the following messages appear:
No local APIC present or hardware disabled
...
CPU: After generic identify, caps: bfebf9ff 00000000 00000000 00000000
CPU: After vendor identify, caps: bfebf9ff 00000000 00000000 00000000
CPU: Trace cache: 12K uops, L1 D cache: 8K
CPU: L2 cache: 512K
CPU: Hyper-Threading is disabled
CPU: After all inits, caps: bfebf9ff 00000000 00000000 00000080
...
SMP motherboard not detected.
Local APIC not detected. Using dummy APIC emulation.
Starting migration thread for cpu 0
CPUS done 4
I hope this helps,
Jochen
--
http://seehuhn.de/
Which kernel is this? In 2.6 we don't look at the MPS table if ACPI is
available. Or ACPI detection is failing?
Jun
> -----Original Message-----
> From: [email protected] [mailto:linux-kernel-
> [email protected]] On Behalf Of Jochen Voss
> Sent: Thursday, November 13, 2003 10:45 AM
> To: Linus Torvalds
> Cc: [email protected]
> Subject: Re: invalid SMP mptable on Toshiba Satellite 2430-301
>
> Hello,
>
> On Thu, Nov 13, 2003 at 10:13:08AM -0800, Linus Torvalds wrote:
> > Hmm.. Do you have ACPI enabled? We really shouldn't need the MP
table if
> > the information is elsewhere, but the mptable assumptions might be a
bit
> > entrenched.
>
> With SMP and ACPI enabled I get the following kernel
> boot messages
>
> Intel MultiProcessor Specification v1.4
> Virtual Wire compatibility mode.
> SMP mptable: bad signature [0x0]!
> BIOS bug, MP table errors detected!...
> ... disabling SMP support. (tell your hw vendor)
>
> The last message is generated by the following code
> in arch/i386/kernel/mpparse.c:
>
> if (!smp_read_mpc((void *)mpf->mpf_physptr)) {
> smp_found_config = 0;
> printk(KERN_ERR "BIOS bug, MP table errors detected!...\n");
> printk(KERN_ERR "... disabling SMP support. (tell your hw
> vendor)\n");
> return;
> }
>
> I don't know what effects are caused by the 'smp_found_config = 0',
> but later-on the following messages appear:
>
> No local APIC present or hardware disabled
> ...
> CPU: After generic identify, caps: bfebf9ff 00000000 00000000
> 00000000
> CPU: After vendor identify, caps: bfebf9ff 00000000 00000000
> 00000000
> CPU: Trace cache: 12K uops, L1 D cache: 8K
> CPU: L2 cache: 512K
> CPU: Hyper-Threading is disabled
> CPU: After all inits, caps: bfebf9ff 00000000 00000000
00000080
> ...
> SMP motherboard not detected.
> Local APIC not detected. Using dummy APIC emulation.
> Starting migration thread for cpu 0
> CPUS done 4
>
> I hope this helps,
> Jochen
> --
> http://seehuhn.de/
Hello,
On Thu, Nov 13, 2003 at 10:56:34AM -0800, Nakajima, Jun wrote:
> Which kernel is this?
It is 2.6.0-test9 source from Debian, which Herbert Xu's patch
to make it boot.
> In 2.6 we don't look at the MPS table if ACPI is
> available. Or ACPI detection is failing?
How do I check this? The calling chain which leads to the "BIOS bug,
MP table errors detected!" message is described in my original report
http://www.ussg.iu.edu/hypermail/linux/kernel/0311.1/0894.html
Other relevant information:
the full dmesg output: http://seehuhn.de/comp/dmesg-2.6.0-test9
my kernel config file: http://seehuhn.de/comp/config-2.6.0-test9
Herbert's patch: http://www.ussg.iu.edu/hypermail/linux/kernel/0311.1/0879.html
I hope this helps,
Jochen
--
http://seehuhn.de/
On Thu, 13 Nov 2003, Jochen Voss wrote:
>
> > In 2.6 we don't look at the MPS table if ACPI is
> > available. Or ACPI detection is failing?
>
> How do I check this?
Well, I just checked, and with my setup the IOAPIC and LAPIC information
is all from ACPI.
In particular, if your ACPI tables have the information, you should have
seen something like this:
..
ACPI: Local APIC address 0xfee00000
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 15:2 APIC version 20
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1 15:2 APIC version 20
ACPI: LAPIC_NMI (acpi_id[0x00] polarity[0x1] trigger[0x1] lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] polarity[0x1] trigger[0x1] lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec00000] global_irq_base[0x0])
...
and you shouldn't ever have gotten to the SMP table parsing, because the
kernel doesn't need the information.
However, your dmesg doesn't have that. Which means that either you don't
have the proper ACPI tables, or you don't ave CONFIG_ACPI_BOOT set.
Your .config file says you _do_ have CONFIG_ACPI_BOOT set, which would
imply that your BIOS tables really _are_ UP-only, even for ACPI.
Did they actually sell you the thing as being HT-enabled? It doesn't look
like it is..
Linus
Hello,
On Thu, Nov 13, 2003 at 02:21:12PM -0800, Linus Torvalds wrote:
> ..., which would
> imply that your BIOS tables really _are_ UP-only, even for ACPI.
>
> Did they actually sell you the thing as being HT-enabled? It doesn't look
> like it is..
No, they did not even mention HT. Which, I guess now,
stands for "HT will not work". I just wanted to try it.
I think the best thing would be, to incorporate the patch to
prevent the crashes with "local APIC support on
uniprocessors" enabled and ignore the rest of the problem.
Thank you for your help,
Jochen
--
http://seehuhn.de/
On Thu, 13 Nov 2003, Jochen Voss wrote:
>
> I think the best thing would be, to incorporate the patch to
> prevent the crashes with "local APIC support on
> uniprocessors" enabled and ignore the rest of the problem.
Yup, I'm going to commit a minimal patch that just changes the panic calls
into printk's.
Thanks for the debugging,
Linus
What's happening is that the kernel did not found interesting things in
ACPI (no local APIC and no I/O APIC), and decided to fall back to MPS.
Then MPS detection code was not robust, calling panic(). So we should
fix the MPS code that calls panic().
We can stop the fallback to MPS if ACPI is detected right, but probably
we should keep the priority at this point, assuming MPS is still more
correct; this is a bug with MPS parsing.
Jun
> -----Original Message-----
> From: [email protected] [mailto:linux-kernel-
> [email protected]] On Behalf Of Jochen Voss
> Sent: Thursday, November 13, 2003 10:45 AM
> To: Linus Torvalds
> Cc: [email protected]
> Subject: Re: invalid SMP mptable on Toshiba Satellite 2430-301
>
> Hello,
>
> On Thu, Nov 13, 2003 at 10:13:08AM -0800, Linus Torvalds wrote:
> > Hmm.. Do you have ACPI enabled? We really shouldn't need the MP
table if
> > the information is elsewhere, but the mptable assumptions might be a
bit
> > entrenched.
>
> With SMP and ACPI enabled I get the following kernel
> boot messages
>
> Intel MultiProcessor Specification v1.4
> Virtual Wire compatibility mode.
> SMP mptable: bad signature [0x0]!
> BIOS bug, MP table errors detected!...
> ... disabling SMP support. (tell your hw vendor)
>
> The last message is generated by the following code
> in arch/i386/kernel/mpparse.c:
>
> if (!smp_read_mpc((void *)mpf->mpf_physptr)) {
> smp_found_config = 0;
> printk(KERN_ERR "BIOS bug, MP table errors detected!...\n");
> printk(KERN_ERR "... disabling SMP support. (tell your hw
> vendor)\n");
> return;
> }
>
> I don't know what effects are caused by the 'smp_found_config = 0',
> but later-on the following messages appear:
>
> No local APIC present or hardware disabled
> ...
> CPU: After generic identify, caps: bfebf9ff 00000000 00000000
> 00000000
> CPU: After vendor identify, caps: bfebf9ff 00000000 00000000
> 00000000
> CPU: Trace cache: 12K uops, L1 D cache: 8K
> CPU: L2 cache: 512K
> CPU: Hyper-Threading is disabled
> CPU: After all inits, caps: bfebf9ff 00000000 00000000
00000080
> ...
> SMP motherboard not detected.
> Local APIC not detected. Using dummy APIC emulation.
> Starting migration thread for cpu 0
> CPUS done 4
>
> I hope this helps,
> Jochen
> --
> http://seehuhn.de/
Linus Torvalds <[email protected]> wrote:
>
> On Thu, 13 Nov 2003, Jochen Voss wrote:
>>
>> I think the best thing would be, to incorporate the patch to
>> prevent the crashes with "local APIC support on
>> uniprocessors" enabled and ignore the rest of the problem.
>
> Yup, I'm going to commit a minimal patch that just changes the panic calls
> into printk's.
That patch produces a message with no terminating newline on the
machine in question. This is because one of the four bytes that
you're printing out is NUL. The following patch avoids that problem.
Thanks,
--
Debian GNU/Linux 3.0 is out! ( http://www.debian.org/ )
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
--- kernel-source-2.5/arch/i386/kernel/mpparse.c.orig 2003-11-14 19:40:49.000000000 +1100
+++ kernel-source-2.5/arch/i386/kernel/mpparse.c 2003-11-13 20:48:50.000000000 +1100
@@ -361,15 +361,12 @@
unsigned char *mpt=((unsigned char *)mpc)+count;
if (memcmp(mpc->mpc_signature,MPC_SIGNATURE,4)) {
- printk("SMP mptable: bad signature [%c%c%c%c]!\n",
- mpc->mpc_signature[0],
- mpc->mpc_signature[1],
- mpc->mpc_signature[2],
- mpc->mpc_signature[3]);
+ printk(KERN_ERR "SMP mptable: bad signature [0x%x]!\n",
+ *(u32 *)mpc->mpc_signature);
return 0;
}
if (mpf_checksum((unsigned char *)mpc,mpc->mpc_length)) {
- printk("SMP mptable: checksum error!\n");
+ printk(KERN_ERR "SMP mptable: checksum error!\n");
return 0;
}
if (mpc->mpc_spec!=0x01 && mpc->mpc_spec!=0x04) {
[email protected] (Jochen Voss) wrote on 13.11.03 in <[email protected]>:
> With SMP and ACPI enabled I get the following kernel
> boot messages
> but later-on the following messages appear:
>
> No local APIC present or hardware disabled
> Local APIC not detected. Using dummy APIC emulation.
Hmmm ... are you sure you didn't confuse ACPI with APIC?
MfG Kai
Hello Kai,
On Sun, Nov 16, 2003 at 06:31:00PM +0200, Kai Henningsen wrote:
> [email protected] (Jochen Voss) wrote on 13.11.03 in <[email protected]>:
>
> > With SMP and ACPI enabled I get the following kernel
> > boot messages
>
> > but later-on the following messages appear:
> >
> > No local APIC present or hardware disabled
>
> > Local APIC not detected. Using dummy APIC emulation.
>
> Hmmm ... are you sure you didn't confuse ACPI with APIC?
Yes, I am sure. I had ACPI enabled (on request by Linus)
and the messages are about the APIC.
As far as I understand this, the situation is as follows:
The original problem was related to the APIC and the
multiprocessor (here: hyper-threading) configuration. The
system tries to use ACPI to acquire information about the
multiprocessor configuration. If ACPI succeeded in doing
so, then my kernel would not try to read the mptable, and
the crash would not occur.
I hope this helps,
Jochen
--
http://seehuhn.de/