A UP kernel compiled with CONFIG_X86_LOCAL_APIC=y dies a very horrible
death on an SMP Athlon motherboard (Tyan S2462 and S2468), flooding the
console with the following messages:
...
masked ExtINT on CPU#0
ESR value before enabling vector: 00000008
ESR value afteC error on CPU8(08)
<6>APIC <6>APIC error PU0: 08(08)
<6C error on CP08(08)
<6>APICor on CPU0: 08
<6>APIC erro CPU0: 08(08)
PIC error on0: 08(08)
<6>Aerror on CPU0:08)
<6>APIC e on CPU0: 08(06>APIC error oU0: 08(08)
<6C error on CPU8(08)
...
[this is a serial console, and the errors are produced faster than the
console can print them]
The kernel is 2.4.21-pre4, but any recent 2.4/2.5 kernels produce the same
results: 2.4.20, 2.5.62, and Red Hat's 2.4.18-{18.24}. It's not a
machine-specific problem, because it shows up on at least 5 different
machines (all of them dual Athlon, using Tyan MB's). The error message is
always "APIC error on CPU0: 08(08)".
A bit of binary searching between kernel versions shows that the problem
was introduced in 2.4.10-pre12.
The IO-APIC option (CONFIG_X86_UP_IOAPIC) does not matter, only the local
APIC option does. The kernel is compiled for an Athlon (CONFIG_MK7=y and
everything that implies).
Ideas?
Thanks,
Ion
--
It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
Ion Badulescu wrote:
> A UP kernel compiled with CONFIG_X86_LOCAL_APIC=y dies a very horrible
> death on an SMP Athlon motherboard (Tyan S2462 and S2468), flooding the
> console with the following messages:
IMO just assume this option is just broken, unless you absolutely need it.
Red Hat ships UP kernels with this option disabled, because either the
code, the BIOS, or both are typically broken.
Jeff
On Thu, 20 Feb 2003, Jeff Garzik wrote:
> Ion Badulescu wrote:
> > A UP kernel compiled with CONFIG_X86_LOCAL_APIC=y dies a very horrible
> > death on an SMP Athlon motherboard (Tyan S2462 and S2468), flooding the
> > console with the following messages:
>
> IMO just assume this option is just broken, unless you absolutely need it.
My only boxes on which this is a problem are the SMP athlons, and only
with UP kernels...
> Red Hat ships UP kernels with this option disabled, because either the
> code, the BIOS, or both are typically broken.
Only recently, though, and probably because they started receiving
complaints that the BOOT kernel (most importantly) and the UP kernel were
not booting up correctly on SMP athlons. At least that's the impression I
got browsing bugzilla.redhat.com.
Moreover, it makes a measurable difference in interrupt latency (and
consequently in the number of UDP packets dropped under stress), so on my
production machines I run RH kernels with this option re-enabled (among
other changes).
Anyway, I'd like to get to the bottom of this, since I've narrowed it down
so much. Anyone know who submitted the APIC changes in 2.4.10-pre12? I'd
debug it myself, but I know next to nothing about the APIC. If you know
where to get some documentation, I'm more than willing to give it a shot.
Thanks,
Ion
--
It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
Ion Badulescu writes:
> On Thu, 20 Feb 2003, Jeff Garzik wrote:
>
> > Ion Badulescu wrote:
> > > A UP kernel compiled with CONFIG_X86_LOCAL_APIC=y dies a very horrible
> > > death on an SMP Athlon motherboard (Tyan S2462 and S2468), flooding the
> > > console with the following messages:
> >
> > IMO just assume this option is just broken, unless you absolutely need it.
>
> My only boxes on which this is a problem are the SMP athlons, and only
> with UP kernels...
Chipset?
Is the second CPU installed or not?
If the second CPU is installed, has it been disabled in BIOS?
Relevant config? What combinations of UP_APIC and UP_IOAPIC have
you been using? Has ACPI been enabled or not?
A plain kernel with UP_APIC but no SMP or UP_IOAPIC shouldn't
provoke the kinds of APIC errors you mentioned, unless the APIC
bus is noisy due to a missing second CPU (just a theory).
> Anyway, I'd like to get to the bottom of this, since I've narrowed it down
> so much. Anyone know who submitted the APIC changes in 2.4.10-pre12?
Ingo Molnar, Maciej W. Rozycki, and myself.
> debug it myself, but I know next to nothing about the APIC. If you know
> where to get some documentation, I'm more than willing to give it a shot.
Intel's IA32 manual set, Volume 3, is required reading.
/Mikael
Hi Mikael,
On Fri, 21 Feb 2003, Mikael Pettersson wrote:
> > My only boxes on which this is a problem are the SMP athlons, and only
> > with UP kernels...
>
> Chipset?
AMD 760MP and 760MPX, both have this problem.
> Is the second CPU installed or not?
Installed.
> If the second CPU is installed, has it been disabled in BIOS?
It hasn't been disabled (the BIOS doesn't have that option).
> Relevant config? What combinations of UP_APIC and UP_IOAPIC have
> you been using?
CONFIG_MK7=y
CONFIG_NOHIGHMEM=y
CONFIG_MTRR=y
CONFIG_X86_UP_IOAPIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
but CONFIG_X86_IO_APIC can be turned off and the problem still persists.
> Has ACPI been enabled or not?
The problem is present both with and without ACPI.
> A plain kernel with UP_APIC but no SMP or UP_IOAPIC shouldn't
> provoke the kinds of APIC errors you mentioned, unless the APIC
> bus is noisy due to a missing second CPU (just a theory).
Well, the second CPU is there, and there are no problems with the APIC and
the IO-APIC if the kernel is compiled with CONFIG_SMP=y. Only the UP case
causes the problem. So I don't think the bus itself is noisy, unless the
noises are produced by the second CPU somehow, and we can't do anything
about it because we're not touching that second CPU.
> > Anyway, I'd like to get to the bottom of this, since I've narrowed it down
> > so much. Anyone know who submitted the APIC changes in 2.4.10-pre12?
>
> Ingo Molnar, Maciej W. Rozycki, and myself.
Thanks for emailing back. :)
Yeah, I noticed your name in most of the relevant changes between
2.4.10-pre11 and pre12, so I was going to email you directly after
narrowing it down some more. Right now I'm trying to isolate the smallest
portion of the pre11-pre12 patch that triggers the problem.
But if you have any ideas or patches to try, please do let me know...
> Intel's IA32 manual set, Volume 3, is required reading.
Thanks, I'll try to get it.
I know that AMD's K7 APIC is supposed to be compatible with the Intel P6
APIC, but do you think there might be some incompatibility between that
that causes this? Or perhaps some undefined behavior we rely on, and which
is different between Intel and AMD?...
Anyway, I'll keep on digging.
Thanks,
Ion
--
It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
Ion Badulescu writes:
> AMD 760MP and 760MPX, both have this problem.
Ok, AMD's chipsets are reasonable.
> > Is the second CPU installed or not?
>
> Installed.
>
> > If the second CPU is installed, has it been disabled in BIOS?
>
> It hasn't been disabled (the BIOS doesn't have that option).
That kills the noisy-bus theory.
> Well, the second CPU is there, and there are no problems with the APIC and
> the IO-APIC if the kernel is compiled with CONFIG_SMP=y. Only the UP case
> causes the problem. So I don't think the bus itself is noisy, unless the
> noises are produced by the second CPU somehow, and we can't do anything
> about it because we're not touching that second CPU.
An UP_APIC kernel without IOAPIC shouldn't generate any APIC bus messages.
Have you checked if the BIOS has an option for choosing "PIC" or "APIC"
interrupt delivery? Try setting it to PIC mode.
> I know that AMD's K7 APIC is supposed to be compatible with the Intel P6
> APIC, but do you think there might be some incompatibility between that
> that causes this? Or perhaps some undefined behavior we rely on, and which
> is different between Intel and AMD?...
None that I know of, to both questions.
All problems I've seen have been caused by non-Intel chipsets.
/Mikael
On Fri, 21 Feb 2003, Ion Badulescu wrote:
> Anyway, I'll keep on digging.
And this is what I found: eliminating two lines from
APIC_init_uniprocessor() makes the problem go away.
diff -urNX diff_kernel_excludes linux-2.4.10-pre12/arch/i386/kernel/apic.c linux-2.4.10-pre11++/arch/i386/kernel/apic.c
--- linux-2.4.10-pre12/arch/i386/kernel/apic.c Wed Feb 19 23:53:15 2003
+++ linux-2.4.10-pre11++/arch/i386/kernel/apic.c Fri Feb 21 15:37:06 2003
@@ -1087,9 +1087,6 @@
connect_bsp_APIC();
- phys_cpu_present_map = 1;
- apic_write_around(APIC_ID, boot_cpu_id);
-
apic_pm_init2();
setup_local_APIC();
[patch against 2.4.10-pre12, but 2.4.21-pre4 is reasonably similar]
Ion
--
It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
On Friday 21 February 2003 21:42, Ion Badulescu wrote:
Hi Ion,
> And this is what I found: eliminating two lines from
> APIC_init_uniprocessor() makes the problem go away.
> diff -urNX diff_kernel_excludes linux-2.4.10-pre12/arch/i386/kernel/apic.c
> linux-2.4.10-pre11++/arch/i386/kernel/apic.c ---
> linux-2.4.10-pre12/arch/i386/kernel/apic.c Wed Feb 19 23:53:15 2003 +++
> linux-2.4.10-pre11++/arch/i386/kernel/apic.c Fri Feb 21 15:37:06 2003 @@
> -1087,9 +1087,6 @@
>
> connect_bsp_APIC();
>
> - phys_cpu_present_map = 1;
> - apic_write_around(APIC_ID, boot_cpu_id);
> -
> apic_pm_init2();
>
> setup_local_APIC();
>
> [patch against 2.4.10-pre12, but 2.4.21-pre4 is reasonably similar]
Don't do this. I am pretty sure it will break all Intels. I still cannot
understand why this fixes your AMD Athlon problem.
ciao, Marc
On Fri, 21 Feb 2003, Marc-Christian Petersen wrote:
> Don't do this. I am pretty sure it will break all Intels. I still cannot
> understand why this fixes your AMD Athlon problem.
Oh, I don't doubt it -- it was just the result of my process of
elimination, trying to find the change that broke it in 2.4.10-pre12.
Somebody who understands the APIC stuff better than I do will have to draw
some conclusions from this little experiment...
Thanks,
Ion
--
It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
On Fri, 21 Feb 2003 22:41:23 +0100, Marc-Christian Petersen wrote:
>> And this is what I found: eliminating two lines from
>> APIC_init_uniprocessor() makes the problem go away.
>> diff -urNX diff_kernel_excludes linux-2.4.10-pre12/arch/i386/kernel/apic.c
>> linux-2.4.10-pre11++/arch/i386/kernel/apic.c ---
>> linux-2.4.10-pre12/arch/i386/kernel/apic.c Wed Feb 19 23:53:15 2003 +++
>> linux-2.4.10-pre11++/arch/i386/kernel/apic.c Fri Feb 21 15:37:06 2003 @@
>> -1087,9 +1087,6 @@
>>
>> connect_bsp_APIC();
>>
>> - phys_cpu_present_map = 1;
>> - apic_write_around(APIC_ID, boot_cpu_id);
>> -
>> apic_pm_init2();
>>
>> setup_local_APIC();
>>
>> [patch against 2.4.10-pre12, but 2.4.21-pre4 is reasonably similar]
>Don't do this. I am pretty sure it will break all Intels. I still cannot
>understand why this fixes your AMD Athlon problem.
This is interesting. Very interesting, even. A plain UP_APIC kernel
(with IO_APIC not enabled or not detected) shouldn't need to touch
APIC_ID at all. I strongly suspect this is a remnant of apic.c's old
SMP-only history, and that it should be removed for UP_APIC-only.
I need to get some downtime (zzz...) but I'll look into this ASAP.
/Mikael
On Sat, 22 Feb 2003, Mikael Pettersson wrote:
> This is interesting. Very interesting, even. A plain UP_APIC kernel
> (with IO_APIC not enabled or not detected) shouldn't need to touch
> APIC_ID at all. I strongly suspect this is a remnant of apic.c's old
> SMP-only history, and that it should be removed for UP_APIC-only.
>
> I need to get some downtime (zzz...) but I'll look into this ASAP.
More testing on more platforms actually lead me to a slightly different
patch, which makes a lot more sense as far as phys_cpu_present_map is
concerned:
--- linux-2.4.21-pre4/arch/i386/kernel/apic.c.old Fri Jan 31 10:32:12 2003
+++ linux-2.4.21-pre4/arch/i386/kernel/apic.c Sat Feb 22 02:47:02 2003
@@ -1169,8 +1169,8 @@
connect_bsp_APIC();
- phys_cpu_present_map = 1;
- apic_write_around(APIC_ID, boot_cpu_physical_apicid);
+ phys_cpu_present_map = (1 << boot_cpu_physical_apicid);
+ printk("Setting %d in the phys_cpu_present_map\n", boot_cpu_physical_apicid);
apic_pm_init2();
This has passed my testing on the following platforms:
* P3 (I820 chipset, no IO-APIC, APIC originally disabled by BIOS)
* P3 (440BX chipset, no IO-APIC, APIC originally disabled by BIOS)
* single P3 (ServerWorks OSB4 chipset, one CPU in dual CPU MB)
* dual P3 (ServerWorks OSB4 chipset, both CPU's present)
* dual P4 Xeon (I7500 chipset, both CPU's present, HT enabled)
* K7 (VIA KT400 chipset, IO-APIC present)
* K7 (VIA KM133 chipset, IO-APIC present)
* dual K7 (AMD 760MP chipset, both CPU's present)
* dual K7 (AMD 760MPX chipset, both CPU's present)
As a matter of fact, I got very interesting numbers from that printk() I
added:
- all the Intel and single proc AMD printed "0".
- all the dual AMD machines printed "1".
So the reason it was crashing on the dual Athlons is two-fold:
- It would try to write 1 into APIC_ID, when instead it should have
written (1 << 24). So it was effectively setting the APIC ID to 0.
- It would unconditionally set bit 0 in the phys_cpu_present_map bitmap,
but later on it would check bit number boot_cpu_physical_apicid and BUG()
if it found it clear.
So I think the patch above is safe. We could maybe replace the
old apic_write_around() with something like
apic_write_around(APIC_ID, (boot_cpu_physical_apicid << 24))
but it's probably unnecessary, as you said.
Ok. Enough for today. Zzzz catching time for me too...
Thanks,
Ion
--
It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
Hi !
Good catch, Ion !
On Sat, Feb 22, 2003 at 03:05:37AM -0500, Ion Badulescu wrote:
> As a matter of fact, I got very interesting numbers from that printk() I
> added:
>
> - all the Intel and single proc AMD printed "0".
> - all the dual AMD machines printed "1".
Same here, dual AMD/760MPX.
BTW, there's something I don't understand. The only reference to
APIC_init_uniprocessor() I found was in smpboot.c:1044. It's called when the
SMP config has not been found at boot time (and it also sets
phys_cpu_present_map to 1, BTW). My problem is that this function is executed
on my dual-k7, on an SMP kernel (because I see the added message), but I
don't see the "SMP motherboard not detected" message which should be displayed
just before APIC_init_uniprocessor().
So I suspect there's something strange in this code that might explain why
only CPU0 receives the interrupts, but I don't understand the code path !
I'd appreciate it if someone has a clue... I can provide .config and dmesg
if needed. By this time, I'll add printk's everywhere in the kernel.
Cheers,
Willy
On Sat, Feb 22, 2003 at 10:06:04AM +0100, Willy Tarreau wrote:
> BTW, there's something I don't understand. The only reference to
> APIC_init_uniprocessor() I found was in smpboot.c:1044. It's called when the
> SMP config has not been found at boot time (and it also sets
> phys_cpu_present_map to 1, BTW). My problem is that this function is executed
> on my dual-k7, on an SMP kernel (because I see the added message), but I
> don't see the "SMP motherboard not detected" message which should be displayed
> just before APIC_init_uniprocessor().
Oops ! Sorry for the noise, I confused the message about phys_id_present_map
with the one I added to APIC_init_uniprocessor(). So I confirm that this
function is NOT executed on my dual-k7.
Cheers,
Willy