2002-01-21 10:42:26

by Peter Monta

[permalink] [raw]
Subject: APIC errors, Asus A7M266-D (760MPX chipset)

I'm getting a storm of APIC errors with an Asus A7M266-D motherboard
at the beginning of boot (and no further boot progress is made).
The console message is "APIC error on CPU0: 04(04)". A single
Athlon MP 1200 is installed; the kernel is compiled with SMP=n
but both "local APIC" and "use IO-APIC" are enabled. I'm using
2.5.2pre10.

Oddly, booting this same kernel with the "noapic" option results in
the same problem, but recompiling with all APIC options disabled
gives a successful boot.

While running in this legacy interrupt mode there seem to be a lot
of ERR: interrupts in /proc/interrupts, in fact about five times as
many as "real" ones listed above. Does this affect performance? How
much is interrupt latency increased when the APIC bus must reissue
commands like this (if I understand the documentation correctly)?

Cheers,
Peter Monta


2002-01-21 16:37:42

by James Cleverdon

[permalink] [raw]
Subject: Re: APIC errors, Asus A7M266-D (760MPX chipset)

On Monday 21 January 2002 02:42 am, Peter Monta wrote:
> I'm getting a storm of APIC errors with an Asus A7M266-D motherboard
> at the beginning of boot (and no further boot progress is made).
> The console message is "APIC error on CPU0: 04(04)". A single
> Athlon MP 1200 is installed; the kernel is compiled with SMP=n
> but both "local APIC" and "use IO-APIC" are enabled. I'm using
> 2.5.2pre10.

ESR == 04 means Send Accept Error. The P6 manual says:

Set when the local APIC detects that a message it sent was not
accepted by any APIC on the bus.

Now, I wonder why it was sending anything at all. Maybe that CPU was the one
that does the PIC emulation and has one of the LVT{0,1} entries set to ExtInt.

BTW, be aware that Intel has changed the way the ESR works. The P6 manual
says that software must do two back-to-back writes to clear ESR bits. The
most recent P4s say that a read clears bits and writes are ignored, save for
the requirement of a write before the read to get the ESR up to date. AFAIK,
Intel's Linux gang hasn't posted any patches to fix smp_error_interrupt() for
P4s. I've been meaning to do that myself....

> Oddly, booting this same kernel with the "noapic" option results in
> the same problem, but recompiling with all APIC options disabled
> gives a successful boot.

Sounds like the APIC was up to something anyway....

> While running in this legacy interrupt mode there seem to be a lot
> of ERR: interrupts in /proc/interrupts, in fact about five times as
> many as "real" ones listed above. Does this affect performance? How
> much is interrupt latency increased when the APIC bus must reissue
> commands like this (if I understand the documentation correctly)?

As you would expect performance takes a nosedive from the printks, if nothing
else, and keeping the APIC bus busy with retransmits increases latency. They
use a round robin fairness algorithm so no interrupt source gets starved, but
because the APIC bus is a serial bus (relatively slow) waiting for a turn can
take a while. How long? Beats me. The HW guys are no longer leasing an
APIC bus analyzer pod, so your guess is as good as mine.

> Cheers,
> Peter Monta

--
James Cleverdon, IBM xSeries Platform (NUMA), Beaverton
[email protected] | [email protected]

2002-01-21 17:01:54

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: APIC errors, Asus A7M266-D (760MPX chipset)

On Mon, 21 Jan 2002, James Cleverdon wrote:

> BTW, be aware that Intel has changed the way the ESR works. The P6 manual
> says that software must do two back-to-back writes to clear ESR bits. The
> most recent P4s say that a read clears bits and writes are ignored, save for
> the requirement of a write before the read to get the ESR up to date. AFAIK,
> Intel's Linux gang hasn't posted any patches to fix smp_error_interrupt() for
> P4s. I've been meaning to do that myself....

Have you verified this empirically? AFAIK, the intended way for the ESR
to work is to update upon write and clear upon read. Now there are
various errata within specific implementations, but our code seems immune
to them (except from the Pentium case we handle explicitly), i.e. the
ESR's value as fetched by the handler is correct. The value outside the
handler may be meaningless, but who cares?

> > Oddly, booting this same kernel with the "noapic" option results in
> > the same problem, but recompiling with all APIC options disabled
> > gives a successful boot.
>
> Sounds like the APIC was up to something anyway....

The "noapic" option only affects I/O APICs. Local APICs get programmed
anyway.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +