2001-04-05 22:51:46

by Kurt Garloff

[permalink] [raw]
Subject: APIC errors ...

Hi,

lately having upgraded my DUal-BX motherboard to two PIII-850 CPUs, I run
into some trouble.
FIrst, I had had an assymetric configuration (iPIII-850 + iPII-350) , which
Linux did not support; I created a fix and sent it to LKML. It worked
perfectly, i.e. without the problems described below.

Now, I have two iPIII-850, but I run into different kind of troubles:
(a) The BIOS will sometimes not recognize the second CPU
(b) Linux reports APIC errors and occasionally stops to process IRQs on the
second CPU or crashes (2.4.x kernel).

Some details: DFI P2XBL/D, i440BX, BIOS Award mid 2000 (MPS 1.4), microcode
patches end 2000 patched into BIOS (which yields the rev. 08 for my pIII
(868)). The board is unable to supply the needed 1.7V for the CPUs,
therefore the Slot Adapter (from PowerLeap) contains voltage regulators and
VID is faked to 2.2V. The mainboard by specs supports up to 800MHz (max
multiplier 8 with FSB 100MHz).

The config should be fine; the nmultipliers are fixe anyway nowadays. However:
(a) If I explicitly specify 100, 103 or 112 MHz FSB freq., the second CPU is
not recognized by the BIOS (and subsequently not by Linux) most of the
times. If set to automatic (yields 100MHz), it always recognizes the
2nd CPU. Strange! Setting 83, 75, or 66 MHz FSB, the 2nd CPU is
recognized as well.
(b) The 2.2.16 kernel seems to be happy (did not run long enough to really
check stability), but the 2.4.x kernels reports lots of APIC errors.
Lots is smth in between 1/minute (almost idle computer) and more than
1/second (gears Meas demo running). After some time, eventually the 2nd
CPU does not get IRQs any more; I've even seen some lockups (after a
day or so) of Linux, which I'm not used to :-(
Going back to 83/75/66 MHz FSB seems to also solve this problem, but
is not considered a solution by me.

Here's some excerpt: (dmesg)
APIC error on CPU1: 02(02)
APIC error on CPU0: 01(01)
APIC error on CPU1: 02(02)
APIC error on CPU0: 01(05)
APIC error on CPU1: 02(02)
unexpected IRQ trap at vector d0
unexpected IRQ trap at vector 88
APIC error on CPU1: 02(02)
APIC error on CPU0: 05(01)
APIC error on CPU1: 02(02)
APIC error on CPU0: 01(01)
APIC error on CPU1: 02(02)
APIC error on CPU0: 01(01)
APIC error on CPU0: 01(01)
APIC error on CPU1: 02(02)
APIC error on CPU0: 01(01)

pckurt:~ # cat /proc/interrupts
CPU0 CPU1
0: 5180522 2357505 IO-APIC-edge timer
1: 24284 15803 IO-APIC-edge keyboard
2: 0 0 XT-PIC cascade
3: 2 0 IO-APIC-edge
4: 0 2 IO-APIC-edge serial
5: 35031 27240 IO-APIC-edge snd-card-als100 - DSP
6: 1 2 IO-APIC-edge
7: 2 0 IO-APIC-edge parport0
8: 0 1 IO-APIC-edge rtc
10: 1 0 IO-APIC-edge snd-card-als100 - MPU-401
12: 5124 5959 IO-APIC-edge PS/2 Mouse
14: 18953 18258 IO-APIC-edge ide0
17: 21728 20208 IO-APIC-level eth0
18: 23418 22327 IO-APIC-level sym53c8xx
19: 9553 9442 IO-APIC-level aic7xxx, bttv
28: 0 13 none
136: 0 35 none
140: 0 3 none
152: 0 1 none
156: 0 2 none
160: 0 2 none
172: 0 14 none
200: 0 1 none
204: 0 2 none
208: 0 13 none
NMI: 0 0
LOC: 7538766 7538742
ERR: 777

(Note that I patched the IRQ reporting stuff, so you can get a count for
bogus IRQ vectors.) The AGP slot (MGA400) is mapped to IRQ16. (Not visible
above.)

As you can see, the APIC on CPU1 seems eems to suffer under noise!
It gets APIC errors (which it acknowledges and causes CPU0 to also get an
error) and occasionally receives bogus IRQ vectors.

So this looks like a HW problem. Some reports on LKML seem to indicate that
this is indeed the case.

Somebody is talking about the voltage regulators not giving a really stable
voltage (without load?) causing the noise. A resistor with a capacitor
should help then ... However, sensors reports 2.20V without any flakiness.
Any details on this known?
It could also be that the MoBo chipset (IO-APIC?) has problems to recognize
the signals from 1.7V CPUs expecting at least 1.8 (or 2.2) V. Maybe faking
the VID to 2.0V instead of 2.2V would be useful then.

I would be thankful for any knowledge on this issue!

(As this is slightly off-topic, you may reply via PM. If I happen to solve
my problems, I'll post a summary to LKML.)

Regards,
--
Kurt Garloff <[email protected]> [Eindhoven, NL]
Physics: Plasma simulations <[email protected]> [TU Eindhoven, NL]
Linux: SCSI, Security <[email protected]> [SuSE Nuernberg, FRG]
(See mail header or public key servers for PGP2 and GPG public keys.)


2001-04-18 23:37:22

by Bruce Harada

[permalink] [raw]
Subject: Re: APIC errors ...

On Wed, 18 Apr 2001 15:21:17 -0700 (PDT)
"Dr. Kelsey Hudson" <[email protected]> wrote:

*snip*

> You have a couple solutions: Upgrade the motherboard to one of the VIA
> 133MHz chipsets (I dont care for the VIA chipset so this really doesn't
> strike my fancy) or upgrade to that other Intel chipset that supports SMP;
> unfortunately it also is a rambus board....Serverworks also has a chipset
> out that does dual intel chips at 133MHz; I've heard only good things
> about it.

Er... I believe there was some discussion on l-k some while ago regarding a
certain lack of forthcomingness by Serverworks and the resultant general
flakiness of Linux support for their chipsets...

2001-04-19 02:27:32

by Rico Tudor

[permalink] [raw]
Subject: Re: APIC errors ...

Another problem area is ECC monitoring. I'm still waiting for info from
ServerWorks, and so is Dan Hollis. Alexander Stohr has even submitted
code to Jim Foster for approval, without evident effect. I have 18GB of
RAM divided among five ServerWorks boxes, so the matter is not academic.

2001-04-18 22:22:02

by Dr. Kelsey Hudson

[permalink] [raw]
Subject: Re: APIC errors ...

Um... Looks like when you clock the BX-chipset out of spec (>100MHz FSB)
you get the error. Since BX wasn't ever designed to be run at >100MHz
these errors are *expected*.

You have a couple solutions: Upgrade the motherboard to one of the VIA
133MHz chipsets (I dont care for the VIA chipset so this really doesn't
strike my fancy) or upgrade to that other Intel chipset that supports SMP;
unfortunately it also is a rambus board....Serverworks also has a chipset
out that does dual intel chips at 133MHz; I've heard only good things
about it.

but, from what it looks like, your board is flakey up high...

good luck,
-kelsey


On Fri, 6 Apr 2001, Kurt Garloff wrote:

> Hi,
>
> lately having upgraded my DUal-BX motherboard to two PIII-850 CPUs, I run
> into some trouble.
> FIrst, I had had an assymetric configuration (iPIII-850 + iPII-350) , which
> Linux did not support; I created a fix and sent it to LKML. It worked
> perfectly, i.e. without the problems described below.
>
> Now, I have two iPIII-850, but I run into different kind of troubles:
> (a) The BIOS will sometimes not recognize the second CPU
> (b) Linux reports APIC errors and occasionally stops to process IRQs on the
> second CPU or crashes (2.4.x kernel).
>
> Some details: DFI P2XBL/D, i440BX, BIOS Award mid 2000 (MPS 1.4), microcode
> patches end 2000 patched into BIOS (which yields the rev. 08 for my pIII
> (868)). The board is unable to supply the needed 1.7V for the CPUs,
> therefore the Slot Adapter (from PowerLeap) contains voltage regulators and
> VID is faked to 2.2V. The mainboard by specs supports up to 800MHz (max
> multiplier 8 with FSB 100MHz).
>
> The config should be fine; the nmultipliers are fixe anyway nowadays. However:
> (a) If I explicitly specify 100, 103 or 112 MHz FSB freq., the second CPU is
> not recognized by the BIOS (and subsequently not by Linux) most of the
> times. If set to automatic (yields 100MHz), it always recognizes the
> 2nd CPU. Strange! Setting 83, 75, or 66 MHz FSB, the 2nd CPU is
> recognized as well.
> (b) The 2.2.16 kernel seems to be happy (did not run long enough to really
> check stability), but the 2.4.x kernels reports lots of APIC errors.
> Lots is smth in between 1/minute (almost idle computer) and more than
> 1/second (gears Meas demo running). After some time, eventually the 2nd
> CPU does not get IRQs any more; I've even seen some lockups (after a
> day or so) of Linux, which I'm not used to :-(
> Going back to 83/75/66 MHz FSB seems to also solve this problem, but
> is not considered a solution by me.
>
> Here's some excerpt: (dmesg)
> APIC error on CPU1: 02(02)
> APIC error on CPU0: 01(01)
> APIC error on CPU1: 02(02)
> APIC error on CPU0: 01(05)
> APIC error on CPU1: 02(02)
> unexpected IRQ trap at vector d0
> unexpected IRQ trap at vector 88
> APIC error on CPU1: 02(02)
> APIC error on CPU0: 05(01)
> APIC error on CPU1: 02(02)
> APIC error on CPU0: 01(01)
> APIC error on CPU1: 02(02)
> APIC error on CPU0: 01(01)
> APIC error on CPU0: 01(01)
> APIC error on CPU1: 02(02)
> APIC error on CPU0: 01(01)
>
> pckurt:~ # cat /proc/interrupts
> CPU0 CPU1
> 0: 5180522 2357505 IO-APIC-edge timer
> 1: 24284 15803 IO-APIC-edge keyboard
> 2: 0 0 XT-PIC cascade
> 3: 2 0 IO-APIC-edge
> 4: 0 2 IO-APIC-edge serial
> 5: 35031 27240 IO-APIC-edge snd-card-als100 - DSP
> 6: 1 2 IO-APIC-edge
> 7: 2 0 IO-APIC-edge parport0
> 8: 0 1 IO-APIC-edge rtc
> 10: 1 0 IO-APIC-edge snd-card-als100 - MPU-401
> 12: 5124 5959 IO-APIC-edge PS/2 Mouse
> 14: 18953 18258 IO-APIC-edge ide0
> 17: 21728 20208 IO-APIC-level eth0
> 18: 23418 22327 IO-APIC-level sym53c8xx
> 19: 9553 9442 IO-APIC-level aic7xxx, bttv
> 28: 0 13 none
> 136: 0 35 none
> 140: 0 3 none
> 152: 0 1 none
> 156: 0 2 none
> 160: 0 2 none
> 172: 0 14 none
> 200: 0 1 none
> 204: 0 2 none
> 208: 0 13 none
> NMI: 0 0
> LOC: 7538766 7538742
> ERR: 777
>
> (Note that I patched the IRQ reporting stuff, so you can get a count for
> bogus IRQ vectors.) The AGP slot (MGA400) is mapped to IRQ16. (Not visible
> above.)
>
> As you can see, the APIC on CPU1 seems eems to suffer under noise!
> It gets APIC errors (which it acknowledges and causes CPU0 to also get an
> error) and occasionally receives bogus IRQ vectors.
>
> So this looks like a HW problem. Some reports on LKML seem to indicate that
> this is indeed the case.
>
> Somebody is talking about the voltage regulators not giving a really stable
> voltage (without load?) causing the noise. A resistor with a capacitor
> should help then ... However, sensors reports 2.20V without any flakiness.
> Any details on this known?
> It could also be that the MoBo chipset (IO-APIC?) has problems to recognize
> the signals from 1.7V CPUs expecting at least 1.8 (or 2.2) V. Maybe faking
> the VID to 2.0V instead of 2.2V would be useful then.
>
> I would be thankful for any knowledge on this issue!
>
> (As this is slightly off-topic, you may reply via PM. If I happen to solve
> my problems, I'll post a summary to LKML.)
>
> Regards,
>

--
Kelsey Hudson [email protected]
Software Engineer
Compendium Technologies, Inc (619) 725-0771
---------------------------------------------------------------------------

2001-04-19 01:23:00

by Alan

[permalink] [raw]
Subject: Re: APIC errors ...

> Er... I believe there was some discussion on l-k some while ago regarding a
> certain lack of forthcomingness by Serverworks and the resultant general
> flakiness of Linux support for their chipsets...

Serverworks stuff is pretty well supported now - they've been working to make
some stuff available. Having said that their AGP isnt supported (and is
reportedly pretty poor in windows) and they seem to lack UDMA100 IDE right now.

Of course server customers think IDE is an attachment for cdroms and zip drives..

2001-04-19 02:06:27

by Kurt Garloff

[permalink] [raw]
Subject: Re: APIC errors ...

On Wed, Apr 18, 2001 at 03:21:17PM -0700, Dr. Kelsey Hudson wrote:
> Um... Looks like when you clock the BX-chipset out of spec (>100MHz FSB)
> you get the error. Since BX wasn't ever designed to be run at >100MHz
> these errors are *expected*.

No, the APIC errors also occur at exactly 100MHz.
Unfortunately, my MoBo does not offer 95MHz, so I'm running 75MHz now :-(

Reading APIC specs and errata now.
The funny thing is that the errors occur on the APIC bus, which runs
independently @ 33MHz, no matter if the FSB is 66 or 100 MHz, if I
understand the docs well. Maybe some timing stuff at the local APICs ...

Regards,
--
Kurt Garloff <[email protected]> Eindhoven, NL
GPG key: See mail header, key servers Linux kernel development
SuSE GmbH, Nuernberg, FRG SCSI, Security


Attachments:
(No filename) (843.00 B)
(No filename) (233.00 B)
Download all attachments

2001-04-19 04:43:20

by Chris Wedgwood

[permalink] [raw]
Subject: Re: APIC errors ...

On Wed, Apr 18, 2001 at 09:27:12PM -0500, Rico Tudor wrote:

Another problem area is ECC monitoring. I'm still waiting for
info from ServerWorks, and so is Dan Hollis. Alexander Stohr has
even submitted code to Jim Foster for approval, without evident
effect. I have 18GB of RAM divided among five ServerWorks boxes,
so the matter is not academic.

Add environemt monitoring. One mf my play machines is a dell 2540,
dual AC power, lots os fans and temperature sensing, I'd really like
to be able to get this information from it (yeah, closed source Dell
drivers are worth almost zero).



--cw

2001-04-19 12:10:45

by Steffen Persvold

[permalink] [raw]
Subject: Re: APIC errors ...

Chris Wedgwood wrote:
>
> On Wed, Apr 18, 2001 at 09:27:12PM -0500, Rico Tudor wrote:
>
> Another problem area is ECC monitoring. I'm still waiting for
> info from ServerWorks, and so is Dan Hollis. Alexander Stohr has
> even submitted code to Jim Foster for approval, without evident
> effect. I have 18GB of RAM divided among five ServerWorks boxes,
> so the matter is not academic.
>
> Add environemt monitoring. One mf my play machines is a dell 2540,
> dual AC power, lots os fans and temperature sensing, I'd really like
> to be able to get this information from it (yeah, closed source Dell
> drivers are worth almost zero).
>

This must be a Dell issue then, because I wrote a lm_sensors (http://www.netroedge.com/~lm78/) driver for the
ServerWorks OSB4 (SouthBridge) some time ago and they have merged it with the PIIX4 driver. lm_sensors 2.5.5 and above
should have support for the ServerWorks System Management Bus. I have been running lm_sensors 2.5.5 on several mobos
with ServerWorks chipset of all kinds (LE, HE, HE-SL) and most of them work with the PIIX4 driver (with OSB4 support).
The only one I've had problems with so far, is the Compaq DL360 which seem to have disabled the SMB on the OSB4 and
instead using another approach (proprietary). This could be the problem with the Dell machines too (2450, 2550, 1550).

Best regards
--
Steffen Persvold Systems Engineer
Email : mailto:[email protected] Scali AS (http://www.scali.com)
Norway : Pho : (+47) 2262 8950 Olaf Helsets vei 6
Fax : (+47) 2262 8951 N-0621 Oslo, Norway

USA : Pho : (+1) 713 706 0544 10500 Richmond Ave, Suite 190
Houston, Texas 77042, USA