2004-03-23 12:49:30

by Chris Stromsoe

[permalink] [raw]
Subject: apic errors and looping with 2.4, none with 2.2

I have one machine that won't run 2.4. As soon as a 2.4 kernel boots, it
starts throwing APIC errors.

The machine is a dual CPU pIII 933MHz system with 512Mb ram on a
SuperMicro motherboard, either a P3TDLR or a 370DLR, with the ServerWorks
LE chipset. I'm booting using lilo with append="noapic".

As soon as I boot into a 2.4 kernel, I start getting APIC errors on both
CPUs. Varying combinations of:

Mar 23 00:40:45 dahlia kernel: APIC error on CPU0: 02(08)
Mar 23 00:40:45 dahlia kernel: APIC error on CPU1: 01(08)
Mar 23 00:45:45 dahlia kernel: APIC error on CPU1: 08(08)
Mar 23 00:45:45 dahlia kernel: APIC error on CPU0: 08(08)
Mar 23 00:58:27 dahlia kernel: APIC error on CPU0: 08(01)
Mar 23 00:58:27 dahlia kernel: APIC error on CPU1: 08(02)
Mar 23 01:04:54 dahlia kernel: APIC error on CPU1: 02(02)
Mar 23 01:04:54 dahlia kernel: APIC error on CPU0: 01(02)
Mar 23 01:05:46 dahlia kernel: APIC error on CPU1: 02(08)
Mar 23 01:05:46 dahlia kernel: APIC error on CPU0: 02(08)
Mar 23 01:08:37 dahlia kernel: APIC error on CPU1: 08(02)
Mar 23 01:08:37 dahlia kernel: APIC error on CPU0: 08(02)
Mar 23 01:11:04 dahlia kernel: APIC error on CPU1: 02(02)
Mar 23 01:11:04 dahlia kernel: APIC error on CPU0: 02(0a)
Mar 23 01:11:04 dahlia kernel: APIC error on CPU1: 02(08)
Mar 23 01:25:45 dahlia kernel: APIC error on CPU1: 08(08)
Mar 23 01:25:45 dahlia kernel: APIC error on CPU0: 0a(08)

After a few hours of uptime, the box stops responding to keyboard input.
It begins printing the above messages to console over and over. I have
several other identical machines that I received in the same batch that
run 2.4 without any problems (though they do seem to require "noapic").

It runs fine with 2.2 and is running 2.2.26 right now.

The machine is not in production use and can be used to test. Any ideas
for what I should look at?


-Chris


2004-03-24 10:50:39

by Chris Stromsoe

[permalink] [raw]
Subject: Re: apic errors and looping with 2.4, none with 2.2 (supermicro/serverworks LE chipset)

I've rebooted with noapic and nolapic and the machine seemed to be stable
for a while. Then I got:

Mar 24 00:27:08 dahlia kernel: APIC error on CPU1: 00(02)
Mar 24 00:27:08 dahlia kernel: APIC error on CPU0: 00(02)
Mar 24 00:27:08 dahlia kernel: spurious APIC interrupt on CPU#0, should never happen.
Mar 24 00:27:13 dahlia kernel: APIC error on CPU1: 02(08)
Mar 24 00:27:13 dahlia kernel: APIC error on CPU0: 02(08)
Mar 24 00:28:07 dahlia kernel: APIC error on CPU1: 08(02)
Mar 24 00:28:07 dahlia kernel: APIC error on CPU0: 08(02)
Mar 24 00:28:07 dahlia kernel: APIC error on CPU0: 02(08)
Mar 24 00:28:07 dahlia kernel: APIC error on CPU0: 08(02)
Mar 24 00:28:07 dahlia kernel: APIC error on CPU1: 02(0a)

I added nosmp to the lilo append line and rebooted.

noapic, nolapic, and nosmp seems to be stable. I haven't had anything
logged in the last 2 hours. Are there known APIC or SMP problems with
serverworks LE chipsets or supermicro motherboards and 2.4? What are the
steps to troubleshooting an APIC problem?


-Chris

On Tue, 23 Mar 2004, Chris Stromsoe wrote:

> I have one machine that won't run 2.4. As soon as a 2.4 kernel boots, it
> starts throwing APIC errors.
>
> The machine is a dual CPU pIII 933MHz system with 512Mb ram on a
> SuperMicro motherboard, either a P3TDLR or a 370DLR, with the ServerWorks
> LE chipset. I'm booting using lilo with append="noapic".
>
> As soon as I boot into a 2.4 kernel, I start getting APIC errors on both
> CPUs. Varying combinations of:
>
> Mar 23 00:40:45 dahlia kernel: APIC error on CPU0: 02(08)
> Mar 23 00:40:45 dahlia kernel: APIC error on CPU1: 01(08)
> Mar 23 00:45:45 dahlia kernel: APIC error on CPU1: 08(08)
> Mar 23 00:45:45 dahlia kernel: APIC error on CPU0: 08(08)
> Mar 23 00:58:27 dahlia kernel: APIC error on CPU0: 08(01)
> Mar 23 00:58:27 dahlia kernel: APIC error on CPU1: 08(02)
> Mar 23 01:04:54 dahlia kernel: APIC error on CPU1: 02(02)
> Mar 23 01:04:54 dahlia kernel: APIC error on CPU0: 01(02)
> Mar 23 01:05:46 dahlia kernel: APIC error on CPU1: 02(08)
> Mar 23 01:05:46 dahlia kernel: APIC error on CPU0: 02(08)
> Mar 23 01:08:37 dahlia kernel: APIC error on CPU1: 08(02)
> Mar 23 01:08:37 dahlia kernel: APIC error on CPU0: 08(02)
> Mar 23 01:11:04 dahlia kernel: APIC error on CPU1: 02(02)
> Mar 23 01:11:04 dahlia kernel: APIC error on CPU0: 02(0a)
> Mar 23 01:11:04 dahlia kernel: APIC error on CPU1: 02(08)
> Mar 23 01:25:45 dahlia kernel: APIC error on CPU1: 08(08)
> Mar 23 01:25:45 dahlia kernel: APIC error on CPU0: 0a(08)
>
> After a few hours of uptime, the box stops responding to keyboard input.
> It begins printing the above messages to console over and over. I have
> several other identical machines that I received in the same batch that
> run 2.4 without any problems (though they do seem to require "noapic").
>
> It runs fine with 2.2 and is running 2.2.26 right now.
>
> The machine is not in production use and can be used to test. Any ideas
> for what I should look at?
>
>
> -Chris
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-03-24 20:27:47

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: apic errors and looping with 2.4, none with 2.2 (supermicro/serverworks LE chipset)


Chris,

The least I know is that similar IOAPIC errors have been seen due to
BIOS/hardware misconfigurations.

Maybe Maciej or Mikael have more clue of what might be happening.

On Wed, Mar 24, 2004 at 02:50:32AM -0800, Chris Stromsoe wrote:
> I've rebooted with noapic and nolapic and the machine seemed to be stable
> for a while. Then I got:
>
> Mar 24 00:27:08 dahlia kernel: APIC error on CPU1: 00(02)
> Mar 24 00:27:08 dahlia kernel: APIC error on CPU0: 00(02)
> Mar 24 00:27:08 dahlia kernel: spurious APIC interrupt on CPU#0, should never happen.
> Mar 24 00:27:13 dahlia kernel: APIC error on CPU1: 02(08)
> Mar 24 00:27:13 dahlia kernel: APIC error on CPU0: 02(08)
> Mar 24 00:28:07 dahlia kernel: APIC error on CPU1: 08(02)
> Mar 24 00:28:07 dahlia kernel: APIC error on CPU0: 08(02)
> Mar 24 00:28:07 dahlia kernel: APIC error on CPU0: 02(08)
> Mar 24 00:28:07 dahlia kernel: APIC error on CPU0: 08(02)
> Mar 24 00:28:07 dahlia kernel: APIC error on CPU1: 02(0a)
>
> I added nosmp to the lilo append line and rebooted.
>
> noapic, nolapic, and nosmp seems to be stable. I haven't had anything
> logged in the last 2 hours. Are there known APIC or SMP problems with
> serverworks LE chipsets or supermicro motherboards and 2.4? What are the
> steps to troubleshooting an APIC problem?
>
>
> -Chris
>
> On Tue, 23 Mar 2004, Chris Stromsoe wrote:
>
> > I have one machine that won't run 2.4. As soon as a 2.4 kernel boots, it
> > starts throwing APIC errors.
> >
> > The machine is a dual CPU pIII 933MHz system with 512Mb ram on a
> > SuperMicro motherboard, either a P3TDLR or a 370DLR, with the ServerWorks
> > LE chipset. I'm booting using lilo with append="noapic".
> >
> > As soon as I boot into a 2.4 kernel, I start getting APIC errors on both
> > CPUs. Varying combinations of:
> >
> > Mar 23 00:40:45 dahlia kernel: APIC error on CPU0: 02(08)
> > Mar 23 00:40:45 dahlia kernel: APIC error on CPU1: 01(08)
> > Mar 23 00:45:45 dahlia kernel: APIC error on CPU1: 08(08)
> > Mar 23 00:45:45 dahlia kernel: APIC error on CPU0: 08(08)
> > Mar 23 00:58:27 dahlia kernel: APIC error on CPU0: 08(01)
> > Mar 23 00:58:27 dahlia kernel: APIC error on CPU1: 08(02)
> > Mar 23 01:04:54 dahlia kernel: APIC error on CPU1: 02(02)
> > Mar 23 01:04:54 dahlia kernel: APIC error on CPU0: 01(02)
> > Mar 23 01:05:46 dahlia kernel: APIC error on CPU1: 02(08)
> > Mar 23 01:05:46 dahlia kernel: APIC error on CPU0: 02(08)
> > Mar 23 01:08:37 dahlia kernel: APIC error on CPU1: 08(02)
> > Mar 23 01:08:37 dahlia kernel: APIC error on CPU0: 08(02)
> > Mar 23 01:11:04 dahlia kernel: APIC error on CPU1: 02(02)
> > Mar 23 01:11:04 dahlia kernel: APIC error on CPU0: 02(0a)
> > Mar 23 01:11:04 dahlia kernel: APIC error on CPU1: 02(08)
> > Mar 23 01:25:45 dahlia kernel: APIC error on CPU1: 08(08)
> > Mar 23 01:25:45 dahlia kernel: APIC error on CPU0: 0a(08)
> >
> > After a few hours of uptime, the box stops responding to keyboard input.
> > It begins printing the above messages to console over and over. I have
> > several other identical machines that I received in the same batch that
> > run 2.4 without any problems (though they do seem to require "noapic").
> >
> > It runs fine with 2.2 and is running 2.2.26 right now.
> >
> > The machine is not in production use and can be used to test. Any ideas
> > for what I should look at?

2004-03-24 21:27:51

by Chris Stromsoe

[permalink] [raw]
Subject: Re: apic errors and looping with 2.4, none with 2.2 (supermicro/serverworks LE chipset)

I also tested with 2.6.5-rc2 with no kernel command line parameters and
locked up the machine hard (no serial console) after about 2 minutes of
APIC errors. I'll check the BIOS in a few hours after I go reboot the
machine.


-Chris

On Wed, 24 Mar 2004, Marcelo Tosatti wrote:

>
> Chris,
>
> The least I know is that similar IOAPIC errors have been seen due to
> BIOS/hardware misconfigurations.
>
> Maybe Maciej or Mikael have more clue of what might be happening.
>
> On Wed, Mar 24, 2004 at 02:50:32AM -0800, Chris Stromsoe wrote:
> > I've rebooted with noapic and nolapic and the machine seemed to be stable
> > for a while. Then I got:
> >
> > Mar 24 00:27:08 dahlia kernel: APIC error on CPU1: 00(02)
> > Mar 24 00:27:08 dahlia kernel: APIC error on CPU0: 00(02)
> > Mar 24 00:27:08 dahlia kernel: spurious APIC interrupt on CPU#0, should never happen.
> > Mar 24 00:27:13 dahlia kernel: APIC error on CPU1: 02(08)
> > Mar 24 00:27:13 dahlia kernel: APIC error on CPU0: 02(08)
> > Mar 24 00:28:07 dahlia kernel: APIC error on CPU1: 08(02)
> > Mar 24 00:28:07 dahlia kernel: APIC error on CPU0: 08(02)
> > Mar 24 00:28:07 dahlia kernel: APIC error on CPU0: 02(08)
> > Mar 24 00:28:07 dahlia kernel: APIC error on CPU0: 08(02)
> > Mar 24 00:28:07 dahlia kernel: APIC error on CPU1: 02(0a)
> >
> > I added nosmp to the lilo append line and rebooted.
> >
> > noapic, nolapic, and nosmp seems to be stable. I haven't had anything
> > logged in the last 2 hours. Are there known APIC or SMP problems with
> > serverworks LE chipsets or supermicro motherboards and 2.4? What are the
> > steps to troubleshooting an APIC problem?
> >
> >
> > -Chris
> >
> > On Tue, 23 Mar 2004, Chris Stromsoe wrote:
> >
> > > I have one machine that won't run 2.4. As soon as a 2.4 kernel boots, it
> > > starts throwing APIC errors.
> > >
> > > The machine is a dual CPU pIII 933MHz system with 512Mb ram on a
> > > SuperMicro motherboard, either a P3TDLR or a 370DLR, with the ServerWorks
> > > LE chipset. I'm booting using lilo with append="noapic".
> > >
> > > As soon as I boot into a 2.4 kernel, I start getting APIC errors on both
> > > CPUs. Varying combinations of:
> > >
> > > Mar 23 00:40:45 dahlia kernel: APIC error on CPU0: 02(08)
> > > Mar 23 00:40:45 dahlia kernel: APIC error on CPU1: 01(08)
> > > Mar 23 00:45:45 dahlia kernel: APIC error on CPU1: 08(08)
> > > Mar 23 00:45:45 dahlia kernel: APIC error on CPU0: 08(08)
> > > Mar 23 00:58:27 dahlia kernel: APIC error on CPU0: 08(01)
> > > Mar 23 00:58:27 dahlia kernel: APIC error on CPU1: 08(02)
> > > Mar 23 01:04:54 dahlia kernel: APIC error on CPU1: 02(02)
> > > Mar 23 01:04:54 dahlia kernel: APIC error on CPU0: 01(02)
> > > Mar 23 01:05:46 dahlia kernel: APIC error on CPU1: 02(08)
> > > Mar 23 01:05:46 dahlia kernel: APIC error on CPU0: 02(08)
> > > Mar 23 01:08:37 dahlia kernel: APIC error on CPU1: 08(02)
> > > Mar 23 01:08:37 dahlia kernel: APIC error on CPU0: 08(02)
> > > Mar 23 01:11:04 dahlia kernel: APIC error on CPU1: 02(02)
> > > Mar 23 01:11:04 dahlia kernel: APIC error on CPU0: 02(0a)
> > > Mar 23 01:11:04 dahlia kernel: APIC error on CPU1: 02(08)
> > > Mar 23 01:25:45 dahlia kernel: APIC error on CPU1: 08(08)
> > > Mar 23 01:25:45 dahlia kernel: APIC error on CPU0: 0a(08)
> > >
> > > After a few hours of uptime, the box stops responding to keyboard input.
> > > It begins printing the above messages to console over and over. I have
> > > several other identical machines that I received in the same batch that
> > > run 2.4 without any problems (though they do seem to require "noapic").
> > >
> > > It runs fine with 2.2 and is running 2.2.26 right now.
> > >
> > > The machine is not in production use and can be used to test. Any ideas
> > > for what I should look at?
>

2004-03-25 16:40:08

by Mikael Pettersson

[permalink] [raw]
Subject: Re: apic errors and looping with 2.4, none with 2.2

Chris Stromsoe writes:
> I have one machine that won't run 2.4. As soon as a 2.4 kernel boots, it
> starts throwing APIC errors.
>
> The machine is a dual CPU pIII 933MHz system with 512Mb ram on a
> SuperMicro motherboard, either a P3TDLR or a 370DLR, with the ServerWorks
> LE chipset. I'm booting using lilo with append="noapic".
>
> As soon as I boot into a 2.4 kernel, I start getting APIC errors on both
> CPUs. Varying combinations of:
>
> Mar 23 00:40:45 dahlia kernel: APIC error on CPU0: 02(08)
> Mar 23 00:40:45 dahlia kernel: APIC error on CPU1: 01(08)
...
> After a few hours of uptime, the box stops responding to keyboard input.
> It begins printing the above messages to console over and over. I have
> several other identical machines that I received in the same batch that
> run 2.4 without any problems (though they do seem to require "noapic").
>
> It runs fine with 2.2 and is running 2.2.26 right now.

Like Maciej wrote, the machine's local APIC bus corrupts messages.
The fact that your other supposedly-identical machines don't have
this problem indicates strongly that this particular box has a HW
problem. It could be a weak power-supply, inadequate cooling, a
bad CPU, or a bad motherboard.

I just checked the 2.2.25 kernel and it doesn't seem to enable
APIC_LVTERR. You're problably getting APIC bus errors with 2.2
too, you just won't see them logged anywhere.

(Regarding your other boxes that need "noapic": you have tried
w/o ACPI, right? acpi=off or pci=noacpi.)

/Mikael