2001-10-01 21:16:27

by Thomas Davis

[permalink] [raw]
Subject: endless APIC error messages..

Yes, I know, I've got a busted ABIT-BP2 system board.

Running 2.4, I get thousands of the APIC error messages, which fill my
syslog.

Is there a reason for this constant spewing? The short little patch,
that simply does stops the system from complaining any more - it's
busted - we know that.

--- linux/arch/i386/kernel/apic.c Mon Oct 1 14:12:50 2001
+++ linux-2.4.9-ac16/arch/i386/kernel/apic.c Mon Oct 1 14:10:19 2001
@@ -37,6 +37,8 @@
int prof_old_multiplier[NR_CPUS] = { 1, };
int prof_counter[NR_CPUS] = { 1, };

+static int apic_error_count = 50;
+
int get_maxlvt(void)
{
unsigned int v, ver, maxlvt;
@@ -1061,8 +1063,11 @@
6: Received illegal vector
7: Illegal register address
*/
- printk (KERN_ERR "APIC error on CPU%d: %02lx(%02lx)\n",
- smp_processor_id(), v , v1);
+ if (apic_error_count != 0) {
+ apic_error_count--;
+ printk (KERN_ERR "APIC error on CPU%d: %02lx(%02lx)\n",
+ smp_processor_id(), v , v1);
+ }
}

/*

--
------------------------+--------------------------------------------------
Thomas Davis | ASG Cluster guy
[email protected] |
(510) 486-4524 | "80 nodes and chugging Captain!"


2001-10-01 22:03:54

by Mark Hahn

[permalink] [raw]
Subject: Re: endless APIC error messages..

> Running 2.4, I get thousands of the APIC error messages, which fill my
> syslog.
>
> Is there a reason for this constant spewing? The short little patch,

yes: any machine with enough apic errors to annoy
is a machine that is *not* catching all corrupt apic messages.
you don't want that. if you want any patch at all, have it panic()
if it ever sees, say, two apic errors per jiffy.

your patch is about like removing the battery from your smoke alarm...

2001-10-01 22:20:26

by Thomas Davis

[permalink] [raw]
Subject: Re: endless APIC error messages..

Before anyone sends me any more messages to junk the hardware, I quote
from the SMP-HOWTO at http://www.linuxdoc.org/HOWTO/SMP-HOWTO-3.html

"APIC error interrupt on CPU#n, should never happen" messages in logs

A message like:


APIC error interrupt on CPU#0, should never happen.
... APIC ESR0: 00000002
... APIC ESR1: 00000000

indicates a 'receive checksum error'. This cannot be caused by Linux as
the APIC message checksumming part is completely in hardware. It might
be marginal hardware. As long as you dont see any instability, they are
not a problem - APIC messages are retried until delivered. (Ingo Molnar)


I am NOT seeing instability, just tons of these messages.

Thomas Davis wrote:
>
> Yes, I know, I've got a busted ABIT-BP2 system board.
>
> Running 2.4, I get thousands of the APIC error messages, which fill my
> syslog.
>
> Is there a reason for this constant spewing? The short little patch,
> that simply does stops the system from complaining any more - it's
> busted - we know that.
>
> --- linux/arch/i386/kernel/apic.c Mon Oct 1 14:12:50 2001
> +++ linux-2.4.9-ac16/arch/i386/kernel/apic.c Mon Oct 1 14:10:19 2001
> @@ -37,6 +37,8 @@
> int prof_old_multiplier[NR_CPUS] = { 1, };
> int prof_counter[NR_CPUS] = { 1, };
>
> +static int apic_error_count = 50;
> +
> int get_maxlvt(void)
> {
> unsigned int v, ver, maxlvt;
> @@ -1061,8 +1063,11 @@
> 6: Received illegal vector
> 7: Illegal register address
> */
> - printk (KERN_ERR "APIC error on CPU%d: %02lx(%02lx)\n",
> - smp_processor_id(), v , v1);
> + if (apic_error_count != 0) {
> + apic_error_count--;
> + printk (KERN_ERR "APIC error on CPU%d: %02lx(%02lx)\n",
> + smp_processor_id(), v , v1);
> + }
> }
>
> /*
>
> --
> ------------------------+--------------------------------------------------
> Thomas Davis | ASG Cluster guy
> [email protected] |
> (510) 486-4524 | "80 nodes and chugging Captain!"
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
------------------------+--------------------------------------------------
Thomas Davis | ASG Cluster guy
[email protected] |
(510) 486-4524 | "80 nodes and chugging Captain!"

2001-10-01 22:33:09

by Alan

[permalink] [raw]
Subject: Re: endless APIC error messages..

> the APIC message checksumming part is completely in hardware. It might
> be marginal hardware. As long as you dont see any instability, they are
> not a problem - APIC messages are retried until delivered. (Ingo Molnar)

APIC checksums are not it appears stunningly robust. You also want a
-ac kernel or 2.4.10 to handle the apic event rerun bug