2008-08-18 14:17:20

by Andy Chittenden

[permalink] [raw]
Subject: Linux 2.6.26 edac errors and ASUS P5W DH Deluxe motherboard

I've just installed the linux-image-2.6.26-1-amd64 debian package on
three of our ASUS P5W DH Deluxe based machines and they've all started
spewing out messages:

Message from syslogd@savage at Mon Aug 18 14:01:52 2008 ...
savage kernel: [ 74.389644] EDAC MC0: UE page 0x7fe03, offset 0x0,
grain 128, row 2, labels ":": i82975x UE

Message from syslogd@savage at Mon Aug 18 14:01:53 2008 ...
savage kernel: [ 75.555862] EDAC MC0: UE page 0x7fd44, offset 0x0,
grain 128, row 2, labels ":": i82975x UE

Message from syslogd@savage at Mon Aug 18 14:01:54 2008 ...
savage kernel: [ 76.628039] EDAC MC0: UE page 0x7fd41, offset 0x0,
grain 128, row 2, labels ":": i82975x UE

Message from syslogd@savage at Mon Aug 18 14:01:55 2008 ...
savage kernel: [ 77.629260] EDAC MC0: UE page 0x7fd27, offset 0x0,
grain 128, row 2, labels ":": i82975x UE

every second.

I've removed that kernel package and they're running previous versions
of the kernel (eg linux-image-2.6.25-2-amd64) happily. I've run memtest
on one of them with no problems. So, anyone got any ideas what's causing
this? (FWIW the machines have all got ECC memory in them).

--
Andy, BlueArc Engineering


2008-08-18 19:53:09

by Bernd Schubert

[permalink] [raw]
Subject: Re: Linux 2.6.26 edac errors and ASUS P5W DH Deluxe motherboard

Andy Chittenden wrote:

> I've just installed the linux-image-2.6.26-1-amd64 debian package on
> three of our ASUS P5W DH Deluxe based machines and they've all started
> spewing out messages:
>
> Message from syslogd@savage at Mon Aug 18 14:01:52 2008 ...
> savage kernel: [ 74.389644] EDAC MC0: UE page 0x7fe03, offset 0x0,
> grain 128, row 2, labels ":": i82975x UE
>
> Message from syslogd@savage at Mon Aug 18 14:01:53 2008 ...
> savage kernel: [ 75.555862] EDAC MC0: UE page 0x7fd44, offset 0x0,
> grain 128, row 2, labels ":": i82975x UE
>
> Message from syslogd@savage at Mon Aug 18 14:01:54 2008 ...
> savage kernel: [ 76.628039] EDAC MC0: UE page 0x7fd41, offset 0x0,
> grain 128, row 2, labels ":": i82975x UE
>
> Message from syslogd@savage at Mon Aug 18 14:01:55 2008 ...
> savage kernel: [ 77.629260] EDAC MC0: UE page 0x7fd27, offset 0x0,
> grain 128, row 2, labels ":": i82975x UE
>
> every second.
>
> I've removed that kernel package and they're running previous versions
> of the kernel (eg linux-image-2.6.25-2-amd64) happily. I've run memtest
> on one of them with no problems. So, anyone got any ideas what's causing
> this? (FWIW the machines have all got ECC memory in them).
>

Do have an IPMI card installed in these systems? Know issue here with Asus
boards + IPMI, you then need to disable a few ipmi sensors.


Cheers,
Bernd

2008-08-18 19:53:36

by Doug Thompson

[permalink] [raw]
Subject: Re: Linux 2.6.26 edac errors and ASUS P5W DH Deluxe motherboard


--- Andy Chittenden <[email protected]> wrote:

> I've just installed the linux-image-2.6.26-1-amd64 debian package on
> three of our ASUS P5W DH Deluxe based machines and they've all started
> spewing out messages:
>
> Message from syslogd@savage at Mon Aug 18 14:01:52 2008 ...
> savage kernel: [ 74.389644] EDAC MC0: UE page 0x7fe03, offset 0x0,
> grain 128, row 2, labels ":": i82975x UE
>
> Message from syslogd@savage at Mon Aug 18 14:01:53 2008 ...
> savage kernel: [ 75.555862] EDAC MC0: UE page 0x7fd44, offset 0x0,
> grain 128, row 2, labels ":": i82975x UE
>
> Message from syslogd@savage at Mon Aug 18 14:01:54 2008 ...
> savage kernel: [ 76.628039] EDAC MC0: UE page 0x7fd41, offset 0x0,
> grain 128, row 2, labels ":": i82975x UE
>
> Message from syslogd@savage at Mon Aug 18 14:01:55 2008 ...
> savage kernel: [ 77.629260] EDAC MC0: UE page 0x7fd27, offset 0x0,
> grain 128, row 2, labels ":": i82975x UE
>
> every second.
>
> I've removed that kernel package and they're running previous versions
> of the kernel (eg linux-image-2.6.25-2-amd64) happily. I've run memtest
> on one of them with no problems. So, anyone got any ideas what's causing
> this? (FWIW the machines have all got ECC memory in them).
>
> --
> Andy, BlueArc Engineering


I don't know which version of the source code was used in the 25 or the 26 versions of the debian
package, but it might be that the later one is really finding errors as I remember there was some
patches against the i82975x module.

The reports printed above are consistent. They are ALL in Chip Select Row 2, yet all 3 of the
machines are outputting messages.

Are they ALL the same row, or are they different rows? If different, they could be legit. The same
row there might be an issue.

Reading the manual for the mobo (http://support.asus.com/download/download.aspx?SLanguage=en-us) I
see that there are 4 slots for memory:

DIMM_A1
DIMM_A2
DIMM_B1
DIMM_B2

In the output above, you can see the following:

labels ":"

When properly set by edac-utils (http://sourceforge.net/projects/edac-utils/) user space support
package (IF the target motherboard is set in its database) the labels' field will be composed of
the offending DIMM, like "DIMM_A2" or such. This aids in identifying the problem DIMM. If you have
this already installed, you might need to add to the motherboard database, your motherboard's DIMM
labels to see it.

Since I don't have one of these chipsets, is it possible I could access to one or more of these
machines to take a look around?

doug t


W1DUG

2008-08-19 08:18:03

by Andy Chittenden

[permalink] [raw]
Subject: RE: Linux 2.6.26 edac errors and ASUS P5W DH Deluxe motherboard

Hi Doug

> I don't know which version of the source code was used in the 25 or
> the 26 versions of the debian package, but it might be that the later
> one is really finding errors as I remember there was some patches
> against the i82975x module.

I've done a diff between 2.6.25 and 2.6.26 source code of the
i82975x_edac module. As you can see, there's not much difference:

# diff -u linux-2.6.2[56]/drivers/edac/i82975x_edac.c
--- linux-2.6.25/drivers/edac/i82975x_edac.c 2008-04-17
03:49:44.000000000 +0100
+++ linux-2.6.26/drivers/edac/i82975x_edac.c 2008-07-13
22:51:29.000000000 +0100
@@ -14,7 +14,7 @@
#include <linux/pci.h>
#include <linux/pci_ids.h>
#include <linux/slab.h>
-
+#include <linux/edac.h>
#include "edac_core.h"

#define I82975X_REVISION " Ver: 1.0.0 " __DATE__
@@ -611,6 +611,9 @@

debugf3("%s()\n", __func__);

+ /* Ensure that the OPSTATE is set correctly for POLL or NMI */
+ opstate_init();
+
pci_rc = pci_register_driver(&i82975x_driver);
if (pci_rc < 0)
goto fail0;
@@ -664,3 +667,6 @@
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Arvind R. <[email protected]>");
MODULE_DESCRIPTION("MC support for Intel 82975 memory hub
controllers");
+
+module_param(edac_op_state, int, 0444);
+MODULE_PARM_DESC(edac_op_state, "EDAC Error Reporting state:
0=Poll,1=NMI");


> Are they ALL the same row, or are they different rows? If different,
> they could be legit. The same row there might be an issue.

Hmm, they're different. On another m/c, I've managed to find the logged
info when it booted up 2.6.26:

/var/log/kern.log.1.gz:Aug 4 11:38:15 diesel kernel: [ 9.079151]
EDAC MC0: UE page 0x7fe0b, offset 0x0, grain 128, row 1, labels ":":
i82975x UE
/var/log/kern.log.1.gz:Aug 4 11:38:15 diesel kernel: [ 10.104762]
EDAC MC0: UE page 0x7e451, offset 0x0, grain 128, row 1, labels ":":
i82975x UE
/var/log/kern.log.1.gz:Aug 4 11:38:15 diesel kernel: [ 11.110256]
EDAC MC0: UE page 0x7e7ae, offset 0x0, grain 128, row 1, labels ":":
i82975x UE
...
/var/log/kern.log.1.gz:Aug 4 11:52:05 diesel kernel: [ 11.636753]
EDAC MC0: UE page 0x60000, offset 0x0, grain 128, row 1, labels ":":
i82975x UE
/var/log/kern.log.1.gz:Aug 4 11:52:05 diesel kernel: [ 12.641616]
EDAC MC0: UE page 0xde771, offset 0x0, grain 128, row 3, labels ":":
i82975x UE
/var/log/kern.log.1.gz:Aug 4 11:52:05 diesel kernel: [ 13.734052]
EDAC MC0: UE page 0xde771, offset 0x0, grain 128, row 3, labels ":":
i82975x UE
/var/log/kern.log.1.gz:Aug 4 11:52:05 diesel kernel: [ 14.743449]
EDAC MC0: UE page 0xde771, offset 0x0, grain 128, row 3, labels ":":
i82975x UE

> When properly set by edac-utils
(http://sourceforge.net/projects/edac-utils/) ...

Thanks for the pointer. I've now installed edac-utils on the offending
motherboards. It seems that the motherboard is half known about:

# edac-ctl --mainboard
edac-ctl: mainboard: ASUSTEK COMPUTER INC P5W DH Deluxe
# edac-ctl --print-labels
No dimm labels for ASUSTEK COMPUTER INC P5W DH Deluxe

dmidecode gives some memory module info:

Handle 0x0009, DMI type 6, 12 bytes
Memory Module Information
Socket Designation: DIMM0
Bank Connections: 9 11
Current Speed: 30 ns
Type: Unknown FPM Parity ECC SDRAM
Installed Size: 2048 MB (Double-bank Connection)
Enabled Size: 2048 MB (Double-bank Connection)
Error Status: OK

Handle 0x000A, DMI type 6, 12 bytes
Memory Module Information
Socket Designation: DIMM1
Bank Connections: 9 11
Current Speed: 30 ns
Type: Unknown FPM Parity ECC SDRAM
Installed Size: 2048 MB (Double-bank Connection)
Enabled Size: 2048 MB (Double-bank Connection)
Error Status: OK

Handle 0x000B, DMI type 6, 12 bytes
Memory Module Information
Socket Designation: DIMM2
Bank Connections: 9 11
Current Speed: 30 ns
Type: Unknown FPM Parity ECC SDRAM
Installed Size: 2048 MB (Double-bank Connection)
Enabled Size: 2048 MB (Double-bank Connection)
Error Status: OK

Handle 0x000C, DMI type 6, 12 bytes
Memory Module Information
Socket Designation: DIMM3
Bank Connections: 9 11
Current Speed: 30 ns
Type: Unknown FPM Parity ECC SDRAM
Installed Size: 2048 MB (Double-bank Connection)
Enabled Size: 2048 MB (Double-bank Connection)
Error Status: OK

> Since I don't have one of these chipsets, is it possible I could
access to one or more of these
machines to take a look around?

Unfortunately not. If there's any commands you'd like me to run, then
please let me know.

If you could let me know what I need to put in /etc/edac/labels.db, that
would be appreciated too.

--
Andy, BlueArc Engineering

2008-08-19 17:48:49

by Doug Thompson

[permalink] [raw]
Subject: RE: Linux 2.6.26 edac errors and ASUS P5W DH Deluxe motherboard


--- Andy Chittenden <[email protected]> wrote:

>
> If you could let me know what I need to put in /etc/edac/labels.db, that
> would be appreciated too.
>

This becomes a manual, one time, event, to discover the mapping of DIMMs to the silkscreen.

One command is the 'dmidecode' which is run as root and dumps the BIOS DMI Tables. Unfortunately,
many BIOSes do not correctly set these tables properly to the correct DIMM silk screen labels.
Because of this lack, EDAC and edac-utils was created to provide mechanism for end users.

If your system does provide correct DIMM Labels, you can create/correct the entry for your
motherboard in the database file for edac-utils.

If your system provides simple generic labels, then you will need to physically move DIMMs from
slot to slot and watching as the error "moves" with the DIMM. This will take a few iterations and
a state table.

Usually, a DIMM will have 2 Chip-Select Rows (csrow)

The first set of DIMMs form a 128-bit data path (called dual channel operation) and have csrows 0
and 1

The second set of DIMMs will have csrows 2 and 3.

Therefore, you need to examine which csrow and which channel the error is being reported in.

doug t


W1DUG