LinuxLists.cc - EDAC chipkill messages

2007-01-18 22:56:13

Subject: EDAC chipkill messages

Can someone please explain to me what these mean?

EDAC k8 MC1: general bus error: participating processor(local node
origin), time-out(no timeout) memory transaction type(generic read), mem
or i/o(mem access), cache level(generic)
EDAC MC1: CE page 0xfbf6f, offset 0x4d0, grain 8, syndrome 0xc8f4, row
1, channel 0, label "": k8_edac
EDAC MC1: CE - no information available: k8_edac Error Overflow set
EDAC k8 MC1: extended error code: ECC chipkill x4 error

Thanks!

--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA/CoRA Division FAX: 303-415-9702
3380 Mitchell Lane [email protected]
Boulder, CO 80301 http://www.cora.nwra.com

2007-01-19 00:12:55

by Robert Hancock

[permalink] [raw]

Subject: Re: EDAC chipkill messages

Orion Poplawski wrote:
> Can someone please explain to me what these mean?
>
> EDAC k8 MC1: general bus error: participating processor(local node
> origin), time-out(no timeout) memory transaction type(generic read), mem
> or i/o(mem access), cache level(generic)
> EDAC MC1: CE page 0xfbf6f, offset 0x4d0, grain 8, syndrome 0xc8f4, row
> 1, channel 0, label "": k8_edac
> EDAC MC1: CE - no information available: k8_edac Error Overflow set
> EDAC k8 MC1: extended error code: ECC chipkill x4 error
>
> Thanks!
>

Sounds like you're having some memory ECC errors.. some Memtest86, etc.
runs may be in order. You may be able to figure out from this info what
DIMM is having the problem.

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/

2007-01-19 16:46:14

by Orion Poplawski

[permalink] [raw]

Subject: Re: EDAC chipkill messages

Robert Hancock wrote:
> Orion Poplawski wrote:
>> Can someone please explain to me what these mean?
>>
>> EDAC k8 MC1: general bus error: participating processor(local node
>> origin), time-out(no timeout) memory transaction type(generic read),
>> mem or i/o(mem access), cache level(generic)
>> EDAC MC1: CE page 0xfbf6f, offset 0x4d0, grain 8, syndrome 0xc8f4, row
>> 1, channel 0, label "": k8_edac
>> EDAC MC1: CE - no information available: k8_edac Error Overflow set
>> EDAC k8 MC1: extended error code: ECC chipkill x4 error
>>
>> Thanks!
>>
>
> Sounds like you're having some memory ECC errors.. some Memtest86, etc.
> runs may be in order. You may be able to figure out from this info what
> DIMM is having the problem.
>

That was my assumption as well, but was hoping someone could decode the
above information and point me to the problem chip. I ran Memtest86
overnight but found no problems, but don't know if it needs to run in a
particular ECC mode.

This is a dual proc 275 system with 4 1GB DIMMs. Guessing that MC1 is
the controller on the second CPU. Would row 1 be the second DIMM?

2007-01-19 19:34:54

by Doug Thompson

[permalink] [raw]

Subject: Re: EDAC chipkill messages

--- Orion Poplawski <[email protected]> wrote:

> Robert Hancock wrote:
> > Orion Poplawski wrote:
> >> Can someone please explain to me what these mean?
> >>
> >> EDAC k8 MC1: general bus error: participating processor(local node
>
> >> origin), time-out(no timeout) memory transaction type(generic
> read),
> >> mem or i/o(mem access), cache level(generic)
> >> EDAC MC1: CE page 0xfbf6f, offset 0x4d0, grain 8, syndrome 0xc8f4,
> row
> >> 1, channel 0, label "": k8_edac
> >> EDAC MC1: CE - no information available: k8_edac Error Overflow
> set
> >> EDAC k8 MC1: extended error code: ECC chipkill x4 error
> >>
> >> Thanks!
> >>
> >
> > Sounds like you're having some memory ECC errors.. some Memtest86,
> etc.
> > runs may be in order. You may be able to figure out from this info
> what
> > DIMM is having the problem.
> >
>
> That was my assumption as well, but was hoping someone could decode
> the
> above information and point me to the problem chip. I ran Memtest86
> overnight but found no problems, but don't know if it needs to run in
> a
> particular ECC mode.
>
> This is a dual proc 275 system with 4 1GB DIMMs. Guessing that MC1
> is
> the controller on the second CPU. Would row 1 be the second DIMM?

No that would be the FIRST DIMM, on Channel 0

Each DIMM has 2 ChipSelect Rows (CSROW)

Each csrow covers two channels across, therefore on a 4 socket memory
array, there are CSROWS 0 and 1 on the first DIMM row and CSROWS 2 and
3 on the second DIMM row.

WWWWWWWWW XXXXXXXXXXX
YYYYYYYYY ZZZZZZZZZZZ

The W and the Y DIMMs are channel 0
The X and the Z DIMMs are channel 1

csrows 0 and 1 would cross over Y and Z DIMMs
csrows 2 and 3 would cross over W and X DIMMs

The mapping problem occurs in then identifying each of the above goes
to which silk screen labeled sockets on the mobo.

Usually they are labeled:

H0_DIMM2A H0_DIMM2B
H0_DIMM1A H0_DIMM1B

where A is channel 0 and B is channel 1 and
the "DIMM1" would indicate the CSROWs 0 and 1
and "DIMM2" would indicate the CSROWs 2 and 3

The string 'label ""' can be filled in by a userspace script to
properly identify the DIMM silk screen according to the motherboard
used.

The lines with "EDAC MC1:" are EDAC CORE output messages, while the
"EDAC K8:" lines are EDAC Memory Controller driver messages.
"CE" is correctable error
MC1 is memory controller 1 (0 based)

ECC ChipKill x4 was what found the error and corrected it.

The FRU, (field replaceable unit) is the DIMM located at socket
H1_DIMM1A, according to the labeling I mentioned above.

caveat: the detector is not 100% perfect but gives a general area to
look at, the DIMM specification. Sometimes other errors can cause what
looks like a memory error, but usually a bad memory DIMM is the root
cause of the vast majority of such errors.

In addition, memtest86+ doesn't find all the bad memory in all cases,
but it is still a VERY useful tool

doug thompson