DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.com;
  h=X-YMail-OSG:Received:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding:Message-ID;
  b=6AfchusKGPV7B8GdubbiYhkTG1N7jaxUH8ajIb8ziQWROIlmAUBNxTVys6rldVSyT/KsBjRtcg3groWddGoojEjNBy4ZYFCpUT0EHTnnOCzj2K3QngC8HnEncpvHR2MFDwWesMYyzdpdMbpGFNR6avufWDy2ZFwDZOosEYvTr14=;
Date: Fri, 19 Jan 2007 11:34:52 -0800 (PST)
From: Doug Thompson <norsk5@yahoo.com>
Subject: Re: EDAC chipkill messages
To: Orion Poplawski <orion@cora.nwra.com>, linux-kernel@vger.kernel.org
In-Reply-To: <45B0F5B7.4080703@cora.nwra.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7BIT
Message-ID: <703726.11655.qm@web50115.mail.yahoo.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3172
Lines: 99


--- Orion Poplawski <orion@cora.nwra.com> wrote:

> Robert Hancock wrote:
> > Orion Poplawski wrote:
> >> Can someone please explain to me what these mean?
> >>
> >> EDAC k8 MC1: general bus error: participating processor(local node
> 
> >> origin), time-out(no timeout) memory transaction type(generic
> read), 
> >> mem or i/o(mem access), cache level(generic)
> >> EDAC MC1: CE page 0xfbf6f, offset 0x4d0, grain 8, syndrome 0xc8f4,
> row 
> >> 1, channel 0, label "": k8_edac
> >> EDAC MC1: CE - no information available: k8_edac Error Overflow
> set
> >> EDAC k8 MC1: extended error code: ECC chipkill x4 error
> >>
> >> Thanks!
> >>
> > 
> > Sounds like you're having some memory ECC errors.. some Memtest86,
> etc. 
> > runs may be in order. You may be able to figure out from this info
> what 
> > DIMM is having the problem.
> > 
> 
> That was my assumption as well, but was hoping someone could decode
> the 
> above information and point me to the problem chip.  I ran Memtest86 
> overnight but found no problems, but don't know if it needs to run in
> a 
> particular ECC mode.
> 
> This is a dual proc 275 system with 4 1GB DIMMs.  Guessing that MC1
> is 
> the controller on the second CPU.  Would row 1 be the second DIMM?


No that would be the FIRST DIMM, on Channel 0

Each DIMM has 2 ChipSelect Rows (CSROW)

Each csrow covers two channels across, therefore on a 4 socket memory
array, there are CSROWS 0 and 1 on the first DIMM row and CSROWS 2 and
3 on the second DIMM row.

WWWWWWWWW  XXXXXXXXXXX
YYYYYYYYY  ZZZZZZZZZZZ

The W and the Y DIMMs are channel 0
The X and the Z DIMMs are channel 1

csrows 0 and 1 would cross over Y and Z DIMMs
csrows 2 and 3 would cross over W and X DIMMs

The mapping problem occurs in then identifying each of the above goes
to which silk screen labeled sockets on the mobo.

Usually they are labeled:

H0_DIMM2A   H0_DIMM2B
H0_DIMM1A   H0_DIMM1B

where A is channel 0 and B is channel 1 and
the "DIMM1" would indicate the CSROWs 0 and 1
and "DIMM2" would indicate the CSROWs 2 and 3

The string 'label ""' can be filled in by a userspace script to
properly identify the DIMM silk screen according to the motherboard
used.

The lines with "EDAC MC1:" are EDAC CORE output messages, while the
"EDAC K8:" lines are EDAC Memory Controller driver messages.
"CE" is correctable error 
MC1 is memory controller 1 (0 based)

ECC ChipKill x4 was what found the error and corrected it.

The FRU, (field replaceable unit) is the DIMM located at socket
H1_DIMM1A, according to the labeling I mentioned above.

caveat: the detector is not 100% perfect but gives a general area to
look at, the DIMM specification. Sometimes other errors can cause what
looks like a memory error, but usually a bad memory DIMM is the root
cause of the vast majority of such errors.

In addition, memtest86+ doesn't find all the bad memory in all cases,
but it is still a VERY useful tool

doug thompson

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/