2011-02-08 13:38:13

by martin f krafft

[permalink] [raw]
Subject: Opteron ECC/ChipKill error

Dear list,

I just got to see the following message on my Opteron server:

kernel: [810137.744689] Northbridge Error, node 1
kernel: [810137.756250] ECC/ChipKill ECC error.
kernel: [810137.766975] EDAC amd64 MC1: CE ERROR_ADDRESS= 0x26bdd40f0
kernel: [810137.766982] EDAC MC1: CE page 0x26bdd4, offset 0xf0, grain 0, syndrome 0xe1e2, row 6, channel 1, label "": amd64_edac

Is there any way to deduce from these data the actual
culprit/component to replace?

Thanks,

--
martin | http://madduck.net/ | http://two.sentenc.es/

"a cigarette is the perfect type of pleasure.
it is exquisite, and it leaves one unsatisfied."
-- oscar wilde

spamtraps: [email protected]


Attachments:
(No filename) (743.00 B)
digital_signature_gpg.asc (1.10 kB)
Digital signature (see http://martin-krafft.net/gpg/sig-policy/999bbcc4/current)
Download all attachments

2011-02-08 13:49:21

by Borislav Petkov

[permalink] [raw]
Subject: Re: Opteron ECC/ChipKill error

On Tue, Feb 08, 2011 at 02:30:11PM +0100, martin f krafft wrote:
> Dear list,
>
> I just got to see the following message on my Opteron server:
>
> kernel: [810137.744689] Northbridge Error, node 1
> kernel: [810137.756250] ECC/ChipKill ECC error.
> kernel: [810137.766975] EDAC amd64 MC1: CE ERROR_ADDRESS= 0x26bdd40f0
> kernel: [810137.766982] EDAC MC1: CE page 0x26bdd4, offset 0xf0, grain 0, syndrome 0xe1e2, row 6, channel 1, label "": amd64_edac
>
> Is there any way to deduce from these data the actual
> culprit/component to replace?

It is a DRAM ECC error on one of the DIMMs on your node 1. If it is a
single occurrence I wouldn't start to worry yet - I'd monitor to see
whether the same row above (row 6) starts increasing its error rate.
Also, sometimes reseating the DIMMs helps.

Can you send your dmesg please?

Thanks.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

2011-02-08 14:00:05

by martin f krafft

[permalink] [raw]
Subject: Re: Opteron ECC/ChipKill error

also sprach Borislav Petkov <[email protected]> [2011.02.08.1449 +0100]:
> It is a DRAM ECC error on one of the DIMMs on your node 1. If it is a
> single occurrence I wouldn't start to worry yet - I'd monitor to see
> whether the same row above (row 6) starts increasing its error rate.
> Also, sometimes reseating the DIMMs helps.

Thanks. I really hope this won't happen again as I really don't want
to go to the hosting place and open the server. ;)

> Can you send your dmesg please?

Don't want to spam the list, so:
http://scratch.madduck.net/__tmp__dmesg.gz

--
martin | http://madduck.net/ | http://two.sentenc.es/

"women, when they are not in love,
have all the cold blood of an experienced attorney."
-- honor? de balzac

spamtraps: [email protected]


Attachments:
(No filename) (820.00 B)
digital_signature_gpg.asc (1.10 kB)
Digital signature (see http://martin-krafft.net/gpg/sig-policy/999bbcc4/current)
Download all attachments

2011-02-08 14:24:41

by Borislav Petkov

[permalink] [raw]
Subject: Re: Opteron ECC/ChipKill error

On Tue, Feb 08, 2011 at 08:59:56AM -0500, martin f krafft wrote:
> also sprach Borislav Petkov <[email protected]> [2011.02.08.1449 +0100]:
> > It is a DRAM ECC error on one of the DIMMs on your node 1. If it is a
> > single occurrence I wouldn't start to worry yet - I'd monitor to see
> > whether the same row above (row 6) starts increasing its error rate.
> > Also, sometimes reseating the DIMMs helps.
>
> Thanks. I really hope this won't happen again as I really don't want
> to go to the hosting place and open the server. ;)

Yeah, well, keep your fingers crossed. Just to reiterate, getting ECCs
is not a problem per se - they may appear even during normal operation
and in this case get corrected just fine by the memory controller. Only
an increase in the error rate may hint at a failing DRAM device so if
the error starts repeating you might start thinking when the downtime to
replace the failing DIMM is less hurtful/more suitable for you.

Cough.. IMHO. :)

> > Can you send your dmesg please?
>
> Don't want to spam the list, so:
> http://scratch.madduck.net/__tmp__dmesg.gz

Ah ok, this is a .32 kernel and it doesn't have the information I was
looking for. I've changed that in later kernels so that EDAC dumps the
DRAM chip selects placement on the memory controller. Here's an example:


[ 15.256809] EDAC MC: DCT0 chip selects:
[ 15.261007] EDAC amd64: MC: 0: 2048MB 1: 2048MB
[ 15.266073] EDAC amd64: MC: 2: 2048MB 3: 2048MB
[ 15.271140] EDAC amd64: MC: 4: 0MB 5: 0MB
[ 15.276207] EDAC amd64: MC: 6: 0MB 7: 0MB
[ 15.291246] EDAC MC: DCT1 chip selects:
[ 15.295443] EDAC amd64: MC: 0: 2048MB 1: 2048MB
[ 15.300511] EDAC amd64: MC: 2: 2048MB 3: 2048MB
[ 15.305579] EDAC amd64: MC: 4: 0MB 5: 0MB
[ 15.310647] EDAC amd64: MC: 6: 0MB 7: 0MB
[ 15.315711] EDAC amd64: using x8 syndromes.

and from this I can see that I have 4 DIMMs on the node, 2 per channel
and each DIMM is 4G (dual-ranked). The last one I know from the DIMMs
type.

In your case, the ECC comes from chip select 6 which should mean the
last DIMM on the node on the second channel. You have to look at the
silkscreen labels on the board to pinpoint which DIMM it is or search
through board layout manuals. (I know, this should be easier, I know...
).

Btw, you should be able to get the above output if you enable
CONFIG_EDAC_DEBUG or upgrade your kernel.

HTH.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

2011-02-08 17:47:33

by martin f krafft

[permalink] [raw]
Subject: Re: Opteron ECC/ChipKill error

also sprach Borislav Petkov <[email protected]> [2011.02.08.1524 +0100]:
> Yeah, well, keep your fingers crossed. Just to reiterate, getting ECCs
> is not a problem per se - they may appear even during normal operation
> and in this case get corrected just fine by the memory controller.

That's what I thought. Many thanks for confirming it.

> > Don't want to spam the list, so:
> > http://scratch.madduck.net/__tmp__dmesg.gz
>
> Ah ok, this is a .32 kernel and it doesn't have the information I was
> looking for. I've changed that in later kernels so that EDAC dumps the
> DRAM chip selects placement on the memory controller.

Excellent to see you are working to improve this. If the problems
increase, then I shall either turn on CONFIG_EDAC_DEBUG or upgrade
to 2.6.38.

Thank you for your help!

--
martin | http://madduck.net/ | http://two.sentenc.es/

wind catches lily,
scattering petals to the ground.
segmentation fault.

spamtraps: [email protected]


Attachments:
(No filename) (970.00 B)
digital_signature_gpg.asc (1.10 kB)
Digital signature (see http://martin-krafft.net/gpg/sig-policy/999bbcc4/current)
Download all attachments