2006-01-12 23:37:34

by don fisher

[permalink] [raw]
Subject: machine check errors

MCE 0
CPU 2 4 northbridge TSC 967b1992c66
ADDR 2a52cb5f0
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 40b9
bit32 = err cpu0
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d45cc00140080813 MCGSTATUS 0
MCE 1
CPU 2 4 northbridge TSC a101bbf7338
ADDR 2922df698
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 6051
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d428c00060080a13 MCGSTATUS 0
MCE 2
CPU 2 4 northbridge TSC ab885e5bdbe
ADDR 2922df698
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 6051
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d428c00060080a13 MCGSTATUS 0
MCE 0
CPU 2 4 northbridge TSC b60f17bf394
ADDR 2918d98d0
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 6051
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d428c00060080a13 MCGSTATUS 0
MCE 1
CPU 2 4 northbridge TSC c095ba23a7e
ADDR 2918cdff0
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 40b9
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d45cc00040080a13 MCGSTATUS 0
MCE 2
CPU 2 4 northbridge TSC cb1c7387269
ADDR 2bf0cfa50
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 6051
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d428c00060080a13 MCGSTATUS 0
MCE 3
CPU 2 4 northbridge TSC d5a315f1a34
ADDR 2900df990
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 20e8
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d474400020080a13 MCGSTATUS 0
MCE 4
CPU 2 4 northbridge TSC e029b8504a6
ADDR 2900dd030
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 6051
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d428c00060080a13 MCGSTATUS 0
MCE 5
CPU 2 4 northbridge TSC eab071b5316
ADDR 291ac9d98
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 6051
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d428c00060080a13 MCGSTATUS 0
MCE 6
CPU 2 4 northbridge TSC f537141ab1c
ADDR 2918dfe78
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 40b9
bit33 = err cpu1
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d45cc00240080813 MCGSTATUS 0
MCE 7
CPU 2 4 northbridge TSC ffbdb67fd26
ADDR 2beac9010
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 40b9
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node response, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d45cc00040080a13 MCGSTATUS 0
MCE 0
CPU 2 4 northbridge TSC 10a446fe04fa
ADDR 2cfcc9870
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 40b9
bit32 = err cpu0
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d45cc00140080813 MCGSTATUS 0
MCE 1
CPU 2 4 northbridge TSC 114cb12451a2
ADDR 291ac9630
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 6051
bit33 = err cpu1
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d428c00260080813 MCGSTATUS 0
MCE 2
CPU 2 4 northbridge TSC 11f51cba9b82
ADDR 2c10cb010
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 20e8
bit32 = err cpu0
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d474400120080813 MCGSTATUS 0
MCE 3
CPU 2 4 northbridge TSC 129d86e0d26a
ADDR 294ec9390
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 6051
bit32 = err cpu0
bit46 = corrected ecc error
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
STATUS d428c00160080813 MCGSTATUS 0


Attachments:
mcelog (5.57 kB)

2006-01-13 04:38:56

by Robert Hancock

[permalink] [raw]
Subject: Re: machine check errors

don fisher wrote:
> I have a Tyan S2892 board with a pair Opteron 288 dual core cpus and
> 16GB dram. I receive the errors shown in the attached file, mcelog. It
> appears that these occur when the free memory becomes small, there is a
> lot in the cache, and a lot of IO.
>
> The Tyan S2892 has an Nvidia Crush K8-04, which I think they call the
> southbridge. My errors appear to be related to the north bridge. There
> is an AMD 8131 PCI-X controller that runs the PCI slots. There is a
> 3WARE 9500-12 located in one of the PCI-X slots.
>
> I have run Memtest86+-1.65 for 24 hours without errors. I recently
> upgraded the BIOS to V2.00 without any remarkable changes.
>
> I am running 2.6.15 within a current Fedora Core4 configuration.
>
> I would appreciate any advice as to how to proceed. I have not noticed
> any adverse behavior from the mce's. But that could be masked is data
> transfered or ???.
>
> Could there be any connection with the memory cache? Thanks in advance
> for your assistance.

I would say you likely do have some bad RAM, that seems to be what those
MCEs are indicating. Depending on the configuration, Memtest86 may not
find all the errors if they are being corrected by ECC..

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/

2006-01-13 10:07:19

by Alan

[permalink] [raw]
Subject: Re: machine check errors

On Iau, 2006-01-12 at 16:37 -0700, don fisher wrote:
> CPU 2 4 northbridge TSC 967b1992c66
> ADDR 2a52cb5f0
> Northbridge Chipkill ECC error
> Chipkill ECC syndrome = 40b9

Corrected ECC errors from memory. You've got bad memory but because you
have ECC memory it was able to recover the failure.

Alan

2006-01-13 15:21:14

by Roger Heflin

[permalink] [raw]
Subject: RE: machine check errors



> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of don fisher
> Sent: Thursday, January 12, 2006 5:37 PM
> To: [email protected]
> Subject: machine check errors
>
> I have a Tyan S2892 board with a pair Opteron 288 dual core
> cpus and 16GB dram. I receive the errors shown in the
> attached file, mcelog. It appears that these occur when the
> free memory becomes small, there is a lot in the cache, and a
> lot of IO.

You probably mean Opteron 285's. or Opteron 280's.

>
> The Tyan S2892 has an Nvidia Crush K8-04, which I think they
> call the southbridge. My errors appear to be related to the
> north bridge. There is an AMD 8131 PCI-X controller that runs
> the PCI slots. There is a 3WARE 9500-12 located in one of the
> PCI-X slots.
>
> I have run Memtest86+-1.65 for 24 hours without errors. I
> recently upgraded the BIOS to V2.00 without any remarkable changes.

Does memtest86+ support reading of ecc errors on that motherboard,
if it does not, memtest won't tell you anything as the hardware
ecc will correct the errors and memtest will not find anything, if
that version of memtest is ecc aware it will register an ecc error.

>
> I am running 2.6.15 within a current Fedora Core4 configuration.
>
> I would appreciate any advice as to how to proceed. I have
> not noticed any adverse behavior from the mce's. But that
> could be masked is data transfered or ???.

Download edac/bluesmoke from sourceforge and compile and install
it, this will monitor ecc errors from linux, and should tell you
if you are getting ecc errors.

If you were running certain other Linux distributions they won't
report mces as they are missing the mcelog program, but the errors
would have been there.

>
> Could there be any connection with the memory cache? Thanks
> in advance for your assistance.
>
> don

Non-fatal mce's are usually ecc faults, and *USUALLY* track back
to bad memory, though it can also be overheating cpu, or a problematic cpu,
or rarely the MB could be the fault.

ECC/MCE counts will get worse under load, unless the problem is
really severe you won't see them at idle.

Roger
Atipa Technologies

2006-01-14 00:28:47

by Doug Thompson

[permalink] [raw]
Subject: RE: machine check errors

>
> Download edac/bluesmoke from sourceforge and compile and install
> it, this will monitor ecc errors from linux, and should tell you
> if you are getting ecc errors.
>

Download the bluesmoke from bluesmoke.sourceforge.net for now, as EDAC currently does not have the
Opteron driver yet. It is in the queue.

(EDAC is the name for the new module to be in the kernel shortly. Bluesmoke is the lagacy name for
EDAC)

doug t




"If you think Education is expensive, just try Ignorance"

"Don't tell people HOW to do things, tell them WHAT you
want and they will surprise you with their ingenuity."
Gen George Patton