2006-05-16 08:55:50

by Andy Chittenden

[permalink] [raw]
Subject: EDAC MC0: UE page 0x1fffa, offset 0x0, grain 4096, row 0, labels ":": i82875p UE

Every one of our ASUS P4C800-E and ASUS P4C800 based machines that I've
installed a 2.6.16 smp based kernel on is logging messages of the form:

EDAC MC0: UE page 0x1fffa, offset 0x0, grain 4096, row 0, labels ":":
i82875p UE

every second or so. So I've downgraded them back to 2.6.15. I believe
the message is moaning that the ECC memory has unrecoverable errors.
However, the memory in the machines I've tried passes memtest. And
I'd've expected system hangs which we don't get.

So what's wrong?

--
Andy, BlueArc Engineering


2006-05-16 12:30:55

by Andy Chittenden

[permalink] [raw]
Subject: RE: EDAC MC0: UE page 0x1fffa, offset 0x0, grain 4096, row 0, labels ":": i82875p UE

> Every one of our ASUS P4C800-E and ASUS P4C800 based machines
> that I've installed a 2.6.16 smp based kernel on is logging
> messages of the form:
>
> EDAC MC0: UE page 0x1fffa, offset 0x0, grain 4096, row 0,
> labels ":": i82875p UE
>
> every second or so. So I've downgraded them back to 2.6.15. I
> believe the message is moaning that the ECC memory has
> unrecoverable errors. However, the memory in the machines
> I've tried passes memtest. And I'd've expected system hangs
> which we don't get.

Well, memtest passes if I don't enable ECC in memtest. However, if I do,
it fails. So it looks like we've got a memory/memory controller issue
(the fact that it's happening on all machines with these motherboards
implies to me that's a controller/bios issue rather than a memory
issue). If I update the AGP aperture in the BIOS to 256Mb (from 64Mb),
memtest with ECC enabled passes but linux then boots up extremely
slowly. So is this also going to be motherboard/bios issue?

--
Andy, BlueArc Engineering

2006-05-28 23:09:11

by Daniel Roesen

[permalink] [raw]
Subject: Re: EDAC MC0: UE page 0x1fffa, offset 0x0, grain 4096, row 0, labels ":": i82875p UE

On Tue, May 16, 2006 at 09:55:47AM +0100, Andy Chittenden wrote:
> Every one of our ASUS P4C800-E and ASUS P4C800 based machines that I've
> installed a 2.6.16 smp based kernel on is logging messages of the form:
>
> EDAC MC0: UE page 0x1fffa, offset 0x0, grain 4096, row 0, labels ":":
> i82875p UE
>
> every second or so. So I've downgraded them back to 2.6.15. I believe
> the message is moaning that the ECC memory has unrecoverable errors.
> However, the memory in the machines I've tried passes memtest. And
> I'd've expected system hangs which we don't get.

I'm experiencing the same problem, which sorta keeps me somewhat from
using an up-to-date kernel.

Also reported to Fedora:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=191506

rmmod'ing the EDAC driver makes those messages go away, but that's
sticking the head into the sand.

Are actually errors happening when those messages pop up? The system
is running rock-solid since ages. It's only with the new kernels and
it's EDAC module where those errors do pop up...


Best regards,
Daniel

--
CLUE-RIPE -- Jabber: [email protected] -- dr@IRCnet -- PGP: 0xA85C8AA0