LinuxLists.cc - MSI K8D-Master

2003-08-04 01:05:14

Subject: MSI K8D-Master - GART error 3

Hi lists,

(please cc me on any replies as I am not currently subscribed to lkml)

I have recently installed Red Hat Enterprise Linux 2.9.5 Beta (Taroon)
x86-64 on an MSI K8D-Master (MSI-9131) motherboard with Dual Opteron 240
processors.

While the system is running, every 30 seconds I get the following on
system console and in /var/log/messages:

Aug 4 12:52:41 terra kernel: Northbridge status 9405c00000000a13
Aug 4 12:52:41 terra kernel: GART error 3
Aug 4 12:52:41 terra kernel: Lost an northbridge error
Aug 4 12:52:41 terra kernel: NB error address 00000000002e0310
Aug 4 12:53:11 terra kernel: Northbridge status 9405c00000000a13
Aug 4 12:53:11 terra kernel: GART error 3
Aug 4 12:53:11 terra kernel: Lost an northbridge error
Aug 4 12:53:11 terra kernel: NB error address 0000000004432320

and so forth. This also occurred under SuSE Linux 8.2 Beta x86_64 and it
even occurs while running the Red Hat installer (isolinux).

Otherwise the system seems to run fine. Can anyone shed some light on
what this means, and how concerned should I be? Is it fixable?

I thought GART referred to the AGP aperture - this system doesn't
actually have an AGP port, could that be the cause of this? (It has an
onboard ATI Rage XL chip)

# uname -a
Linux terra 2.4.21-1.1931.2.349.2.2.entsmp #1 SMP Fri Jul 18 00:06:19
EDT 2003 x86_64 x86_64 x86_64 GNU/Linux

The system also has an Adaptec 2120S scsi raid card.

cheers,

-Simon

2003-08-05 00:12:17

by Andi Kleen

[permalink] [raw]

Subject: Re: MSI K8D-Master - GART error 3

"Simon Garner" <[email protected]> writes:

> Aug 4 12:52:41 terra kernel: Northbridge status 9405c00000000a13
> Aug 4 12:52:41 terra kernel: GART error 3

There is nothing in any of my trees that generates such a message.
If it was GART related it would be either "GART TLB error ..." or
"extended error gart error". But even that should not happen anymore,
see below.

I don't know what the RedHat kernel does, they may have changed the MCE
handler over the reference port.

> The system also has an Adaptec 2120S scsi raid card.

Probably the driver is doing something bad with the pci_dma API
(which uses the GART on x86-64)

You can always disable it with mce=off or better mce=0
as the message seems to be caused by the periodic non fatal MCE check timer.

However there was a bug in the MCE handler where it managed to turn on
an GART related MCE event through the backdoor that doesn't work
correctly and is sometimes raised spuriously. But at least in the SuSE
beta9 kernel or recent x86-64.org kernels this should have been
fixed. But it doesn't generate such a error message anyways,
so it's hard to know what the exact cause is.

I would suggest to retry with a recent x86-64.org CVS kernel and see
if it still happens there.

-Andi

2003-08-05 00:45:09

by Simon Garner

[permalink] [raw]

Subject: Re: MSI K8D-Master - GART error 3

Andi Kleen <[email protected]> wrote:

> There is nothing in any of my trees that generates such a message.
> If it was GART related it would be either "GART TLB error ..." or
> "extended error gart error". But even that should not happen anymore,
> see below.
>
> I don't know what the RedHat kernel does, they may have changed the
> MCE handler over the reference port.
>

A quick google brings up this reference:
http://www.iglu.org.il/lxr/source/arch/x86_64/kernel/bluesmoke.c

The error appears to be generated by the code starting around line 152
in that file.

Btw, what is 'bluesmoke'?

>> The system also has an Adaptec 2120S scsi raid card.
>
> Probably the driver is doing something bad with the pci_dma API
> (which uses the GART on x86-64)
>

Certainly I had a lot of trouble with this card, I was pleased the
aacraid driver worked enough to let me even install this time - the Red
Hat GinGin64 installer gave a kernel panic - so I wouldn't be surprised
if this card/driver were the cause. :(

> You can always disable it with mce=off or better mce=0
> as the message seems to be caused by the periodic non fatal MCE check
> timer.
>

What will I lose by disabling this?

I just tried booting with mce=0 and I am still getting the same errors.

> However there was a bug in the MCE handler where it managed to turn on
> an GART related MCE event through the backdoor that doesn't work
> correctly and is sometimes raised spuriously. But at least in the SuSE
> beta9 kernel or recent x86-64.org kernels this should have been
> fixed. But it doesn't generate such a error message anyways,
> so it's hard to know what the exact cause is.
>
> I would suggest to retry with a recent x86-64.org CVS kernel and see
> if it still happens there.
>

I will give that a go and see what happens.

Thanks for the response Andi.

-Simon

2003-08-05 13:42:48

by Andi Kleen

[permalink] [raw]

Subject: Re: MSI K8D-Master - GART error 3

On Tue, Aug 05, 2003 at 12:45:01PM +1200, Simon Garner wrote:
> Andi Kleen <[email protected]> wrote:
>
> > There is nothing in any of my trees that generates such a message.
> > If it was GART related it would be either "GART TLB error ..." or
> > "extended error gart error". But even that should not happen anymore,
> > see below.
> >
> > I don't know what the RedHat kernel does, they may have changed the
> > MCE handler over the reference port.
> >
>
> A quick google brings up this reference:
> http://www.iglu.org.il/lxr/source/arch/x86_64/kernel/bluesmoke.c

Ok that's the very old MCE code that incorrectly enabled the northbridge
machine check. Don't use that or use mce=off. However I still think
it's a driver bug in your case. If it was the shakey GART MCE itself
you would get a panic because it's a unrecoverable MCE. More
likely the driver is accessing PCI DMA mappings after they got unmapped,
which is a serious bug, but somehow not serious enough that the
northbridge triggers the MCE.

I was confused by your statement that the SuSE 8.2 beta9 kernel
generated that. It didn't because it doesn't contain that old code.

What does a modern kernel like the SuSE one or a x86-64.org kernel
generate exactly?

>
> The error appears to be generated by the code starting around line 152
> in that file.
>
> Btw, what is 'bluesmoke'?

Alan Cox's sense of humour. Look it up in the jargon file.

> > You can always disable it with mce=off or better mce=0
> > as the message seems to be caused by the periodic non fatal MCE check
> > timer.
> >
>
> What will I lose by disabling this?

mce=0 turns off periodic MCE checking for non fatal errors.
That's not a big issue, the worst you lose is reporting of one bit
corrected ECC memory failures.

mce=off turns off MCE reporting for fatal MCE exceptions (however
your box may still crash when something really bad happens)

mce=0 should have turned off the periodic check and your
message very much looks like a periodic one, as actual MCE
exceptions report more data. I'm a bit puzzled why it doesn't
kill the message here. You can try mce=off, but I'm not
sure it will help neither.

Using a newer kernel is probably a good idea anyways, as there
were many bugfixes since then.

-Andi

2003-08-10 22:45:06

by Simon Garner

[permalink] [raw]

Subject: Re: MSI K8D-Master - GART error 3

On Wednesday, August 06, 2003 1:42 AM [GMT+1200=NZT],
Andi Kleen <[email protected]> wrote:
>
> Ok that's the very old MCE code that incorrectly enabled the
> northbridge machine check. Don't use that or use mce=off. However I
> still think it's a driver bug in your case. If it was the shakey GART
> MCE itself you would get a panic because it's a unrecoverable MCE.
> More likely the driver is accessing PCI DMA mappings after they got
> unmapped, which is a serious bug, but somehow not serious enough that
> the northbridge triggers the MCE.
>
> I was confused by your statement that the SuSE 8.2 beta9 kernel
> generated that. It didn't because it doesn't contain that old code.
>
> What does a modern kernel like the SuSE one or a x86-64.org kernel
> generate exactly?
>

I have reinstalled SuSE now, and I apologise as I was only partially
correct. I do get errors, but they are slightly different from RH. They
appear to be saying the same thing, though. Every 30 seconds I get:

Aug 11 10:37:06 terra kernel: Northbridge status 9405c00000000a13
Aug 11 10:37:06 terra kernel: ECC syndrome bits b
Aug 11 10:37:06 terra kernel: extended error ecc error
Aug 11 10:37:06 terra kernel: link number 0
Aug 11 10:37:06 terra kernel: corrected ecc error
Aug 11 10:37:06 terra kernel: error address valid
Aug 11 10:37:06 terra kernel: error enable
Aug 11 10:37:06 terra kernel: previous error lost
Aug 11 10:37:06 terra kernel: error address 00000000003e4710
Aug 11 10:37:36 terra kernel: Northbridge status 9405c00000000813
Aug 11 10:37:36 terra kernel: ECC syndrome bits b
Aug 11 10:37:36 terra kernel: extended error ecc error
Aug 11 10:37:36 terra kernel: link number 0
Aug 11 10:37:36 terra kernel: corrected ecc error
Aug 11 10:37:36 terra kernel: error address valid
Aug 11 10:37:36 terra kernel: error enable
Aug 11 10:37:36 terra kernel: previous error lost
Aug 11 10:37:36 terra kernel: error address 00000000003c4220

These suggest it's just reporting ECC corrections. Why would it do this
exactly every 30 seconds? (or is that just the reporting interval?)

# uname -a
Linux terra 2.4.19-SMP #1 SMP Wed Jun 25 21:37:18 UTC 2003 x86_64
unknown unknown GNU/Linux

thanks for the help,

-Simon

2003-08-10 22:56:29

by Andi Kleen

[permalink] [raw]

Subject: Re: MSI K8D-Master - GART error 3

On Mon, Aug 11, 2003 at 10:43:57AM +1200, Simon Garner wrote:
> These suggest it's just reporting ECC corrections. Why would it do this

Yep. You have faulty DIMMs, consider replacing them.

> exactly every 30 seconds? (or is that just the reporting interval?)

The interval timer checking for "silent" MCEs runs every 30s.

You can change that by booting with mce=<number> then it will run
each number seconds. 0 should turn it off.

-Andi

2003-08-12 23:24:22

by Simon Garner

[permalink] [raw]

Subject: Re: MSI K8D-Master - GART error 3

On Monday, August 11, 2003 10:56 AM [GMT+1200=NZT],
Andi Kleen <[email protected]> wrote:

> On Mon, Aug 11, 2003 at 10:43:57AM +1200, Simon Garner wrote:
>> These suggest it's just reporting ECC corrections. Why would it do
>> this
>
> Yep. You have faulty DIMMs, consider replacing them.
>

Well I found that a little hard to stomach (since there's four DIMMs -
surely they couldn't all be faulty - and I had already been through a
whole other complete set with the same results, when the supplier sent
the wrong speed modules), but now that I knew the errors were
memory-related I did some more experimenting.

(Here is the memory population chart from the motherboard manual to help
make sense of this:
http://www.expio.co.nz/~sgarner/terra/msi9131memorypop.gif)

First I found that if I disabled ECC in the BIOS then the system
wouldn't even POST. But if I rearranged the modules so that they were in
single channel operation (using only three DIMMs in slots 2,4,6) then
the system would boot and I got no errors in SuSE (even after reenabling
ECC).

Then I tried using a different memory population layout, using all four
DIMMs as dual channel w/ ECC in slots 3,4,5,6 where I had been using
1,2,5,6. The system booted and again I got no errors in SuSE.

"That's strange," thought I, so I tried putting the memory back as it
was, in slots 1,2,5,6, with ECC enabled. Booted the system and still no
errors in SuSE.

So I'm not sure what I did exactly but the system is now running fine
and the ECC errors are gone. I'm still using the same DIMMs - the only
thing that may have changed is the DIMMs may be arranged differently
among the slots. I have tried swapping them around though and I still
can't get the ECC errors back. But that's fine because I didn't
particularly want the errors anyway! :)

-Simon

PS: Under the Northbridge/ECC configuration in the BIOS, the motherboard
has options for DRAM, L2 and L1 cache "BG Scrub" which are selected as
times from 40ns through to some microseconds. There are also options for
"DRAM Scrub REDIRECT" and "ECC Chip Kill". The motherboard manual offers
no advice as to the preferred values for these settings or what they do.
Can anyone suggest good values for these? I currently have them
disabled.