LinuxLists.cc - [PATCH] AMD64: fix mce_cpu

2006-02-01 19:14:59

Subject: [PATCH] AMD64: fix mce_cpu_quirks typos

The spurious MCE is TLB-related. I *think* the bit for the correct
status code is stored at position 10 HEX, not 10 DEC. At least I
still get those MCEs on a two-way Opteron box, even though they are
supposed to be filtered out.

Signed-off-by: Florian Weimer <[email protected]>

---

arch/x86_64/kernel/mce.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

102dfead12550ecaf7363a8ca7269ac0f1241bac
diff --git a/arch/x86_64/kernel/mce.c b/arch/x86_64/kernel/mce.c
index 13a2ead..975d128 100644
--- a/arch/x86_64/kernel/mce.c
+++ b/arch/x86_64/kernel/mce.c
@@ -350,9 +350,9 @@ static void __cpuinit mce_cpu_quirks(str
{
/* This should be disabled by the BIOS, but isn't always */
if (c->x86_vendor == X86_VENDOR_AMD && c->x86 == 15) {
- /* disable GART TBL walk error reporting, which trips off
+ /* disable GART TLB walk error reporting, which trips off
incorrectly with the IOMMU & 3ware & Cerberus. */
- clear_bit(10, &bank[4]);
+ clear_bit(0x10, &bank[4]);
/* Lots of broken BIOS around that don't clear them
by default and leave crap in there. Don't log. */
mce_bootlog = 0;
--
1.1.5

2006-02-01 19:44:24

by Dave Jones

[permalink] [raw]

Subject: Re: [PATCH] AMD64: fix mce_cpu_quirks typos

On Wed, Feb 01, 2006 at 08:14:56PM +0100, Florian Weimer wrote:
> The spurious MCE is TLB-related. I *think* the bit for the correct
> status code is stored at position 10 HEX, not 10 DEC

not true. According to the BIOS writer guide, it's bit 10.
The register only defines bits up to bit 12

Your patch makes it poke a reserved part of the register, which
is definitly undesired.

Dave

2006-02-01 19:50:02

by Florian Weimer

[permalink] [raw]

Subject: Re: [PATCH] AMD64: fix mce_cpu_quirks typos

* Dave Jones:

> On Wed, Feb 01, 2006 at 08:14:56PM +0100, Florian Weimer wrote:
> > The spurious MCE is TLB-related. I *think* the bit for the correct
> > status code is stored at position 10 HEX, not 10 DEC
>
> not true. According to the BIOS writer guide, it's bit 10.
> The register only defines bits up to bit 12

Okay, so why I'm still getting these MCEs?

MCE 0
CPU 0 4 northbridge TSC 91ec03f09330
ADDR 104500000
Northbridge GART error
bit61 = error uncorrected
TLB error 'generic transaction, level generic'
STATUS a40000000005001b MCGSTATUS 0

They are supposed to be disabled by the quirks routine, aren't they?

2006-02-01 19:59:16

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH] AMD64: fix mce_cpu_quirks typos

Florian Weimer <[email protected]> writes:

First please send x86-64 patches cc to the maintainer, things can
get lost in the noise of the list.

> The spurious MCE is TLB-related. I *think* the bit for the correct
> status code is stored at position 10 HEX, not 10 DEC. At least I
> still get those MCEs on a two-way Opteron box, even though they are
> supposed to be filtered out.

No, 10 is the correct bit index. But normally it's set by BIOS anyways.

The reason you still see it is that setting the bit here only
prevent MCE exceptions, but it's still logged and the regular polling
picks them up anyways. I have not found a nice way to handle this
(other than adding a ugly CPU specific special case in the middle
of the nice cpu independent machine check handler, which I couldn't
bring myself to do so far...)

-Andi

2006-02-01 20:21:32

by Florian Weimer

[permalink] [raw]

Subject: Re: [PATCH] AMD64: fix mce_cpu_quirks typos

* Andi Kleen:

> Florian Weimer <[email protected]> writes:
>
> First please send x86-64 patches cc to the maintainer, things can
> get lost in the noise of the list.

Oops, sorry about that. Perhaps I should repeat that MCE for the sake
of discuss@, as decoded by mcelog:

MCE 0
CPU 0 4 northbridge TSC 91ec03f09330
ADDR 104500000
Northbridge GART error
bit61 = error uncorrected
TLB error 'generic transaction, level generic'
STATUS a40000000005001b MCGSTATUS 0

>> The spurious MCE is TLB-related. I *think* the bit for the correct
>> status code is stored at position 10 HEX, not 10 DEC. At least I
>> still get those MCEs on a two-way Opteron box, even though they are
>> supposed to be filtered out.
>
> No, 10 is the correct bit index. But normally it's set by BIOS anyways.
>
> The reason you still see it is that setting the bit here only
> prevent MCE exceptions,

And thus a kernel panic?

> but it's still logged and the regular polling picks them up
> anyways. I have not found a nice way to handle this (other than
> adding a ugly CPU specific special case in the middle of the nice
> cpu independent machine check handler, which I couldn't bring myself
> to do so far...)

Someone tried to track these messages down together with someone else
from AMD, but they never got it finished.

For reference, here's the lspci -n output for the system. It's a
two-way Opteron box (248, 2.2 GHz, stepping 10) with 8 GB of RAM.
(BIOS and chipset details are not available to me at the moment.)
The MCEs only appeared after a switch to a 64-bit kernel (2.6.15.2),
adding the second CPU, along with 4 GB of RAM. Previously, the box
ran 2.6.13 in 32-bit mode, and no MCEs appeared regularly.

In the history of the system, there was one more MCE, but we thought
at that time it was related to thermal issues (it happened after
someone had switched off air conditioning in the server room *cough*).

0000:00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07)
0000:00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05)
0000:00:07.1 IDE interface: Advanced Micro Devices [AMD] AMD-8111 IDE (rev 03)
0000:00:07.2 SMBus: Advanced Micro Devices [AMD] AMD-8111 SMBus 2.0 (rev 02)
0000:00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
0000:00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
0000:00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
0000:00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
0000:00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X APIC (rev 01)
0000:00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
0000:00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
0000:00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
0000:00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
0000:00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
0000:00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
0000:00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
0000:00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 NorthBridge
0000:01:05.0 RAID bus controller: 3ware Inc 3ware ATA-RAID
0000:02:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 03)
0000:02:09.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 03)
0000:03:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
0000:03:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
0000:03:06.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
0000:03:08.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 10)

2006-02-01 22:55:46

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH] AMD64: fix mce_cpu_quirks typos

On Wednesday 01 February 2006 21:21, Florian Weimer wrote:

> > but it's still logged and the regular polling picks them up
> > anyways. I have not found a nice way to handle this (other than
> > adding a ugly CPU specific special case in the middle of the nice
> > cpu independent machine check handler, which I couldn't bring myself
> > to do so far...)
>
> Someone tried to track these messages down together with someone else
> from AMD, but they never got it finished.

They could have saved themselves a lot of work by just asking
at the right mailing lists (which is not l-k BTW)

> For reference, here's the lspci -n output for the system. It's a
> two-way Opteron box (248, 2.2 GHz, stepping 10) with 8 GB of RAM.
> (BIOS and chipset details are not available to me at the moment.)
> The MCEs only appeared after a switch to a 64-bit kernel (2.6.15.2),
> adding the second CPU, along with 4 GB of RAM. Previously, the box
> ran 2.6.13 in 32-bit mode, and no MCEs appeared regularly.

The 64bit kernel uses the AGP aperture as IOMMU, the 32bit kernel doesn't.
It's a known documented hardware bug that this causes spurious GART errors.
That is why the BIOS and Linux disable them. Unfortunately the Linux
MCE handler is too thorough and picks them up anyways as corrected events.

-Andi

2006-02-02 12:59:10

by Florian Weimer

[permalink] [raw]

Subject: Re: [PATCH] AMD64: fix mce_cpu_quirks typos

* Andi Kleen:

> On Wednesday 01 February 2006 21:21, Florian Weimer wrote:
>
>> > but it's still logged and the regular polling picks them up
>> > anyways. I have not found a nice way to handle this (other than
>> > adding a ugly CPU specific special case in the middle of the nice
>> > cpu independent machine check handler, which I couldn't bring myself
>> > to do so far...)
>>
>> Someone tried to track these messages down together with someone else
>> from AMD, but they never got it finished.
>
> They could have saved themselves a lot of work by just asking
> at the right mailing lists (which is not l-k BTW)

Marc Michelsen brought this up last year on <[email protected]>
(which I suppose is the right list), but he didn't receive many
comments (not publicly, at least).

> The 64bit kernel uses the AGP aperture as IOMMU, the 32bit kernel
> doesn't. It's a known documented hardware bug that this causes
> spurious GART errors.

Someone from AMD told Marc that fixes in pci-gart.c (probably related
to iommu_fullflush, see the comment there) are supposed to suppress
the error in the first place. That's why we are a bit confused
whether the errors are really harmless (our machines do run stable,
though).

It also seems that the bug is not as well-documented as it deserves to
be. (The search engines will pick up this thread, though.) Our
vendor told us to have the RAM tested, for instance. 8->

> That is why the BIOS and Linux disable them. Unfortunately the Linux
> MCE handler is too thorough and picks them up anyways as corrected
> events.

If the errors are really harmless, it probably makes sense to add a
warning to the mcelog output that this MCE is expected, preferably
with an AMD errata reference.

Filtering in the kernel seems to be overkill because the rate of those
spurious MCEs is fairly low, and they won't lead to loss of other,
more important MCEs.

2006-02-02 13:31:43

by Andi Kleen

[permalink] [raw]

Subject: Re: [discuss] Re: [PATCH] AMD64: fix mce_cpu_quirks typos

On Thursday 02 February 2006 13:59, Florian Weimer wrote:

>
> > The 64bit kernel uses the AGP aperture as IOMMU, the 32bit kernel
> > doesn't. It's a known documented hardware bug that this causes
> > spurious GART errors.
>
> Someone from AMD told Marc that fixes in pci-gart.c (probably related
> to iommu_fullflush, see the comment there) are supposed to suppress
> the error in the first place. That's why we are a bit confused
> whether the errors are really harmless (our machines do run stable,
> though).

Long ago there was a real bug in this area which caused these GART
errors legitimately, but even what that one was fixed they still
occurred occasionally.

I was told back then that there was a bug in the Northbridge
that causes them occasionally - that is why BIOS turn them off.
The kernel did that eventually too.

Of course there is some probability that you have a driver
that accesses the buffers after unmapping. The GART is currently
not flushed on unmapping because that would be

Normally such drivers are caught though because some other IOMMU
implementations on other architectures have stronger checking in this area.

-Andi