2004-06-17 00:06:06

by Tim Hockin

[permalink] [raw]
Subject: Opteron fatal machine check during PCI probe

Hey all,

I have a couple dual Opteron boxen that consistently gets an MCE during
PCI probing. This is from linux-2.6.6, but the EXACT same scenario happens
on a 2.4.x kernel.

The MCE shows that the error is an IO read, with the address 0xfdfc000cfe.
The RIP points to pci_conf1_read(), when we try to inw() from the PCI data
register.

This is called during the PCI probing, and stops the kernel dead in it's
tracks. The disassembly of the surrounding code is:

ffffffff802822c5: 89 ca mov %ecx,%edx
ffffffff802822c7: 83 e2 02 and $0x2,%edx
ffffffff802822ca: 66 81 c2 fc 0c add $0xcfc,%dx
ffffffff802822cf: 66 ed in (%dx),%ax

This all seems legit to me.

What is interesting is that the address 0xfdfc000cfe is correct in the
low-order 16 bits. The extra 0xfdfc000000 is what is puzzling to me, or
maybe it's a red herring.

I added a show_registers() to the MCE handler, and %rdx *really* is all
zeros, other than the 0xcfe.

If I disable MCE, then the system boot fine, and runs fine.

Anyone have any ideas?

Tim


2004-06-17 00:27:51

by Andi Kleen

[permalink] [raw]
Subject: Re: Opteron fatal machine check during PCI probe

On Wed, 16 Jun 2004 17:06:02 -0700
Tim Hockin <[email protected]> wrote:

> Hey all,
>
> I have a couple dual Opteron boxen that consistently gets an MCE during
> PCI probing. This is from linux-2.6.6, but the EXACT same scenario happens
> on a 2.4.x kernel.

> The MCE shows that the error is an IO read, with the address 0xfdfc000cfe.
> The RIP points to pci_conf1_read(), when we try to inw() from the PCI data
> register.

Is it an master abort (0x100 set in MC4_STATUS) ?
If yes it's an BIOS issue, the BIOS are supposed to disable that one.

> This is called during the PCI probing, and stops the kernel dead in it's
> tracks. The disassembly of the surrounding code is:
>
> ffffffff802822c5: 89 ca mov %ecx,%edx
> ffffffff802822c7: 83 e2 02 and $0x2,%edx
> ffffffff802822ca: 66 81 c2 fc 0c add $0xcfc,%dx
> ffffffff802822cf: 66 ed in (%dx),%ax
>
> This all seems legit to me.
>
> What is interesting is that the address 0xfdfc000cfe is correct in the
> low-order 16 bits. The extra 0xfdfc000000 is what is puzzling to me, or
> maybe it's a red herring.

It is. in only uses 16 bits of its operand.


-Andi