2008-10-07 14:20:37

by Gernot Hillier

[permalink] [raw]
Subject: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

Hi there,

On at least two machines using the Supermicro X7DB3 board with Intel
82563EB (a.k.a. PCI device 8086:1096), we see sporadic problems on modprobe
(about 1 time in some hundred tries):

e1000e: Intel(R) PRO/1000 Network Driver - 0.3.3.3-k2
e1000e: Copyright (c) 1999-2008 Intel Corporation.
e1000e 0000:06:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
e1000e 0000:06:00.0: setting latency timer to 64
0000:06:00.0: 0000:06:00.0: Hardware Error
0000:06:00.0: eth0: (PCI Express:2.5GB/s:Width x4) 00:30:48:67:f5:f6
0000:06:00.0: eth0: Intel(R) PRO/1000 Network Connection
0000:06:00.0: eth0: MAC: 3, PHY: 5, PBA No: 2050ff-0ff
e1000e 0000:06:00.1: PCI INT B -> GSI 19 (level, low) -> IRQ 19
e1000e 0000:06:00.1: setting latency timer to 64
0000:06:00.1: eth1: (PCI Express:2.5GB/s:Width x4) 00:30:48:67:f5:f7
0000:06:00.1: eth1: Intel(R) PRO/1000 Network Connection
0000:06:00.1: eth1: MAC: 3, PHY: 5, PBA No: 2050ff-0ff
0000:06:00.0: eth0: Hardware Error

eth0 is not available after module loading. During boot, this means the
machine won't come up correctly. Problem can be "fixed" by removing and
reloading the module.

This happens on the rather old SUSE-patched 2.6.25.11 with e1000e 0.2.0 as
well as with vanilla 2.6.27-rc8 including e1000e 0.3.3.3-k2.

The machines are equipped with two Quad-Core Xeons E5440 and 8GB of RAM.
Both kernels are compiled for x86_64.

Supermicro claims that there's no known hardware problem with these boards
and that the Windows driver doesn't show any issue...

Is there anything I can do to help narrowing down the problem? Anything I
can test? Any help greatly appreciated...

TIA!

--
Gernot Hillier
Siemens AG, CT SE 2, Corporate Competence Center Embedded Linux


2008-10-08 10:29:36

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

Hi,

"Hillier, Gernot" <[email protected]> writes:

> On at least two machines using the Supermicro X7DB3 board with Intel
> 82563EB (a.k.a. PCI device 8086:1096), we see sporadic problems on modprobe
> (about 1 time in some hundred tries):
>
> e1000e: Intel(R) PRO/1000 Network Driver - 0.3.3.3-k2
> e1000e: Copyright (c) 1999-2008 Intel Corporation.
> e1000e 0000:06:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
> e1000e 0000:06:00.0: setting latency timer to 64
> 0000:06:00.0: 0000:06:00.0: Hardware Error

What does "lspci -vv" say about it when the above happens?

I spurious chip reset (hardware) could probably cause that.
--
Krzysztof Halasa

2008-10-08 13:30:37

by Gernot Hillier

[permalink] [raw]
Subject: Re: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

Hello!

Krzysztof Halasa wrote:
> Hi,
>
> "Hillier, Gernot" <[email protected]> writes:
>
>> On at least two machines using the Supermicro X7DB3 board with Intel
>> 82563EB (a.k.a. PCI device 8086:1096), we see sporadic problems on modprobe
>> (about 1 time in some hundred tries):
>>
>> e1000e: Intel(R) PRO/1000 Network Driver - 0.3.3.3-k2
>> e1000e: Copyright (c) 1999-2008 Intel Corporation.
>> e1000e 0000:06:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
>> e1000e 0000:06:00.0: setting latency timer to 64
>> 0000:06:00.0: 0000:06:00.0: Hardware Error
>
> What does "lspci -vv" say about it when the above happens?
>
> I spurious chip reset (hardware) could probably cause that.

Here's the output of "lspci -vv" in the error case (for the eth devices):

------- SNIP -----------
06:00.0 Class 0200: Device 8086:1096 (rev 01)
Subsystem: Device 15d9:1096
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 18
Region 0: Memory at d0020000 (32-bit, non-prefetchable) [size=128K]
Region 1: Memory at d0000000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at 4000 [size=32]
[virtual] Expansion ROM at d0080000 [disabled] [size=64K]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable-
Address: 00000000feeff00c Data: 4158
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM unknown, Latency L0 <128ns, L1 <64us
ClockPM- Suprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Capabilities: [100] Advanced Error Reporting <?>
Capabilities: [140] Device Serial Number 06-c7-66-ff-ff-48-30-00
Kernel driver in use: e1000e
Kernel modules: e1000e

06:00.1 Class 0200: Device 8086:1096 (rev 01)
Subsystem: Device 15d9:1096
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin B routed to IRQ 19
Region 0: Memory at d0060000 (32-bit, non-prefetchable) [size=128K]
Region 1: Memory at d0040000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at 4020 [size=32]
[virtual] Expansion ROM at d0090000 [disabled] [size=64K]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable-
Address: 0000000000000000 Data: 0000
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr+ TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM unknown, Latency L0 <128ns, L1 <64us
ClockPM- Suprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Capabilities: [100] Advanced Error Reporting <?>
Capabilities: [140] Device Serial Number 06-c7-66-ff-ff-48-30-00
Kernel driver in use: e1000e
Kernel modules: e1000e
------- SNIP -----------

Retried this several times in the error and normal case. The only things
which change are three values for device 06:00.0:

- Control "DisINTx-" changes to "DisINTx+" if the card is correctly
initialized
- Interrupt changes from IRQ 18 to IRQ 4345 if card is correctly initialized
- Message Signalled Interrupts change from "Enable-" to "Enable+"

In addition, the "Data" field from "Message Signalled Interrupts" seems to
change w/o any clear pattern.

For 06:00.1, everything seems to be the same in the error as well as in the
normal case.

Does this tell you anything valuable?

--
Gernot Hillier, Siemens AG, CT SE 2

2008-10-08 15:26:16

by Graham, David

[permalink] [raw]
Subject: RE: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

Hi Gernot,

Thanks for reporting this issue. We have witnessed this in our labs too,
only on platforms that have BMC management firmware. I'm very familiar
with the problem, and believe that we have fixed it, though the
application of the fix may not be simple. The problem is a result of
improper synchronization between the platform FW and the e1000e driver
when they attempt concurrent access to LAN resources, and fixes were
made both on the driver side, and on the FW side. On some platforms a
simple driver update resolves the problem, others require FW fixes too.

The 0.2.0 driver in 2.6.25 has no fixes for this problem, and so I am
not surprised that you see it there. The first set of changes for this
issue are already in the 0.3.3.3-k2 driver that you are still seeing the
problem with on 2.6.26, so either those changes are not good, or your
issue requires one of the additional fixes.

There have been further improvements made to the driver synchronization
code since the 0.3.3.3-k2 driver, and it is possible that a newer driver
would resolve the issue. It'd be good for us to know if that's the case.
The driver version is not yet (AFAICS) upstream, but is already
available in the standalone e1000e-0.4.1.7 driver on sourceforge.
(google "sourceforge e1000e"). Would you be able to try that, as a first
step ?

If this does not resolve the issue for the Supermicro board, you likely
also require a "FW-side" fix, and this comes in one of two flavors. If
the board has an INTEL BMC, then we will need to update it with a new
BMC version. If the board has a Supermicro BMC (I expect that it does),
then we can provide a patch to some of the platform microcode using a
EEPROM update. To determine which is appropriate for you, we'll need to
know more about the platform. There's probably a BMC version number on
one of the BIOS menus. I can work with you to find the info we need, and
then, to help you to perform the necessary steps to perform an upgrade.

Dave



Dave-----Original Message-----
From: [email protected] [mailto:[email protected]]
On Behalf Of Hillier, Gernot
Sent: Tuesday, October 07, 2008 7:26 AM
To: Brandeburg, Jesse
Cc: [email protected]; [email protected]; Allan, Bruce W
Subject: e1000e: sporadic "hardware error"s with Intel 82563EB on
Supermicro X7DB3

Hi there,

On at least two machines using the Supermicro X7DB3 board with Intel
82563EB (a.k.a. PCI device 8086:1096), we see sporadic problems on
modprobe
(about 1 time in some hundred tries):

e1000e: Intel(R) PRO/1000 Network Driver - 0.3.3.3-k2
e1000e: Copyright (c) 1999-2008 Intel Corporation.
e1000e 0000:06:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
e1000e 0000:06:00.0: setting latency timer to 64
0000:06:00.0: 0000:06:00.0: Hardware Error
0000:06:00.0: eth0: (PCI Express:2.5GB/s:Width x4) 00:30:48:67:f5:f6
0000:06:00.0: eth0: Intel(R) PRO/1000 Network Connection
0000:06:00.0: eth0: MAC: 3, PHY: 5, PBA No: 2050ff-0ff
e1000e 0000:06:00.1: PCI INT B -> GSI 19 (level, low) -> IRQ 19
e1000e 0000:06:00.1: setting latency timer to 64
0000:06:00.1: eth1: (PCI Express:2.5GB/s:Width x4) 00:30:48:67:f5:f7
0000:06:00.1: eth1: Intel(R) PRO/1000 Network Connection
0000:06:00.1: eth1: MAC: 3, PHY: 5, PBA No: 2050ff-0ff
0000:06:00.0: eth0: Hardware Error

eth0 is not available after module loading. During boot, this means the
machine won't come up correctly. Problem can be "fixed" by removing and
reloading the module.

This happens on the rather old SUSE-patched 2.6.25.11 with e1000e 0.2.0
as
well as with vanilla 2.6.27-rc8 including e1000e 0.3.3.3-k2.

The machines are equipped with two Quad-Core Xeons E5440 and 8GB of RAM.
Both kernels are compiled for x86_64.

Supermicro claims that there's no known hardware problem with these
boards
and that the Windows driver doesn't show any issue...

Is there anything I can do to help narrowing down the problem? Anything
I
can test? Any help greatly appreciated...

TIA!

--
Gernot Hillier
Siemens AG, CT SE 2, Corporate Competence Center Embedded Linux

2008-10-08 21:37:49

by Stephen Hemminger

[permalink] [raw]
Subject: Re: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

On Wed, 8 Oct 2008 08:25:49 -0700
"Graham, David" <[email protected]> wrote:

> Hi Gernot,
>
> Thanks for reporting this issue. We have witnessed this in our labs too,
> only on platforms that have BMC management firmware. I'm very familiar
> with the problem, and believe that we have fixed it, though the
> application of the fix may not be simple. The problem is a result of
> improper synchronization between the platform FW and the e1000e driver
> when they attempt concurrent access to LAN resources, and fixes were
> made both on the driver side, and on the FW side. On some platforms a
> simple driver update resolves the problem, others require FW fixes too.
>
> The 0.2.0 driver in 2.6.25 has no fixes for this problem, and so I am
> not surprised that you see it there. The first set of changes for this
> issue are already in the 0.3.3.3-k2 driver that you are still seeing the
> problem with on 2.6.26, so either those changes are not good, or your
> issue requires one of the additional fixes.
>
> There have been further improvements made to the driver synchronization
> code since the 0.3.3.3-k2 driver, and it is possible that a newer driver
> would resolve the issue. It'd be good for us to know if that's the case.
> The driver version is not yet (AFAICS) upstream, but is already
> available in the standalone e1000e-0.4.1.7 driver on sourceforge.
> (google "sourceforge e1000e"). Would you be able to try that, as a first
> step ?

Repeat rant heard from many users and vendors:
Why does Intel continue to not do driver development in mainline kernel?

2008-10-08 22:03:19

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

"Hillier, Gernot" <[email protected]> writes:

> Region 0: Memory at d0020000 (32-bit, non-prefetchable) [size=128K]
> Region 1: Memory at d0000000 (32-bit, non-prefetchable) [size=128K]
> Region 2: I/O ports at 4000 [size=32]
> [virtual] Expansion ROM at d0080000 [disabled] [size=64K]

Doesn't look like a spurious reset, though.

> - Control "DisINTx-" changes to "DisINTx+" if the card is correctly
> initialized
> - Interrupt changes from IRQ 18 to IRQ 4345 if card is correctly initialized
> - Message Signalled Interrupts change from "Enable-" to "Enable+"

The "normal" effects of enabling MSI.
--
Krzysztof Halasa

2008-10-09 13:14:00

by Gernot Hillier

[permalink] [raw]
Subject: Re: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

Dear David,

first of all thanks for your quick answer! This is what I call great
support from a hardware vendor!! :-)

Graham, David wrote:
> Thanks for reporting this issue. We have witnessed this in our labs too,
> only on platforms that have BMC management firmware. I'm very familiar
> with the problem, and believe that we have fixed it, though the
> application of the fix may not be simple. The problem is a result of
> improper synchronization between the platform FW and the e1000e driver
> when they attempt concurrent access to LAN resources, and fixes were
> made both on the driver side, and on the FW side. On some platforms a
> simple driver update resolves the problem, others require FW fixes too.

That sounds quite promising and seems to fit to our problem.

However, one detail confuses us: we can currently reproduce this problem on
two machines. One of them is equipped with an optional IPMI card, the other
one isn't. (The Supermicro X7DB3 doesn't include full IPMI support onboard,
but has a "LP IPMI 2.0 (SIMLP) Slot" where you can place an optional card).

The box with the IPMI card shows the hardware errors quite often (in one of
about 200 tries) while the other box still shows the problem, but much more
seldom (in one of >1000 tries). Now we wonder if the BMC is on the IPMI
card or on the board itself - in the first case, I'm not sure if you thesis
fully explains the problems we can see.

And there's another detail I'd like to mention: we first found the problem
by doing continuous reboots as originally described, but we found we can
also reproduce it with an endless loop of "rmmod;sleep 3;modprobe". Does
this somehow contradict with your thesis?

> There have been further improvements made to the driver synchronization
> code since the 0.3.3.3-k2 driver, and it is possible that a newer driver
> would resolve the issue. It'd be good for us to know if that's the case.
> The driver version is not yet (AFAICS) upstream, but is already
> available in the standalone e1000e-0.4.1.7 driver on sourceforge.
> (google "sourceforge e1000e"). Would you be able to try that, as a first
> step ?

Yes, I did. Unfortunately, 0.4.1.7 still shows the problem - on both machines:

e1000e: Intel(R) PRO/1000 Network Driver - 0.4.1.7-NAPI
e1000e: Copyright (c) 1999-2008 Intel Corporation.
ACPI: PCI Interrupt 0000:06:00.0[A] -> GSI 18 (level, low) -> IRQ 18
PCI: Setting latency timer of device 0000:06:00.0 to 64
0000:06:00.0: 0000:06:00.0: Hardware Error
0000:06:00.0: eth0: (PCI Express:2.5GB/s:Width x4) 00:30:48:66:c7:06
0000:06:00.0: eth0: Intel(R) PRO/1000 Network Connection
0000:06:00.0: eth0: MAC: 5, PHY: 5, PBA No: 2050ff-0ff
ACPI: PCI Interrupt 0000:06:00.1[B] -> GSI 19 (level, low) -> IRQ 19
PCI: Setting latency timer of device 0000:06:00.1 to 64
0000:06:00.1: eth1: (PCI Express:2.5GB/s:Width x4) 00:30:48:66:c7:07
0000:06:00.1: eth1: Intel(R) PRO/1000 Network Connection
0000:06:00.1: eth1: MAC: 5, PHY: 5, PBA No: 2050ff-0ff
0000:06:00.0: eth0: Hardware Error
0000:06:00.0: eth0: Hardware Error
0000:06:00.0: eth0: Hardware Error
0000:06:00.0: eth0: Hardware Error
0000:06:00.0: eth0: Hardware Error

Is there any further debug code I could add to narrow down things?

> If this does not resolve the issue for the Supermicro board, you likely
> also require a "FW-side" fix, and this comes in one of two flavors. If
> the board has an INTEL BMC, then we will need to update it with a new
> BMC version. If the board has a Supermicro BMC (I expect that it does),
> then we can provide a patch to some of the platform microcode using a
> EEPROM update. To determine which is appropriate for you, we'll need to
> know more about the platform. There's probably a BMC version number on
> one of the BIOS menus. I can work with you to find the info we need, and
> then, to help you to perform the necessary steps to perform an upgrade.

Sorry, but we can't provide any further details about this yet. We still
try to get through to the Supermicro developers, but so far our FAE contact
insists on telling us "don't use e1000e, e1000 is the right driver for your
hardware".

--
Gernot Hillier
Siemens AG, CT SE 2

2008-10-14 09:18:49

by Gernot Hillier

[permalink] [raw]
Subject: Re: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

Hi Dave!

Sorry for the delay (and the self-follow-up), but now I can hopefully
provide answers to all your questions...

Hillier, Gernot wrote:
> However, one detail confuses us: we can currently reproduce this problem on
> two machines. One of them is equipped with an optional IPMI card, the other
> one isn't. (The Supermicro X7DB3 doesn't include full IPMI support onboard,
> but has a "LP IPMI 2.0 (SIMLP) Slot" where you can place an optional card).

The "IPMI card" we use is a "Supermicro AOC-SIMLP-B".

Overview: http://www.supermicro.com/products/accessories/addon/sim.cfm
Manual: http://www.supermicro.com/manuals/other/AOC-SIMLP.pdf

> The box with the IPMI card shows the hardware errors quite often (in one of
> about 200 tries) while the other box still shows the problem, but much more
> seldom (in one of >1000 tries). Now we wonder if the BMC is on the IPMI
> card or on the board itself - in the first case, I'm not sure if you thesis
> fully explains the problems we can see.

However, after digging through some manuals, I'm quite sure the BMC is
integrated in the Intel ESB2 I/O Controller Hub used on our board, not
on the IPMI card. So we should have an Intel BMC.

> And there's another detail I'd like to mention: we first found the problem
> by doing continuous reboots as originally described, but we found we can
> also reproduce it with an endless loop of "rmmod;sleep 3;modprobe". Does
> this somehow contradict with your thesis?
>
>> There have been further improvements made to the driver synchronization
>> code since the 0.3.3.3-k2 driver, and it is possible that a newer driver
>> would resolve the issue. It'd be good for us to know if that's the case.
>> The driver version is not yet (AFAICS) upstream, but is already
>> available in the standalone e1000e-0.4.1.7 driver on sourceforge.
>> (google "sourceforge e1000e"). Would you be able to try that, as a first
>> step ?
>
> Yes, I did. Unfortunately, 0.4.1.7 still shows the problem - on both machines:
>
> e1000e: Intel(R) PRO/1000 Network Driver - 0.4.1.7-NAPI
> e1000e: Copyright (c) 1999-2008 Intel Corporation.
> ACPI: PCI Interrupt 0000:06:00.0[A] -> GSI 18 (level, low) -> IRQ 18
> PCI: Setting latency timer of device 0000:06:00.0 to 64
> 0000:06:00.0: 0000:06:00.0: Hardware Error
> 0000:06:00.0: eth0: (PCI Express:2.5GB/s:Width x4) 00:30:48:66:c7:06
> 0000:06:00.0: eth0: Intel(R) PRO/1000 Network Connection
> 0000:06:00.0: eth0: MAC: 5, PHY: 5, PBA No: 2050ff-0ff
> ACPI: PCI Interrupt 0000:06:00.1[B] -> GSI 19 (level, low) -> IRQ 19
> PCI: Setting latency timer of device 0000:06:00.1 to 64
> 0000:06:00.1: eth1: (PCI Express:2.5GB/s:Width x4) 00:30:48:66:c7:07
> 0000:06:00.1: eth1: Intel(R) PRO/1000 Network Connection
> 0000:06:00.1: eth1: MAC: 5, PHY: 5, PBA No: 2050ff-0ff
> 0000:06:00.0: eth0: Hardware Error
> 0000:06:00.0: eth0: Hardware Error
> 0000:06:00.0: eth0: Hardware Error
> 0000:06:00.0: eth0: Hardware Error
> 0000:06:00.0: eth0: Hardware Error
>
> Is there any further debug code I could add to narrow down things?
>
>> If this does not resolve the issue for the Supermicro board, you likely
>> also require a "FW-side" fix, and this comes in one of two flavors. If
>> the board has an INTEL BMC, then we will need to update it with a new
>> BMC version. If the board has a Supermicro BMC (I expect that it does),
>> then we can provide a patch to some of the platform microcode using a
>> EEPROM update. To determine which is appropriate for you, we'll need to
>> know more about the platform. There's probably a BMC version number on
>> one of the BIOS menus. I can work with you to find the info we need, and
>> then, to help you to perform the necessary steps to perform an upgrade.
>
[...]
Still no helpful contact within Supermicro, but we found the following
information in the web interface provided by the "IPMI card":

Device InformationProduct Name: Supermicro Daughter Card
Serial Number: 02969601ac46a6df
Device IP Address: 192.168.2.4
Device MAC Address: 08:15:08:15:08:15
Firmware Version: 01.59.00
Firmware Build Number: 5420
Firmware Description: Sep-29-2008-09-45-NonKVM
Hardware Revision: 0x22

The BIOS IPMI menu itself says:

IPMI Specification Version: 2.0
Firmware Version: 1.59

I hope that those details answered your questions, so that we can
proceed with your suggestions. Think we now need the "new BMC version"
you mentioned, right?

If there's anything I can test or lookup from the software side to
speedup things (like additional debugging of the driver, etc.), please
don't hesitate to ask!

--
Gernot

2008-10-15 16:38:16

by Graham, David

[permalink] [raw]
Subject: RE: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

Hi Gernot,

I think that the system with the SuperMicro IPMI card is configured as
having an "external BMC" from the perspective of the INTEL-based system.
My experience of such configurations is that the IPMI traffic is handled
by the BMC in the card, but routed in/out of the system over the "eth0"
on-motherboard esb2 interface. I looked at the AOC-SIMPL-B card
described in the SuperMicro link you provided and see that it too has an
ethernet interface. I'm not sure if the interface on the card provides a
second IPMI interface to the system, or that IPMI to the mainboard eth0
is disabled. I have IPMI management contacts here in INTEL, and am
trying to find out.

If this system does route IPMI traffic between the SuperMicro card & the
mainboard LAN eth0, the onboard LAN now has two clients, one on the
SuperMicro card, and one in the host OS. INTEL provides APIs to external
BMCs so that they can use the LAN, and hidden behind those APIs is code
to allow each client to operate without having to be aware of the state
of the other. There is a bug in this code that can be exposed when the
host resets the LAN. The bug is resolved by a patch to the API code,
which is applied as an EEPROM update to the system. I am working with
Jeff Hockert & others in-house to find out details of how we are
deploying that EEPROM update.

I continue to review - with help- the information that you have already
provided, to determine whether this system does match the IPMI
configuration that I think it does. I'll keep you up to date.

OK, now for the system without the IPMI card. Probably that one does
have an active INTEL BMC. And, if it does, the core bug that I (sort-of)
explained above is also relevant there, though it's not fixable in the
same way because the buggy code in this case is integrated directly as
part of the INTEL BMC. In this case, you'll need a BMC upgrade. But
first, just like for the other case, I need to confirm that the
configuration is what I think it is.

It would help if you could provide a little more information. Could you
provide (for one of each of the two configurations that you have - one
with the IPMI card, one without):

lspci -t
lspci -vvv -xxxx
ethtool -e eth0
BIOS "IPMI" menus (I know you already gave us one, but both
would be good)

Thanks

Dave

-----Original Message-----
From: Gernot Hillier [mailto:[email protected]]
Sent: Tuesday, October 14, 2008 2:18 AM
To: Graham, David
Cc: [email protected]; [email protected]; Allan, Bruce
W; Hockert, Jeff W
Subject: Re: e1000e: sporadic "hardware error"s with Intel 82563EB on
Supermicro X7DB3

Hi Dave!

Sorry for the delay (and the self-follow-up), but now I can hopefully
provide answers to all your questions...

Hillier, Gernot wrote:
> However, one detail confuses us: we can currently reproduce this
problem on
> two machines. One of them is equipped with an optional IPMI card, the
other
> one isn't. (The Supermicro X7DB3 doesn't include full IPMI support
onboard,
> but has a "LP IPMI 2.0 (SIMLP) Slot" where you can place an optional
card).

The "IPMI card" we use is a "Supermicro AOC-SIMLP-B".

Overview: http://www.supermicro.com/products/accessories/addon/sim.cfm
Manual: http://www.supermicro.com/manuals/other/AOC-SIMLP.pdf

> The box with the IPMI card shows the hardware errors quite often (in
one of
> about 200 tries) while the other box still shows the problem, but much
more
> seldom (in one of >1000 tries). Now we wonder if the BMC is on the
IPMI
> card or on the board itself - in the first case, I'm not sure if you
thesis
> fully explains the problems we can see.

However, after digging through some manuals, I'm quite sure the BMC is
integrated in the Intel ESB2 I/O Controller Hub used on our board, not
on the IPMI card. So we should have an Intel BMC.

> And there's another detail I'd like to mention: we first found the
problem
> by doing continuous reboots as originally described, but we found we
can
> also reproduce it with an endless loop of "rmmod;sleep 3;modprobe".
Does
> this somehow contradict with your thesis?
>
>> There have been further improvements made to the driver
synchronization
>> code since the 0.3.3.3-k2 driver, and it is possible that a newer
driver
>> would resolve the issue. It'd be good for us to know if that's the
case.
>> The driver version is not yet (AFAICS) upstream, but is already
>> available in the standalone e1000e-0.4.1.7 driver on sourceforge.
>> (google "sourceforge e1000e"). Would you be able to try that, as a
first
>> step ?
>
> Yes, I did. Unfortunately, 0.4.1.7 still shows the problem - on both
machines:
>
> e1000e: Intel(R) PRO/1000 Network Driver - 0.4.1.7-NAPI
> e1000e: Copyright (c) 1999-2008 Intel Corporation.
> ACPI: PCI Interrupt 0000:06:00.0[A] -> GSI 18 (level, low) -> IRQ 18
> PCI: Setting latency timer of device 0000:06:00.0 to 64
> 0000:06:00.0: 0000:06:00.0: Hardware Error
> 0000:06:00.0: eth0: (PCI Express:2.5GB/s:Width x4) 00:30:48:66:c7:06
> 0000:06:00.0: eth0: Intel(R) PRO/1000 Network Connection
> 0000:06:00.0: eth0: MAC: 5, PHY: 5, PBA No: 2050ff-0ff
> ACPI: PCI Interrupt 0000:06:00.1[B] -> GSI 19 (level, low) -> IRQ 19
> PCI: Setting latency timer of device 0000:06:00.1 to 64
> 0000:06:00.1: eth1: (PCI Express:2.5GB/s:Width x4) 00:30:48:66:c7:07
> 0000:06:00.1: eth1: Intel(R) PRO/1000 Network Connection
> 0000:06:00.1: eth1: MAC: 5, PHY: 5, PBA No: 2050ff-0ff
> 0000:06:00.0: eth0: Hardware Error
> 0000:06:00.0: eth0: Hardware Error
> 0000:06:00.0: eth0: Hardware Error
> 0000:06:00.0: eth0: Hardware Error
> 0000:06:00.0: eth0: Hardware Error
>
> Is there any further debug code I could add to narrow down things?
>
>> If this does not resolve the issue for the Supermicro board, you
likely
>> also require a "FW-side" fix, and this comes in one of two flavors.
If
>> the board has an INTEL BMC, then we will need to update it with a new
>> BMC version. If the board has a Supermicro BMC (I expect that it
does),
>> then we can provide a patch to some of the platform microcode using a
>> EEPROM update. To determine which is appropriate for you, we'll need
to
>> know more about the platform. There's probably a BMC version number
on
>> one of the BIOS menus. I can work with you to find the info we need,
and
>> then, to help you to perform the necessary steps to perform an
upgrade.
>
[...]
Still no helpful contact within Supermicro, but we found the following
information in the web interface provided by the "IPMI card":

Device InformationProduct Name: Supermicro Daughter Card
Serial Number: 02969601ac46a6df
Device IP Address: 192.168.2.4
Device MAC Address: 08:15:08:15:08:15
Firmware Version: 01.59.00
Firmware Build Number: 5420
Firmware Description: Sep-29-2008-09-45-NonKVM
Hardware Revision: 0x22

The BIOS IPMI menu itself says:

IPMI Specification Version: 2.0
Firmware Version: 1.59

I hope that those details answered your questions, so that we can
proceed with your suggestions. Think we now need the "new BMC version"
you mentioned, right?

If there's anything I can test or lookup from the software side to
speedup things (like additional debugging of the driver, etc.), please
don't hesitate to ask!

--
Gernot

2008-10-16 12:27:49

by Gernot Hillier

[permalink] [raw]
Subject: Re: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

Hi Dave!

I added Zoltan Fodor from your PAE department to the distribution list as
he also supports us regarding this problem.

Am 15.10.2008 18:37 schrieb Graham, David:
> I think that the system with the SuperMicro IPMI card is configured as
> having an "external BMC" from the perspective of the INTEL-based system.

Exactly. That's what our hardware experts told me in the meantime, too.

> My experience of such configurations is that the IPMI traffic is handled
> by the BMC in the card, but routed in/out of the system over the "eth0"
> on-motherboard esb2 interface. I looked at the AOC-SIMPL-B card
> described in the SuperMicro link you provided and see that it too has an
> ethernet interface. I'm not sure if the interface on the card provides
> a second IPMI interface to the system, or that IPMI to the mainboard
> eth0 is disabled. I have IPMI management contacts here in INTEL, and am
> trying to find out.
>
> If this system does route IPMI traffic between the SuperMicro card & the
> mainboard LAN eth0, the onboard LAN now has two clients, one on the
> SuperMicro card, and one in the host OS.

The latter is true for us. This IPMI card has an own eth interface as you
mentioned, but due to product requirements, we can't use it but need the
"shared NIC" feature. Therefore, this card is configured (jumpered) to
route its IPMI traffic through the eth0 on the motherboard.

> INTEL provides APIs to external BMCs so that they can use the LAN, and
> hidden behind those APIs is code to allow each client to operate without
> having to be aware of the state of the other. There is a bug in this
> code that can be exposed when the host resets the LAN. The bug is
> resolved by a patch to the API code, which is applied as an EEPROM
> update to the system. I am working with Jeff Hockert & others in-house
> to find out details of how we are deploying that EEPROM update.

Thanks for the explanation. I would be more than happy to try anything in
that area!

> I continue to review - with help- the information that you have already
> provided, to determine whether this system does match the IPMI
> configuration that I think it does. I'll keep you up to date.

As explained above, your assumptions should exactly apply to our scenario, yes.

> OK, now for the system without the IPMI card. Probably that one does
> have an active INTEL BMC. And, if it does, the core bug that I (sort-of)
> explained above is also relevant there, though it's not fixable in the
> same way because the buggy code in this case is integrated directly as
> part of the INTEL BMC. In this case, you'll need a BMC upgrade. But
> first, just like for the other case, I need to confirm that the
> configuration is what I think it is.
>
> It would help if you could provide a little more information. Could you
> provide (for one of each of the two configurations that you have - one
> with the IPMI card, one without):
>
> lspci -t
> lspci -vvv -xxxx
> ethtool -e eth0

I will provide those as soon as possible. Currently, they would be
meeningless for you probably as our hardware experts tried some kind of
firmware update which broke the "Shared NIC" feature - so I doubt we can
reproduce the bug in the current configuration.

As soon, as I can get the machines back to the state where we can reproduce
the issue, I'll send you the requested details.

> BIOS "IPMI" menus (I know you
> already gave us one, but both would be good)

For this, I can already tell that there is no BIOS IPMI menu available if
there's no IPMI card plugged in. Seems like the Supermicro BIOS developers
deny access to the Intel BMC in standalone mode...

--
With kind regards,
Gernot Hillier
Siemens AG, CT SE 2, Corporate Competence Center Embedded Linux

2008-10-16 16:02:43

by Gernot Hillier

[permalink] [raw]
Subject: Re: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

Hi there!

Graham, David wrote:
> It would help if you could provide a little more information. Could you
> provide (for one of each of the two configurations that you have - one
> with the IPMI card, one without):
>
> lspci -t
> lspci -vvv -xxxx
> ethtool -e eth0

Ok, it turned out that we still can reproduce the problem - even after the
firmware upgrade. So I collected the information you requested from two
machines:

- BVSIM3 is the one with the IPMI card.
- BVSIM5 the one without the IPMI card.

As the output from the commands you requested is rather large, I uploaded
it to the following URLs instead posting it to the list, hope that's ok:

http://www.hillier.de/bvsim5-ethtool-e.txt
http://www.hillier.de/bvsim5-lspci-t.txt
http://www.hillier.de/bvsim5-lspci-vvv-xxxx.txt
http://www.hillier.de/bvsim3-ethtool-e.txt
http://www.hillier.de/bvsim3-lspci-t.txt
http://www.hillier.de/bvsim3-lspci-vvv-xxxx.txt

All commands were run in the error case, i.e. after e1000e said "Hardware
error".

> BIOS "IPMI" menus (I know you already gave us one, but both
> would be good)

BVSIM3 shows the IPMI menu I already provided
BVSIM5 shows no IPMI menu

Thanks in advance!

--
With kind regards,
Gernot Hillier
Siemens AG, CT SE 2, Corporate Competence Center Embedded Linux

2008-11-11 09:59:43

by Gernot Hillier

[permalink] [raw]
Subject: Re: e1000e: sporadic "hardware error"s with Intel 82563EB on Supermicro X7DB3

Dear Dave,

On 2008-10-16, Hillier, Gernot wrote:
> Hi there!
>
> Graham, David wrote:
>> It would help if you could provide a little more information. Could you
>> provide (for one of each of the two configurations that you have - one
>> with the IPMI card, one without):
>>
>> lspci -t
>> lspci -vvv -xxxx
>> ethtool -e eth0
>
> Ok, it turned out that we still can reproduce the problem - even after the
> firmware upgrade. So I collected the information you requested from two
> machines:

Wanted to let you know that this problem seems to be fixed for us.

We received a preliminary update from Supermicro which contains a new NIC
firmware 2.5 they recently received from your side with improved Shared LAN
support (56313.eep, release date 2008-10-01) .

After flashing this update together with a new BMC card firmware 1.59, the
problem has finally vanished for us. (For some reason, only updating the
NIC firmware wasn't possible, so we can't unfortunately nail down which
update part really fixed the problem.)

1.5 days of an rmmod/insmod loop and 2.5 days of a complete OS reboot loop
now have been passed w/o problems. Both tests triggered the problem
reliably within at most one day before.

So thanks again for your help!

--
With kind regards,

Gernot Hillier
Siemens AG, Corporate Competence Center Embedded Linux