2012-11-08 06:25:14

by Joe Jin

[permalink] [raw]
Subject: 82571EB: Detected Hardware Unit Hang

Hi list,

IHAC reported "82571EB Detected Hardware Unit Hang" on HP ProLiant DL360 G6, and
have to reboot the server to recover:

e1000e 0000:06:00.1: eth3: Detected Hardware Unit Hang:
TDH <1a>
TDT <1a>
next_to_use <1a>
next_to_clean <18>
buffer_info[next_to_clean]:
time_stamp <10047a74e>
next_to_watch <18>
jiffies <10047a88c>
next_to_watch.status <1>
MAC Status <80383>
PHY Status <792d>
PHY 1000BASE-T Status <3800>
PHY Extended Status <3000>
PCI Status <10>

With newer kernel 2.0.0.1 the issue still reproducible.

Device info:
06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06)
06:00.1 0200: 8086:10bc (rev 06)

I compared lspci output before and after the issue, different as below:
06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06)
Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port Gigabit Server Adapter
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx-
- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx+


Would you please help to it?

Thanks in advance,
Joe

--
Oracle <http://www.oracle.com>
Joe Jin | Software Development Senior Manager | +8610.6106.5624
ORACLE | Linux and Virtualization
No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing


2012-11-08 20:35:53

by Dave, Tushar N

[permalink] [raw]
Subject: RE: 82571EB: Detected Hardware Unit Hang

>-----Original Message-----
>From: [email protected] [mailto:[email protected]]
>On Behalf Of Joe Jin
>Sent: Wednesday, November 07, 2012 10:25 PM
>To: [email protected]
>Cc: [email protected]; [email protected]; Mary Mcgrath
>Subject: 82571EB: Detected Hardware Unit Hang
>
>Hi list,
>
>IHAC reported "82571EB Detected Hardware Unit Hang" on HP ProLiant DL360
>G6, and have to reboot the server to recover:
>
>e1000e 0000:06:00.1: eth3: Detected Hardware Unit Hang:
> TDH <1a>
> TDT <1a>
> next_to_use <1a>
> next_to_clean <18>
>buffer_info[next_to_clean]:
> time_stamp <10047a74e>
> next_to_watch <18>
> jiffies <10047a88c>
> next_to_watch.status <1>
>MAC Status <80383>
>PHY Status <792d>
>PHY 1000BASE-T Status <3800>
>PHY Extended Status <3000>
>PCI Status <10>
>
>With newer kernel 2.0.0.1 the issue still reproducible.
>
>Device info:
>06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
>Controller (Copper) (rev 06)
>06:00.1 0200: 8086:10bc (rev 06)
>
>I compared lspci output before and after the issue, different as below:
> 06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
>Controller (Copper) (rev 06)
> Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port
>Gigabit Server Adapter
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
>Stepping- SERR- FastB2B- DisINTx-
>- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
><TAbort- <MAbort- >SERR- <PERR- INTx-
>+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>+<TAbort- <MAbort- >SERR- <PERR- INTx+

Are you sure this is not similar issue as before that you reported.
i.e.
On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:
> I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
> doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2,
> just copy a big file (>500M) from another server will hit it at once.

All devices in path from root complex to 82571, should have *same* max payload size otherwise it can cause hang.
Can you double check this?

-Tushar

2012-11-09 01:22:43

by Joe Jin

[permalink] [raw]
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 11/09/12 04:35, Dave, Tushar N wrote:
> Are you sure this is not similar issue as before that you reported.
> i.e.

Tushar,

Thanks for your quick response, I'll check with customer if they can modify the Max
payload size from BIOS, this time issue hit on HP's server.

Thanks again,
Joe

> On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:
>> > I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
>> > doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2,
>> > just copy a big file (>500M) from another server will hit it at once.
> All devices in path from root complex to 82571, should have *same* max payload size otherwise it can cause hang.
> Can you double check this?
>


--
Oracle <http://www.oracle.com>
Joe Jin | Software Development Senior Manager | +8610.6106.5624
ORACLE | Linux and Virtualization
No. 24 Zhongguancun Software Park, Haidian District | 100193 Beijing

2012-11-14 02:48:11

by Joe Jin

[permalink] [raw]
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 11/09/12 04:35, Dave, Tushar N wrote:
> All devices in path from root complex to 82571, should have *same* max payload size otherwise it can cause hang.
> Can you double check this?

Hi Tushar,

Checked with hardware vendor and they said no way to modify the max payload size
from BIOS, can I modify it from driver side?

Thanks,
Joe

2012-11-14 03:37:33

by Li Yu

[permalink] [raw]
Subject: Re: 82571EB: Detected Hardware Unit Hang

于 2012年11月09日 04:35, Dave, Tushar N 写道:
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]]
>> On Behalf Of Joe Jin
>> Sent: Wednesday, November 07, 2012 10:25 PM
>> To: [email protected]
>> Cc: [email protected]; [email protected]; Mary Mcgrath
>> Subject: 82571EB: Detected Hardware Unit Hang
>>
>> Hi list,
>>
>> IHAC reported "82571EB Detected Hardware Unit Hang" on HP ProLiant DL360
>> G6, and have to reboot the server to recover:
>>
>> e1000e 0000:06:00.1: eth3: Detected Hardware Unit Hang:
>> TDH <1a>
>> TDT <1a>
>> next_to_use <1a>
>> next_to_clean <18>
>> buffer_info[next_to_clean]:
>> time_stamp <10047a74e>
>> next_to_watch <18>
>> jiffies <10047a88c>
>> next_to_watch.status <1>
>> MAC Status <80383>
>> PHY Status <792d>
>> PHY 1000BASE-T Status <3800>
>> PHY Extended Status <3000>
>> PCI Status <10>
>>
>> With newer kernel 2.0.0.1 the issue still reproducible.
>>
>> Device info:
>> 06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
>> Controller (Copper) (rev 06)
>> 06:00.1 0200: 8086:10bc (rev 06)
>>
>> I compared lspci output before and after the issue, different as below:
>> 06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
>> Controller (Copper) (rev 06)
>> Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port
>> Gigabit Server Adapter
>> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
>> Stepping- SERR- FastB2B- DisINTx-
>> - Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>> <TAbort- <MAbort- >SERR- <PERR- INTx-
>> + Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>> +<TAbort- <MAbort- >SERR- <PERR- INTx+
>
> Are you sure this is not similar issue as before that you reported.
> i.e.
> On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:
>> I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
>> doing scp test. this issue is easy do reproduced on SUN FIRE X2270 M2,
>> just copy a big file (>500M) from another server will hit it at once.
>
> All devices in path from root complex to 82571, should have *same* max payload size otherwise it can cause hang.
> Can you double check this?
>

We also found such hang problem on 82599EB (ixgbe driver) in RHEL6.3
kernel, we ever tried to upgrade to latest version (3.8.21 or 3.10.17),
but it still happens.

Is it probably also due to wrong "max payload size" set in BIOS?

Thanks

Yu

> -Tushar
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-11-14 03:43:38

by Dave, Tushar N

[permalink] [raw]
Subject: RE: 82571EB: Detected Hardware Unit Hang

>-----Original Message-----
>From: Li Yu [mailto:[email protected]]
>Sent: Tuesday, November 13, 2012 7:37 PM
>To: Dave, Tushar N
>Cc: Joe Jin; [email protected]; [email protected]; linux-
>[email protected]; Mary Mcgrath
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>于 2012年11月09日 04:35, Dave, Tushar N 写道:
>>> -----Original Message-----
>>> From: [email protected]
>>> [mailto:[email protected]]
>>> On Behalf Of Joe Jin
>>> Sent: Wednesday, November 07, 2012 10:25 PM
>>> To: [email protected]
>>> Cc: [email protected]; [email protected]; Mary
>>> Mcgrath
>>> Subject: 82571EB: Detected Hardware Unit Hang
>>>
>>> Hi list,
>>>
>>> IHAC reported "82571EB Detected Hardware Unit Hang" on HP ProLiant
>>> DL360 G6, and have to reboot the server to recover:
>>>
>>> e1000e 0000:06:00.1: eth3: Detected Hardware Unit Hang:
>>> TDH <1a>
>>> TDT <1a>
>>> next_to_use <1a>
>>> next_to_clean <18>
>>> buffer_info[next_to_clean]:
>>> time_stamp <10047a74e>
>>> next_to_watch <18>
>>> jiffies <10047a88c>
>>> next_to_watch.status <1>
>>> MAC Status <80383>
>>> PHY Status <792d>
>>> PHY 1000BASE-T Status <3800>
>>> PHY Extended Status <3000>
>>> PCI Status <10>
>>>
>>> With newer kernel 2.0.0.1 the issue still reproducible.
>>>
>>> Device info:
>>> 06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
>>> Ethernet Controller (Copper) (rev 06)
>>> 06:00.1 0200: 8086:10bc (rev 06)
>>>
>>> I compared lspci output before and after the issue, different as below:
>>> 06:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
>>> Ethernet Controller (Copper) (rev 06)
>>> Subsystem: Hewlett-Packard Company NC364T PCI Express Quad Port
>>> Gigabit Server Adapter
>>> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+
>>> Stepping- SERR- FastB2B- DisINTx-
>>> - Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>>> <TAbort- <MAbort- >SERR- <PERR- INTx-
>>> + Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>>> +<TAbort- <MAbort- >SERR- <PERR- INTx+
>>
>> Are you sure this is not similar issue as before that you reported.
>> i.e.
>> On Mon, 2012-07-09 at 16:51 +0800, Joe Jin wrote:
>>> I'm seeing a Unit Hang even with the latest e1000e driver 2.0.0 when
>>> doing scp test. this issue is easy do reproduced on SUN FIRE X2270
>>> M2, just copy a big file (>500M) from another server will hit it at
>once.
>>
>> All devices in path from root complex to 82571, should have *same* max
>payload size otherwise it can cause hang.
>> Can you double check this?
>>
>
>We also found such hang problem on 82599EB (ixgbe driver) in RHEL6.3
>kernel, we ever tried to upgrade to latest version (3.8.21 or 3.10.17),
>but it still happens.
>
>Is it probably also due to wrong "max payload size" set in BIOS?
>
It could be or could not be. I would suggest please create another thread with that issue as these two devices are significantly different.

-Tushar
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-11-14 03:45:33

by Dave, Tushar N

[permalink] [raw]
Subject: RE: 82571EB: Detected Hardware Unit Hang

>-----Original Message-----
>From: Joe Jin [mailto:[email protected]]
>Sent: Tuesday, November 13, 2012 6:48 PM
>To: Dave, Tushar N
>Cc: [email protected]; [email protected]; linux-
>[email protected]; Mary Mcgrath
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 11/09/12 04:35, Dave, Tushar N wrote:
>> All devices in path from root complex to 82571, should have *same* max
>payload size otherwise it can cause hang.
>> Can you double check this?
>
>Hi Tushar,
>
>Checked with hardware vendor and they said no way to modify the max
>payload size from BIOS, can I modify it from driver side?

If you want to change value for 82571 device you can do it from eeprom but for other upstream devices I am not sure. I will check with my team.

-Tushar

2012-11-15 00:33:01

by Joe Jin

[permalink] [raw]
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 11/14/12 11:45, Dave, Tushar N wrote:
>> -----Original Message-----
>> From: Joe Jin [mailto:[email protected]]
>> Sent: Tuesday, November 13, 2012 6:48 PM
>> To: Dave, Tushar N
>> Cc: [email protected]; [email protected]; linux-
>> [email protected]; Mary Mcgrath
>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>
>> On 11/09/12 04:35, Dave, Tushar N wrote:
>>> All devices in path from root complex to 82571, should have *same* max
>> payload size otherwise it can cause hang.
>>> Can you double check this?
>>
>> Hi Tushar,
>>
>> Checked with hardware vendor and they said no way to modify the max
>> payload size from BIOS, can I modify it from driver side?
>
> If you want to change value for 82571 device you can do it from eeprom but for other upstream devices I am not sure. I will check with my team.

Hi Tushar,

Would you please help to fine the offset of max payload size in eeprom?
I'd like to have a try to modify it by ethtool.

Thanks in advance,
Joe

2012-11-15 20:26:39

by Dave, Tushar N

[permalink] [raw]
Subject: RE: 82571EB: Detected Hardware Unit Hang

>-----Original Message-----
>From: Joe Jin [mailto:[email protected]]
>Sent: Wednesday, November 14, 2012 4:33 PM
>To: Dave, Tushar N
>Cc: [email protected]; [email protected]; linux-
>[email protected]; Mary Mcgrath
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 11/14/12 11:45, Dave, Tushar N wrote:
>>> -----Original Message-----
>>> From: Joe Jin [mailto:[email protected]]
>>> Sent: Tuesday, November 13, 2012 6:48 PM
>>> To: Dave, Tushar N
>>> Cc: [email protected]; [email protected]; linux-
>>> [email protected]; Mary Mcgrath
>>> Subject: Re: 82571EB: Detected Hardware Unit Hang
>>>
>>> On 11/09/12 04:35, Dave, Tushar N wrote:
>>>> All devices in path from root complex to 82571, should have *same*
>>>> max
>>> payload size otherwise it can cause hang.
>>>> Can you double check this?
>>>
>>> Hi Tushar,
>>>
>>> Checked with hardware vendor and they said no way to modify the max
>>> payload size from BIOS, can I modify it from driver side?
>>
>> If you want to change value for 82571 device you can do it from eeprom
>but for other upstream devices I am not sure. I will check with my team.
>
>Hi Tushar,
>
>Would you please help to fine the offset of max payload size in eeprom?
>I'd like to have a try to modify it by ethtool.

It is defined using bit 8 of word 0x1A.
Bit value 0 = 128B , bit value 1 = 256B

-Tushar

2012-11-19 05:38:36

by Joe Jin

[permalink] [raw]
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 11/16/12 04:26, Dave, Tushar N wrote:
>> Would you please help to fine the offset of max payload size in eeprom?
>> I'd like to have a try to modify it by ethtool.
>
> It is defined using bit 8 of word 0x1A.
> Bit value 0 = 128B , bit value 1 = 256B

Hi Tushar,

I checked one of my server which Max Payload Size is 128:

# lspci -vvv -s 52:00.1
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
Subsystem: Intel Corporation PRO/1000 PT Quad Port Server Adapter
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin B routed to IRQ 266
Region 0: Memory at dfea0000 (32-bit, non-prefetchable) [size=128K]
Region 1: Memory at dfe80000 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at 6020 [size=32]
[virtual] Expansion ROM at d8120000 [disabled] [size=128K]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00000 Data: 409a
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 <4us, L1 <64us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt+ RxOF+ MalfTLP+ ECRC- UnsupReq+ ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr-
AERCap: First Error Pointer: 14, GenCap- CGenEn- ChkCap- ChkEn-
Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-16-ed-86
Kernel driver in use: e1000e
Kernel modules: e1000e

And eeprom dump as below:

Offset Values
------ ------
0x0000 00 15 17 16 ed 86 24 05 ff ff a2 50 ff ff ff ff
0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1
0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01
0x0030 f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06
0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00
0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0060 00 01 00 40 1e 12 07 40 00 01 00 40 ff ff ff ff


If I did not misunderstand, the value of offset 0x1a is 0x07a6, then the bit 8 is 1, but
my NIC's MPS is 128b, anything I'm wrong?

Thanks,
Joe

2012-11-20 08:59:06

by Dave, Tushar N

[permalink] [raw]
Subject: RE: 82571EB: Detected Hardware Unit Hang

>-----Original Message-----
>From: Joe Jin [mailto:[email protected]]
>Sent: Sunday, November 18, 2012 9:38 PM
>To: Dave, Tushar N
>Cc: [email protected]; [email protected]; linux-
>[email protected]; Mary Mcgrath
>Subject: Re: 82571EB: Detected Hardware Unit Hang
>
>On 11/16/12 04:26, Dave, Tushar N wrote:
>>> Would you please help to fine the offset of max payload size in eeprom?
>>> I'd like to have a try to modify it by ethtool.
>>
>> It is defined using bit 8 of word 0x1A.
>> Bit value 0 = 128B , bit value 1 = 256B
>
>Hi Tushar,
>
>I checked one of my server which Max Payload Size is 128:
>
># lspci -vvv -s 52:00.1
>52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
>Controller (rev 06)
> Subsystem: Intel Corporation PRO/1000 PT Quad Port Server Adapter
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
>ParErr+ Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
><TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin B routed to IRQ 266
> Region 0: Memory at dfea0000 (32-bit, non-prefetchable)
>[size=128K]
> Region 1: Memory at dfe80000 (32-bit, non-prefetchable)
>[size=128K]
> Region 2: I/O ports at 6020 [size=32]
> [virtual] Expansion ROM at d8120000 [disabled] [size=128K]
> Capabilities: [c8] Power Management version 2
> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-
>,D3hot+,D3cold-)
> Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
> Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
> Address: 00000000fee00000 Data: 409a
> Capabilities: [e0] Express (v1) Endpoint, MSI 00
> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s
><512ns, L1 <64us
> ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
> DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+
>Unsupported+
> RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
> MaxPayload 128 bytes, MaxReadReq 4096 bytes
> DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr-
>TransPend-
> LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s,
>Latency L0 <4us, L1 <64us
> ClockPM- Surprise- LLActRep- BwNot-
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain-
>CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+
>DLActive- BWMgmt- ABWMgmt-
> Capabilities: [100 v1] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
> UESvrt: DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt+
>RxOF+ MalfTLP+ ECRC- UnsupReq+ ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
>NonFatalErr-
> CEMsk: RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+
>NonFatalErr-
> AERCap: First Error Pointer: 14, GenCap- CGenEn- ChkCap-
>ChkEn-
> Capabilities: [140 v1] Device Serial Number 00-15-17-ff-ff-16-ed-
>86
> Kernel driver in use: e1000e
> Kernel modules: e1000e
>
>And eeprom dump as below:
>
>Offset Values
>------ ------
>0x0000 00 15 17 16 ed 86 24 05 ff ff a2 50 ff ff ff ff
>0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1
>0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01
>0x0030 f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06
>0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00
>0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>0x0060 00 01 00 40 1e 12 07 40 00 01 00 40 ff ff ff ff
>
>
>If I did not misunderstand, the value of offset 0x1a is 0x07a6, then the
>bit 8 is 1, but my NIC's MPS is 128b, anything I'm wrong?

Have you power off the system completely after modifying eeprom? If not please do so.
-Tushar

2012-11-20 13:24:25

by Joe Jin

[permalink] [raw]
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 11/20/12 16:59, Dave, Tushar N wrote:
> Have you power off the system completely after modifying eeprom? If not please do so.

seems not works for me, would you please help to check what is wrong of my operations?

Original eeprom dump:

# ethtool -e eth3 | head -8
Offset Values
------ ------
0x0000 00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff
0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1
0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01
0x0030 f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06
^^^^^
0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00
0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

# lspci -s 0000:52:00.1 -vvv
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
<--snip-->
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
^^^^^^^^^^^^^^^^^^^^^
<--snip-->

# ethtool eth3
Settings for eth3:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: off
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000007 (7)
Link detected: yes

# ethtool -E eth3 magic 0x10a48086 offset 0x34 value 0xa7
# ethtool -e eth3 | head -8
Offset Values
------ ------
0x0000 00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff
0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1
0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01
0x0030 f6 6c b0 37 a7 07 03 84 83 07 00 00 03 c3 02 06
^^^^^ <== a6 --> a7
0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00
0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

# reboot

# ethtool -e eth3 | head -8
Offset Values
------ ------
0x0000 00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff
0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1
0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01
0x0030 f6 6c b0 37 a7 07 03 84 83 07 00 00 03 c3 02 06
0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00
0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

# lspci -s 0000:52:00.1 -vvv
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
<--snip-->
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
^^^^^^^^^^^^^^^^^^^^^
DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 <4us, L1 <64us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
<--snip-->

# ethtool -E eth3 magic 0x10a48086 offset 0x35 value 0x17

# ethtool -e eth3 | head -8
Offset Values
------ ------
0x0000 00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff
0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1
0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01
0x0030 f6 6c b0 37 a6 17 03 84 83 07 00 00 03 c3 02 06
^^^^^<== 07 -> 17
0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00
0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

# reboot

# ethtool -e eth3 | head -8
Offset Values
------ ------
0x0000 00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff
0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1
0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01
0x0030 f6 6c b0 37 a6 17 03 84 83 07 00 00 03 c3 02 06
^^^^^<== 07 -> 17
0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00
0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

# lspci -s 0000:52:00.1 -vvv
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
<--snip-->
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
^^^^^^^^^^^^^^^^^^^^^
DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 <4us, L1 <64us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
<--snip-->

Thanks,
Joe

2012-11-20 13:24:49

by Joe Jin

[permalink] [raw]
Subject: Re: 82571EB: Detected Hardware Unit Hang

On 11/20/12 16:59, Dave, Tushar N wrote:
> Have you power off the system completely after modifying eeprom? If not please do so.

Hi Tushar,

Seems not works for me, would you please help to check what is wrong of my operations?

Original eeprom dump:

# ethtool -e eth3 | head -8
Offset Values
------ ------
0x0000 00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff
0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1
0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01
0x0030 f6 6c b0 37 a6 07 03 84 83 07 00 00 03 c3 02 06
^^^^^
0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00
0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

# lspci -s 0000:52:00.1 -vvv
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
<--snip-->
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
^^^^^^^^^^^^^^^^^^^^^
<--snip-->

# ethtool eth3
Settings for eth3:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: off
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000007 (7)
Link detected: yes

# ethtool -E eth3 magic 0x10a48086 offset 0x34 value 0xa7
# ethtool -e eth3 | head -8
Offset Values
------ ------
0x0000 00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff
0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1
0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01
0x0030 f6 6c b0 37 a7 07 03 84 83 07 00 00 03 c3 02 06
^^^^^ <== a6 --> a7
0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00
0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

# reboot

# ethtool -e eth3 | head -8
Offset Values
------ ------
0x0000 00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff
0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1
0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01
0x0030 f6 6c b0 37 a7 07 03 84 83 07 00 00 03 c3 02 06
0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00
0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

# lspci -s 0000:52:00.1 -vvv
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
<--snip-->
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
^^^^^^^^^^^^^^^^^^^^^
DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 <4us, L1 <64us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
<--snip-->

# ethtool -E eth3 magic 0x10a48086 offset 0x35 value 0x17

# ethtool -e eth3 | head -8
Offset Values
------ ------
0x0000 00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff
0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1
0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01
0x0030 f6 6c b0 37 a6 17 03 84 83 07 00 00 03 c3 02 06
^^^^^<== 07 -> 17
0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00
0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

# reboot

# ethtool -e eth3 | head -8
Offset Values
------ ------
0x0000 00 15 17 16 ee 9a 24 05 ff ff a2 50 ff ff ff ff
0x0010 57 d4 07 74 2f a4 a4 11 86 80 a4 10 86 80 65 b1
0x0020 08 00 a4 10 00 58 00 00 01 50 00 00 00 00 00 01
0x0030 f6 6c b0 37 a6 17 03 84 83 07 00 00 03 c3 02 06
^^^^^<== 07 -> 17
0x0040 08 00 f0 0e 64 21 40 00 01 40 00 00 00 00 00 00
0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

# lspci -s 0000:52:00.1 -vvv
52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
<--snip-->
Capabilities: [e0] Express (v1) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 4096 bytes
^^^^^^^^^^^^^^^^^^^^^
DevSta: CorrErr- UncorrErr- FatalErr+ UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x4, ASPM L0s, Latency L0 <4us, L1 <64us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
<--snip-->

Thanks,
Joe

2012-11-26 16:23:43

by Fujinaka, Todd

[permalink] [raw]
Subject: RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang

On Tue, 20 Nov 2012, Joe Jin wrote:

> On 11/20/12 16:59, Dave, Tushar N wrote:
>> Have you power off the system completely after modifying eeprom? If not please do so.
>
> Hi Tushar,
>
> Seems not works for me, would you please help to check what is wrong of my operations?

...

> # lspci -s 0000:52:00.1 -vvv
> 52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
> <--snip-->
> Capabilities: [e0] Express (v1) Endpoint, MSI 00
> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
^^^^^^^^^^^^^^^^^^^^
> ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset-
> DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
> RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
> MaxPayload 128 bytes, MaxReadReq 4096 bytes
> ^^^^^^^^^^^^^^^^^^^^^
>
<--snip-->

If you look at the previous section, DevCap, you'll see that it's
correctly advertising 256 bytes but the system is negotiating 128 for
the link to the Ethernet controller. Things on the "other" side of the
link are controlled outside of the e1000 driver.

Tushar's first suggestion was to check the PCIe payload settings in the
entire chain. Have you done that? Mismatches will cause hangs.

Todd Fujinaka
Technical Marketing Engineer
LAN Access Division (LAD)
Intel Corporation
[email protected]
(503) 712-4565

2012-11-27 00:59:49

by Joe Jin

[permalink] [raw]
Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

On 11/27/12 00:23, Fujinaka, Todd wrote:
> If you look at the previous section, DevCap, you'll see that it's
> correctly advertising 256 bytes but the system is negotiating 128 for
> the link to the Ethernet controller. Things on the "other" side of the
> link are controlled outside of the e1000 driver.
>
> Tushar's first suggestion was to check the PCIe payload settings in the
> entire chain. Have you done that? Mismatches will cause hangs.

Hi Todd,

So far I had to know how to modify the maxpayload size, since BIOS have not
entry to change this, so I had to use ethtool, now I need to get the offset
of MaxPayload size in eeprom, I ever tried to find from Intel online document
but failed, any idea?

Thanks in advance,
Joe

2012-11-27 02:07:06

by Mary Mcgrath

[permalink] [raw]
Subject: RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang

Joe
Thank you for working this.
I would love to find out how they expect a customer to make the modification
To "word 0x1A, and see if the 8th bit is 0 or 1, and to change to 0."

I have in turn asked the ct for the lspci command on eth3, maybe the incorrect setting is upstream.

Again, thank you.
Regards
Mary



-----Original Message-----
From: Joe Jin
Sent: Monday, November 26, 2012 8:00 PM
To: Fujinaka, Todd
Cc: Dave, Tushar N; [email protected]; [email protected]; [email protected]; Mary Mcgrath
Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

On 11/27/12 00:23, Fujinaka, Todd wrote:
> If you look at the previous section, DevCap, you'll see that it's
> correctly advertising 256 bytes but the system is negotiating 128 for
> the link to the Ethernet controller. Things on the "other" side of the
> link are controlled outside of the e1000 driver.
>
> Tushar's first suggestion was to check the PCIe payload settings in
> the entire chain. Have you done that? Mismatches will cause hangs.

Hi Todd,

So far I had to know how to modify the maxpayload size, since BIOS have not entry to change this, so I had to use ethtool, now I need to get the offset of MaxPayload size in eeprom, I ever tried to find from Intel online document but failed, any idea?

Thanks in advance,
Joe

2012-11-27 17:32:32

by Fujinaka, Todd

[permalink] [raw]
Subject: RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang

Forgive me if I'm being too repetitious as I think some of this has been mentioned in the past.

We (and by we I mean the Ethernet part and driver) can only change the advertised availability of a larger MaxPayloadSize. The size is negotiated by both sides of the link when the link is established. The driver should not change the size of the link as it would be poking at registers outside of its scope and is controlled by the upstream bridge (not us).

You also need to check all the PCIe links to get to the device. There can be several to get from the root complex, through bridges, to the endpoint Ethernet controller. The Ethernet part and driver has no control over any other links. You'll have to talk to the motherboard manufacturer about those links.

Your original problem appears to be hangs and Tushar asked you to the entire path of PCIe connections from the root complex to the endpoint. Any mismatches in payload can cause hangs and I believe you have had the problem in the past. I'm sure you remember all the lspci commands to list the tree view and to dump all the details from each of the links and I would suggest you do that to check to see that the payload sizes match. What I do is "lspci -tvvv" to see what's connected, then "lspci -s xx:xx.x -vvv" to check the devices on the link.

Thanks.

Todd Fujinaka
Technical Marketing Engineer
LAN Access Division (LAD)
Intel Corporation
[email protected]
(503) 712-4565


-----Original Message-----
From: Mary Mcgrath [mailto:[email protected]]
Sent: Monday, November 26, 2012 6:07 PM
To: Joe Jin
Cc: [email protected]; [email protected]; [email protected]
Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

Joe
Thank you for working this.
I would love to find out how they expect a customer to make the modification To "word 0x1A, and see if the 8th bit is 0 or 1, and to change to 0."

I have in turn asked the ct for the lspci command on eth3, maybe the incorrect setting is upstream.

Again, thank you.
Regards
Mary



-----Original Message-----
From: Joe Jin
Sent: Monday, November 26, 2012 8:00 PM
To: Fujinaka, Todd
Cc: Dave, Tushar N; [email protected]; [email protected]; [email protected]; Mary Mcgrath
Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

On 11/27/12 00:23, Fujinaka, Todd wrote:
> If you look at the previous section, DevCap, you'll see that it's
> correctly advertising 256 bytes but the system is negotiating 128 for
> the link to the Ethernet controller. Things on the "other" side of the
> link are controlled outside of the e1000 driver.
>
> Tushar's first suggestion was to check the PCIe payload settings in
> the entire chain. Have you done that? Mismatches will cause hangs.

Hi Todd,

So far I had to know how to modify the maxpayload size, since BIOS have not entry to change this, so I had to use ethtool, now I need to get the offset of MaxPayload size in eeprom, I ever tried to find from Intel online document but failed, any idea?

Thanks in advance,
Joe

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

2012-11-27 18:10:45

by Ben Hutchings

[permalink] [raw]
Subject: RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang

On Tue, 2012-11-27 at 17:32 +0000, Fujinaka, Todd wrote:
> Forgive me if I'm being too repetitious as I think some of this has
> been mentioned in the past.
>
> We (and by we I mean the Ethernet part and driver) can only change the
> advertised availability of a larger MaxPayloadSize. The size is
> negotiated by both sides of the link when the link is established. The
> driver should not change the size of the link as it would be poking at
> registers outside of its scope and is controlled by the upstream
> bridge (not us).
[...]

MaxPayloadSize (MPS) is not negotiated between devices but is programmed
by the system firmware (at least for devices present at boot - the
kernel may be responsible in case of hotplug). You can use the kernel
parameter 'pci=pcie_bus_perf' (or one of several others) to set a policy
that overrides this, but no policy will allow setting MPS above the
device's MaxPayloadSizeSupported (MPSS).

(These parameters are not documented in
Documentation/kernel-parameters.txt! Someone ought to fix that.)

Ben.

--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

2012-11-27 18:24:37

by Fujinaka, Todd

[permalink] [raw]
Subject: RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang

Thanks for the clarification. I was just going by the PCIe spec, which says the lowest value of both ends is used, and I figured SOMETHING had to be looking at that and doing some sort of negotiation. I'm no BIOS guy, so I'm not sure what's actually going on, whether something walks the PCIe tree or if the BIOS just sets all the values to the minimum.

Todd Fujinaka
Technical Marketing Engineer
LAN Access Division (LAD)
Intel Corporation
[email protected]
(503) 712-4565


-----Original Message-----
From: Ben Hutchings [mailto:[email protected]]
Sent: Tuesday, November 27, 2012 10:11 AM
To: Fujinaka, Todd; Mary Mcgrath
Cc: Joe Jin; [email protected]; [email protected]; [email protected]; linux-pci
Subject: RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang

On Tue, 2012-11-27 at 17:32 +0000, Fujinaka, Todd wrote:
> Forgive me if I'm being too repetitious as I think some of this has
> been mentioned in the past.
>
> We (and by we I mean the Ethernet part and driver) can only change the
> advertised availability of a larger MaxPayloadSize. The size is
> negotiated by both sides of the link when the link is established. The
> driver should not change the size of the link as it would be poking at
> registers outside of its scope and is controlled by the upstream
> bridge (not us).
[...]

MaxPayloadSize (MPS) is not negotiated between devices but is programmed by the system firmware (at least for devices present at boot - the kernel may be responsible in case of hotplug). You can use the kernel parameter 'pci=pcie_bus_perf' (or one of several others) to set a policy that overrides this, but no policy will allow setting MPS above the device's MaxPayloadSizeSupported (MPSS).

(These parameters are not documented in
Documentation/kernel-parameters.txt! Someone ought to fix that.)

Ben.

--
Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-11-28 08:31:36

by Joe Jin

[permalink] [raw]
Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

On 11/28/12 02:10, Ben Hutchings wrote:
> On Tue, 2012-11-27 at 17:32 +0000, Fujinaka, Todd wrote:
>> Forgive me if I'm being too repetitious as I think some of this has
>> been mentioned in the past.
>>
>> We (and by we I mean the Ethernet part and driver) can only change the
>> advertised availability of a larger MaxPayloadSize. The size is
>> negotiated by both sides of the link when the link is established. The
>> driver should not change the size of the link as it would be poking at
>> registers outside of its scope and is controlled by the upstream
>> bridge (not us).
> [...]
>
> MaxPayloadSize (MPS) is not negotiated between devices but is programmed
> by the system firmware (at least for devices present at boot - the
> kernel may be responsible in case of hotplug). You can use the kernel
> parameter 'pci=pcie_bus_perf' (or one of several others) to set a policy
> that overrides this, but no policy will allow setting MPS above the
> device's MaxPayloadSizeSupported (MPSS).
>

Ben,

Unfortunately I'm using 3.0.x kernel and this is not included in the kernel.
So I'm trying to use ethtool modify it from eeprom to see if help or no.


Todd, I'll review all MaxPayload for all devices, but need to say if it mismatch,
customer could not modify it from BIOS for there was not entry at there, to
test it, we have to find how to verify if this is the root cause, so still
need to find the offset in eeprom.

Thanks in advance,
Joe

2012-11-28 15:53:05

by Fujinaka, Todd

[permalink] [raw]
Subject: RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang

The only EEPROM I know about or can speak to is the one attached to the 82571 and it doesn't set the MaxPayloadSize. That's done by the BIOS.

Todd Fujinaka
Technical Marketing Engineer
LAN Access Division (LAD)
Intel Corporation
[email protected]
(503) 712-4565


-----Original Message-----
From: Joe Jin [mailto:[email protected]]
Sent: Wednesday, November 28, 2012 12:31 AM
To: Ben Hutchings
Cc: Fujinaka, Todd; Mary Mcgrath; [email protected]; [email protected]; [email protected]; linux-pci
Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

On 11/28/12 02:10, Ben Hutchings wrote:
> On Tue, 2012-11-27 at 17:32 +0000, Fujinaka, Todd wrote:
>> Forgive me if I'm being too repetitious as I think some of this has
>> been mentioned in the past.
>>
>> We (and by we I mean the Ethernet part and driver) can only change
>> the advertised availability of a larger MaxPayloadSize. The size is
>> negotiated by both sides of the link when the link is established.
>> The driver should not change the size of the link as it would be
>> poking at registers outside of its scope and is controlled by the
>> upstream bridge (not us).
> [...]
>
> MaxPayloadSize (MPS) is not negotiated between devices but is
> programmed by the system firmware (at least for devices present at
> boot - the kernel may be responsible in case of hotplug). You can use
> the kernel parameter 'pci=pcie_bus_perf' (or one of several others) to
> set a policy that overrides this, but no policy will allow setting MPS
> above the device's MaxPayloadSizeSupported (MPSS).
>

Ben,

Unfortunately I'm using 3.0.x kernel and this is not included in the kernel.
So I'm trying to use ethtool modify it from eeprom to see if help or no.


Todd, I'll review all MaxPayload for all devices, but need to say if it mismatch, customer could not modify it from BIOS for there was not entry at there, to test it, we have to find how to verify if this is the root cause, so still need to find the offset in eeprom.

Thanks in advance,
Joe

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-11-29 03:10:26

by ethan zhao

[permalink] [raw]
Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

Joe,
Possibly your customer is running a kernel without source code on
a platform whose vendor wouldn't like to fix BIOS issue( Is that a
HP/Dell server ?).
Anyway, to see if is a payload issue or, you could change the
payload size with setpci tool to those devices and set the link
retrain bit to trigger the link retraining to debug the issue and
identity the root cause. I thinks it is much easier than modify the
BIOS or eeprom of NIC.

e.g.
set device control register to 0f 00 (128 bytes payload size)
# setpci -v -s 00:02.0 98.w=000f
set device link control register to 60h (retrain the link)
# setpci -v -s 00:02.0 a0.b=60

Hope it works, Just my 2 cents.

[email protected]

On Wed, Nov 28, 2012 at 11:53 PM, Fujinaka, Todd
<[email protected]> wrote:
> The only EEPROM I know about or can speak to is the one attached to the 82571 and it doesn't set the MaxPayloadSize. That's done by the BIOS.
>
> Todd Fujinaka
> Technical Marketing Engineer
> LAN Access Division (LAD)
> Intel Corporation
> [email protected]
> (503) 712-4565
>
>
> -----Original Message-----
> From: Joe Jin [mailto:[email protected]]
> Sent: Wednesday, November 28, 2012 12:31 AM
> To: Ben Hutchings
> Cc: Fujinaka, Todd; Mary Mcgrath; [email protected]; [email protected]; [email protected]; linux-pci
> Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
>
> On 11/28/12 02:10, Ben Hutchings wrote:
>> On Tue, 2012-11-27 at 17:32 +0000, Fujinaka, Todd wrote:
>>> Forgive me if I'm being too repetitious as I think some of this has
>>> been mentioned in the past.
>>>
>>> We (and by we I mean the Ethernet part and driver) can only change
>>> the advertised availability of a larger MaxPayloadSize. The size is
>>> negotiated by both sides of the link when the link is established.
>>> The driver should not change the size of the link as it would be
>>> poking at registers outside of its scope and is controlled by the
>>> upstream bridge (not us).
>> [...]
>>
>> MaxPayloadSize (MPS) is not negotiated between devices but is
>> programmed by the system firmware (at least for devices present at
>> boot - the kernel may be responsible in case of hotplug). You can use
>> the kernel parameter 'pci=pcie_bus_perf' (or one of several others) to
>> set a policy that overrides this, but no policy will allow setting MPS
>> above the device's MaxPayloadSizeSupported (MPSS).
>>
>
> Ben,
>
> Unfortunately I'm using 3.0.x kernel and this is not included in the kernel.
> So I'm trying to use ethtool modify it from eeprom to see if help or no.
>
>
> Todd, I'll review all MaxPayload for all devices, but need to say if it mismatch, customer could not modify it from BIOS for there was not entry at there, to test it, we have to find how to verify if this is the root cause, so still need to find the offset in eeprom.
>
> Thanks in advance,
> Joe
>

2012-11-29 15:52:24

by Fujinaka, Todd

[permalink] [raw]
Subject: RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang

Someone else pointed this out to me locally. If you have a non-client BIOS, you should be able to set the MaxPayloadSize using setpci. You have to make sure that you're being consistent throughout all the associated links.

Todd Fujinaka
Technical Marketing Engineer
LAN Access Division (LAD)
Intel Corporation
[email protected]
(503) 712-4565


-----Original Message-----
From: Ethan Zhao [mailto:[email protected]]
Sent: Wednesday, November 28, 2012 7:10 PM
To: Fujinaka, Todd
Cc: Joe Jin; Ben Hutchings; Mary Mcgrath; [email protected]; [email protected]; [email protected]; linux-pci
Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang

Joe,
Possibly your customer is running a kernel without source code on a platform whose vendor wouldn't like to fix BIOS issue( Is that a HP/Dell server ?).
Anyway, to see if is a payload issue or, you could change the payload size with setpci tool to those devices and set the link retrain bit to trigger the link retraining to debug the issue and identity the root cause. I thinks it is much easier than modify the BIOS or eeprom of NIC.

e.g.
set device control register to 0f 00 (128 bytes payload size)
# setpci -v -s 00:02.0 98.w=000f
set device link control register to 60h (retrain the link)
# setpci -v -s 00:02.0 a0.b=60

Hope it works, Just my 2 cents.

[email protected]

On Wed, Nov 28, 2012 at 11:53 PM, Fujinaka, Todd <[email protected]> wrote:
> The only EEPROM I know about or can speak to is the one attached to the 82571 and it doesn't set the MaxPayloadSize. That's done by the BIOS.
>
> Todd Fujinaka
> Technical Marketing Engineer
> LAN Access Division (LAD)
> Intel Corporation
> [email protected]
> (503) 712-4565
>
>
> -----Original Message-----
> From: Joe Jin [mailto:[email protected]]
> Sent: Wednesday, November 28, 2012 12:31 AM
> To: Ben Hutchings
> Cc: Fujinaka, Todd; Mary Mcgrath; [email protected];
> [email protected]; [email protected]; linux-pci
> Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang
>
> On 11/28/12 02:10, Ben Hutchings wrote:
>> On Tue, 2012-11-27 at 17:32 +0000, Fujinaka, Todd wrote:
>>> Forgive me if I'm being too repetitious as I think some of this has
>>> been mentioned in the past.
>>>
>>> We (and by we I mean the Ethernet part and driver) can only change
>>> the advertised availability of a larger MaxPayloadSize. The size is
>>> negotiated by both sides of the link when the link is established.
>>> The driver should not change the size of the link as it would be
>>> poking at registers outside of its scope and is controlled by the
>>> upstream bridge (not us).
>> [...]
>>
>> MaxPayloadSize (MPS) is not negotiated between devices but is
>> programmed by the system firmware (at least for devices present at
>> boot - the kernel may be responsible in case of hotplug). You can
>> use the kernel parameter 'pci=pcie_bus_perf' (or one of several
>> others) to set a policy that overrides this, but no policy will allow
>> setting MPS above the device's MaxPayloadSizeSupported (MPSS).
>>
>
> Ben,
>
> Unfortunately I'm using 3.0.x kernel and this is not included in the kernel.
> So I'm trying to use ethtool modify it from eeprom to see if help or no.
>
>
> Todd, I'll review all MaxPayload for all devices, but need to say if it mismatch, customer could not modify it from BIOS for there was not entry at there, to test it, we have to find how to verify if this is the root cause, so still need to find the offset in eeprom.
>
> Thanks in advance,
> Joe
>