2018-04-24 15:16:28

by Holger Schurig

[permalink] [raw]
Subject: [BUG] igb: reconnecting of cable not always detected

Hi all,

I'm on kernel 4.16.4 and have an issue with eth0, driver is igb. When I
remove the ethernet cable, this is always detected:

[ 2.772360] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
[ 2.772363] igb: Copyright (c) 2007-2014 Intel Corporation.
[ 3.023707] igb 0000:02:00.0: added PHC on eth0
[ 3.023710] igb 0000:02:00.0: Intel(R) Gigabit Ethernet Network Connection
[ 3.023713] igb 0000:02:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 00:13:95:1a:54:33
[ 3.023758] igb 0000:02:00.0: eth0: PBA No: 000300-000
[ 3.023762] igb 0000:02:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
[ 7.984921] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
[ 11.184593] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Down

Sometimes, plugging the cable back in is detected ...

[ 43.736922] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX

... but sometimes this is *NOT* detected. I can put the cable in and
even after two minutes nothing has been detected.

But when I run "rmmod igb" followed by "modpobe igb", the link is
detected again:

[ 100.528609] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Down
[ 2336.583244] igb 0000:02:00.0: removed PHC on eth0
[ 2339.693521] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
[ 2339.693524] igb: Copyright (c) 2007-2014 Intel Corporation.
[ 2339.990553] pps pps0: new PPS source ptp0
[ 2339.990561] igb 0000:02:00.0: added PHC on eth0
[ 2339.990565] igb 0000:02:00.0: Intel(R) Gigabit Ethernet Network Connection
[ 2339.990569] igb 0000:02:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 00:13:95:1a:54:33
[ 2339.990611] igb 0000:02:00.0: eth0: PBA No: 000300-000
[ 2339.990615] igb 0000:02:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
[ 2343.001114] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX

(In above dmesg snippet the ethernet cable was the whole time inserted).


Any tips on how I can debug this further?

PS: I already tried a different switch and also a direct connection from
device-to-device, without a switch.


2018-04-24 18:10:39

by Alexander Duyck

[permalink] [raw]
Subject: Re: [BUG] igb: reconnecting of cable not always detected

On Tue, Apr 24, 2018 at 8:14 AM, Holger Schurig <[email protected]> wrote:
> Hi all,
>
> I'm on kernel 4.16.4 and have an issue with eth0, driver is igb. When I
> remove the ethernet cable, this is always detected:
>
> [ 2.772360] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
> [ 2.772363] igb: Copyright (c) 2007-2014 Intel Corporation.
> [ 3.023707] igb 0000:02:00.0: added PHC on eth0
> [ 3.023710] igb 0000:02:00.0: Intel(R) Gigabit Ethernet Network Connection
> [ 3.023713] igb 0000:02:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 00:13:95:1a:54:33
> [ 3.023758] igb 0000:02:00.0: eth0: PBA No: 000300-000
> [ 3.023762] igb 0000:02:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
> [ 7.984921] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
> [ 11.184593] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Down
>
> Sometimes, plugging the cable back in is detected ...
>
> [ 43.736922] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
>
> ... but sometimes this is *NOT* detected. I can put the cable in and
> even after two minutes nothing has been detected.
>
> But when I run "rmmod igb" followed by "modpobe igb", the link is
> detected again:
>
> [ 100.528609] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Down
> [ 2336.583244] igb 0000:02:00.0: removed PHC on eth0
> [ 2339.693521] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.4.0-k
> [ 2339.693524] igb: Copyright (c) 2007-2014 Intel Corporation.
> [ 2339.990553] pps pps0: new PPS source ptp0
> [ 2339.990561] igb 0000:02:00.0: added PHC on eth0
> [ 2339.990565] igb 0000:02:00.0: Intel(R) Gigabit Ethernet Network Connection
> [ 2339.990569] igb 0000:02:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 00:13:95:1a:54:33
> [ 2339.990611] igb 0000:02:00.0: eth0: PBA No: 000300-000
> [ 2339.990615] igb 0000:02:00.0: Using MSI-X interrupts. 4 rx queue(s), 4 tx queue(s)
> [ 2343.001114] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
>
> (In above dmesg snippet the ethernet cable was the whole time inserted).
>
>
> Any tips on how I can debug this further?
>
> PS: I already tried a different switch and also a direct connection from
> device-to-device, without a switch.

Sounds like the link is failing to re-establish. You might double
check a few things. One is to verify if the link partner is
recognizing the link as coming up or not. That would help to tell us
if this is a problem of the driver detecting the link, or if the link
itself is not being re-established.

Another thing you could look at doing is running "ethtool -r eth0"
after plugging the cable in to see if that re-establishes the link or
not. It should be easier anyway than having to unload and reload the
driver.

If you could also provide an "lspci -vvv" and "ethtool -i" for the
device it would help us in the debugging process as it would provide
us with information on what NIC it is you are using and what firmware
is in use on it.

Thanks.

- Alex

2018-04-25 03:31:45

by Richard Cochran

[permalink] [raw]
Subject: Re: [BUG] igb: reconnecting of cable not always detected

On Tue, Apr 24, 2018 at 11:09:02AM -0700, Alexander Duyck wrote:
> On Tue, Apr 24, 2018 at 8:14 AM, Holger Schurig <[email protected]> wrote:
> > Sometimes, plugging the cable back in is detected ...
> >
> > [ 43.736922] igb 0000:02:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
> >
> > ... but sometimes this is *NOT* detected. I can put the cable in and
> > even after two minutes nothing has been detected.
> >
> > But when I run "rmmod igb" followed by "modpobe igb", the link is
> > detected again:

FWIW, I have noticed over the past months (or even years?) that my
i210 cards (or the igb driver) also fail to detect link changes after
a few physical link interruptions. I never bothered to try and debug
this, but it is super annoying.

Thanks,
Richard

2018-04-25 09:48:39

by Holger Schurig

[permalink] [raw]
Subject: Re: [BUG] igb: reconnecting of cable not always detected

Hi Alex,

(Sent a 2nd time, this time with "Reply to all" and without HTML, so
that it hits the kernel archives as well. Sorry for the noise.




> Sounds like the link is failing to re-establish. You might double
> check a few things. One is to verify if the link partner is
> recognizing the link as coming up or not.

It turns on differently. Before I remove the cable, the LED on the TP
LINK "TL SG-108" was green. After removing the cable, the LED went off.
After reinserting the cable, it became orange after some while.

Green LED means 1000 MB/s, orange LED means 10/100 MB/s.


I have a different, even older switch: "Allnet ALL8039". Here the same:
the switch detects a link, but igb not.



> If you could also provide an "lspci -vvv"

02:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network
Connection (rev 03)
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 19
Region 0: Memory at 90600000 (32-bit, non-prefetchable) [size=512K]
Region 2: I/O ports at d000 [size=32]
Region 3: Memory at 90680000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
Vector table: BAR=3 offset=00000000
PBA: BAR=3 offset=00002000
Capabilities: [a0] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s
<512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
SlotPowerLimit 0.000W
DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+
Unsupported+
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+
TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit
Latency L0s <2us, L1 <16us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+
DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-,
OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+,
LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance-
SpeedDis-
Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB,
EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-,
LinkEqualizationRequest-
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+
ChkEn-
Capabilities: [140 v1] Device Serial Number 00-13-95-ff-ff-1a-54-33
Capabilities: [1a0 v1] Transaction Processing Hints
Device specific mode supported
Steering table in TPH capability structure
Kernel driver in use: igb
Kernel modules: igb

> and "ethtool -i" for the

driver: igb
version: 5.4.0-k
firmware-version: 3.20, 0x80000553
expansion-rom-version:
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes



One thing that is interesting is how igb reacts to ethtool inquiries
once it goes into the failed state. You inquired for "ethtool -i eth0",
but in the failed state I only get this:

Cannot restart autonegotiation: No such device

But eth0 is of course still there, "ip -d link show eth0" shows:


2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
mode DEFAULT group default qlen 1000
link/ether 00:13:95:1a:54:33 brd ff:ff:ff:ff:ff:ff promiscuity 0
numtxqueues 8 numrxqueues 8 gso_max_size 65536 gso_max_segs 65535





Other ethtool commands also don't report any information once the link
went bogus. Here one output from "ethtool eth0":

Settings for eth0:
Supported ports: [ TP ]
Supported link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Advertised link modes: 10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 1
Transceiver: internal
Auto-negotiation: on
MDI-X: off (auto)
Supports Wake-on: pumbg
Wake-on: g
Current message level: 0x00000007 (7)
drv probe link
Link detected: yes

... and here another:

Settings for eth0:
Cannot get device settings: No such device
Cannot get wake-on-lan settings: No such device
Cannot get message level: No such device
Cannot get link status: No such device
Settings for eth0:
No data available



I'm willing to pepper the source with printk, if this helps :-)


Greetings,
Holger

2018-04-25 16:03:48

by Alexander Duyck

[permalink] [raw]
Subject: Re: [BUG] igb: reconnecting of cable not always detected

On Wed, Apr 25, 2018 at 2:47 AM, Holger Schurig <[email protected]> wrote:
> Hi Alex,
>
> (Sent a 2nd time, this time with "Reply to all" and without HTML, so
> that it hits the kernel archives as well. Sorry for the noise.
>
>
>
>
>> Sounds like the link is failing to re-establish. You might double
>> check a few things. One is to verify if the link partner is
>> recognizing the link as coming up or not.
>
> It turns on differently. Before I remove the cable, the LED on the TP
> LINK "TL SG-108" was green. After removing the cable, the LED went off.
> After reinserting the cable, it became orange after some while.
>
> Green LED means 1000 MB/s, orange LED means 10/100 MB/s.

Was the orange LED on the igb NIC or on the TL SG-108? Based on the
comment below I am assuming it is the switch.

Based on that I am thinking we probably need to work on the PHY configuration.

> I have a different, even older switch: "Allnet ALL8039". Here the same:
> the switch detects a link, but igb not.
>
>
>
>> If you could also provide an "lspci -vvv"
>
> 02:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network
> Connection (rev 03)

Okay so we are working with an i210.

> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx+
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 19
> Region 0: Memory at 90600000 (32-bit, non-prefetchable) [size=512K]
> Region 2: I/O ports at d000 [size=32]
> Region 3: Memory at 90680000 (32-bit, non-prefetchable) [size=16K]
> Capabilities: [40] Power Management version 3
> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
> Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
> Address: 0000000000000000 Data: 0000
> Masking: 00000000 Pending: 00000000
> Capabilities: [70] MSI-X: Enable+ Count=5 Masked-
> Vector table: BAR=3 offset=00000000
> PBA: BAR=3 offset=00002000
> Capabilities: [a0] Express (v2) Endpoint, MSI 00
> DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s
> <512ns, L1 <64us
> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
> SlotPowerLimit 0.000W
> DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+
> Unsupported+
> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
> FLReset-
> MaxPayload 128 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+
> TransPend-
> LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit
> Latency L0s <2us, L1 <16us
> ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+
> DLActive- BWMgmt- ABWMgmt-
> DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-,
> OBFF Not Supported
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+,
> LTR-, OBFF Disabled
> LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance-
> SpeedDis-
> Transmit Margin: Normal Operating Range,
> EnterModifiedCompliance- ComplianceSOS-
> Compliance De-emphasis: -6dB
> LnkSta2: Current De-emphasis Level: -6dB,
> EqualizationComplete-, EqualizationPhase1-
> EqualizationPhase2-, EqualizationPhase3-,
> LinkEqualizationRequest-
> Capabilities: [100 v2] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
> NonFatalErr-
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout-
> NonFatalErr+
> AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+
> ChkEn-
> Capabilities: [140 v1] Device Serial Number 00-13-95-ff-ff-1a-54-33
> Capabilities: [1a0 v1] Transaction Processing Hints
> Device specific mode supported
> Steering table in TPH capability structure
> Kernel driver in use: igb
> Kernel modules: igb
>
>> and "ethtool -i" for the
>
> driver: igb
> version: 5.4.0-k
> firmware-version: 3.20, 0x80000553
> expansion-rom-version:
> bus-info: 0000:02:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: yes
>
>
>
> One thing that is interesting is how igb reacts to ethtool inquiries
> once it goes into the failed state. You inquired for "ethtool -i eth0",
> but in the failed state I only get this:
>
> Cannot restart autonegotiation: No such device

I assume you mean "ethtool -r" since that is what is supposed to be
restarting negotiation. The "ethtool -i" is what you provided above.

The fact that the device disappears is a bit concerning. I'm wondering
if we are somehow triggering the surprise removal code.

> But eth0 is of course still there, "ip -d link show eth0" shows:
>
>
> 2: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
> mode DEFAULT group default qlen 1000
> link/ether 00:13:95:1a:54:33 brd ff:ff:ff:ff:ff:ff promiscuity 0
> numtxqueues 8 numrxqueues 8 gso_max_size 65536 gso_max_segs 65535
>
>
>
>
>
> Other ethtool commands also don't report any information once the link
> went bogus. Here one output from "ethtool eth0":
>
> Settings for eth0:
> Supported ports: [ TP ]
> Supported link modes: 10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> 1000baseT/Full
> Supported pause frame use: Symmetric
> Supports auto-negotiation: Yes
> Advertised link modes: 10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> 1000baseT/Full
> Advertised pause frame use: Symmetric
> Advertised auto-negotiation: Yes
> Speed: 1000Mb/s
> Duplex: Full
> Port: Twisted Pair
> PHYAD: 1
> Transceiver: internal
> Auto-negotiation: on
> MDI-X: off (auto)
> Supports Wake-on: pumbg
> Wake-on: g
> Current message level: 0x00000007 (7)
> drv probe link
> Link detected: yes
>
> ... and here another:
>
> Settings for eth0:
> Cannot get device settings: No such device
> Cannot get wake-on-lan settings: No such device
> Cannot get message level: No such device
> Cannot get link status: No such device
> Settings for eth0:
> No data available
>
>
>
> I'm willing to pepper the source with printk, if this helps :-)
>
>
> Greetings,
> Holger

Thanks. I'm suspecting we may need to instrument igb_rd32 at this
point. In order to trigger what you are seeing I am assuming the
device has been detached due to a read failure of some sort.

Another thing you could look at doing is narrowing down the possible
factors involved. You could go through and limit phy settings and look
at possibly dropping features such as EEE if it is enabled on the
device.

Thanks.

- Alex

2018-04-26 07:56:14

by Holger Schurig

[permalink] [raw]
Subject: Re: [BUG] igb: reconnecting of cable not always detected

> Was the orange LED on the igb NIC or on the TL SG-108? Based on the
> comment below I am assuming it is the switch.

The LEDs were on the switch.

When everything works, the switch says green == 1000 MB/s.

When cable is disconnected, switch doesn't light any LED.

When cable is inserted and things fail, the switch says orange LED == 100 MB/s.
Sometimes the insertion process works, then the switch will go, of
course, to the green LED == 1000 MB/s.

I must admit that I didn't look at the LEDs of the device.


Now I looked there, and the device the left+green LED is on. In the
failed case (so, in the dmesg output the last thing I see is "Link is
Down", but the device still has left+green LED on.

The right+orange LED on the device seems to indicate traffic, and it is
constantly off in the failed case.



> I assume you mean "ethtool -r" since that is what is supposed to be
> restarting negotiation. The "ethtool -i" is what you provided above.

Maybe I've edited my text too much and moved output along. Anyway, in
the failed case neither "ethtool- r eth0" nor "ethtool -i eth0" nor
"mii-tool eth0" work at all, they all emit error warning.

> Thanks. I'm suspecting we may need to instrument igb_rd32 at this
> point. In order to trigger what you are seeing I am assuming the
> device has been detached due to a read failure of some sort.

I'll do that and reply later. I first need to understand this source
part :-)


> Another thing you could look at doing is narrowing down the possible
> factors involved. You could go through and limit phy settings and look
> at possibly dropping features such as EEE if it is enabled on the
> device.

I actually tried a driver patch to remove 1000 GB/s from the driver, in
the assumption that maybe this specific hardware has a bad layout and
thus trouble (I don't really think that, because I never observed any
data transfer problem).

So, is the following patch (that didn't help) what in the line of what
you suggested?


Index: linux-4.16/drivers/net/ethernet/intel/igb/igb_main.c
===================================================================
--- linux-4.16.orig/drivers/net/ethernet/intel/igb/igb_main.c 2018-04-01 23:20:27.000000000 +0200
+++ linux-4.16/drivers/net/ethernet/intel/igb/igb_main.c 2018-04-24 11:35:17.420760650 +0200
@@ -2080,7 +2080,7 @@

if ((adapter->flags & IGB_FLAG_EEE) &&
(!hw->dev_spec._82575.eee_disable))
- adapter->eee_advert = MDIO_EEE_100TX | MDIO_EEE_1000T;
+ adapter->eee_advert = MDIO_EEE_100TX /* | MDIO_EEE_1000T */;

return 0;
}
@@ -2908,7 +2908,7 @@
/* Initialize link properties that are user-changeable */
adapter->fc_autoneg = true;
hw->mac.autoneg = true;
- hw->phy.autoneg_advertised = 0x2f;
+ hw->phy.autoneg_advertised = 0x0f;

hw->fc.requested_mode = e1000_fc_default;
hw->fc.current_mode = e1000_fc_default;
@@ -3099,7 +3099,7 @@
if ((!err) &&
(!hw->dev_spec._82575.eee_disable)) {
adapter->eee_advert =
- MDIO_EEE_100TX | MDIO_EEE_1000T;
+ MDIO_EEE_100TX /* | MDIO_EEE_1000T */;
adapter->flags |= IGB_FLAG_EEE;
}
break;
@@ -3110,7 +3110,7 @@
if ((!err) &&
(!hw->dev_spec._82575.eee_disable)) {
adapter->eee_advert =
- MDIO_EEE_100TX | MDIO_EEE_1000T;
+ MDIO_EEE_100TX /* | MDIO_EEE_1000T */;
adapter->flags |= IGB_FLAG_EEE;
}
}
Index: linux-4.16/drivers/net/ethernet/intel/igb/igb_ethtool.c
===================================================================
--- linux-4.16.orig/drivers/net/ethernet/intel/igb/igb_ethtool.c 2018-04-01 23:20:27.000000000 +0200
+++ linux-4.16/drivers/net/ethernet/intel/igb/igb_ethtool.c 2018-04-24 11:42:36.737959749 +0200
@@ -170,7 +170,7 @@
SUPPORTED_10baseT_Full |
SUPPORTED_100baseT_Half |
SUPPORTED_100baseT_Full |
- SUPPORTED_1000baseT_Full|
+ /* SUPPORTED_1000baseT_Full| */
SUPPORTED_Autoneg |
SUPPORTED_TP |
SUPPORTED_Pause);
@@ -3003,7 +3003,7 @@
(hw->phy.media_type != e1000_media_type_copper))
return -EOPNOTSUPP;

- edata->supported = (SUPPORTED_1000baseT_Full |
+ edata->supported = (/* SUPPORTED_1000baseT_Full | */
SUPPORTED_100baseT_Full);
if (!hw->dev_spec._82575.eee_disable)
edata->advertised =

2018-04-26 09:10:13

by Holger Schurig

[permalink] [raw]
Subject: Re: [BUG] igb: reconnecting of cable not always detected

Hi,

> Thanks. I'm suspecting we may need to instrument igb_rd32 at this
> point. In order to trigger what you are seeing I am assuming the
> device has been detached due to a read failure of some sort.

Okay, I added a printk to igb_rd32. And because no one calls this
function directly (all access goes via the rd32/rd32_array macro) I also
added the output of the calling function. This should help greatly in
identifying the read from the hardware to the consumer.

Finally, I noticed that igb_update_stats() produced a lot of churn that
most likely are unrelated. So I helper variable to make output from this
function go away.

I installed this modified driver, rebooted, and removed / inserted the
LAN cable until the error was present.

As before, "ethtool" and "mii-tool" now said that the device is not
there, while "ip link" showed the device as present.


The full output of "journalctl -fk | grep igb" is 600 kB. So put the
whole file at Google Drive:

https://drive.google.com/open?id=1p9cCT2d_EHnSHh29oS3AepUgFTKGFSeA



I looked at the output to see patterns, e.g with

grep -n igb_get_cfg_done_i210 igb.error.txt
grep -n __igb_shutdown igb.error.txt
...

(and almost all other function names). I hoped to see patterns. But for
my untrained eye, things looked not out of the order.





(For reference, here is the debug patch)

Index: linux-4.16/drivers/net/ethernet/intel/igb/igb_main.c
===================================================================
--- linux-4.16.orig/drivers/net/ethernet/intel/igb/igb_main.c 2018-04-01 23:20:27.000000000 +0200
+++ linux-4.16/drivers/net/ethernet/intel/igb/igb_main.c 2018-04-26 10:36:09.625135952 +0200
@@ -759,7 +759,8 @@
}
}

-u32 igb_rd32(struct e1000_hw *hw, u32 reg)
+int igb_rd32_silent = 0;
+u32 igb_rd32(const char *func, struct e1000_hw *hw, u32 reg)
{
struct igb_adapter *igb = container_of(hw, struct igb_adapter, hw);
u8 __iomem *hw_addr = READ_ONCE(hw->hw_addr);
@@ -769,6 +770,8 @@
return ~value;

value = readl(&hw_addr[reg]);
+ if (!igb_rd32_silent)
+ printk("rd32 %s %08x %08x\n", func, reg, value);

/* reads should not return all F's */
if (!(~value) && (!reg || !(~readl(hw_addr)))) {
@@ -5935,6 +5938,7 @@
if (pci_channel_offline(pdev))
return;

+ igb_rd32_silent = 1;
bytes = 0;
packets = 0;

@@ -6100,6 +6104,7 @@
adapter->stats.b2ospc += rd32(E1000_B2OSPC);
adapter->stats.b2ogprc += rd32(E1000_B2OGPRC);
}
+ igb_rd32_silent = 0;
}

static void igb_tsync_interrupt(struct igb_adapter *adapter)
Index: linux-4.16/drivers/net/ethernet/intel/igb/e1000_regs.h
===================================================================
--- linux-4.16.orig/drivers/net/ethernet/intel/igb/e1000_regs.h 2018-04-01 23:20:27.000000000 +0200
+++ linux-4.16/drivers/net/ethernet/intel/igb/e1000_regs.h 2018-04-26 10:34:24.332157000 +0200
@@ -370,7 +370,8 @@

struct e1000_hw;

-u32 igb_rd32(struct e1000_hw *hw, u32 reg);
+extern int igb_rd32_silent;
+u32 igb_rd32(const char *fname, struct e1000_hw *hw, u32 reg);

/* write operations, indexed using DWORDS */
#define wr32(reg, val) \
@@ -380,14 +381,14 @@
writel((val), &hw_addr[(reg)]); \
} while (0)

-#define rd32(reg) (igb_rd32(hw, reg))
+#define rd32(reg) (igb_rd32(__func__, hw, reg))

#define wrfl() ((void)rd32(E1000_STATUS))

#define array_wr32(reg, offset, value) \
wr32((reg) + ((offset) << 2), (value))

-#define array_rd32(reg, offset) (igb_rd32(hw, reg + ((offset) << 2)))
+#define array_rd32(reg, offset) (igb_rd32(__func__, hw, reg + ((offset) << 2)))

/* DMA Coalescing registers */
#define E1000_PCIEMISC 0x05BB8 /* PCIE misc config register */

2018-04-26 16:03:59

by Alexander Duyck

[permalink] [raw]
Subject: Re: [BUG] igb: reconnecting of cable not always detected

On Thu, Apr 26, 2018 at 2:08 AM, Holger Schurig <[email protected]> wrote:
> Hi,
>
>> Thanks. I'm suspecting we may need to instrument igb_rd32 at this
>> point. In order to trigger what you are seeing I am assuming the
>> device has been detached due to a read failure of some sort.
>
> Okay, I added a printk to igb_rd32. And because no one calls this
> function directly (all access goes via the rd32/rd32_array macro) I also
> added the output of the calling function. This should help greatly in
> identifying the read from the hardware to the consumer.
>
> Finally, I noticed that igb_update_stats() produced a lot of churn that
> most likely are unrelated. So I helper variable to make output from this
> function go away.
>
> I installed this modified driver, rebooted, and removed / inserted the
> LAN cable until the error was present.
>
> As before, "ethtool" and "mii-tool" now said that the device is not
> there, while "ip link" showed the device as present.
>
>
> The full output of "journalctl -fk | grep igb" is 600 kB. So put the
> whole file at Google Drive:
>
> https://drive.google.com/open?id=1p9cCT2d_EHnSHh29oS3AepUgFTKGFSeA
>
>
>
> I looked at the output to see patterns, e.g with
>
> grep -n igb_get_cfg_done_i210 igb.error.txt
> grep -n __igb_shutdown igb.error.txt
> ...
>
> (and almost all other function names). I hoped to see patterns. But for
> my untrained eye, things looked not out of the order.


Thanks for the data. It is actually useful. There are a few things
that I see that seem to point to an obvious issue.

The first are the following 2 lines from your dump:
Apr 26 10:42:49 kernel: igb 0000:02:00.0 eth0: igb: eth0 NIC Link is
Up 1000 Mbps Half Duplex, Flow Control: RX
Apr 26 10:42:49 kernel: igb 0000:02:00.0: EEE Disabled: unsupported at
half duplex. Re-enable using ethtool when at full duplex.

In case you aren't aware 1000Mbps Half Duplex is not a valid combination.

The other bit that catches my attention is:
Apr 26 10:42:51 kernel: igb 0000:02:00.0: exceed max 2 second

Which appears to be a timeout error that is triggered in response to
the above error which I believe is the fact that it didn't actually
link at 1000Mbps.

As I get time I will try to look into this further. I will have to go
through the MDIC reads to figure out if there is something in there
that is providing us with bad information from the PHY or if we are
misinterpreting something.

Thanks.

- Alex

2018-04-27 10:41:09

by Holger Schurig

[permalink] [raw]
Subject: Re: [BUG] igb: reconnecting of cable not always detected

Hi Alex,

> The first are the following 2 lines from your dump:
> Apr 26 10:42:49 kernel: igb 0000:02:00.0 eth0: igb: eth0 NIC Link is
> Up 1000 Mbps Half Duplex, Flow Control: RX
> Apr 26 10:42:49 kernel: igb 0000:02:00.0: EEE Disabled: unsupported at
> half duplex. Re-enable using ethtool when at full duplex.

Can it be the case that this is just a follow-up error?

In one of the mails from yesterday I showed you my patch to disable 1000
MB/s ... and still I had the link-always-down.

Similarly when I used a 10/100 MB/s switch only.

Both scenarios disabled 1000 MB/s, one more strictly than the other :-)


2018-05-18 07:36:26

by Holger Schurig

[permalink] [raw]
Subject: Re: [BUG] igb: reconnecting of cable not always detected

Alexander Duyck <[email protected]> writes:
> Thanks for the data. It is actually useful. There are a few things
> that I see that seem to point to an obvious issue.

Any news on this?

A collegue of mine states (I have not checked this) that a kernel
4.9.0-6-686 from a Debian Live ISO (debian-live-9.4.0-i386-kde.iso)
didn't show this behavior, so we have some kind of regression perhaps?


Greetings,
Holger

2019-01-17 21:56:08

by Jeff Kirsher

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [BUG] igb: reconnecting of cable not always detected

On Fri, May 18, 2018 at 12:36 AM Holger Schurig <[email protected]> wrote:
>
> Alexander Duyck <[email protected]> writes:
> > Thanks for the data. It is actually useful. There are a few things
> > that I see that seem to point to an obvious issue.
>
> Any news on this?
>
> A collegue of mine states (I have not checked this) that a kernel
> 4.9.0-6-686 from a Debian Live ISO (debian-live-9.4.0-i386-kde.iso)
> didn't show this behavior, so we have some kind of regression perhaps?

Our validation team was only able to reproduce this once, but is not
able to reproduce the issue again or even consistently to be able to
adequate debug the issue.

Are you still seeing the issue with the latest upstream kernel from
either David Miller's net-next tree or Linus's tree?

--
Cheers,
Jeff