Hi there,
I have some servers with an 82574L based NIC and recently upgraded from
a 4.4 series kernel to 4.7. Upon doing so, servers with this chipset
have begun frequently reporting "Link is Down" and "Link is Up"
messages. No other related network errors are reported by the kernel or
e1000e driver. I saw some reports about using "ethtool -s $iface msglvl
6" to reveal more information, but nothing extra was reported.
Some testing showed that this was introduced between the 4.4 and 4.5
series. I was able to further narrow it down to two commits that look
related:
e1000e: Do not write lsc to ics in msi-x mode
(a61cfe4ffad7864a07e0c74969ca7ceb77ab2f1f)
e1000e: Do not read ICR in Other interrupt
(16ecba59bc333d6282ee057fb02339f77a880beb)
Reverting these two commits resolves the Link is Down/Link is Up
messages. This has been tested on about six servers so far and all have
stopped reporting these link flaps.
In total I have about ten servers that are frequently seeing this issue,
and a couple dozen more triggering it sporadically.
This is about the extent of my troubleshooting knowledge so far. I am
happy to test code changes and provide any additional information as
necessary. While I do not understand what specifically causes the link
flaps, they reliably begin occurring on the affected servers within a
couple hours of boot.
A snip of one such instance is below.
Thank you for any assistance troubleshooting this.
Kind regards,
Jack Suter
# ethtool -i enp2s0
driver: e1000e
version: 3.2.6-k
firmware-version: 2.1-2
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
[ 3532.745587] e1000e: enp2s0 NIC Link is Down
[ 3532.771461] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[15463.117592] e1000e: enp2s0 NIC Link is Down
[15463.119419] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[15469.155922] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[15648.196579] e1000e: enp2s0 NIC Link is Down
[15651.405310] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[15728.959981] e1000e: enp2s0 NIC Link is Down
[15729.000625] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[15835.132034] e1000e: enp2s0 NIC Link is Down
[15835.185222] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[15839.104020] e1000e: enp2s0 NIC Link is Down
[15839.142346] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[15845.142287] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[16401.940127] e1000e: enp2s0 NIC Link is Down
[16401.945106] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[16408.121843] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[17025.823220] e1000e: enp2s0 NIC Link is Down
[17025.825473] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
[17032.100202] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: Rx/Tx
On 2016/11/01 19:56, Jack Suter wrote:
> Hi there,
>
> I have some servers with an 82574L based NIC and recently upgraded from
> a 4.4 series kernel to 4.7. Upon doing so, servers with this chipset
> have begun frequently reporting "Link is Down" and "Link is Up"
> messages. No other related network errors are reported by the kernel or
> e1000e driver. I saw some reports about using "ethtool -s $iface msglvl
> 6" to reveal more information, but nothing extra was reported.
>
> Some testing showed that this was introduced between the 4.4 and 4.5
> series. I was able to further narrow it down to two commits that look
> related:
>
> e1000e: Do not write lsc to ics in msi-x mode
> (a61cfe4ffad7864a07e0c74969ca7ceb77ab2f1f)
> e1000e: Do not read ICR in Other interrupt
> (16ecba59bc333d6282ee057fb02339f77a880beb)
I'm just about to get on a plane but I'll be able to look at this on
Monday. Two guesses are that:
1) There is something else than LSC that triggers the "other" interrupt.
Even if that is so, it should not cause e1000e_check_for_copper_link to
report link down however.
2) The link down events are real but some lsc interrupts were not
processed properly prior to this patchset, causing the events to be
lost/ignored.
> From: Jack Suter [mailto:[email protected]]
> Sent: Tuesday, November 1, 2016 4:57 PM
> To: Kirsher, Jeffrey T <[email protected]>
> Cc: [email protected]; [email protected]; Brown, Aaron F
> <[email protected]>; [email protected]; linux-
> [email protected]
> Subject: Kernel regression introduced by "e1000e: Do not write lsc to ics in
> msi-x mode" and/or "e1000e: Do not read ICR in Other interrupt"
>
> Hi there,
>
> I have some servers with an 82574L based NIC and recently upgraded from
> a 4.4 series kernel to 4.7. Upon doing so, servers with this chipset
> have begun frequently reporting "Link is Down" and "Link is Up"
> messages. No other related network errors are reported by the kernel or
> e1000e driver. I saw some reports about using "ethtool -s $iface msglvl
> 6" to reveal more information, but nothing extra was reported.
>
> Some testing showed that this was introduced between the 4.4 and 4.5
> series. I was able to further narrow it down to two commits that look
> related:
>
> e1000e: Do not write lsc to ics in msi-x mode
> (a61cfe4ffad7864a07e0c74969ca7ceb77ab2f1f)
> e1000e: Do not read ICR in Other interrupt
> (16ecba59bc333d6282ee057fb02339f77a880beb)
I did not notice any link flapping when I tested those patches, I would have rejected them if I had. I have several systems with 82574L LOMs and as yet am not able to reproduce a link flap with recent upstream kernels/drivers (net-next 4.8.0 on one and 4.9.0-rc3 on another.)
One of those systems is dedicated to a kernel regression setup, I checked the test logs from it and am not seeing any evidence of flaps in the 4.4, through 4.6 range either.
>
> Reverting these two commits resolves the Link is Down/Link is Up
> messages. This has been tested on about six servers so far and all have
> stopped reporting these link flaps.
Are you able to revert either of the patches independently, I don't recall if they were stand alone or not.
>
> In total I have about ten servers that are frequently seeing this issue,
> and a couple dozen more triggering it sporadically.
Are they all 82574L or does it affect others?
>
> This is about the extent of my troubleshooting knowledge so far. I am
> happy to test code changes and provide any additional information as
> necessary. While I do not understand what specifically causes the link
> flaps, they reliably begin occurring on the affected servers within a
> couple hours of boot.
Is there any particular traffic pattern involved? Sitting idle, moderate use, heavy constant flow?
>
> A snip of one such instance is below.
>
> Thank you for any assistance troubleshooting this.
Which kernel tree are you using? Linus's upstream kernel from kernel.org, a distribution provided one or? I'm generally working off of David Miller's net-next, but can try to repro the issue on my boxes if I know the exact kernel to work from.
Perhaps a power saving state trying to kick in? Bad cables or speed/duplex mismatches are common causes of link flap, but they seem unlikely given reverting the patches resolves the issue.
Those patches are interrupt related, what kind of interrupts are in use? What is interrupt moderation (coalescing set to)? What is the link partner? Same type switch for all problem machines or a mix?
cat /proc/interrupts
ethtool -c enp2s0
maybe an `lspci` dump could help shed some more light.
>
> Kind regards,
>
> Jack Suter
>
> # ethtool -i enp2s0
> driver: e1000e
> version: 3.2.6-k
> firmware-version: 2.1-2
> bus-info: 0000:02:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: no
>
> [ 3532.745587] e1000e: enp2s0 NIC Link is Down
> [ 3532.771461] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15463.117592] e1000e: enp2s0 NIC Link is Down
> [15463.119419] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15469.155922] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15648.196579] e1000e: enp2s0 NIC Link is Down
> [15651.405310] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15728.959981] e1000e: enp2s0 NIC Link is Down
> [15729.000625] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15835.132034] e1000e: enp2s0 NIC Link is Down
> [15835.185222] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15839.104020] e1000e: enp2s0 NIC Link is Down
> [15839.142346] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15845.142287] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [16401.940127] e1000e: enp2s0 NIC Link is Down
> [16401.945106] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [16408.121843] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [17025.823220] e1000e: enp2s0 NIC Link is Down
> [17025.825473] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [17032.100202] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
-----Original Message-----
From: Intel-wired-lan [mailto:[email protected]] On Behalf Of Brown, Aaron F
Sent: Wednesday, November 02, 2016 11:20 PM
To: Jack Suter <[email protected]>; Kirsher, Jeffrey T <[email protected]>
Cc: [email protected]; [email protected]; [email protected]; [email protected]
Subject: Re: [Intel-wired-lan] Kernel regression introduced by "e1000e: Do not write lsc to ics in msi-x mode" and/or "e1000e: Do not read ICR in Other interrupt"
> From: Jack Suter [mailto:[email protected]]
> Sent: Tuesday, November 1, 2016 4:57 PM
> To: Kirsher, Jeffrey T <[email protected]>
> Cc: [email protected]; [email protected]; Brown, Aaron
> F <[email protected]>; [email protected]; linux-
> [email protected]
> Subject: Kernel regression introduced by "e1000e: Do not write lsc to
> ics in msi-x mode" and/or "e1000e: Do not read ICR in Other interrupt"
>
> Hi there,
>
> I have some servers with an 82574L based NIC and recently upgraded
> from a 4.4 series kernel to 4.7. Upon doing so, servers with this
> chipset have begun frequently reporting "Link is Down" and "Link is Up"
> messages. No other related network errors are reported by the kernel
> or e1000e driver. I saw some reports about using "ethtool -s $iface
> msglvl 6" to reveal more information, but nothing extra was reported.
>
> Some testing showed that this was introduced between the 4.4 and 4.5
> series. I was able to further narrow it down to two commits that look
> related:
>
> e1000e: Do not write lsc to ics in msi-x mode
> (a61cfe4ffad7864a07e0c74969ca7ceb77ab2f1f)
> e1000e: Do not read ICR in Other interrupt
> (16ecba59bc333d6282ee057fb02339f77a880beb)
I did not notice any link flapping when I tested those patches, I would have rejected them if I had. I have several systems with 82574L LOMs and as yet am not able to reproduce a link flap with recent upstream kernels/drivers (net-next 4.8.0 on one and 4.9.0-rc3 on another.)
One of those systems is dedicated to a kernel regression setup, I checked the test logs from it and am not seeing any evidence of flaps in the 4.4, through 4.6 range either.
>
> Reverting these two commits resolves the Link is Down/Link is Up
> messages. This has been tested on about six servers so far and all
> have stopped reporting these link flaps.
Are you able to revert either of the patches independently, I don't recall if they were stand alone or not.
>
> In total I have about ten servers that are frequently seeing this
> issue, and a couple dozen more triggering it sporadically.
Are they all 82574L or does it affect others?
>
> This is about the extent of my troubleshooting knowledge so far. I am
> happy to test code changes and provide any additional information as
> necessary. While I do not understand what specifically causes the link
> flaps, they reliably begin occurring on the affected servers within a
> couple hours of boot.
Is there any particular traffic pattern involved? Sitting idle, moderate use, heavy constant flow?
>
> A snip of one such instance is below.
>
> Thank you for any assistance troubleshooting this.
Which kernel tree are you using? Linus's upstream kernel from kernel.org, a distribution provided one or? I'm generally working off of David Miller's net-next, but can try to repro the issue on my boxes if I know the exact kernel to work from.
Perhaps a power saving state trying to kick in? Bad cables or speed/duplex mismatches are common causes of link flap, but they seem unlikely given reverting the patches resolves the issue.
Those patches are interrupt related, what kind of interrupts are in use? What is interrupt moderation (coalescing set to)? What is the link partner? Same type switch for all problem machines or a mix?
cat /proc/interrupts
ethtool -c enp2s0
maybe an `lspci` dump could help shed some more light.
>
> Kind regards,
>
> Jack Suter
>
> # ethtool -i enp2s0
> driver: e1000e
> version: 3.2.6-k
> firmware-version: 2.1-2
> bus-info: 0000:02:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: no
>
> [ 3532.745587] e1000e: enp2s0 NIC Link is Down [ 3532.771461] e1000e:
> enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15463.117592] e1000e: enp2s0 NIC Link is Down [15463.119419] e1000e:
> enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15469.155922] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex,
> Flow
> Control: Rx/Tx
> [15648.196579] e1000e: enp2s0 NIC Link is Down [15651.405310] e1000e:
> enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15728.959981] e1000e: enp2s0 NIC Link is Down [15729.000625] e1000e:
> enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15835.132034] e1000e: enp2s0 NIC Link is Down [15835.185222] e1000e:
> enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15839.104020] e1000e: enp2s0 NIC Link is Down [15839.142346] e1000e:
> enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [15845.142287] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex,
> Flow
> Control: Rx/Tx
> [16401.940127] e1000e: enp2s0 NIC Link is Down [16401.945106] e1000e:
> enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [16408.121843] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex,
> Flow
> Control: Rx/Tx
> [17025.823220] e1000e: enp2s0 NIC Link is Down [17025.825473] e1000e:
> enp2s0 NIC Link is Up 1000 Mbps Full Duplex, Flow
> Control: Rx/Tx
> [17032.100202] e1000e: enp2s0 NIC Link is Up 1000 Mbps Full Duplex,
> Flow
> Control: Rx/Tx
_______________________________________________
Intel-wired-lan mailing list
[email protected]
http://lists.osuosl.org/mailman/listinfo/intel-wired-lan
Hello,
We have no reproduced this problem in our labs too. We have tested x99 server platform with 82574L NIC and 4.8.0 kernel.
You wrote that you have several servers with this issue. What is platforms you use? Is there some specific platform's or link partner configuration? Interesting to know if you experienced such problem with stable 4.8.4 or mainline 4.9-rc3.
Sasha
> > Reverting these two commits resolves the Link is Down/Link is Up
> > messages. This has been tested on about six servers so far and all have
> > stopped reporting these link flaps.
>
> Are you able to revert either of the patches independently, I don't
> recall if they were stand alone or not.
I can try this shortly.
> Are they all 82574L or does it affect others?
All are 82574L. If the server has an 82574L it has seen a flap at least
once in the past two weeks when kernel upgrades began, though some are
much more frequent than others.
Except for one, the affected servers are all HP DL120 G7s. The NIC has
firmware 2.1-2 as reported by `ethtool -i`.
The one is a Supermicro server with NIC firmware 1.8-0. Link flaps occur
most frequently on this server; 3963 such instances compared to at most
429 on an HP. It also has more network/disk activity than the affected
HPs.
> Is there any particular traffic pattern involved? Sitting idle, moderate
> use, heavy constant flow?
All are being used as file servers, so heavy network traffic and disk
I/O can be expected at times. `vnstat -d` shows the servers averaging
100 - 200 Mbit/s per day. The Supermicro averages closer to 300 Mbit/s
per day.
> Which kernel tree are you using? Linus's upstream kernel from
> kernel.org, a distribution provided one or? I'm generally working off of
> David Miller's net-next, but can try to repro the issue on my boxes if I
> know the exact kernel to work from.
I'm using a Gentoo Hardened kernel; specifically 4.7.9. It follows
grsecurity's patch so a 4.8 / 4.9 kernel is not available yet.
> Perhaps a power saving state trying to kick in? Bad cables or
> speed/duplex mismatches are common causes of link flap, but they seem
> unlikely given reverting the patches resolves the issue.
I'm not aware of any power save settings that should be trying to kick
in but I can investigate this angle further if you think it may be
related.
One of the HP servers was upgraded to (Gentoo Hardened) 4.5.7 back in
August and began experiencing these flaps shortly after. At the time it
was one of only a few servers on a 4.5+ series kernel and the first to
experience this issue, so it was treated as a physical layer issue. No
interface errors were seen switch-side[1], but the network cable was
replaced regardless. The link flaps on that server still continued.
[1] As reported to me. I am not sure if the switch saw the link flaps
occurring.
> Those patches are interrupt related, what kind of interrupts are in use?
> What is interrupt moderation (coalescing set to)? What is the link
> partner? Same type switch for all problem machines or a mix?
>
> cat /proc/interrupts
> ethtool -c enp2s0
Mostly the same type of switch; either Juniper EX3200 or EX3300. All
single connections to the switch, no LACP or anything fancy.
`cat /proc/interrupts` from two HP servers are below. One server is
still experiencing flaps; the other was rebooted ~30 hours ago into the
patched kernel. I can provide /proc/interrupts for the Supermicro server
too, but there isn't a similar server to compare it to. It also has many
more CPUs so its output is a bit messier.
>From ethtool -c; all other values are zero and available in full below.
Applies to both HP and Supermicro.
Adaptive RX: off TX: off
rx-usecs: 3
> maybe an `lspci` dump could help shed some more light.
>From the HPs:
00:00.0 Host bridge: Intel Corporation 2nd Generation Core Processor
Family DRAM Controller (rev 09)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200/2nd Generation Core
Processor Family PCI Express Root Port (rev 09)
00:06.0 PCI bridge: Intel Corporation Xeon E3-1200/2nd Generation Core
Processor Family PCI Express Root Port (rev 09)
00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset
Family USB Enhanced Host Controller #2 (rev 05)
00:1c.0 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset
Family PCI Express Root Port 1 (rev b5)
00:1c.4 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset
Family PCI Express Root Port 5 (rev b5)
00:1c.5 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset
Family PCI Express Root Port 6 (rev b5)
00:1c.6 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset
Family PCI Express Root Port 7 (rev b5)
00:1c.7 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset
Family PCI Express Root Port 8 (rev b5)
00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset
Family USB Enhanced Host Controller #1 (rev 05)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a5)
00:1f.0 ISA bridge: Intel Corporation C204 Chipset Family LPC Controller
(rev 05)
00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset
Family SATA AHCI Controller (rev 05)
01:00.0 System peripheral: Hewlett-Packard Company Integrated Lights-Out
Standard Slave Instrumentation & System Support (rev 05)
01:00.1 VGA compatible controller: Matrox Electronics Systems Ltd. MGA
G200EH
01:00.2 System peripheral: Hewlett-Packard Company Integrated Lights-Out
Standard Management Processor Support and Messaging (rev 05)
01:00.4 USB controller: Hewlett-Packard Company Integrated Lights-Out
Standard Virtual USB Controller (rev 02)
02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network
Connection
03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network
Connection
And the Supermicro:
00:00.0 Host bridge: Intel Corporation 5500 I/O Hub to ESI Port (rev 22)
00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express
Root Port 1 (rev 22)
00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express
Root Port 3 (rev 22)
00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express
Root Port 7 (rev 22)
00:09.0 PCI bridge: Intel Corporation 7500/5520/5500/X58 I/O Hub PCI
Express Root Port 9 (rev 22)
00:13.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub I/OxAPIC
Interrupt Controller (rev 22)
00:14.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub System
Management Registers (rev 22)
00:14.1 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub GPIO and
Scratch Pad Registers (rev 22)
00:14.2 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub Control Status
and RAS Registers (rev 22)
00:14.3 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub Throttle
Registers (rev 22)
00:16.0 System peripheral: Intel Corporation 5520/5500/X58 Chipset
QuickData Technology Device (rev 22)
00:16.1 System peripheral: Intel Corporation 5520/5500/X58 Chipset
QuickData Technology Device (rev 22)
00:16.2 System peripheral: Intel Corporation 5520/5500/X58 Chipset
QuickData Technology Device (rev 22)
00:16.3 System peripheral: Intel Corporation 5520/5500/X58 Chipset
QuickData Technology Device (rev 22)
00:16.4 System peripheral: Intel Corporation 5520/5500/X58 Chipset
QuickData Technology Device (rev 22)
00:16.5 System peripheral: Intel Corporation 5520/5500/X58 Chipset
QuickData Technology Device (rev 22)
00:16.6 System peripheral: Intel Corporation 5520/5500/X58 Chipset
QuickData Technology Device (rev 22)
00:16.7 System peripheral: Intel Corporation 5520/5500/X58 Chipset
QuickData Technology Device (rev 22)
00:1a.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB
UHCI Controller #4
00:1a.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB
UHCI Controller #5
00:1a.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB
UHCI Controller #6
00:1a.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2
EHCI Controller #2
00:1c.0 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express
Root Port 1
00:1c.4 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express
Root Port 5
00:1c.5 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express
Root Port 6
00:1d.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB
UHCI Controller #1
00:1d.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB
UHCI Controller #2
00:1d.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB
UHCI Controller #3
00:1d.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2
EHCI Controller #1
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
00:1f.0 ISA bridge: Intel Corporation 82801JIR (ICH10R) LPC Interface
Controller
00:1f.2 SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA
AHCI Controller
00:1f.3 SMBus: Intel Corporation 82801JI (ICH10 Family) SMBus Controller
05:00.0 RAID bus controller: 3ware Inc 9650SE SATA-II RAID PCIe (rev 01)
06:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network
Connection
07:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network
Connection
08:01.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA
G200eW WPCM450 (rev 0a)
fe:00.0 Host bridge: Intel Corporation Xeon 5600 Series QuickPath
Architecture Generic Non-core Registers (rev 02)
fe:00.1 Host bridge: Intel Corporation Xeon 5600 Series QuickPath
Architecture System Address Decoder (rev 02)
fe:02.0 Host bridge: Intel Corporation Xeon 5600 Series QPI Link 0 (rev
02)
fe:02.1 Host bridge: Intel Corporation Xeon 5600 Series QPI Physical 0
(rev 02)
fe:02.2 Host bridge: Intel Corporation Xeon 5600 Series Mirror Port Link
0 (rev 02)
fe:02.3 Host bridge: Intel Corporation Xeon 5600 Series Mirror Port Link
1 (rev 02)
fe:02.4 Host bridge: Intel Corporation Xeon 5600 Series QPI Link 1 (rev
02)
fe:02.5 Host bridge: Intel Corporation Xeon 5600 Series QPI Physical 1
(rev 02)
fe:03.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Registers (rev 02)
fe:03.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Target Address Decoder (rev 02)
fe:03.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller RAS Registers (rev 02)
fe:03.4 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Test Registers (rev 02)
fe:04.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 0 Control (rev 02)
fe:04.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 0 Address (rev 02)
fe:04.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 0 Rank (rev 02)
fe:04.3 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 0 Thermal Control (rev 02)
fe:05.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 1 Control (rev 02)
fe:05.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 1 Address (rev 02)
fe:05.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 1 Rank (rev 02)
fe:05.3 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 1 Thermal Control (rev 02)
fe:06.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 2 Control (rev 02)
fe:06.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 2 Address (rev 02)
fe:06.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 2 Rank (rev 02)
fe:06.3 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 2 Thermal Control (rev 02)
ff:00.0 Host bridge: Intel Corporation Xeon 5600 Series QuickPath
Architecture Generic Non-core Registers (rev 02)
ff:00.1 Host bridge: Intel Corporation Xeon 5600 Series QuickPath
Architecture System Address Decoder (rev 02)
ff:02.0 Host bridge: Intel Corporation Xeon 5600 Series QPI Link 0 (rev
02)
ff:02.1 Host bridge: Intel Corporation Xeon 5600 Series QPI Physical 0
(rev 02)
ff:02.2 Host bridge: Intel Corporation Xeon 5600 Series Mirror Port Link
0 (rev 02)
ff:02.3 Host bridge: Intel Corporation Xeon 5600 Series Mirror Port Link
1 (rev 02)
ff:02.4 Host bridge: Intel Corporation Xeon 5600 Series QPI Link 1 (rev
02)
ff:02.5 Host bridge: Intel Corporation Xeon 5600 Series QPI Physical 1
(rev 02)
ff:03.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Registers (rev 02)
ff:03.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Target Address Decoder (rev 02)
ff:03.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller RAS Registers (rev 02)
ff:03.4 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Test Registers (rev 02)
ff:04.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 0 Control (rev 02)
ff:04.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 0 Address (rev 02)
ff:04.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 0 Rank (rev 02)
ff:04.3 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 0 Thermal Control (rev 02)
ff:05.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 1 Control (rev 02)
ff:05.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 1 Address (rev 02)
ff:05.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 1 Rank (rev 02)
ff:05.3 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 1 Thermal Control (rev 02)
ff:06.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 2 Control (rev 02)
ff:06.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 2 Address (rev 02)
ff:06.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 2 Rank (rev 02)
ff:06.3 Host bridge: Intel Corporation Xeon 5600 Series Integrated
Memory Controller Channel 2 Thermal Control (rev 02)
>From an HP server without the two reverted commits, still experiencing
flaps:
# dmesg | grep 'Link is Down' | wc -l
160
# cat /proc/interrupts
CPU0 CPU1
0: 71 0 IO-APIC 2-edge timer
1: 9 0 IO-APIC 1-edge i8042
8: 26 0 IO-APIC 8-edge rtc0
9: 0 0 IO-APIC 9-fasteoi acpi
12: 5 0 IO-APIC 12-edge i8042
16: 101 0 IO-APIC 16-fasteoi uhci_hcd:usb3
20: 29 0 IO-APIC 20-fasteoi ehci_hcd:usb2
21: 31 0 IO-APIC 21-fasteoi ehci_hcd:usb1
26: 466035195 0 PCI-MSI 512000-edge ahci[0000:00:1f.2]
27: 4011416578 0 PCI-MSI 1048576-edge enp2s0-rx-0
28: 2635120533 0 PCI-MSI 1048577-edge enp2s0-tx-0
29: 21247 0 PCI-MSI 1048578-edge enp2s0
NMI: 32827 13374 Non-maskable interrupts
LOC: 639865868 608834533 Local timer interrupts
SPU: 0 0 Spurious interrupts
PMI: 32827 13374 Performance monitoring interrupts
IWI: 6 0 IRQ work interrupts
RTR: 0 0 APIC ICR read retries
RES: 53178810 784807944 Rescheduling interrupts
CAL: 47602 16104 Function call interrupts
TLB: 14655054 5994312 TLB shootdowns
TRM: 0 0 Thermal event interrupts
THR: 0 0 Threshold APIC interrupts
DFR: 0 0 Deferred Error APIC interrupts
MCE: 0 0 Machine check exceptions
MCP: 3134 3134 Machine check polls
ERR: 0
MIS: 0
PIN: 0 0 Posted-interrupt notification event
PIW: 0 0 Posted-interrupt wakeup event
>From an HP server that was previously affected but now has the patched
kernel:
# cat /proc/interrupts
CPU0 CPU1
0: 27 0 IO-APIC 2-edge timer
1: 9 0 IO-APIC 1-edge i8042
8: 63 0 IO-APIC 8-edge rtc0
9: 0 0 IO-APIC 9-fasteoi acpi
12: 5 0 IO-APIC 12-edge i8042
16: 0 0 IO-APIC 16-fasteoi uhci_hcd:usb3
20: 29 0 IO-APIC 20-fasteoi ehci_hcd:usb2
21: 31 0 IO-APIC 21-fasteoi ehci_hcd:usb1
26: 10222204 0 PCI-MSI 512000-edge ahci[0000:00:1f.2]
27: 260871340 0 PCI-MSI 1048576-edge enp2s0-rx-0
28: 320328246 0 PCI-MSI 1048577-edge enp2s0-tx-0
29: 2 0 PCI-MSI 1048578-edge enp2s0
NMI: 1023 520 Non-maskable interrupts
LOC: 55824119 46253516 Local timer interrupts
SPU: 0 0 Spurious interrupts
PMI: 1023 520 Performance monitoring interrupts
IWI: 4 0 IRQ work interrupts
RTR: 0 0 APIC ICR read retries
RES: 963280 23369703 Rescheduling interrupts
CAL: 711 450 Function call interrupts
TLB: 104153 57497 TLB shootdowns
TRM: 0 0 Thermal event interrupts
THR: 0 0 Threshold APIC interrupts
DFR: 0 0 Deferred Error APIC interrupts
MCE: 0 0 Machine check exceptions
MCP: 381 381 Machine check polls
ERR: 0
MIS: 0
PIN: 0 0 Posted-interrupt notification event
PIW: 0 0 Posted-interrupt wakeup event
# ethtool -c enp2s0
Coalesce parameters for enp2s0:
Adaptive RX: off TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
rx-usecs: 3
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 0
tx-usecs: 0
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 0
rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0
rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0
On 2016/11/02 21:19, Brown, Aaron F wrote:
> > From: Jack Suter [mailto:[email protected]]
> > Sent: Tuesday, November 1, 2016 4:57 PM
> > To: Kirsher, Jeffrey T <[email protected]>
> > Cc: [email protected]; [email protected]; Brown, Aaron F
> > <[email protected]>; [email protected]; linux-
> > [email protected]
> > Subject: Kernel regression introduced by "e1000e: Do not write lsc to ics in
> > msi-x mode" and/or "e1000e: Do not read ICR in Other interrupt"
> >
> > Hi there,
> >
> > I have some servers with an 82574L based NIC and recently upgraded from
> > a 4.4 series kernel to 4.7. Upon doing so, servers with this chipset
> > have begun frequently reporting "Link is Down" and "Link is Up"
> > messages. No other related network errors are reported by the kernel or
> > e1000e driver. I saw some reports about using "ethtool -s $iface msglvl
> > 6" to reveal more information, but nothing extra was reported.
> >
> > Some testing showed that this was introduced between the 4.4 and 4.5
> > series. I was able to further narrow it down to two commits that look
> > related:
> >
> > e1000e: Do not write lsc to ics in msi-x mode
> > (a61cfe4ffad7864a07e0c74969ca7ceb77ab2f1f)
> > e1000e: Do not read ICR in Other interrupt
> > (16ecba59bc333d6282ee057fb02339f77a880beb)
>
> I did not notice any link flapping when I tested those patches, I would have rejected them if I had. I have several systems with 82574L LOMs and as yet am not able to reproduce a link flap with recent upstream kernels/drivers (net-next 4.8.0 on one and 4.9.0-rc3 on another.)
>
> One of those systems is dedicated to a kernel regression setup, I checked the test logs from it and am not seeing any evidence of flaps in the 4.4, through 4.6 range either.
>
> >
> > Reverting these two commits resolves the Link is Down/Link is Up
> > messages. This has been tested on about six servers so far and all have
> > stopped reporting these link flaps.
>
> Are you able to revert either of the patches independently, I don't recall if they were stand alone or not.
>From what I recall, the series is entirely bisectable. I tested again
just now and could do a netperf RR test after applying each commit
sequentially.