2017-07-18 14:31:06

by Lennart Sorensen

[permalink] [raw]
Subject: commit 16ecba59 breaks 82574L under heavy load.

Commit 16ecba59bc333d6282ee057fb02339f77a880beb has apparently broken
at least the 82574L under heavy load (as in load heavy enough to cause
packet drops). In this case, when running in MSI-X mode, the Other
Causes interrupt fires about 3000 times per second, but not due to link
state changes. Unfortunately this commit changed the driver to assume
that the Other Causes interrupt can only mean link state change and
hence sets the flag that (unfortunately) means both link is down and link
state should be checked. Since this now happens 3000 times per second,
the chances of it happening while the watchdog_task is checking the link
state becomes pretty high, and it if does happen to coincice, then the
watchdog_task will reset the adapter, which causes a real loss of link.

Reverting the commit makes everything work fine again (of course packets
are still dropped, but at least the link stays up, the adapter isn't
reset, and most packets make it through).

I tried checking what the bits in the ICR actually were under these
conditions, and it would appear that the only bit set is 24 (the Other
Causes interrupt bit). So I don't know what the real cause is although
rx buffer overrun would be my guess, and in fact I see nothing in the
datasheet indicating that you can actually disable the rx buffer overrun
from generating an interrupt.

Prior to this commit, the interrupt handler explicitly checked that the
interrupt was caused by a link state change and only then did it trigger
a recheck which worked fine and did not cause incorrect adapter resets,
although it of course still had lots of undesired interrupts to deal with.

Of course ideally there would be a way to make these 3000 pointless
interrupts per second not happen, but unless there is a way to determine
that, I think this commit needs reverting, since it apparently causes
link failures on actual hardware that exists.

The ports are onboard intel 82574L on a Supermicro X7SPA-HF-D525 with
1.2a BIOS (upgrading to 1.2b to check if it makes a difference is not
an option unfortunately).

--
Len Sorensen


2017-07-18 23:14:42

by Benjamin Poirier

[permalink] [raw]
Subject: Re: commit 16ecba59 breaks 82574L under heavy load.

On 2017/07/18 10:21, Lennart Sorensen wrote:
> Commit 16ecba59bc333d6282ee057fb02339f77a880beb has apparently broken
> at least the 82574L under heavy load (as in load heavy enough to cause
> packet drops). In this case, when running in MSI-X mode, the Other
> Causes interrupt fires about 3000 times per second, but not due to link
> state changes. Unfortunately this commit changed the driver to assume
> that the Other Causes interrupt can only mean link state change and

Thanks for the detailed analysis.

Refering to the original discussion around this patch series, it seemed like
the IMS bit for a condition had to be set for the Other interrupt to be raised
for that condition.

https://lkml.org/lkml/2015/11/4/683

In this case however, E1000_ICR_RXT0 is not set in IMS so Other shouldn't be
raised for Receiver Overrun. Apparently something is going on...

I can reproduce the spurious Other interrupts with a simple mdelay()
With the debugging patch at the end of the mail I see stuff like this
while blasting with udp frames:
<idle>-0 [086] d.h1 15338.742675: e1000_msix_other: got Other interrupt, count 15127
<...>-54504 [086] d.h. 15338.742724: e1000_msix_other: got Other interrupt, count 1
<...>-54504 [086] d.h. 15338.742774: e1000_msix_other: got Other interrupt, count 1
<...>-54504 [086] d.h. 15338.742824: e1000_msix_other: got Other interrupt, count 1
<idle>-0 [086] d.h1 15340.745123: e1000_msix_other: got Other interrupt, count 27584
<...>-54504 [086] d.h. 15340.745172: e1000_msix_other: got Other interrupt, count 1
<...>-54504 [086] d.h. 15340.745222: e1000_msix_other: got Other interrupt, count 1
<...>-54504 [086] d.h. 15340.745272: e1000_msix_other: got Other interrupt, count 1

> hence sets the flag that (unfortunately) means both link is down and link
> state should be checked. Since this now happens 3000 times per second,
> the chances of it happening while the watchdog_task is checking the link
> state becomes pretty high, and it if does happen to coincice, then the
> watchdog_task will reset the adapter, which causes a real loss of link.

Through which path does watchdog_task reset the adapter? I didn't
reproduce that.

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index b3679728caac..689ad76d0d12 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -46,6 +46,8 @@

#include "e1000.h"

+DEFINE_RATELIMIT_STATE(e1000e_ratelimit_state, 2 * HZ, 4);
+
#define DRV_EXTRAVERSION "-k"

#define DRV_VERSION "3.2.6" DRV_EXTRAVERSION
@@ -937,6 +939,8 @@ static bool e1000_clean_rx_irq(struct e1000_ring *rx_ring, int *work_done,
bool cleaned = false;
unsigned int total_rx_bytes = 0, total_rx_packets = 0;

+ mdelay(10);
+
i = rx_ring->next_to_clean;
rx_desc = E1000_RX_DESC_EXT(*rx_ring, i);
staterr = le32_to_cpu(rx_desc->wb.upper.status_error);
@@ -1067,6 +1071,13 @@ static bool e1000_clean_rx_irq(struct e1000_ring *rx_ring, int *work_done,

adapter->total_rx_bytes += total_rx_bytes;
adapter->total_rx_packets += total_rx_packets;
+
+ if (__ratelimit(&e1000e_ratelimit_state)) {
+ static unsigned int max;
+ max = max(max, total_rx_packets);
+ trace_printk("received %u max %u\n", total_rx_packets, max);
+ }
+
return cleaned;
}

@@ -1904,9 +1915,16 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
struct net_device *netdev = data;
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = &adapter->hw;
+ static unsigned int count;

hw->mac.get_link_status = true;

+ count++;
+ if (__ratelimit(&e1000e_ratelimit_state)) {
+ trace_printk("got Other interrupt, count %u\n", count);
+ count = 0;
+ }
+
/* guard against interrupt when we're going down */
if (!test_bit(__E1000_DOWN, &adapter->state)) {
mod_timer(&adapter->watchdog_timer, jiffies + 1);
@@ -7121,7 +7139,7 @@ static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
netdev->netdev_ops = &e1000e_netdev_ops;
e1000e_set_ethtool_ops(netdev);
netdev->watchdog_timeo = 5 * HZ;
- netif_napi_add(netdev, &adapter->napi, e1000e_poll, 64);
+ netif_napi_add(netdev, &adapter->napi, e1000e_poll, 500);
strlcpy(netdev->name, pci_name(pdev), sizeof(netdev->name));

netdev->mem_start = mmio_start;
@@ -7327,6 +7345,8 @@ static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
if (err)
goto err_register;

+ ratelimit_set_flags(&e1000e_ratelimit_state, RATELIMIT_MSG_ON_RELEASE);
+
/* carrier off reporting is important to ethtool even BEFORE open */
netif_carrier_off(netdev);


2017-07-19 14:19:56

by Lennart Sorensen

[permalink] [raw]
Subject: Re: commit 16ecba59 breaks 82574L under heavy load.

On Tue, Jul 18, 2017 at 04:14:35PM -0700, Benjamin Poirier wrote:
> Thanks for the detailed analysis.
>
> Refering to the original discussion around this patch series, it seemed like
> the IMS bit for a condition had to be set for the Other interrupt to be raised
> for that condition.
>
> https://lkml.org/lkml/2015/11/4/683
>
> In this case however, E1000_ICR_RXT0 is not set in IMS so Other shouldn't be
> raised for Receiver Overrun. Apparently something is going on...
>
> I can reproduce the spurious Other interrupts with a simple mdelay()
> With the debugging patch at the end of the mail I see stuff like this
> while blasting with udp frames:
> <idle>-0 [086] d.h1 15338.742675: e1000_msix_other: got Other interrupt, count 15127
> <...>-54504 [086] d.h. 15338.742724: e1000_msix_other: got Other interrupt, count 1
> <...>-54504 [086] d.h. 15338.742774: e1000_msix_other: got Other interrupt, count 1
> <...>-54504 [086] d.h. 15338.742824: e1000_msix_other: got Other interrupt, count 1
> <idle>-0 [086] d.h1 15340.745123: e1000_msix_other: got Other interrupt, count 27584
> <...>-54504 [086] d.h. 15340.745172: e1000_msix_other: got Other interrupt, count 1
> <...>-54504 [086] d.h. 15340.745222: e1000_msix_other: got Other interrupt, count 1
> <...>-54504 [086] d.h. 15340.745272: e1000_msix_other: got Other interrupt, count 1
>
> > hence sets the flag that (unfortunately) means both link is down and link
> > state should be checked. Since this now happens 3000 times per second,
> > the chances of it happening while the watchdog_task is checking the link
> > state becomes pretty high, and it if does happen to coincice, then the
> > watchdog_task will reset the adapter, which causes a real loss of link.
>
> Through which path does watchdog_task reset the adapter? I didn't
> reproduce that.

The other interrupt happens and sets get_link_status to true. At some
point the watchdog_task runs on some core and calls e1000e_has_link,
which then calls check_for_link to find out the current link status.
While e1000e_check_for_copper_link is checking the link state and
after updating get_link_status to false to indicate link is up, another
interrupt occurs and another core handles it and changes get_link_status
to true again. So by the time e1000e_has_link goes to determine the
return value, get_link_state has changed back again so now it returns
link down, and as a result the watchdog_task calls reset, because we
have packets in the transmit queue (we were busy forwarding over 100000
packets per second when it happened).

Running on an Atom D525 which isn't very fast and uses hyperthreading
might have something to do with how the scheduling manages to trigger
this race condition. On a faster CPU you very likely would be done
checking the link state quickly enough that the interrupt handler rarely
gets a chance to interfere. Also we have the irq affinity set so the
RX/TX of one port is handled by one CPU, the RX/TX of the other port
by a different CPU and the Other interrupts and other tasks (like the
watchdog) are handled by the last two CPUs.

Either making the current link state its own bool and keeping it's meaning
away from get_link_state, or making the interrupt handler only change
get_link_state when LSC is actually present makes the problem go away.
Having two meanings to get_link_state (both link state needs checking
and what the link state is) causes issues. After all it is using a bool
to store 3 values: Link is up, link needs checking but is up and link
needs checking but is down. Of course the last two states are rather
quantum, in that you don't know which it is until you check.

> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> index b3679728caac..689ad76d0d12 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -46,6 +46,8 @@
>
> #include "e1000.h"
>
> +DEFINE_RATELIMIT_STATE(e1000e_ratelimit_state, 2 * HZ, 4);
> +
> #define DRV_EXTRAVERSION "-k"
>
> #define DRV_VERSION "3.2.6" DRV_EXTRAVERSION
> @@ -937,6 +939,8 @@ static bool e1000_clean_rx_irq(struct e1000_ring *rx_ring, int *work_done,
> bool cleaned = false;
> unsigned int total_rx_bytes = 0, total_rx_packets = 0;
>
> + mdelay(10);
> +
> i = rx_ring->next_to_clean;
> rx_desc = E1000_RX_DESC_EXT(*rx_ring, i);
> staterr = le32_to_cpu(rx_desc->wb.upper.status_error);
> @@ -1067,6 +1071,13 @@ static bool e1000_clean_rx_irq(struct e1000_ring *rx_ring, int *work_done,
>
> adapter->total_rx_bytes += total_rx_bytes;
> adapter->total_rx_packets += total_rx_packets;
> +
> + if (__ratelimit(&e1000e_ratelimit_state)) {
> + static unsigned int max;
> + max = max(max, total_rx_packets);
> + trace_printk("received %u max %u\n", total_rx_packets, max);
> + }
> +
> return cleaned;
> }
>
> @@ -1904,9 +1915,16 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
> struct net_device *netdev = data;
> struct e1000_adapter *adapter = netdev_priv(netdev);
> struct e1000_hw *hw = &adapter->hw;
> + static unsigned int count;
>
> hw->mac.get_link_status = true;
>
> + count++;
> + if (__ratelimit(&e1000e_ratelimit_state)) {
> + trace_printk("got Other interrupt, count %u\n", count);
> + count = 0;
> + }
> +
> /* guard against interrupt when we're going down */
> if (!test_bit(__E1000_DOWN, &adapter->state)) {
> mod_timer(&adapter->watchdog_timer, jiffies + 1);
> @@ -7121,7 +7139,7 @@ static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> netdev->netdev_ops = &e1000e_netdev_ops;
> e1000e_set_ethtool_ops(netdev);
> netdev->watchdog_timeo = 5 * HZ;
> - netif_napi_add(netdev, &adapter->napi, e1000e_poll, 64);
> + netif_napi_add(netdev, &adapter->napi, e1000e_poll, 500);
> strlcpy(netdev->name, pci_name(pdev), sizeof(netdev->name));
>
> netdev->mem_start = mmio_start;
> @@ -7327,6 +7345,8 @@ static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> if (err)
> goto err_register;
>
> + ratelimit_set_flags(&e1000e_ratelimit_state, RATELIMIT_MSG_ON_RELEASE);
> +
> /* carrier off reporting is important to ethtool even BEFORE open */
> netif_carrier_off(netdev);

--
Len Sorensen

2017-07-20 00:07:59

by Benjamin Poirier

[permalink] [raw]
Subject: Re: commit 16ecba59 breaks 82574L under heavy load.

On 2017/07/19 10:19, Lennart Sorensen wrote:
> On Tue, Jul 18, 2017 at 04:14:35PM -0700, Benjamin Poirier wrote:
> > Thanks for the detailed analysis.
> >
> > Refering to the original discussion around this patch series, it seemed like
> > the IMS bit for a condition had to be set for the Other interrupt to be raised
> > for that condition.
> >
> > https://lkml.org/lkml/2015/11/4/683
> >
> > In this case however, E1000_ICR_RXT0 is not set in IMS so Other shouldn't be
> > raised for Receiver Overrun. Apparently something is going on...
> >
> > I can reproduce the spurious Other interrupts with a simple mdelay()
> > With the debugging patch at the end of the mail I see stuff like this
> > while blasting with udp frames:
> > <idle>-0 [086] d.h1 15338.742675: e1000_msix_other: got Other interrupt, count 15127
> > <...>-54504 [086] d.h. 15338.742724: e1000_msix_other: got Other interrupt, count 1
> > <...>-54504 [086] d.h. 15338.742774: e1000_msix_other: got Other interrupt, count 1
> > <...>-54504 [086] d.h. 15338.742824: e1000_msix_other: got Other interrupt, count 1
> > <idle>-0 [086] d.h1 15340.745123: e1000_msix_other: got Other interrupt, count 27584
> > <...>-54504 [086] d.h. 15340.745172: e1000_msix_other: got Other interrupt, count 1
> > <...>-54504 [086] d.h. 15340.745222: e1000_msix_other: got Other interrupt, count 1
> > <...>-54504 [086] d.h. 15340.745272: e1000_msix_other: got Other interrupt, count 1
> >
> > > hence sets the flag that (unfortunately) means both link is down and link
> > > state should be checked. Since this now happens 3000 times per second,
> > > the chances of it happening while the watchdog_task is checking the link
> > > state becomes pretty high, and it if does happen to coincice, then the
> > > watchdog_task will reset the adapter, which causes a real loss of link.
> >
> > Through which path does watchdog_task reset the adapter? I didn't
> > reproduce that.
>
> The other interrupt happens and sets get_link_status to true. At some
> point the watchdog_task runs on some core and calls e1000e_has_link,
> which then calls check_for_link to find out the current link status.
> While e1000e_check_for_copper_link is checking the link state and
> after updating get_link_status to false to indicate link is up, another
> interrupt occurs and another core handles it and changes get_link_status
> to true again. So by the time e1000e_has_link goes to determine the
> return value, get_link_state has changed back again so now it returns
> link down, and as a result the watchdog_task calls reset, because we
> have packets in the transmit queue (we were busy forwarding over 100000
> packets per second when it happened).

Ah I see. Thanks again.

In your previous mail,
On 2017/07/18 10:21, Lennart Sorensen wrote:
[...]
> I tried checking what the bits in the ICR actually were under these
> conditions, and it would appear that the only bit set is 24 (the Other
> Causes interrupt bit). So I don't know what the real cause is although

Are you sure about this? In my testing, while triggering the overrun
with the msleep, I read ICR when entering e1000_msix_other() and RXO is
consistently set.

I'm working on a patch that uses that fact to handle the situation and
limit the interrupt.

2017-07-20 14:00:32

by Lennart Sorensen

[permalink] [raw]
Subject: Re: commit 16ecba59 breaks 82574L under heavy load.

On Wed, Jul 19, 2017 at 05:07:47PM -0700, Benjamin Poirier wrote:
> Are you sure about this? In my testing, while triggering the overrun
> with the msleep, I read ICR when entering e1000_msix_other() and RXO is
> consistently set.

I had thousands of calls to e1000_msix_other where the only bit set
was OTHER.

I don't know if the cause is overruns, it just seems plausible.

> I'm working on a patch that uses that fact to handle the situation and
> limit the interrupt.

Excellent.

Running in MSI mode rather than MSI-X seems to not have the problem of
unexpected interrupts, but has other issues (such as loosing the IRQ
affinity setting if you do ifconfig down;ifconfig up on the interface,
which does not happen in MSI-X's case.) That's rather annoying as you
can't set the affinity before bringing up the interface which is rather
undesirable.

--
Len Sorensen

2017-07-20 23:45:04

by Benjamin Poirier

[permalink] [raw]
Subject: Re: commit 16ecba59 breaks 82574L under heavy load.

On 2017/07/20 10:00, Lennart Sorensen wrote:
> On Wed, Jul 19, 2017 at 05:07:47PM -0700, Benjamin Poirier wrote:
> > Are you sure about this? In my testing, while triggering the overrun
> > with the msleep, I read ICR when entering e1000_msix_other() and RXO is
> > consistently set.
>
> I had thousands of calls to e1000_msix_other where the only bit set
> was OTHER.
>
> I don't know if the cause is overruns, it just seems plausible.
>
> > I'm working on a patch that uses that fact to handle the situation and
> > limit the interrupt.
>
> Excellent.
>

Could you please test the following patch and let me know if it:
1) reduces the interrupt rate of the Other msi-x vector
2) avoids the link flaps
or
3) logs some dmesg warnings of the form "Other interrupt with unhandled [...]"
In this case, please paste icr values printed.

Thanks

diff --git a/drivers/net/ethernet/intel/e1000e/defines.h b/drivers/net/ethernet/intel/e1000e/defines.h
index 0641c0098738..afb7ebe20b24 100644
--- a/drivers/net/ethernet/intel/e1000e/defines.h
+++ b/drivers/net/ethernet/intel/e1000e/defines.h
@@ -398,6 +398,7 @@
#define E1000_ICR_LSC 0x00000004 /* Link Status Change */
#define E1000_ICR_RXSEQ 0x00000008 /* Rx sequence error */
#define E1000_ICR_RXDMT0 0x00000010 /* Rx desc min. threshold (0) */
+#define E1000_ICR_RXO 0x00000040 /* Receiver Overrun */
#define E1000_ICR_RXT0 0x00000080 /* Rx timer intr (ring 0) */
#define E1000_ICR_ECCER 0x00400000 /* Uncorrectable ECC Error */
/* If this bit asserted, the driver should claim the interrupt */
diff --git a/drivers/net/ethernet/intel/e1000e/e1000.h b/drivers/net/ethernet/intel/e1000e/e1000.h
index c7c994eb410e..f7b46eba3efb 100644
--- a/drivers/net/ethernet/intel/e1000e/e1000.h
+++ b/drivers/net/ethernet/intel/e1000e/e1000.h
@@ -351,6 +351,10 @@ struct e1000_adapter {
s32 ptp_delta;

u16 eee_advert;
+
+ unsigned int uh_count;
+ u32 uh_values[16];
+ unsigned int uh_values_nb;
};

struct e1000_info {
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index b3679728caac..46697338c0e1 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -46,6 +46,8 @@

#include "e1000.h"

+DEFINE_RATELIMIT_STATE(other_uh_ratelimit_state, HZ, 1);
+
#define DRV_EXTRAVERSION "-k"

#define DRV_VERSION "3.2.6" DRV_EXTRAVERSION
@@ -1904,12 +1906,60 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
struct net_device *netdev = data;
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = &adapter->hw;
+ u32 icr;
+ bool enable = true;
+ bool handled = false;
+ unsigned int i;

- hw->mac.get_link_status = true;
+ icr = er32(ICR);
+ if (icr & E1000_ICR_RXO) {
+ ew32(ICR, E1000_ICR_RXO);
+ enable = false;
+ /* napi poll will re-enable Other, make sure it runs */
+ if (napi_schedule_prep(&adapter->napi)) {
+ adapter->total_rx_bytes = 0;
+ adapter->total_rx_packets = 0;
+ __napi_schedule(&adapter->napi);
+ }
+ handled = true;
+ }
+ if (icr & E1000_ICR_LSC) {
+ ew32(ICR, E1000_ICR_LSC);
+ hw->mac.get_link_status = true;
+ /* guard against interrupt when we're going down */
+ if (!test_bit(__E1000_DOWN, &adapter->state)) {
+ mod_timer(&adapter->watchdog_timer, jiffies + 1);
+ }
+ handled = true;
+ }

- /* guard against interrupt when we're going down */
- if (!test_bit(__E1000_DOWN, &adapter->state)) {
- mod_timer(&adapter->watchdog_timer, jiffies + 1);
+ if (!handled) {
+ adapter->uh_count++;
+ /* only print unseen icr values */
+ if (adapter->uh_values_nb < ARRAY_SIZE(adapter->uh_values)) {
+ for (i = 0; i < ARRAY_SIZE(adapter->uh_values); i++) {
+ if (adapter->uh_values[i] == icr) {
+ break;
+ }
+ }
+ if (i == ARRAY_SIZE(adapter->uh_values)) {
+ adapter->uh_values[adapter->uh_values_nb] =
+ icr;
+ adapter->uh_values_nb++;
+ netdev_warn(netdev,
+ "Other interrupt with unhandled icr 0x%08x\n",
+ icr);
+ }
+ }
+ }
+ if (adapter->uh_count && __ratelimit(&other_uh_ratelimit_state)) {
+ netdev_warn(netdev,
+ "Other interrupt with unhandled cause, count %u\n",
+ adapter->uh_count);
+ adapter->uh_count = 0;
+ }
+
+ if (enable && !test_bit(__E1000_DOWN, &adapter->state)) {
ew32(IMS, E1000_IMS_OTHER);
}

@@ -2681,7 +2731,8 @@ static int e1000e_poll(struct napi_struct *napi, int weight)
napi_complete_done(napi, work_done);
if (!test_bit(__E1000_DOWN, &adapter->state)) {
if (adapter->msix_entries)
- ew32(IMS, adapter->rx_ring->ims_val);
+ ew32(IMS, adapter->rx_ring->ims_val |
+ E1000_IMS_OTHER);
else
e1000_irq_enable(adapter);
}
@@ -4197,7 +4248,7 @@ static void e1000e_trigger_lsc(struct e1000_adapter *adapter)
struct e1000_hw *hw = &adapter->hw;

if (adapter->msix_entries)
- ew32(ICS, E1000_ICS_OTHER);
+ ew32(ICS, E1000_ICS_LSC | E1000_ICS_OTHER);
else
ew32(ICS, E1000_ICS_LSC);
}
@@ -7572,6 +7623,8 @@ static int __init e1000_init_module(void)
e1000e_driver_version);
pr_info("Copyright(c) 1999 - 2015 Intel Corporation.\n");

+ ratelimit_set_flags(&other_uh_ratelimit_state, RATELIMIT_MSG_ON_RELEASE);
+
return pci_register_driver(&e1000_driver);
}
module_init(e1000_init_module);

2017-07-21 15:27:39

by Lennart Sorensen

[permalink] [raw]
Subject: Re: commit 16ecba59 breaks 82574L under heavy load.

On Thu, Jul 20, 2017 at 04:44:55PM -0700, Benjamin Poirier wrote:
> Could you please test the following patch and let me know if it:
> 1) reduces the interrupt rate of the Other msi-x vector
> 2) avoids the link flaps
> or
> 3) logs some dmesg warnings of the form "Other interrupt with unhandled [...]"
> In this case, please paste icr values printed.

I will give it a try.

Another odd behaviour I see is that the driver will hang in
napi_synchronize on shutdown if there is traffic at the time (at least
I think that's the trigger, maybe the trigger is if there has been an
overload of traffic and the backlog in napi was used).

>From doing some searching, this seems to be a problem that has plagued
some people for years with this driver.

I am having trouble figuring out exactly what napi_synchronize is waiting
for and who is supposed to toggle the flag it is waiting on. The flag
appears to work backwards from what I would have expected it to do.
I see lots of places that can set the bit, but only napi_enable seems
to clear it again, and I don't see how that would get called for all
the places that potentially set the bit.

--
Len Sorensen

2017-07-21 16:09:39

by Lennart Sorensen

[permalink] [raw]
Subject: Re: commit 16ecba59 breaks 82574L under heavy load.

On Fri, Jul 21, 2017 at 11:27:09AM -0400, wrote:
> On Thu, Jul 20, 2017 at 04:44:55PM -0700, Benjamin Poirier wrote:
> > Could you please test the following patch and let me know if it:
> > 1) reduces the interrupt rate of the Other msi-x vector
> > 2) avoids the link flaps
> > or
> > 3) logs some dmesg warnings of the form "Other interrupt with unhandled [...]"
> > In this case, please paste icr values printed.
>
> I will give it a try.

So test looks excellent. Seems to only get interrupts when link state
actually changes now.

> Another odd behaviour I see is that the driver will hang in
> napi_synchronize on shutdown if there is traffic at the time (at least
> I think that's the trigger, maybe the trigger is if there has been an
> overload of traffic and the backlog in napi was used).
>
> From doing some searching, this seems to be a problem that has plagued
> some people for years with this driver.
>
> I am having trouble figuring out exactly what napi_synchronize is waiting
> for and who is supposed to toggle the flag it is waiting on. The flag
> appears to work backwards from what I would have expected it to do.
> I see lots of places that can set the bit, but only napi_enable seems
> to clear it again, and I don't see how that would get called for all
> the places that potentially set the bit.

I just realized NAPI_STATE_SCHED and NAPIF_STATE_SCHED are the same
thing and I need to look at both of those.

Still something seems odd in some corner case where napi gets stuck and
you can't close the port anymore due to napi_synchronize never being
able to finish. Some traffic pattern causes that SCHED state bit to
get into the wrong state and nothing ever clears it. Even managed to
see it get stuck so it never passed traffic again and hung on shutdown.
The napi poll was never called again.

--
Len Sorensen

2017-07-21 18:36:52

by Benjamin Poirier

[permalink] [raw]
Subject: [PATCH 1/5] e1000e: Fix error path in link detection

In case of error from e1e_rphy(), the loop will exit early and "success"
will be set to true erroneously.

Signed-off-by: Benjamin Poirier <[email protected]>
---
drivers/net/ethernet/intel/e1000e/phy.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/phy.c b/drivers/net/ethernet/intel/e1000e/phy.c
index d78d47b41a71..86ff0969efb6 100644
--- a/drivers/net/ethernet/intel/e1000e/phy.c
+++ b/drivers/net/ethernet/intel/e1000e/phy.c
@@ -1744,6 +1744,7 @@ s32 e1000e_phy_has_link_generic(struct e1000_hw *hw, u32 iterations,
s32 ret_val = 0;
u16 i, phy_status;

+ *success = false;
for (i = 0; i < iterations; i++) {
/* Some PHYs require the MII_BMSR register to be read
* twice due to the link bit being sticky. No harm doing
@@ -1763,16 +1764,16 @@ s32 e1000e_phy_has_link_generic(struct e1000_hw *hw, u32 iterations,
ret_val = e1e_rphy(hw, MII_BMSR, &phy_status);
if (ret_val)
break;
- if (phy_status & BMSR_LSTATUS)
+ if (phy_status & BMSR_LSTATUS) {
+ *success = true;
break;
+ }
if (usec_interval >= 1000)
msleep(usec_interval / 1000);
else
udelay(usec_interval);
}

- *success = (i < iterations);
-
return ret_val;
}

--
2.13.2

2017-07-21 18:36:57

by Benjamin Poirier

[permalink] [raw]
Subject: [PATCH 2/5] e1000e: Fix wrong comment related to link detection

Reading e1000e_check_for_copper_link() shows that get_link_status is set to
false after link has been detected. Therefore, it stays TRUE until then.

Signed-off-by: Benjamin Poirier <[email protected]>
---
drivers/net/ethernet/intel/e1000e/netdev.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 2dcb5463d9b8..58a87134d2e5 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -5074,7 +5074,7 @@ static bool e1000e_has_link(struct e1000_adapter *adapter)

/* get_link_status is set on LSC (link status) interrupt or
* Rx sequence error interrupt. get_link_status will stay
- * false until the check_for_link establishes link
+ * true until the check_for_link establishes link
* for copper adapters ONLY
*/
switch (hw->phy.media_type) {
@@ -5092,7 +5092,7 @@ static bool e1000e_has_link(struct e1000_adapter *adapter)
break;
case e1000_media_type_internal_serdes:
ret_val = hw->mac.ops.check_for_link(hw);
- link_active = adapter->hw.mac.serdes_has_link;
+ link_active = hw->mac.serdes_has_link;
break;
default:
case e1000_media_type_unknown:
--
2.13.2

2017-07-21 18:37:02

by Benjamin Poirier

[permalink] [raw]
Subject: [PATCH 4/5] e1000e: Separate signaling for link check/link up

Lennart reported the following race condition:

\ e1000_watchdog_task
\ e1000e_has_link
\ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
/* link is up */
mac->get_link_status = false;

/* interrupt */
\ e1000_msix_other
hw->mac.get_link_status = true;

link_active = !hw->mac.get_link_status
/* link_active is false, wrongly */

This problem arises because the single flag get_link_status is used to
signal two different states: link status needs checking and link status is
down.

Avoid the problem by using the return value of .check_for_link to signal
the link status to e1000e_has_link().

Reported-by: Lennart Sorensen <[email protected]>
Signed-off-by: Benjamin Poirier <[email protected]>
---
drivers/net/ethernet/intel/e1000e/mac.c | 11 ++++++++---
drivers/net/ethernet/intel/e1000e/netdev.c | 2 +-
2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/mac.c b/drivers/net/ethernet/intel/e1000e/mac.c
index b322011ec282..f457c5703d0c 100644
--- a/drivers/net/ethernet/intel/e1000e/mac.c
+++ b/drivers/net/ethernet/intel/e1000e/mac.c
@@ -410,6 +410,9 @@ void e1000e_clear_hw_cntrs_base(struct e1000_hw *hw)
* Checks to see of the link status of the hardware has changed. If a
* change in link status has been detected, then we read the PHY registers
* to get the current speed/duplex if link exists.
+ *
+ * Returns a negative error code (-E1000_ERR_*) or 0 (link down) or 1 (link
+ * up).
**/
s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
{
@@ -423,7 +426,7 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
* Change or Rx Sequence Error interrupt.
*/
if (!mac->get_link_status)
- return 0;
+ return 1;

/* First we want to see if the MII Status Register reports
* link. If so, then we want to get the current speed/duplex
@@ -461,10 +464,12 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
* different link partner.
*/
ret_val = e1000e_config_fc_after_link_up(hw);
- if (ret_val)
+ if (ret_val) {
e_dbg("Error configuring flow control\n");
+ return ret_val;
+ }

- return ret_val;
+ return 1;
}

/**
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index fc6a1d9999b2..5a8ab1136566 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -5081,7 +5081,7 @@ static bool e1000e_has_link(struct e1000_adapter *adapter)
case e1000_media_type_copper:
if (hw->mac.get_link_status) {
ret_val = hw->mac.ops.check_for_link(hw);
- link_active = !hw->mac.get_link_status;
+ link_active = ret_val > 0;
} else {
link_active = true;
}
--
2.13.2

2017-07-21 18:37:21

by Benjamin Poirier

[permalink] [raw]
Subject: [PATCH 5/5] e1000e: Avoid receiver overrun interrupt bursts

When e1000e_poll() is not fast enough to keep up with incoming traffic, the
adapter (when operating in msix mode) raises the Other interrupt to signal
Receiver Overrun.

This is a double problem because 1) at the moment e1000_msix_other()
assumes that it is only called in case of Link Status Change and 2) if the
condition persists, the interrupt is repeatedly raised again in quick
succession.

Ideally we would configure the Other interrupt to not be raised in case of
receiver overrun but this doesn't seem possible on this adapter. Instead,
we handle the first part of the problem by reverting to the practice of
reading ICR in the other interrupt handler, like before commit 16ecba59bc33
("e1000e: Do not read ICR in Other interrupt"). Thanks to commit
0a8047ac68e5 ("e1000e: Fix msi-x interrupt automask") which cleared IAME
from CTRL_EXT, reading ICR doesn't interfere with RxQ0, TxQ0 interrupts
anymore. We handle the second part of the problem by not re-enabling the
Other interrupt right away when there is overrun. Instead, we wait until
traffic subsides, napi polling mode is exited and interrupts are
re-enabled.

Reported-by: Lennart Sorensen <[email protected]>
Fixes: 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt")
Signed-off-by: Benjamin Poirier <[email protected]>
---
drivers/net/ethernet/intel/e1000e/defines.h | 1 +
drivers/net/ethernet/intel/e1000e/netdev.c | 33 +++++++++++++++++++++++------
2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/defines.h b/drivers/net/ethernet/intel/e1000e/defines.h
index 0641c0098738..afb7ebe20b24 100644
--- a/drivers/net/ethernet/intel/e1000e/defines.h
+++ b/drivers/net/ethernet/intel/e1000e/defines.h
@@ -398,6 +398,7 @@
#define E1000_ICR_LSC 0x00000004 /* Link Status Change */
#define E1000_ICR_RXSEQ 0x00000008 /* Rx sequence error */
#define E1000_ICR_RXDMT0 0x00000010 /* Rx desc min. threshold (0) */
+#define E1000_ICR_RXO 0x00000040 /* Receiver Overrun */
#define E1000_ICR_RXT0 0x00000080 /* Rx timer intr (ring 0) */
#define E1000_ICR_ECCER 0x00400000 /* Uncorrectable ECC Error */
/* If this bit asserted, the driver should claim the interrupt */
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 5a8ab1136566..803edd1a6401 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1910,12 +1910,30 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
struct net_device *netdev = data;
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = &adapter->hw;
+ u32 icr;
+ bool enable = true;
+
+ icr = er32(ICR);
+ if (icr & E1000_ICR_RXO) {
+ ew32(ICR, E1000_ICR_RXO);
+ enable = false;
+ /* napi poll will re-enable Other, make sure it runs */
+ if (napi_schedule_prep(&adapter->napi)) {
+ adapter->total_rx_bytes = 0;
+ adapter->total_rx_packets = 0;
+ __napi_schedule(&adapter->napi);
+ }
+ }
+ if (icr & E1000_ICR_LSC) {
+ ew32(ICR, E1000_ICR_LSC);
+ hw->mac.get_link_status = true;
+ /* guard against interrupt when we're going down */
+ if (!test_bit(__E1000_DOWN, &adapter->state)) {
+ mod_timer(&adapter->watchdog_timer, jiffies + 1);
+ }
+ }

- hw->mac.get_link_status = true;
-
- /* guard against interrupt when we're going down */
- if (!test_bit(__E1000_DOWN, &adapter->state)) {
- mod_timer(&adapter->watchdog_timer, jiffies + 1);
+ if (enable && !test_bit(__E1000_DOWN, &adapter->state)) {
ew32(IMS, E1000_IMS_OTHER);
}

@@ -2687,7 +2705,8 @@ static int e1000e_poll(struct napi_struct *napi, int weight)
napi_complete_done(napi, work_done);
if (!test_bit(__E1000_DOWN, &adapter->state)) {
if (adapter->msix_entries)
- ew32(IMS, adapter->rx_ring->ims_val);
+ ew32(IMS, adapter->rx_ring->ims_val |
+ E1000_IMS_OTHER);
else
e1000_irq_enable(adapter);
}
@@ -4204,7 +4223,7 @@ static void e1000e_trigger_lsc(struct e1000_adapter *adapter)
struct e1000_hw *hw = &adapter->hw;

if (adapter->msix_entries)
- ew32(ICS, E1000_ICS_OTHER);
+ ew32(ICS, E1000_ICS_LSC | E1000_ICS_OTHER);
else
ew32(ICS, E1000_ICS_LSC);
}
--
2.13.2

2017-07-21 18:38:11

by Benjamin Poirier

[permalink] [raw]
Subject: [PATCH 3/5] e1000e: Fix return value test

All the helpers return -E1000_ERR_PHY.

Signed-off-by: Benjamin Poirier <[email protected]>
---
drivers/net/ethernet/intel/e1000e/netdev.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 58a87134d2e5..fc6a1d9999b2 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -5099,7 +5099,7 @@ static bool e1000e_has_link(struct e1000_adapter *adapter)
break;
}

- if ((ret_val == E1000_ERR_PHY) && (hw->phy.type == e1000_phy_igp_3) &&
+ if ((ret_val == -E1000_ERR_PHY) && (hw->phy.type == e1000_phy_igp_3) &&
(er32(CTRL) & E1000_PHY_CTRL_GBE_DISABLE)) {
/* See e1000_kmrn_lock_loss_workaround_ich8lan() */
e_info("Gigabit has been disabled, downgrading speed\n");
--
2.13.2

2017-07-21 18:48:50

by Lennart Sorensen

[permalink] [raw]
Subject: Re: [PATCH 5/5] e1000e: Avoid receiver overrun interrupt bursts

On Fri, Jul 21, 2017 at 11:36:27AM -0700, Benjamin Poirier wrote:
> When e1000e_poll() is not fast enough to keep up with incoming traffic, the
> adapter (when operating in msix mode) raises the Other interrupt to signal
> Receiver Overrun.
>
> This is a double problem because 1) at the moment e1000_msix_other()
> assumes that it is only called in case of Link Status Change and 2) if the
> condition persists, the interrupt is repeatedly raised again in quick
> succession.
>
> Ideally we would configure the Other interrupt to not be raised in case of
> receiver overrun but this doesn't seem possible on this adapter. Instead,
> we handle the first part of the problem by reverting to the practice of
> reading ICR in the other interrupt handler, like before commit 16ecba59bc33
> ("e1000e: Do not read ICR in Other interrupt"). Thanks to commit
> 0a8047ac68e5 ("e1000e: Fix msi-x interrupt automask") which cleared IAME
> from CTRL_EXT, reading ICR doesn't interfere with RxQ0, TxQ0 interrupts
> anymore. We handle the second part of the problem by not re-enabling the
> Other interrupt right away when there is overrun. Instead, we wait until
> traffic subsides, napi polling mode is exited and interrupts are
> re-enabled.
>
> Reported-by: Lennart Sorensen <[email protected]>
> Fixes: 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt")
> Signed-off-by: Benjamin Poirier <[email protected]>

Any chance of this fix hitting -stable? After all adapter reset under
load is not nice.

--
Len Sorensen

2017-07-21 18:50:04

by Lennart Sorensen

[permalink] [raw]
Subject: Re: [PATCH 4/5] e1000e: Separate signaling for link check/link up

On Fri, Jul 21, 2017 at 11:36:26AM -0700, Benjamin Poirier wrote:
> Lennart reported the following race condition:
>
> \ e1000_watchdog_task
> \ e1000e_has_link
> \ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
> /* link is up */
> mac->get_link_status = false;
>
> /* interrupt */
> \ e1000_msix_other
> hw->mac.get_link_status = true;
>
> link_active = !hw->mac.get_link_status
> /* link_active is false, wrongly */
>
> This problem arises because the single flag get_link_status is used to
> signal two different states: link status needs checking and link status is
> down.
>
> Avoid the problem by using the return value of .check_for_link to signal
> the link status to e1000e_has_link().
>
> Reported-by: Lennart Sorensen <[email protected]>
> Signed-off-by: Benjamin Poirier <[email protected]>

This too seems potentially -stable worthy, although with patch 5, the
problem becomes much much less likely to occur.

--
Len Sorensen

2017-07-21 19:02:47

by Lennart Sorensen

[permalink] [raw]
Subject: Re: commit 16ecba59 breaks 82574L under heavy load.

On Thu, Jul 20, 2017 at 04:44:55PM -0700, Benjamin Poirier wrote:
> Could you please test the following patch and let me know if it:
> 1) reduces the interrupt rate of the Other msi-x vector
> 2) avoids the link flaps
> or
> 3) logs some dmesg warnings of the form "Other interrupt with unhandled [...]"
> In this case, please paste icr values printed.

By the way, while at fixing the e1000e, I just noticed that
if you are blasting the port with traffic when it comes up,
you risk getting a transmit queue time out, because the queue
is started before the carrier is up. ixgbe already fixed that in
cdc04dcce0598fead6029a2f95e95a4d2ea419c2. igb has the same problem (which
goes away by moving the queue start to the watchdog after carrier_on,
I just haven't got around to sending that patch yet).

I am going to try moving the queue start to the watchdog and try it again.

Trace looked like this:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x1f9/0x200
NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Modules linked in: dpi_drv(PO) ccu_util(PO) ipv4_mb(PO) l2bridge_config_util(PO) l2_config_util(PO) route_config_util(PO) qos_config_util(PO) sysapp_common(PO) chantry_fwd_eng_2800_config(PO) shim_module(PO) sadb_cc(PO) ipsecXformer(PO) libeCrypto(PO) ipmatch_cc(PO) l2h_cc(PO) ndproxy_cc(PO) arpint_cc(PO) portinfo_cc(PO) chantryqos_cc(PO) redirector_cc(PO) ix_ph(PO) fpm_core_cc(PO) pulse_cc(PO) vnstt_cc(PO) vnsap_cc(PO) fm_cc(PO) rutm_cc(PO) mutm_cc(PO) ethernet_tx_cc(PO) stkdrv_cc(PO) l2bridge_cc(PO) events_util(PO) sched_cc(PO) qm_cc(PO) ipv4_cc(PO) wred_cc(PO) tc_meter_cc(PO) dscp_classifier_cc(PO) classifier_6t_cc(PO) ent586_cc(PO) dev_cc_arp(PO) chantry_fwd_eng_2800_tables(PO) ether_arp_lib(PO) rtmv4_lib(PO) lkup_lib(PO) l2tm_lib(PO) fragmentation_lib(PO) properties_lib(PO) msg_support_lib(PO)
utilities_lib(PO) cci_lib(PO) rm_lib(PO) libossl(O) vip(O) productSpec_x86_dp(PO) e1000e
CPU: 0 PID: 0 Comm: swapper/0 Tainted: P O 4.9.24 #20
Hardware name: Supermicro X7SPA-HF/X7SPA-HF, BIOS 1.2a 06/23/12
0000000000000000 ffffffff811cef1b ffff88007fc03e88 0000000000000000
ffffffff81037ade 0000000000000000 ffff88007fc03ed8 0000000000000001
0000000000000000 0000000000000082 0000000000000001 ffffffff81037b4c
Call Trace:
<IRQ>
[<ffffffff811cef1b>] ? dump_stack+0x46/0x5b
[<ffffffff81037ade>] ? __warn+0xbe/0xe0
[<ffffffff81037b4c>] ? warn_slowpath_fmt+0x4c/0x50
[<ffffffff8107ac92>] ? mod_timer+0xf2/0x150
[<ffffffff812ffe69>] ? dev_watchdog+0x1f9/0x200
[<ffffffff812ffc70>] ? dev_graft_qdisc+0x70/0x70
[<ffffffff8107aeb1>] ? call_timer_fn.isra.26+0x11/0x80
[<ffffffff8107b048>] ? run_timer_softirq+0x128/0x150
[<ffffffff8103b16b>] ? __do_softirq+0xeb/0x1f0
[<ffffffff8103b365>] ? irq_exit+0x55/0x60
[<ffffffff81024da9>] ? smp_apic_timer_interrupt+0x39/0x50
[<ffffffff813ab19c>] ? apic_timer_interrupt+0x7c/0x90
<EOI>
[<ffffffff813aa1e1>] ? mwait_idle+0x51/0x80
[<ffffffff81067717>] ? cpu_startup_entry+0xa7/0x130
[<ffffffff81663cf4>] ? start_kernel+0x306/0x30e
---[ end trace ee759b7a56e1110b ]---

--
Len Sorensen

2017-07-24 21:56:23

by Philip Prindeville

[permalink] [raw]
Subject: Re: commit 16ecba59 breaks 82574L under heavy load.


> On Jul 20, 2017, at 5:44 PM, Benjamin Poirier <[email protected]> wrote:
>
> [snip]
> Could you please test the following patch and let me know if it:
> 1) reduces the interrupt rate of the Other msi-x vector
> 2) avoids the link flaps
> or
> 3) logs some dmesg warnings of the form "Other interrupt with unhandled [...]"
> In this case, please paste icr values printed.
>
> Thanks
>
> diff --git a/drivers/net/ethernet/intel/e1000e/defines.h b/drivers/net/ethernet/intel/e1000e/defines.h
> index 0641c0098738..afb7ebe20b24 100644
> --- a/drivers/net/ethernet/intel/e1000e/defines.h
> +++ b/drivers/net/ethernet/intel/e1000e/defines.h
> @@ -398,6 +398,7 @@
> #define E1000_ICR_LSC 0x00000004 /* Link Status Change */
> #define E1000_ICR_RXSEQ 0x00000008 /* Rx sequence error */
> #define E1000_ICR_RXDMT0 0x00000010 /* Rx desc min. threshold (0) */
> +#define E1000_ICR_RXO 0x00000040 /* Receiver Overrun */
> #define E1000_ICR_RXT0 0x00000080 /* Rx timer intr (ring 0) */
> #define E1000_ICR_ECCER 0x00400000 /* Uncorrectable ECC Error */
> /* If this bit asserted, the driver should claim the interrupt */
> diff --git a/drivers/net/ethernet/intel/e1000e/e1000.h b/drivers/net/ethernet/intel/e1000e/e1000.h
> index c7c994eb410e..f7b46eba3efb 100644
> --- a/drivers/net/ethernet/intel/e1000e/e1000.h
> +++ b/drivers/net/ethernet/intel/e1000e/e1000.h
> @@ -351,6 +351,10 @@ struct e1000_adapter {
> s32 ptp_delta;
>
> u16 eee_advert;
> +
> + unsigned int uh_count;
> + u32 uh_values[16];
> + unsigned int uh_values_nb;
> };
>
> struct e1000_info {
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> index b3679728caac..46697338c0e1 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -46,6 +46,8 @@
>
> #include "e1000.h"
>
> +DEFINE_RATELIMIT_STATE(other_uh_ratelimit_state, HZ, 1);
> +
> #define DRV_EXTRAVERSION "-k"
>
> #define DRV_VERSION "3.2.6" DRV_EXTRAVERSION
> @@ -1904,12 +1906,60 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
> struct net_device *netdev = data;
> struct e1000_adapter *adapter = netdev_priv(netdev);
> struct e1000_hw *hw = &adapter->hw;
> + u32 icr;
> + bool enable = true;
> + bool handled = false;
> + unsigned int i;
>
> - hw->mac.get_link_status = true;
> + icr = er32(ICR);
> + if (icr & E1000_ICR_RXO) {
> + ew32(ICR, E1000_ICR_RXO);
> + enable = false;
> + /* napi poll will re-enable Other, make sure it runs */
> + if (napi_schedule_prep(&adapter->napi)) {
> + adapter->total_rx_bytes = 0;
> + adapter->total_rx_packets = 0;
> + __napi_schedule(&adapter->napi);
> + }
> + handled = true;
> + }
> + if (icr & E1000_ICR_LSC) {
> + ew32(ICR, E1000_ICR_LSC);
> + hw->mac.get_link_status = true;
> + /* guard against interrupt when we're going down */
> + if (!test_bit(__E1000_DOWN, &adapter->state)) {
> + mod_timer(&adapter->watchdog_timer, jiffies + 1);
> + }
> + handled = true;
> + }
>
> - /* guard against interrupt when we're going down */
> - if (!test_bit(__E1000_DOWN, &adapter->state)) {
> - mod_timer(&adapter->watchdog_timer, jiffies + 1);
> + if (!handled) {
> + adapter->uh_count++;
> + /* only print unseen icr values */
> + if (adapter->uh_values_nb < ARRAY_SIZE(adapter->uh_values)) {
> + for (i = 0; i < ARRAY_SIZE(adapter->uh_values); i++) {
> + if (adapter->uh_values[i] == icr) {
> + break;
> + }
> + }
> + if (i == ARRAY_SIZE(adapter->uh_values)) {
> + adapter->uh_values[adapter->uh_values_nb] =
> + icr;
> + adapter->uh_values_nb++;
> + netdev_warn(netdev,
> + "Other interrupt with unhandled icr 0x%08x\n",
> + icr);
> + }
> + }
> + }
> + if (adapter->uh_count && __ratelimit(&other_uh_ratelimit_state)) {
> + netdev_warn(netdev,
> + "Other interrupt with unhandled cause, count %u\n",
> + adapter->uh_count);
> + adapter->uh_count = 0;
> + }
> +
> + if (enable && !test_bit(__E1000_DOWN, &adapter->state)) {
> ew32(IMS, E1000_IMS_OTHER);
> }
>
> @@ -2681,7 +2731,8 @@ static int e1000e_poll(struct napi_struct *napi, int weight)
> napi_complete_done(napi, work_done);
> if (!test_bit(__E1000_DOWN, &adapter->state)) {
> if (adapter->msix_entries)
> - ew32(IMS, adapter->rx_ring->ims_val);
> + ew32(IMS, adapter->rx_ring->ims_val |
> + E1000_IMS_OTHER);
> else
> e1000_irq_enable(adapter);
> }
> @@ -4197,7 +4248,7 @@ static void e1000e_trigger_lsc(struct e1000_adapter *adapter)
> struct e1000_hw *hw = &adapter->hw;
>
> if (adapter->msix_entries)
> - ew32(ICS, E1000_ICS_OTHER);
> + ew32(ICS, E1000_ICS_LSC | E1000_ICS_OTHER);
> else
> ew32(ICS, E1000_ICS_LSC);
> }
> @@ -7572,6 +7623,8 @@ static int __init e1000_init_module(void)
> e1000e_driver_version);
> pr_info("Copyright(c) 1999 - 2015 Intel Corporation.\n");
>
> + ratelimit_set_flags(&other_uh_ratelimit_state, RATELIMIT_MSG_ON_RELEASE);
> +
> return pci_register_driver(&e1000_driver);
> }
> module_init(e1000_init_module);



We’ve been running this patch on a Lanner FW-7568 (Atom D525, 6x 82574L NICs) under heavy load both with and without RPS enabled and have yet to see a single link flap.

Is it going into linux-stable?

Thanks,

-Philip

2017-08-02 11:28:12

by Sasha Neftin

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [PATCH 4/5] e1000e: Separate signaling for link check/link up

On 7/21/2017 21:36, Benjamin Poirier wrote:
> Lennart reported the following race condition:
>
> \ e1000_watchdog_task
> \ e1000e_has_link
> \ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
> /* link is up */
> mac->get_link_status = false;
>
> /* interrupt */
> \ e1000_msix_other
> hw->mac.get_link_status = true;
>
> link_active = !hw->mac.get_link_status
> /* link_active is false, wrongly */
>
> This problem arises because the single flag get_link_status is used to
> signal two different states: link status needs checking and link status is
> down.
>
> Avoid the problem by using the return value of .check_for_link to signal
> the link status to e1000e_has_link().
>
> Reported-by: Lennart Sorensen <[email protected]>
> Signed-off-by: Benjamin Poirier <[email protected]>
> ---
> drivers/net/ethernet/intel/e1000e/mac.c | 11 ++++++++---
> drivers/net/ethernet/intel/e1000e/netdev.c | 2 +-
> 2 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000e/mac.c b/drivers/net/ethernet/intel/e1000e/mac.c
> index b322011ec282..f457c5703d0c 100644
> --- a/drivers/net/ethernet/intel/e1000e/mac.c
> +++ b/drivers/net/ethernet/intel/e1000e/mac.c
> @@ -410,6 +410,9 @@ void e1000e_clear_hw_cntrs_base(struct e1000_hw *hw)
> * Checks to see of the link status of the hardware has changed. If a
> * change in link status has been detected, then we read the PHY registers
> * to get the current speed/duplex if link exists.
> + *
> + * Returns a negative error code (-E1000_ERR_*) or 0 (link down) or 1 (link
> + * up).
> **/
> s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> {
> @@ -423,7 +426,7 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> * Change or Rx Sequence Error interrupt.
> */
> if (!mac->get_link_status)
> - return 0;
> + return 1;
>
> /* First we want to see if the MII Status Register reports
> * link. If so, then we want to get the current speed/duplex
> @@ -461,10 +464,12 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> * different link partner.
> */
> ret_val = e1000e_config_fc_after_link_up(hw);
> - if (ret_val)
> + if (ret_val) {
> e_dbg("Error configuring flow control\n");
> + return ret_val;
> + }
>
> - return ret_val;
> + return 1;
> }
>
> /**
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> index fc6a1d9999b2..5a8ab1136566 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -5081,7 +5081,7 @@ static bool e1000e_has_link(struct e1000_adapter *adapter)
> case e1000_media_type_copper:
> if (hw->mac.get_link_status) {
> ret_val = hw->mac.ops.check_for_link(hw);
> - link_active = !hw->mac.get_link_status;
> + link_active = ret_val > 0;
> } else {
> link_active = true;
> }

Hello Benjamin,

Will this patch fix any serious problem with link indication? Is it
necessary? Can we consider your patch series without 4/5 part?

2017-08-02 14:34:40

by Lennart Sorensen

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [PATCH 4/5] e1000e: Separate signaling for link check/link up

On Wed, Aug 02, 2017 at 02:28:07PM +0300, Neftin, Sasha wrote:
> On 7/21/2017 21:36, Benjamin Poirier wrote:
> > Lennart reported the following race condition:
> >
> > \ e1000_watchdog_task
> > \ e1000e_has_link
> > \ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
> > /* link is up */
> > mac->get_link_status = false;
> >
> > /* interrupt */
> > \ e1000_msix_other
> > hw->mac.get_link_status = true;
> >
> > link_active = !hw->mac.get_link_status
> > /* link_active is false, wrongly */
> >
> > This problem arises because the single flag get_link_status is used to
> > signal two different states: link status needs checking and link status is
> > down.
> >
> > Avoid the problem by using the return value of .check_for_link to signal
> > the link status to e1000e_has_link().
> >
> > Reported-by: Lennart Sorensen <[email protected]>
> > Signed-off-by: Benjamin Poirier <[email protected]>
> > ---
> > drivers/net/ethernet/intel/e1000e/mac.c | 11 ++++++++---
> > drivers/net/ethernet/intel/e1000e/netdev.c | 2 +-
> > 2 files changed, 9 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/e1000e/mac.c b/drivers/net/ethernet/intel/e1000e/mac.c
> > index b322011ec282..f457c5703d0c 100644
> > --- a/drivers/net/ethernet/intel/e1000e/mac.c
> > +++ b/drivers/net/ethernet/intel/e1000e/mac.c
> > @@ -410,6 +410,9 @@ void e1000e_clear_hw_cntrs_base(struct e1000_hw *hw)
> > * Checks to see of the link status of the hardware has changed. If a
> > * change in link status has been detected, then we read the PHY registers
> > * to get the current speed/duplex if link exists.
> > + *
> > + * Returns a negative error code (-E1000_ERR_*) or 0 (link down) or 1 (link
> > + * up).
> > **/
> > s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> > {
> > @@ -423,7 +426,7 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> > * Change or Rx Sequence Error interrupt.
> > */
> > if (!mac->get_link_status)
> > - return 0;
> > + return 1;
> > /* First we want to see if the MII Status Register reports
> > * link. If so, then we want to get the current speed/duplex
> > @@ -461,10 +464,12 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> > * different link partner.
> > */
> > ret_val = e1000e_config_fc_after_link_up(hw);
> > - if (ret_val)
> > + if (ret_val) {
> > e_dbg("Error configuring flow control\n");
> > + return ret_val;
> > + }
> > - return ret_val;
> > + return 1;
> > }
> > /**
> > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> > index fc6a1d9999b2..5a8ab1136566 100644
> > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> > @@ -5081,7 +5081,7 @@ static bool e1000e_has_link(struct e1000_adapter *adapter)
> > case e1000_media_type_copper:
> > if (hw->mac.get_link_status) {
> > ret_val = hw->mac.ops.check_for_link(hw);
> > - link_active = !hw->mac.get_link_status;
> > + link_active = ret_val > 0;
> > } else {
> > link_active = true;
> > }
>
> Hello Benjamin,
>
> Will this patch fix any serious problem with link indication? Is it
> necessary? Can we consider your patch series without 4/5 part?

Without this patch, you have the race condition that can make the
watchdog_task mistakenly think the link is down when it isn't, and then
it resets the adapter, which does make the link go down.

So it is rather catastrophic for the interface.

The other patch to the interrupt handling should make it never get hit,
but the issue does still exist if not fixed and I wouldn't rule out that
it could possibly still happen even with the other fix in place.

--
Len Sorensen

2017-08-02 14:49:33

by Benjamin Poirier

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [PATCH 4/5] e1000e: Separate signaling for link check/link up

On 2017/08/02 10:34, Lennart Sorensen wrote:
> On Wed, Aug 02, 2017 at 02:28:07PM +0300, Neftin, Sasha wrote:
> > On 7/21/2017 21:36, Benjamin Poirier wrote:
> > > Lennart reported the following race condition:
> > >
> > > \ e1000_watchdog_task
> > > \ e1000e_has_link
> > > \ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
> > > /* link is up */
> > > mac->get_link_status = false;
> > >
> > > /* interrupt */
> > > \ e1000_msix_other
> > > hw->mac.get_link_status = true;
> > >
> > > link_active = !hw->mac.get_link_status
> > > /* link_active is false, wrongly */
> > >
> > > This problem arises because the single flag get_link_status is used to
> > > signal two different states: link status needs checking and link status is
> > > down.
> > >
> > > Avoid the problem by using the return value of .check_for_link to signal
> > > the link status to e1000e_has_link().
> > >
> > > Reported-by: Lennart Sorensen <[email protected]>
> > > Signed-off-by: Benjamin Poirier <[email protected]>
> > > ---
> > > drivers/net/ethernet/intel/e1000e/mac.c | 11 ++++++++---
> > > drivers/net/ethernet/intel/e1000e/netdev.c | 2 +-
> > > 2 files changed, 9 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/intel/e1000e/mac.c b/drivers/net/ethernet/intel/e1000e/mac.c
> > > index b322011ec282..f457c5703d0c 100644
> > > --- a/drivers/net/ethernet/intel/e1000e/mac.c
> > > +++ b/drivers/net/ethernet/intel/e1000e/mac.c
> > > @@ -410,6 +410,9 @@ void e1000e_clear_hw_cntrs_base(struct e1000_hw *hw)
> > > * Checks to see of the link status of the hardware has changed. If a
> > > * change in link status has been detected, then we read the PHY registers
> > > * to get the current speed/duplex if link exists.
> > > + *
> > > + * Returns a negative error code (-E1000_ERR_*) or 0 (link down) or 1 (link
> > > + * up).
> > > **/
> > > s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> > > {
> > > @@ -423,7 +426,7 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> > > * Change or Rx Sequence Error interrupt.
> > > */
> > > if (!mac->get_link_status)
> > > - return 0;
> > > + return 1;
> > > /* First we want to see if the MII Status Register reports
> > > * link. If so, then we want to get the current speed/duplex
> > > @@ -461,10 +464,12 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> > > * different link partner.
> > > */
> > > ret_val = e1000e_config_fc_after_link_up(hw);
> > > - if (ret_val)
> > > + if (ret_val) {
> > > e_dbg("Error configuring flow control\n");
> > > + return ret_val;
> > > + }
> > > - return ret_val;
> > > + return 1;
> > > }
> > > /**
> > > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> > > index fc6a1d9999b2..5a8ab1136566 100644
> > > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> > > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> > > @@ -5081,7 +5081,7 @@ static bool e1000e_has_link(struct e1000_adapter *adapter)
> > > case e1000_media_type_copper:
> > > if (hw->mac.get_link_status) {
> > > ret_val = hw->mac.ops.check_for_link(hw);
> > > - link_active = !hw->mac.get_link_status;
> > > + link_active = ret_val > 0;
> > > } else {
> > > link_active = true;
> > > }
> >
> > Hello Benjamin,
> >
> > Will this patch fix any serious problem with link indication? Is it
> > necessary? Can we consider your patch series without 4/5 part?
>
> Without this patch, you have the race condition that can make the
> watchdog_task mistakenly think the link is down when it isn't, and then
> it resets the adapter, which does make the link go down.
>
> So it is rather catastrophic for the interface.
>
> The other patch to the interrupt handling should make it never get hit,
> but the issue does still exist if not fixed and I wouldn't rule out that
> it could possibly still happen even with the other fix in place.

Exactly. I wouldn't have explained it better, thanks.

2017-08-12 02:13:59

by Philip Prindeville

[permalink] [raw]
Subject: Re: [PATCH 5/5] e1000e: Avoid receiver overrun interrupt bursts


> On Jul 21, 2017, at 12:48 PM, Lennart Sorensen <[email protected]> wrote:
>
> On Fri, Jul 21, 2017 at 11:36:27AM -0700, Benjamin Poirier wrote:
>> When e1000e_poll() is not fast enough to keep up with incoming traffic, the
>> adapter (when operating in msix mode) raises the Other interrupt to signal
>> Receiver Overrun.
>>
>> This is a double problem because 1) at the moment e1000_msix_other()
>> assumes that it is only called in case of Link Status Change and 2) if the
>> condition persists, the interrupt is repeatedly raised again in quick
>> succession.
>>
>> Ideally we would configure the Other interrupt to not be raised in case of
>> receiver overrun but this doesn't seem possible on this adapter. Instead,
>> we handle the first part of the problem by reverting to the practice of
>> reading ICR in the other interrupt handler, like before commit 16ecba59bc33
>> ("e1000e: Do not read ICR in Other interrupt"). Thanks to commit
>> 0a8047ac68e5 ("e1000e: Fix msi-x interrupt automask") which cleared IAME
>> from CTRL_EXT, reading ICR doesn't interfere with RxQ0, TxQ0 interrupts
>> anymore. We handle the second part of the problem by not re-enabling the
>> Other interrupt right away when there is overrun. Instead, we wait until
>> traffic subsides, napi polling mode is exited and interrupts are
>> re-enabled.
>>
>> Reported-by: Lennart Sorensen <[email protected]>
>> Fixes: 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt")
>> Signed-off-by: Benjamin Poirier <[email protected]>
>
> Any chance of this fix hitting -stable? After all adapter reset under
> load is not nice.
>


I tried this patch sequence and I’m seeing a 2% drop in throughput. CPU utilization at softIRQ is also about 8% higher. The previous single patch that went out to fix this problem had better performance.

This is on an Atom D525 with an 82574L and running 2 GB streams across a pair of interfaces with iperf3.

-Philip


2017-08-12 02:47:25

by Philip Prindeville

[permalink] [raw]
Subject: Re: [PATCH 5/5] e1000e: Avoid receiver overrun interrupt bursts


> On Aug 11, 2017, at 8:13 PM, Philip Prindeville <[email protected]> wrote:
>
>>
>> On Jul 21, 2017, at 12:48 PM, Lennart Sorensen <[email protected]> wrote:
>>
>> On Fri, Jul 21, 2017 at 11:36:27AM -0700, Benjamin Poirier wrote:
>>> When e1000e_poll() is not fast enough to keep up with incoming traffic, the
>>> adapter (when operating in msix mode) raises the Other interrupt to signal
>>> Receiver Overrun.
>>>
>>> This is a double problem because 1) at the moment e1000_msix_other()
>>> assumes that it is only called in case of Link Status Change and 2) if the
>>> condition persists, the interrupt is repeatedly raised again in quick
>>> succession.
>>>
>>> Ideally we would configure the Other interrupt to not be raised in case of
>>> receiver overrun but this doesn't seem possible on this adapter. Instead,
>>> we handle the first part of the problem by reverting to the practice of
>>> reading ICR in the other interrupt handler, like before commit 16ecba59bc33
>>> ("e1000e: Do not read ICR in Other interrupt"). Thanks to commit
>>> 0a8047ac68e5 ("e1000e: Fix msi-x interrupt automask") which cleared IAME
>>> from CTRL_EXT, reading ICR doesn't interfere with RxQ0, TxQ0 interrupts
>>> anymore. We handle the second part of the problem by not re-enabling the
>>> Other interrupt right away when there is overrun. Instead, we wait until
>>> traffic subsides, napi polling mode is exited and interrupts are
>>> re-enabled.
>>>
>>> Reported-by: Lennart Sorensen <[email protected]>
>>> Fixes: 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt")
>>> Signed-off-by: Benjamin Poirier <[email protected]>
>>
>> Any chance of this fix hitting -stable? After all adapter reset under
>> load is not nice.
>>
>
>
> I tried this patch sequence and I’m seeing a 2% drop in throughput. CPU utilization at softIRQ is also about 8% higher. The previous single patch that went out to fix this problem had better performance.
>
> This is on an Atom D525 with an 82574L and running 2 GB streams across a pair of interfaces with iperf3.
>
> -Philip


Actually, after turning off MSI-X mode (and using MSI mode instead), and setting InterruptRateThrottle to “4” (conservative dynamic mode) across all interfaces, I’m actually seeing slightly better throughput than the earlier patch… with comparable overall CPU utilization and SoftIRQ utilization.

So setting the module parameters correctly for routing (and not end-system parameters) makes a big difference when routing.

-Philip

2017-08-21 17:17:22

by Benjamin Poirier

[permalink] [raw]
Subject: Re: [PATCH 5/5] e1000e: Avoid receiver overrun interrupt bursts

On 2017/07/21 11:36, Benjamin Poirier wrote:
> When e1000e_poll() is not fast enough to keep up with incoming traffic, the
> adapter (when operating in msix mode) raises the Other interrupt to signal
> Receiver Overrun.
>
> This is a double problem because 1) at the moment e1000_msix_other()
> assumes that it is only called in case of Link Status Change and 2) if the
> condition persists, the interrupt is repeatedly raised again in quick
> succession.
>
> Ideally we would configure the Other interrupt to not be raised in case of
> receiver overrun but this doesn't seem possible on this adapter. Instead,
> we handle the first part of the problem by reverting to the practice of
> reading ICR in the other interrupt handler, like before commit 16ecba59bc33
> ("e1000e: Do not read ICR in Other interrupt"). Thanks to commit
> 0a8047ac68e5 ("e1000e: Fix msi-x interrupt automask") which cleared IAME
> from CTRL_EXT, reading ICR doesn't interfere with RxQ0, TxQ0 interrupts
> anymore. We handle the second part of the problem by not re-enabling the
> Other interrupt right away when there is overrun. Instead, we wait until
> traffic subsides, napi polling mode is exited and interrupts are
> re-enabled.
>
> Reported-by: Lennart Sorensen <[email protected]>
> Fixes: 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt")
> Signed-off-by: Benjamin Poirier <[email protected]>

What's the status on these patches please? One month later they still
show up as "new" in patchwork.

2017-09-15 00:18:51

by Brown, Aaron F

[permalink] [raw]
Subject: RE: [Intel-wired-lan] [PATCH 1/5] e1000e: Fix error path in link detection

> From: Intel-wired-lan [mailto:[email protected]] On Behalf
> Of Benjamin Poirier
> Sent: Friday, July 21, 2017 11:36 AM
> To: Kirsher, Jeffrey T <[email protected]>
> Cc: [email protected]; [email protected]; linux-
> [email protected]; Lennart Sorensen <[email protected]>
> Subject: [Intel-wired-lan] [PATCH 1/5] e1000e: Fix error path in link detection
>
> In case of error from e1e_rphy(), the loop will exit early and "success"
> will be set to true erroneously.
>
> Signed-off-by: Benjamin Poirier <[email protected]>
> ---
> drivers/net/ethernet/intel/e1000e/phy.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>

Tested-by: Aaron Brown <[email protected]>

2017-09-15 00:20:21

by Brown, Aaron F

[permalink] [raw]
Subject: RE: [Intel-wired-lan] [PATCH 3/5] e1000e: Fix return value test

> From: Intel-wired-lan [mailto:[email protected]] On Behalf
> Of Benjamin Poirier
> Sent: Friday, July 21, 2017 11:36 AM
> To: Kirsher, Jeffrey T <[email protected]>
> Cc: [email protected]; [email protected]; linux-
> [email protected]; Lennart Sorensen <[email protected]>
> Subject: [Intel-wired-lan] [PATCH 3/5] e1000e: Fix return value test
>
> All the helpers return -E1000_ERR_PHY.
>
> Signed-off-by: Benjamin Poirier <[email protected]>
> ---
> drivers/net/ethernet/intel/e1000e/netdev.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)

Tested-by: Aaron Brown <[email protected]>

2017-09-15 00:27:50

by Brown, Aaron F

[permalink] [raw]
Subject: RE: [Intel-wired-lan] [PATCH 4/5] e1000e: Separate signaling for link check/link up


On 7/21/2017 21:36, Benjamin Poirier wrote:
> Lennart reported the following race condition:
>
> \ e1000_watchdog_task
> \ e1000e_has_link
> \ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
> /* link is up */
> mac->get_link_status = false;
>
> /* interrupt */
> \ e1000_msix_other
> hw->mac.get_link_status = true;
>
> link_active = !hw->mac.get_link_status
> /* link_active is false, wrongly */
>
> This problem arises because the single flag get_link_status is used to
> signal two different states: link status needs checking and link status is
> down.
>
> Avoid the problem by using the return value of .check_for_link to signal
> the link status to e1000e_has_link().
>
> Reported-by: Lennart Sorensen <[email protected]>
> Signed-off-by: Benjamin Poirier <[email protected]>
> ---
> drivers/net/ethernet/intel/e1000e/mac.c | 11 ++++++++---
> drivers/net/ethernet/intel/e1000e/netdev.c | 2 +-
> 2 files changed, 9 insertions(+), 4 deletions(-)

Tested-by: Aaron Brown <[email protected]>

2017-09-15 00:38:26

by Brown, Aaron F

[permalink] [raw]
Subject: RE: [Intel-wired-lan] [PATCH 5/5] e1000e: Avoid receiver overrun interrupt bursts

> From: Intel-wired-lan [mailto:[email protected]] On Behalf
> Of Benjamin Poirier
> Sent: Friday, July 21, 2017 11:36 AM
> To: Kirsher, Jeffrey T <[email protected]>
> Cc: [email protected]; [email protected]; linux-
> [email protected]; Lennart Sorensen <[email protected]>
> Subject: [Intel-wired-lan] [PATCH 5/5] e1000e: Avoid receiver overrun
> interrupt bursts
>
> When e1000e_poll() is not fast enough to keep up with incoming traffic, the
> adapter (when operating in msix mode) raises the Other interrupt to signal
> Receiver Overrun.
>
> This is a double problem because 1) at the moment e1000_msix_other()
> assumes that it is only called in case of Link Status Change and 2) if the
> condition persists, the interrupt is repeatedly raised again in quick
> succession.
>
> Ideally we would configure the Other interrupt to not be raised in case of
> receiver overrun but this doesn't seem possible on this adapter. Instead,
> we handle the first part of the problem by reverting to the practice of
> reading ICR in the other interrupt handler, like before commit 16ecba59bc33
> ("e1000e: Do not read ICR in Other interrupt"). Thanks to commit
> 0a8047ac68e5 ("e1000e: Fix msi-x interrupt automask") which cleared IAME
> from CTRL_EXT, reading ICR doesn't interfere with RxQ0, TxQ0 interrupts
> anymore. We handle the second part of the problem by not re-enabling the
> Other interrupt right away when there is overrun. Instead, we wait until
> traffic subsides, napi polling mode is exited and interrupts are
> re-enabled.
>
> Reported-by: Lennart Sorensen <[email protected]>
> Fixes: 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt")
> Signed-off-by: Benjamin Poirier <[email protected]>
> ---
> drivers/net/ethernet/intel/e1000e/defines.h | 1 +
> drivers/net/ethernet/intel/e1000e/netdev.c | 33
> +++++++++++++++++++++++------
> 2 files changed, 27 insertions(+), 7 deletions(-)
>

I get an error and a few warnings out of checkpatch from this, but I think the error is false (thinking the reference to a commit in the description is this commit, a fixes commit or something like that) and I'm more concerned with the fix than the warnings, so...

Tested-by: Aaron Brown <[email protected]>

Here is the checkpatch output in case anyone has a different opinion on the severity:
-------------
u1484:[0]/usr/src/kernels/next-queue> git format-patch d81d1e6 -1 --stdout|./scripts/checkpatch.pl -
ERROR: Please use git commit description style 'commit <12+ chars of sha1> ("<title line>")' - ie: 'commit 0a8047ac68e5 ("e1000e: Fix msi-x interrupt automask")'
#20:
0a8047ac68e5 ("e1000e: Fix msi-x interrupt automask") which cleared IAME

WARNING: braces {} are not necessary for single statement blocks
#73: FILE: drivers/net/ethernet/intel/e1000e/netdev.c:1931:
+ if (!test_bit(__E1000_DOWN, &adapter->state)) {
+ mod_timer(&adapter->watchdog_timer, jiffies + 1);
+ }

WARNING: braces {} are not necessary for single statement blocks
#83: FILE: drivers/net/ethernet/intel/e1000e/netdev.c:1936:
+ if (enable && !test_bit(__E1000_DOWN, &adapter->state)) {
ew32(IMS, E1000_IMS_OTHER);
}

total: 1 errors, 2 warnings, 0 checks, 59 lines checked

NOTE: For some of the reported defects, checkpatch may be able to
mechanically convert to the typical style using --fix or --fix-inplace.

Your patch has style problems, please review.

NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
u1484:[0]/usr/src/kernels/next-queue>

2017-09-19 00:13:45

by Brown, Aaron F

[permalink] [raw]
Subject: RE: [Intel-wired-lan] [PATCH 2/5] e1000e: Fix wrong comment related to link detection

> From: Intel-wired-lan [mailto:[email protected]] On Behalf
> Of Benjamin Poirier
> Sent: Friday, July 21, 2017 11:36 AM
> To: Kirsher, Jeffrey T <[email protected]>
> Cc: [email protected]; [email protected]; linux-
> [email protected]; Lennart Sorensen <[email protected]>
> Subject: [Intel-wired-lan] [PATCH 2/5] e1000e: Fix wrong comment related to
> link detection
>
> Reading e1000e_check_for_copper_link() shows that get_link_status is set to
> false after link has been detected. Therefore, it stays TRUE until then.
>
> Signed-off-by: Benjamin Poirier <[email protected]>
> ---
> drivers/net/ethernet/intel/e1000e/netdev.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>

Tested-by: Aaron Brown <[email protected]>

2017-09-19 18:38:12

by Philip Prindeville

[permalink] [raw]
Subject: Re: [5/5] e1000e: Avoid receiver overrun interrupt bursts

Hi.

We’ve been running this patchset (all 5) for about as long as they’ve been under review… about 2 months. And in a burn-in lab with heavy traffic.

We’ve not seen a single link-flap in hundreds of ours of saturated traffic.

Would love to see some resolution soon on this as we don’t want to ship a release with unsanctioned patches.

Is there an estimate on when that might be?

Thanks,

-Philip



> On Jul 21, 2017, at 12:36 PM, Benjamin Poirier <[email protected]> wrote:
>
> When e1000e_poll() is not fast enough to keep up with incoming traffic, the
> adapter (when operating in msix mode) raises the Other interrupt to signal
> Receiver Overrun.
>
> This is a double problem because 1) at the moment e1000_msix_other()
> assumes that it is only called in case of Link Status Change and 2) if the
> condition persists, the interrupt is repeatedly raised again in quick
> succession.
>
> Ideally we would configure the Other interrupt to not be raised in case of
> receiver overrun but this doesn't seem possible on this adapter. Instead,
> we handle the first part of the problem by reverting to the practice of
> reading ICR in the other interrupt handler, like before commit 16ecba59bc33
> ("e1000e: Do not read ICR in Other interrupt"). Thanks to commit
> 0a8047ac68e5 ("e1000e: Fix msi-x interrupt automask") which cleared IAME
> from CTRL_EXT, reading ICR doesn't interfere with RxQ0, TxQ0 interrupts
> anymore. We handle the second part of the problem by not re-enabling the
> Other interrupt right away when there is overrun. Instead, we wait until
> traffic subsides, napi polling mode is exited and interrupts are
> re-enabled.
>
> Reported-by: Lennart Sorensen <[email protected]>
> Fixes: 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt")
> Signed-off-by: Benjamin Poirier <[email protected]>
> Tested-by: Aaron Brown <[email protected]>
> ---
> drivers/net/ethernet/intel/e1000e/defines.h | 1 +
> drivers/net/ethernet/intel/e1000e/netdev.c | 33 +++++++++++++++++++++++------
> 2 files changed, 27 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000e/defines.h b/drivers/net/ethernet/intel/e1000e/defines.h
> index 0641c0098738..afb7ebe20b24 100644
> --- a/drivers/net/ethernet/intel/e1000e/defines.h
> +++ b/drivers/net/ethernet/intel/e1000e/defines.h
> @@ -398,6 +398,7 @@
> #define E1000_ICR_LSC 0x00000004 /* Link Status Change */
> #define E1000_ICR_RXSEQ 0x00000008 /* Rx sequence error */
> #define E1000_ICR_RXDMT0 0x00000010 /* Rx desc min. threshold (0) */
> +#define E1000_ICR_RXO 0x00000040 /* Receiver Overrun */
> #define E1000_ICR_RXT0 0x00000080 /* Rx timer intr (ring 0) */
> #define E1000_ICR_ECCER 0x00400000 /* Uncorrectable ECC Error */
> /* If this bit asserted, the driver should claim the interrupt */
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> index 5a8ab1136566..803edd1a6401 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -1910,12 +1910,30 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
> struct net_device *netdev = data;
> struct e1000_adapter *adapter = netdev_priv(netdev);
> struct e1000_hw *hw = &adapter->hw;
> + u32 icr;
> + bool enable = true;
> +
> + icr = er32(ICR);
> + if (icr & E1000_ICR_RXO) {
> + ew32(ICR, E1000_ICR_RXO);
> + enable = false;
> + /* napi poll will re-enable Other, make sure it runs */
> + if (napi_schedule_prep(&adapter->napi)) {
> + adapter->total_rx_bytes = 0;
> + adapter->total_rx_packets = 0;
> + __napi_schedule(&adapter->napi);
> + }
> + }
> + if (icr & E1000_ICR_LSC) {
> + ew32(ICR, E1000_ICR_LSC);
> + hw->mac.get_link_status = true;
> + /* guard against interrupt when we're going down */
> + if (!test_bit(__E1000_DOWN, &adapter->state)) {
> + mod_timer(&adapter->watchdog_timer, jiffies + 1);
> + }
> + }
>
> - hw->mac.get_link_status = true;
> -
> - /* guard against interrupt when we're going down */
> - if (!test_bit(__E1000_DOWN, &adapter->state)) {
> - mod_timer(&adapter->watchdog_timer, jiffies + 1);
> + if (enable && !test_bit(__E1000_DOWN, &adapter->state)) {
> ew32(IMS, E1000_IMS_OTHER);
> }
>
> @@ -2687,7 +2705,8 @@ static int e1000e_poll(struct napi_struct *napi, int weight)
> napi_complete_done(napi, work_done);
> if (!test_bit(__E1000_DOWN, &adapter->state)) {
> if (adapter->msix_entries)
> - ew32(IMS, adapter->rx_ring->ims_val);
> + ew32(IMS, adapter->rx_ring->ims_val |
> + E1000_IMS_OTHER);
> else
> e1000_irq_enable(adapter);
> }
> @@ -4204,7 +4223,7 @@ static void e1000e_trigger_lsc(struct e1000_adapter *adapter)
> struct e1000_hw *hw = &adapter->hw;
>
> if (adapter->msix_entries)
> - ew32(ICS, E1000_ICS_OTHER);
> + ew32(ICS, E1000_ICS_LSC | E1000_ICS_OTHER);
> else
> ew32(ICS, E1000_ICS_LSC);
> }

2017-09-19 19:41:07

by Benjamin Poirier

[permalink] [raw]
Subject: Re: [5/5] e1000e: Avoid receiver overrun interrupt bursts

On 2017/09/19 12:38, Philip Prindeville wrote:
> Hi.
>
> We’ve been running this patchset (all 5) for about as long as they’ve been under review… about 2 months. And in a burn-in lab with heavy traffic.
>
> We’ve not seen a single link-flap in hundreds of ours of saturated traffic.
>
> Would love to see some resolution soon on this as we don’t want to ship a release with unsanctioned patches.
>
> Is there an estimate on when that might be?

The patches have been added to Jeff Kirsher's next-queue tree. I guess
they will be submitted for v4.15 which might be released in early
2018...
http://phb-crystal-ball.org/