LinuxLists.cc - [PATCH 0/3] e1000e: Revert interrupt handling changes

2018-01-26 09:16:24

by Benjamin Poirier

[permalink] [raw]

Subject: [PATCH 0/3] e1000e: Revert interrupt handling changes

As discussed in the thread "[RFC PATCH] e1000e: Remove Other from EIAC.",
https://www.spinics.net/lists/netdev/msg479311.html

The following list of commits was considered:
4d432f67ff00 e1000e: Remove unreachable code (v4.5-rc1)
16ecba59bc33 e1000e: Do not read ICR in Other interrupt (v4.5-rc1)
a61cfe4ffad7 e1000e: Do not write lsc to ics in msi-x mode (v4.5-rc1)
0a8047ac68e5 e1000e: Fix msi-x interrupt automask (v4.5-rc1)
19110cfbb34d e1000e: Separate signaling for link check/link up (v4.15-rc1)
4aea7a5c5e94 e1000e: Avoid receiver overrun interrupt bursts (v4.15-rc1)
4110e02eb45e e1000e: Fix e1000_check_for_copper_link_ich8lan return value. (v4.15-rc8)

There have a been a slew of regressions due to unforeseen consequences
(receive overflow triggers Other, vmware's emulated e1000e) and programming
mistakes (4110e02eb45e). Since the e1000e driver is supposed to be in
maintenance mode, this patch series revisits the above changes to prune
them down.

After this series, the remaining differences related to how interrupts were
handled at commit 4d432f67ff00 ("e1000e: Remove unreachable code",
v4.5-rc1) are:
* the changes in commit 0a8047ac68e5 ("e1000e: Fix msi-x interrupt
automask", v4.5-rc1) are preserved.
* we manually clear Other from icr in e1000_msix_other().

We try to go back to a long lost time when things were simple and drivers
ran smoothly.

----------------------------------------------------------------
Benjamin Poirier (3):
Partial revert "e1000e: Avoid receiver overrun interrupt bursts"
Revert "e1000e: Separate signaling for link check/link up"
Revert "e1000e: Do not read ICR in Other interrupt"

drivers/net/ethernet/intel/e1000e/defines.h | 1 -
drivers/net/ethernet/intel/e1000e/ich8lan.c | 11 ++------
drivers/net/ethernet/intel/e1000e/mac.c | 11 ++------
drivers/net/ethernet/intel/e1000e/netdev.c | 44 ++++++++++++++---------------
4 files changed, 27 insertions(+), 40 deletions(-)

--
2.15.1

2018-01-26 09:13:53

by Benjamin Poirier

[permalink] [raw]

Subject: [PATCH 3/3] Revert "e1000e: Do not read ICR in Other interrupt"

This reverts commit 16ecba59bc333d6282ee057fb02339f77a880beb.

It was reported that emulated e1000e devices in vmware esxi 6.5 Build
7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
overrun interrupt bursts"). Some tracing shows that after
e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
icr=0x80000004 (_INT_ASSERTED | _LSC) in the same situation.

Some experimentation showed that this flaw in vmware e1000e emulation can
be worked around by not setting Other in EIAC. This is how it was before
commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").

Since the ICR read in the Other interrupt handler has already been
restored, this patch effectively reverts the remainder of commit
16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").

Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
Signed-off-by: Benjamin Poirier <[email protected]>
---
drivers/net/ethernet/intel/e1000e/netdev.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index ed103b9a8d3a..fffc1f0e3895 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1916,6 +1916,13 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
struct e1000_hw *hw = &adapter->hw;
u32 icr = er32(ICR);

+ /* Certain events (such as RXO) which trigger Other do not set
+ * INT_ASSERTED. In that case, read to clear of icr does not take
+ * place.
+ */
+ if (!(icr & E1000_ICR_INT_ASSERTED))
+ ew32(ICR, E1000_ICR_OTHER);
+
if (icr & adapter->eiac_mask)
ew32(ICS, (icr & adapter->eiac_mask));

@@ -2033,7 +2040,6 @@ static void e1000_configure_msix(struct e1000_adapter *adapter)
hw->hw_addr + E1000_EITR_82574(vector));
else
writel(1, hw->hw_addr + E1000_EITR_82574(vector));
- adapter->eiac_mask |= E1000_IMS_OTHER;

/* Cause Tx interrupts on every write back */
ivar |= BIT(31);
@@ -2258,7 +2264,7 @@ static void e1000_irq_enable(struct e1000_adapter *adapter)

if (adapter->msix_entries) {
ew32(EIAC_82574, adapter->eiac_mask & E1000_EIAC_MASK_82574);
- ew32(IMS, adapter->eiac_mask | E1000_IMS_LSC);
+ ew32(IMS, adapter->eiac_mask | E1000_IMS_OTHER | E1000_IMS_LSC);
} else if (hw->mac.type >= e1000_pch_lpt) {
ew32(IMS, IMS_ENABLE_MASK | E1000_IMS_ECCER);
} else {
--
2.15.1

2018-01-26 09:14:26

by Benjamin Poirier

[permalink] [raw]

Subject: [PATCH 1/3] Partial revert "e1000e: Avoid receiver overrun interrupt bursts"

This reverts commit 4aea7a5c5e940c1723add439f4088844cd26196d.

We keep the fix for the first part of the problem (1) described in the log
of that commit however we use the implementation of e1000_msix_other() from
before commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").
We remove the fix for the second part of the problem (2).

Bursts of "Other" interrupts may once again occur during rxo (receive
overflow) traffic conditions. This is deemed acceptable in the interest of
reverting driver code back to its previous state. The e1000e driver should
be in "maintenance" mode and we want to avoid unforeseen fallout from
changes that are not strictly necessary.

Link: https://www.spinics.net/lists/netdev/msg480675.html
Signed-off-by: Benjamin Poirier <[email protected]>
---
drivers/net/ethernet/intel/e1000e/defines.h | 1 -
drivers/net/ethernet/intel/e1000e/netdev.c | 32 +++++++++++------------------
2 files changed, 12 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/defines.h b/drivers/net/ethernet/intel/e1000e/defines.h
index afb7ebe20b24..0641c0098738 100644
--- a/drivers/net/ethernet/intel/e1000e/defines.h
+++ b/drivers/net/ethernet/intel/e1000e/defines.h
@@ -398,7 +398,6 @@
#define E1000_ICR_LSC 0x00000004 /* Link Status Change */
#define E1000_ICR_RXSEQ 0x00000008 /* Rx sequence error */
#define E1000_ICR_RXDMT0 0x00000010 /* Rx desc min. threshold (0) */
-#define E1000_ICR_RXO 0x00000040 /* Receiver Overrun */
#define E1000_ICR_RXT0 0x00000080 /* Rx timer intr (ring 0) */
#define E1000_ICR_ECCER 0x00400000 /* Uncorrectable ECC Error */
/* If this bit asserted, the driver should claim the interrupt */
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 9f18d39bdc8f..398e940436f8 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1914,30 +1914,23 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
struct net_device *netdev = data;
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = &adapter->hw;
- u32 icr;
- bool enable = true;
-
- icr = er32(ICR);
- if (icr & E1000_ICR_RXO) {
- ew32(ICR, E1000_ICR_RXO);
- enable = false;
- /* napi poll will re-enable Other, make sure it runs */
- if (napi_schedule_prep(&adapter->napi)) {
- adapter->total_rx_bytes = 0;
- adapter->total_rx_packets = 0;
- __napi_schedule(&adapter->napi);
- }
- }
- if (icr & E1000_ICR_LSC) {
- ew32(ICR, E1000_ICR_LSC);
+ u32 icr = er32(ICR);
+
+ if (icr & adapter->eiac_mask)
+ ew32(ICS, (icr & adapter->eiac_mask));
+
+ if (icr & E1000_ICR_OTHER) {
+ if (!(icr & E1000_ICR_LSC))
+ goto no_link_interrupt;
hw->mac.get_link_status = true;
/* guard against interrupt when we're going down */
if (!test_bit(__E1000_DOWN, &adapter->state))
mod_timer(&adapter->watchdog_timer, jiffies + 1);
}

- if (enable && !test_bit(__E1000_DOWN, &adapter->state))
- ew32(IMS, E1000_IMS_OTHER);
+no_link_interrupt:
+ if (!test_bit(__E1000_DOWN, &adapter->state))
+ ew32(IMS, E1000_IMS_LSC | E1000_IMS_OTHER);

return IRQ_HANDLED;
}
@@ -2707,8 +2700,7 @@ static int e1000e_poll(struct napi_struct *napi, int weight)
napi_complete_done(napi, work_done);
if (!test_bit(__E1000_DOWN, &adapter->state)) {
if (adapter->msix_entries)
- ew32(IMS, adapter->rx_ring->ims_val |
- E1000_IMS_OTHER);
+ ew32(IMS, adapter->rx_ring->ims_val);
else
e1000_irq_enable(adapter);
}
--
2.15.1

2018-01-26 09:14:44

by Benjamin Poirier

[permalink] [raw]

Subject: [PATCH 2/3] Revert "e1000e: Separate signaling for link check/link up"

This reverts commit 19110cfbb34d4af0cdfe14cd243f3b09dc95b013.
This reverts commit 4110e02eb45ea447ec6f5459c9934de0a273fb91.

... because they cause an extra 2s delay for the link to come up when
autoneg is off.

After reverting, the race condition described in the log of commit
19110cfbb34d ("e1000e: Separate signaling for link check/link up") is
reintroduced. It may still be triggered by LSC events but this should not
result in link flap. It may no longer be triggered by RXO events because
commit 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
restored reading icr in the Other handler.

As discussed, the driver should be in "maintenance mode". In the interest
of stability, revert to the original code as much as possible instead of a
half-baked solution.

Link: https://www.spinics.net/lists/netdev/msg479923.html
Signed-off-by: Benjamin Poirier <[email protected]>
---
drivers/net/ethernet/intel/e1000e/ich8lan.c | 11 +++--------
drivers/net/ethernet/intel/e1000e/mac.c | 11 +++--------
drivers/net/ethernet/intel/e1000e/netdev.c | 2 +-
3 files changed, 7 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c b/drivers/net/ethernet/intel/e1000e/ich8lan.c
index 31277d3bb7dc..d6d4ed7acf03 100644
--- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
+++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
@@ -1367,9 +1367,6 @@ static s32 e1000_disable_ulp_lpt_lp(struct e1000_hw *hw, bool force)
* Checks to see of the link status of the hardware has changed. If a
* change in link status has been detected, then we read the PHY registers
* to get the current speed/duplex if link exists.
- *
- * Returns a negative error code (-E1000_ERR_*) or 0 (link down) or 1 (link
- * up).
**/
static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
{
@@ -1385,7 +1382,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
* Change or Rx Sequence Error interrupt.
*/
if (!mac->get_link_status)
- return 1;
+ return 0;

/* First we want to see if the MII Status Register reports
* link. If so, then we want to get the current speed/duplex
@@ -1616,12 +1613,10 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
* different link partner.
*/
ret_val = e1000e_config_fc_after_link_up(hw);
- if (ret_val) {
+ if (ret_val)
e_dbg("Error configuring flow control\n");
- return ret_val;
- }

- return 1;
+ return ret_val;
}

static s32 e1000_get_variants_ich8lan(struct e1000_adapter *adapter)
diff --git a/drivers/net/ethernet/intel/e1000e/mac.c b/drivers/net/ethernet/intel/e1000e/mac.c
index f457c5703d0c..b322011ec282 100644
--- a/drivers/net/ethernet/intel/e1000e/mac.c
+++ b/drivers/net/ethernet/intel/e1000e/mac.c
@@ -410,9 +410,6 @@ void e1000e_clear_hw_cntrs_base(struct e1000_hw *hw)
* Checks to see of the link status of the hardware has changed. If a
* change in link status has been detected, then we read the PHY registers
* to get the current speed/duplex if link exists.
- *
- * Returns a negative error code (-E1000_ERR_*) or 0 (link down) or 1 (link
- * up).
**/
s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
{
@@ -426,7 +423,7 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
* Change or Rx Sequence Error interrupt.
*/
if (!mac->get_link_status)
- return 1;
+ return 0;

/* First we want to see if the MII Status Register reports
* link. If so, then we want to get the current speed/duplex
@@ -464,12 +461,10 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
* different link partner.
*/
ret_val = e1000e_config_fc_after_link_up(hw);
- if (ret_val) {
+ if (ret_val)
e_dbg("Error configuring flow control\n");
- return ret_val;
- }

- return 1;
+ return ret_val;
}

/**
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 398e940436f8..ed103b9a8d3a 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -5091,7 +5091,7 @@ static bool e1000e_has_link(struct e1000_adapter *adapter)
case e1000_media_type_copper:
if (hw->mac.get_link_status) {
ret_val = hw->mac.ops.check_for_link(hw);
- link_active = ret_val > 0;
+ link_active = !hw->mac.get_link_status;
} else {
link_active = true;
}
--
2.15.1

2018-01-26 16:51:32

by Alexander Duyck

[permalink] [raw]

Subject: Re: [PATCH 1/3] Partial revert "e1000e: Avoid receiver overrun interrupt bursts"

On Fri, Jan 26, 2018 at 1:12 AM, Benjamin Poirier <[email protected]> wrote:
> This reverts commit 4aea7a5c5e940c1723add439f4088844cd26196d.
>
> We keep the fix for the first part of the problem (1) described in the log
> of that commit however we use the implementation of e1000_msix_other() from
> before commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").
> We remove the fix for the second part of the problem (2).
>
> Bursts of "Other" interrupts may once again occur during rxo (receive
> overflow) traffic conditions. This is deemed acceptable in the interest of
> reverting driver code back to its previous state. The e1000e driver should
> be in "maintenance" mode and we want to avoid unforeseen fallout from
> changes that are not strictly necessary.
>
> Link: https://www.spinics.net/lists/netdev/msg480675.html
> Signed-off-by: Benjamin Poirier <[email protected]>

Thanks for doing this.

Only a few minor tweaks I would recommend. Otherwise it looks good.

> ---
> drivers/net/ethernet/intel/e1000e/defines.h | 1 -
> drivers/net/ethernet/intel/e1000e/netdev.c | 32 +++++++++++------------------
> 2 files changed, 12 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000e/defines.h b/drivers/net/ethernet/intel/e1000e/defines.h
> index afb7ebe20b24..0641c0098738 100644
> --- a/drivers/net/ethernet/intel/e1000e/defines.h
> +++ b/drivers/net/ethernet/intel/e1000e/defines.h
> @@ -398,7 +398,6 @@
> #define E1000_ICR_LSC 0x00000004 /* Link Status Change */
> #define E1000_ICR_RXSEQ 0x00000008 /* Rx sequence error */
> #define E1000_ICR_RXDMT0 0x00000010 /* Rx desc min. threshold (0) */
> -#define E1000_ICR_RXO 0x00000040 /* Receiver Overrun */
> #define E1000_ICR_RXT0 0x00000080 /* Rx timer intr (ring 0) */
> #define E1000_ICR_ECCER 0x00400000 /* Uncorrectable ECC Error */
> /* If this bit asserted, the driver should claim the interrupt */
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> index 9f18d39bdc8f..398e940436f8 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -1914,30 +1914,23 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
> struct net_device *netdev = data;
> struct e1000_adapter *adapter = netdev_priv(netdev);
> struct e1000_hw *hw = &adapter->hw;
> - u32 icr;
> - bool enable = true;
> -
> - icr = er32(ICR);
> - if (icr & E1000_ICR_RXO) {
> - ew32(ICR, E1000_ICR_RXO);
> - enable = false;
> - /* napi poll will re-enable Other, make sure it runs */
> - if (napi_schedule_prep(&adapter->napi)) {
> - adapter->total_rx_bytes = 0;
> - adapter->total_rx_packets = 0;
> - __napi_schedule(&adapter->napi);
> - }
> - }
> - if (icr & E1000_ICR_LSC) {
> - ew32(ICR, E1000_ICR_LSC);
> + u32 icr = er32(ICR);
> +
> + if (icr & adapter->eiac_mask)
> + ew32(ICS, (icr & adapter->eiac_mask));

I'm not sure about this bit as it includes queue interrupts if I am
not mistaken. What we should be focusing on are the bits that are not
auto-cleared so this should probably be icr & ~adapter->eiac_mask.
Actually what you might do is something like:
icr &= ~adapter->eiac_mask;
if (icr)
ew32(ICS, icr);

Also a comment explaining that we have to clear the bits that are not
automatically cleared by other interrupt causes might be useful.

> + if (icr & E1000_ICR_OTHER) {
> + if (!(icr & E1000_ICR_LSC))
> + goto no_link_interrupt;

I wouldn't bother with the jump tag. Let's just make this one if
statement checking both OTHER && LSC.

> hw->mac.get_link_status = true;
> /* guard against interrupt when we're going down */
> if (!test_bit(__E1000_DOWN, &adapter->state))
> mod_timer(&adapter->watchdog_timer, jiffies + 1);
> }
>
> - if (enable && !test_bit(__E1000_DOWN, &adapter->state))
> - ew32(IMS, E1000_IMS_OTHER);
> +no_link_interrupt:

If you just combine the if statement logic the code naturally flows to
this point anyway without the jump label being needed.

> + if (!test_bit(__E1000_DOWN, &adapter->state))
> + ew32(IMS, E1000_IMS_LSC | E1000_IMS_OTHER);
>
> return IRQ_HANDLED;
> }
> @@ -2707,8 +2700,7 @@ static int e1000e_poll(struct napi_struct *napi, int weight)
> napi_complete_done(napi, work_done);
> if (!test_bit(__E1000_DOWN, &adapter->state)) {
> if (adapter->msix_entries)
> - ew32(IMS, adapter->rx_ring->ims_val |
> - E1000_IMS_OTHER);
> + ew32(IMS, adapter->rx_ring->ims_val);
> else
> e1000_irq_enable(adapter);
> }
> --
> 2.15.1
>

2018-01-26 17:04:15

by Alexander Duyck

[permalink] [raw]

Subject: Re: [PATCH 2/3] Revert "e1000e: Separate signaling for link check/link up"

On Fri, Jan 26, 2018 at 1:12 AM, Benjamin Poirier <[email protected]> wrote:
> This reverts commit 19110cfbb34d4af0cdfe14cd243f3b09dc95b013.
> This reverts commit 4110e02eb45ea447ec6f5459c9934de0a273fb91.
>
> ... because they cause an extra 2s delay for the link to come up when
> autoneg is off.
>
> After reverting, the race condition described in the log of commit
> 19110cfbb34d ("e1000e: Separate signaling for link check/link up") is
> reintroduced. It may still be triggered by LSC events but this should not
> result in link flap. It may no longer be triggered by RXO events because
> commit 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
> restored reading icr in the Other handler.

With the RXO events removed the only cause for us to transition the
bit should be LSC. I'm not sure if the race condition in that state is
a valid concern or not as the LSC should only get triggered if the
link state toggled, even briefly.

The bigger concern I would have would be the opposite of the original
race that was pointed out:
\ e1000_watchdog_task
\ e1000e_has_link
\ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
/* link is up */
mac->get_link_status = false;

/* interrupt */
\ e1000_msix_other
hw->mac.get_link_status = true;

link_active = !hw->mac.get_link_status
/* link_active is false, wrongly */

So the question I would have is what if we see the LSC for a link down
just after the check_for_copper_link call completes? It may not be
anything seen in the real world since I don't know if we have any link
flapping issues on e1000e or not without this patch. It is something
to keep in mind for the future though.

> As discussed, the driver should be in "maintenance mode". In the interest
> of stability, revert to the original code as much as possible instead of a
> half-baked solution.

If nothing else we may want to do a follow-up on this patch as we
probably shouldn't be returning the error values to trigger link up.
There are definitely issues to be found here. If nothing else we may
want to explore just returning 1 if auto-neg is disabled instead of
returning an error code.

> Link: https://www.spinics.net/lists/netdev/msg479923.html
> Signed-off-by: Benjamin Poirier <[email protected]>
> ---
> drivers/net/ethernet/intel/e1000e/ich8lan.c | 11 +++--------
> drivers/net/ethernet/intel/e1000e/mac.c | 11 +++--------
> drivers/net/ethernet/intel/e1000e/netdev.c | 2 +-
> 3 files changed, 7 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c b/drivers/net/ethernet/intel/e1000e/ich8lan.c
> index 31277d3bb7dc..d6d4ed7acf03 100644
> --- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
> +++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
> @@ -1367,9 +1367,6 @@ static s32 e1000_disable_ulp_lpt_lp(struct e1000_hw *hw, bool force)
> * Checks to see of the link status of the hardware has changed. If a
> * change in link status has been detected, then we read the PHY registers
> * to get the current speed/duplex if link exists.
> - *
> - * Returns a negative error code (-E1000_ERR_*) or 0 (link down) or 1 (link
> - * up).
> **/
> static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
> {
> @@ -1385,7 +1382,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
> * Change or Rx Sequence Error interrupt.
> */
> if (!mac->get_link_status)
> - return 1;
> + return 0;
>
> /* First we want to see if the MII Status Register reports
> * link. If so, then we want to get the current speed/duplex
> @@ -1616,12 +1613,10 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
> * different link partner.
> */
> ret_val = e1000e_config_fc_after_link_up(hw);
> - if (ret_val) {
> + if (ret_val)
> e_dbg("Error configuring flow control\n");
> - return ret_val;
> - }
>
> - return 1;
> + return ret_val;
> }
>
> static s32 e1000_get_variants_ich8lan(struct e1000_adapter *adapter)
> diff --git a/drivers/net/ethernet/intel/e1000e/mac.c b/drivers/net/ethernet/intel/e1000e/mac.c
> index f457c5703d0c..b322011ec282 100644
> --- a/drivers/net/ethernet/intel/e1000e/mac.c
> +++ b/drivers/net/ethernet/intel/e1000e/mac.c
> @@ -410,9 +410,6 @@ void e1000e_clear_hw_cntrs_base(struct e1000_hw *hw)
> * Checks to see of the link status of the hardware has changed. If a
> * change in link status has been detected, then we read the PHY registers
> * to get the current speed/duplex if link exists.
> - *
> - * Returns a negative error code (-E1000_ERR_*) or 0 (link down) or 1 (link
> - * up).
> **/
> s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> {
> @@ -426,7 +423,7 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> * Change or Rx Sequence Error interrupt.
> */
> if (!mac->get_link_status)
> - return 1;
> + return 0;
>
> /* First we want to see if the MII Status Register reports
> * link. If so, then we want to get the current speed/duplex
> @@ -464,12 +461,10 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> * different link partner.
> */
> ret_val = e1000e_config_fc_after_link_up(hw);
> - if (ret_val) {
> + if (ret_val)
> e_dbg("Error configuring flow control\n");
> - return ret_val;
> - }
>
> - return 1;
> + return ret_val;
> }
>
> /**
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> index 398e940436f8..ed103b9a8d3a 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -5091,7 +5091,7 @@ static bool e1000e_has_link(struct e1000_adapter *adapter)
> case e1000_media_type_copper:
> if (hw->mac.get_link_status) {
> ret_val = hw->mac.ops.check_for_link(hw);
> - link_active = ret_val > 0;
> + link_active = !hw->mac.get_link_status;
> } else {
> link_active = true;
> }
> --
> 2.15.1
>

2018-01-26 21:02:25

by Alexander Duyck

[permalink] [raw]

Subject: Re: [PATCH 3/3] Revert "e1000e: Do not read ICR in Other interrupt"

On Fri, Jan 26, 2018 at 1:12 AM, Benjamin Poirier <[email protected]> wrote:
> This reverts commit 16ecba59bc333d6282ee057fb02339f77a880beb.
>
> It was reported that emulated e1000e devices in vmware esxi 6.5 Build
> 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
> overrun interrupt bursts"). Some tracing shows that after
> e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
> on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
> icr=0x80000004 (_INT_ASSERTED | _LSC) in the same situation.
>
> Some experimentation showed that this flaw in vmware e1000e emulation can
> be worked around by not setting Other in EIAC. This is how it was before
> commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").
>
> Since the ICR read in the Other interrupt handler has already been
> restored, this patch effectively reverts the remainder of commit
> 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").
>
> Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
> Signed-off-by: Benjamin Poirier <[email protected]>
> ---
> drivers/net/ethernet/intel/e1000e/netdev.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> index ed103b9a8d3a..fffc1f0e3895 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -1916,6 +1916,13 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
> struct e1000_hw *hw = &adapter->hw;
> u32 icr = er32(ICR);
>
> + /* Certain events (such as RXO) which trigger Other do not set
> + * INT_ASSERTED. In that case, read to clear of icr does not take
> + * place.
> + */
> + if (!(icr & E1000_ICR_INT_ASSERTED))
> + ew32(ICR, E1000_ICR_OTHER);
> +

This piece doesn't make sense to me. Why are we clearing OTHER if
ICR_INT_ASSERTED is not set? The original code that was removed was in
commit 4d432f67ff00 "e1000e: Remove unreachable code" was setting IMS
and returning, not clearing the ICR register. I would argue that the
code is probably unreachable and if we just have the checks for OTHER
and LSC then we should be taking care of all of this in the task
anyway. All this code the code in the original was doing was
re-enabling the interrupt via IMS so we probably don't need this bit
as long as we are clearing OTHER and LSC when they are set so that we
can get future interrupts.

> if (icr & adapter->eiac_mask)
> ew32(ICS, (icr & adapter->eiac_mask));
>
> @@ -2033,7 +2040,6 @@ static void e1000_configure_msix(struct e1000_adapter *adapter)
> hw->hw_addr + E1000_EITR_82574(vector));
> else
> writel(1, hw->hw_addr + E1000_EITR_82574(vector));
> - adapter->eiac_mask |= E1000_IMS_OTHER;
>
> /* Cause Tx interrupts on every write back */
> ivar |= BIT(31);
> @@ -2258,7 +2264,7 @@ static void e1000_irq_enable(struct e1000_adapter *adapter)
>
> if (adapter->msix_entries) {
> ew32(EIAC_82574, adapter->eiac_mask & E1000_EIAC_MASK_82574);
> - ew32(IMS, adapter->eiac_mask | E1000_IMS_LSC);
> + ew32(IMS, adapter->eiac_mask | E1000_IMS_OTHER | E1000_IMS_LSC);
> } else if (hw->mac.type >= e1000_pch_lpt) {
> ew32(IMS, IMS_ENABLE_MASK | E1000_IMS_ECCER);
> } else {
> --
> 2.15.1
>

2018-01-29 07:30:59

by Benjamin Poirier

[permalink] [raw]

Subject: Re: [PATCH 1/3] Partial revert "e1000e: Avoid receiver overrun interrupt bursts"

On 2018/01/26 08:50, Alexander Duyck wrote:
> On Fri, Jan 26, 2018 at 1:12 AM, Benjamin Poirier <[email protected]> wrote:
> > This reverts commit 4aea7a5c5e940c1723add439f4088844cd26196d.
> >
> > We keep the fix for the first part of the problem (1) described in the log
> > of that commit however we use the implementation of e1000_msix_other() from
> > before commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").
> > We remove the fix for the second part of the problem (2).
> >
> > Bursts of "Other" interrupts may once again occur during rxo (receive
> > overflow) traffic conditions. This is deemed acceptable in the interest of
> > reverting driver code back to its previous state. The e1000e driver should
> > be in "maintenance" mode and we want to avoid unforeseen fallout from
> > changes that are not strictly necessary.
> >
> > Link: https://www.spinics.net/lists/netdev/msg480675.html
> > Signed-off-by: Benjamin Poirier <[email protected]>
>
> Thanks for doing this.
>
> Only a few minor tweaks I would recommend. Otherwise it looks good.
>
> > ---
> > drivers/net/ethernet/intel/e1000e/defines.h | 1 -
> > drivers/net/ethernet/intel/e1000e/netdev.c | 32 +++++++++++------------------
> > 2 files changed, 12 insertions(+), 21 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/e1000e/defines.h b/drivers/net/ethernet/intel/e1000e/defines.h
> > index afb7ebe20b24..0641c0098738 100644
> > --- a/drivers/net/ethernet/intel/e1000e/defines.h
> > +++ b/drivers/net/ethernet/intel/e1000e/defines.h
> > @@ -398,7 +398,6 @@
> > #define E1000_ICR_LSC 0x00000004 /* Link Status Change */
> > #define E1000_ICR_RXSEQ 0x00000008 /* Rx sequence error */
> > #define E1000_ICR_RXDMT0 0x00000010 /* Rx desc min. threshold (0) */
> > -#define E1000_ICR_RXO 0x00000040 /* Receiver Overrun */
> > #define E1000_ICR_RXT0 0x00000080 /* Rx timer intr (ring 0) */
> > #define E1000_ICR_ECCER 0x00400000 /* Uncorrectable ECC Error */
> > /* If this bit asserted, the driver should claim the interrupt */
> > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> > index 9f18d39bdc8f..398e940436f8 100644
> > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> > @@ -1914,30 +1914,23 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
> > struct net_device *netdev = data;
> > struct e1000_adapter *adapter = netdev_priv(netdev);
> > struct e1000_hw *hw = &adapter->hw;
> > - u32 icr;
> > - bool enable = true;
> > -
> > - icr = er32(ICR);
> > - if (icr & E1000_ICR_RXO) {
> > - ew32(ICR, E1000_ICR_RXO);
> > - enable = false;
> > - /* napi poll will re-enable Other, make sure it runs */
> > - if (napi_schedule_prep(&adapter->napi)) {
> > - adapter->total_rx_bytes = 0;
> > - adapter->total_rx_packets = 0;
> > - __napi_schedule(&adapter->napi);
> > - }
> > - }
> > - if (icr & E1000_ICR_LSC) {
> > - ew32(ICR, E1000_ICR_LSC);
> > + u32 icr = er32(ICR);
> > +
> > + if (icr & adapter->eiac_mask)
> > + ew32(ICS, (icr & adapter->eiac_mask));
>
> I'm not sure about this bit as it includes queue interrupts if I am
> not mistaken. What we should be focusing on are the bits that are not
> auto-cleared so this should probably be icr & ~adapter->eiac_mask.
> Actually what you might do is something like:
> icr &= ~adapter->eiac_mask;
> if (icr)
> ew32(ICS, icr);
>
> Also a comment explaining that we have to clear the bits that are not
> automatically cleared by other interrupt causes might be useful.

I've re-read your comment many times and I think you misunderstood what
that code is doing.

This:
> > + if (icr & adapter->eiac_mask)
> > + ew32(ICS, (icr & adapter->eiac_mask));

re-raises the queue interrupts in case the txq or rxq bits were set in
icr and the Other interrupt handler read and cleared icr before the
queue interrupt was raised. It's not about "clear the bits that are not
automatically cleared". It's a write to ICS, not ICR.

>
> > + if (icr & E1000_ICR_OTHER) {
> > + if (!(icr & E1000_ICR_LSC))
> > + goto no_link_interrupt;
>
> I wouldn't bother with the jump tag. Let's just make this one if
> statement checking both OTHER && LSC.
>
> > hw->mac.get_link_status = true;
> > /* guard against interrupt when we're going down */
> > if (!test_bit(__E1000_DOWN, &adapter->state))
> > mod_timer(&adapter->watchdog_timer, jiffies + 1);
> > }
> >
> > - if (enable && !test_bit(__E1000_DOWN, &adapter->state))
> > - ew32(IMS, E1000_IMS_OTHER);
> > +no_link_interrupt:
>
> If you just combine the if statement logic the code naturally flows to
> this point anyway without the jump label being needed.

After this patch, the implementation of e1000_msix_other() is what it
was before commit 16ecba59bc33 ("e1000e: Do not read ICR in Other
interrupt"), verbatim.

I agree with your comment though and the binary code is the same.
Changed for the next version.

>
> > + if (!test_bit(__E1000_DOWN, &adapter->state))
> > + ew32(IMS, E1000_IMS_LSC | E1000_IMS_OTHER);
> >
> > return IRQ_HANDLED;
> > }
> > @@ -2707,8 +2700,7 @@ static int e1000e_poll(struct napi_struct *napi, int weight)
> > napi_complete_done(napi, work_done);
> > if (!test_bit(__E1000_DOWN, &adapter->state)) {
> > if (adapter->msix_entries)
> > - ew32(IMS, adapter->rx_ring->ims_val |
> > - E1000_IMS_OTHER);
> > + ew32(IMS, adapter->rx_ring->ims_val);
> > else
> > e1000_irq_enable(adapter);
> > }
> > --
> > 2.15.1
> >
>

2018-01-29 07:40:44

by Benjamin Poirier

[permalink] [raw]

Subject: Re: [PATCH 3/3] Revert "e1000e: Do not read ICR in Other interrupt"

On 2018/01/26 13:01, Alexander Duyck wrote:
> On Fri, Jan 26, 2018 at 1:12 AM, Benjamin Poirier <[email protected]> wrote:
> > This reverts commit 16ecba59bc333d6282ee057fb02339f77a880beb.
> >
> > It was reported that emulated e1000e devices in vmware esxi 6.5 Build
> > 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
> > overrun interrupt bursts"). Some tracing shows that after
> > e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
> > on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
> > icr=0x80000004 (_INT_ASSERTED | _LSC) in the same situation.
> >
> > Some experimentation showed that this flaw in vmware e1000e emulation can
> > be worked around by not setting Other in EIAC. This is how it was before
> > commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").
> >
> > Since the ICR read in the Other interrupt handler has already been
> > restored, this patch effectively reverts the remainder of commit
> > 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").
> >
> > Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
> > Signed-off-by: Benjamin Poirier <[email protected]>
> > ---
> > drivers/net/ethernet/intel/e1000e/netdev.c | 10 ++++++++--
> > 1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> > index ed103b9a8d3a..fffc1f0e3895 100644
> > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> > @@ -1916,6 +1916,13 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
> > struct e1000_hw *hw = &adapter->hw;
> > u32 icr = er32(ICR);
> >
> > + /* Certain events (such as RXO) which trigger Other do not set
> > + * INT_ASSERTED. In that case, read to clear of icr does not take
> > + * place.
> > + */
> > + if (!(icr & E1000_ICR_INT_ASSERTED))
> > + ew32(ICR, E1000_ICR_OTHER);
> > +
>
> This piece doesn't make sense to me. Why are we clearing OTHER if
> ICR_INT_ASSERTED is not set?

Datasheet ?10.2.4.1 ("Interrupt Cause Read Register") says that ICR read
to clear only occurs if INT_ASSERTED is set. This corresponds to what I
observed.

However, while working on these issues, I noticed that when there is an rxo
event, INT_ASSERTED is not always set even though the interrupt is raised. I
think this is a hardware flaw.

For example, if doing
ew32(ICS, E1000_ICS_LSC | E1000_ICS_OTHER);
we enter e1000_msix_other() and two consecutive reads of ICR result in
0x81000004
0x00000000

If doing
ew32(ICS, E1000_ICS_RXO | E1000_ICS_OTHER);
we enter e1000_msix_other() and two consecutive reads of ICR result in
0x01000041
0x01000041

Consequently, we must clear OTHER manually from ICR, otherwise the
interrupt is immediately re-raised after exiting the handler.

These observations are the same whether the interrupt is triggered via a
write to ICS or in hardware.

Furthermore, I tested that this behavior is the same for other Other
events (MDAC, SRPD, ACK, MNG). Those were tested via a write to ICS
only, not in hardware.

This is a version of the test patch that I used to trigger lsc and rxo in
software and hardware. It applies over this patch series.

diff --git a/drivers/net/ethernet/intel/e1000e/defines.h b/drivers/net/ethernet/intel/e1000e/defines.h
index 0641c0098738..f54e7ac9c934 100644
--- a/drivers/net/ethernet/intel/e1000e/defines.h
+++ b/drivers/net/ethernet/intel/e1000e/defines.h
@@ -398,6 +398,7 @@
#define E1000_ICR_LSC 0x00000004 /* Link Status Change */
#define E1000_ICR_RXSEQ 0x00000008 /* Rx sequence error */
#define E1000_ICR_RXDMT0 0x00000010 /* Rx desc min. threshold (0) */
+#define E1000_ICR_RXO 0x00000040 /* rx overrun */
#define E1000_ICR_RXT0 0x00000080 /* Rx timer intr (ring 0) */
#define E1000_ICR_ECCER 0x00400000 /* Uncorrectable ECC Error */
/* If this bit asserted, the driver should claim the interrupt */
diff --git a/drivers/net/ethernet/intel/e1000e/ethtool.c b/drivers/net/ethernet/intel/e1000e/ethtool.c
index 003cbd605799..4933c1beac74 100644
--- a/drivers/net/ethernet/intel/e1000e/ethtool.c
+++ b/drivers/net/ethernet/intel/e1000e/ethtool.c
@@ -1802,98 +1802,20 @@ static void e1000_diag_test(struct net_device *netdev,
struct ethtool_test *eth_test, u64 *data)
{
struct e1000_adapter *adapter = netdev_priv(netdev);
- u16 autoneg_advertised;
- u8 forced_speed_duplex;
- u8 autoneg;
- bool if_running = netif_running(netdev);
+ struct e1000_hw *hw = &adapter->hw;

pm_runtime_get_sync(netdev->dev.parent);

set_bit(__E1000_TESTING, &adapter->state);

- if (!if_running) {
- /* Get control of and reset hardware */
- if (adapter->flags & FLAG_HAS_AMT)
- e1000e_get_hw_control(adapter);
-
- e1000e_power_up_phy(adapter);
-
- adapter->hw.phy.autoneg_wait_to_complete = 1;
- e1000e_reset(adapter);
- adapter->hw.phy.autoneg_wait_to_complete = 0;
- }
-
if (eth_test->flags == ETH_TEST_FL_OFFLINE) {
- /* Offline tests */
-
- /* save speed, duplex, autoneg settings */
- autoneg_advertised = adapter->hw.phy.autoneg_advertised;
- forced_speed_duplex = adapter->hw.mac.forced_speed_duplex;
- autoneg = adapter->hw.mac.autoneg;
-
- e_info("offline testing starting\n");
-
- if (if_running)
- /* indicate we're in test mode */
- e1000e_close(netdev);
-
- if (e1000_reg_test(adapter, &data[0]))
- eth_test->flags |= ETH_TEST_FL_FAILED;
-
- e1000e_reset(adapter);
- if (e1000_eeprom_test(adapter, &data[1]))
- eth_test->flags |= ETH_TEST_FL_FAILED;
-
- e1000e_reset(adapter);
- if (e1000_intr_test(adapter, &data[2]))
- eth_test->flags |= ETH_TEST_FL_FAILED;
-
- e1000e_reset(adapter);
- if (e1000_loopback_test(adapter, &data[3]))
- eth_test->flags |= ETH_TEST_FL_FAILED;
-
- /* force this routine to wait until autoneg complete/timeout */
- adapter->hw.phy.autoneg_wait_to_complete = 1;
- e1000e_reset(adapter);
- adapter->hw.phy.autoneg_wait_to_complete = 0;
-
- if (e1000_link_test(adapter, &data[4]))
- eth_test->flags |= ETH_TEST_FL_FAILED;
-
- /* restore speed, duplex, autoneg settings */
- adapter->hw.phy.autoneg_advertised = autoneg_advertised;
- adapter->hw.mac.forced_speed_duplex = forced_speed_duplex;
- adapter->hw.mac.autoneg = autoneg;
- e1000e_reset(adapter);
-
- clear_bit(__E1000_TESTING, &adapter->state);
- if (if_running)
- e1000e_open(netdev);
+ // LSC, RXO, MDAC, SRPD, ACK, MNG
+ ew32(ICS, E1000_ICR_RXO | E1000_ICR_OTHER);
} else {
- /* Online tests */
-
- e_info("online testing starting\n");
-
- /* register, eeprom, intr and loopback tests not run online */
- data[0] = 0;
- data[1] = 0;
- data[2] = 0;
- data[3] = 0;
-
- if (e1000_link_test(adapter, &data[4]))
- eth_test->flags |= ETH_TEST_FL_FAILED;
-
- clear_bit(__E1000_TESTING, &adapter->state);
- }
-
- if (!if_running) {
- e1000e_reset(adapter);
-
- if (adapter->flags & FLAG_HAS_AMT)
- e1000e_release_hw_control(adapter);
+ ew32(ICS, E1000_ICR_LSC | E1000_ICR_OTHER);
}

- msleep_interruptible(4 * 1000);
+ clear_bit(__E1000_TESTING, &adapter->state);

pm_runtime_put_sync(netdev->dev.parent);
}
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index fffc1f0e3895..5b3a0feaf052 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -46,6 +46,10 @@

#include "e1000.h"

+DEFINE_RATELIMIT_STATE(rx_ratelimit_state, 2 * HZ, 1);
+DEFINE_RATELIMIT_STATE(other_ratelimit_state, 2 * HZ, 1);
+DEFINE_RATELIMIT_STATE(other_ratelimit_state2, 2 * HZ, 1);
+
#define DRV_EXTRAVERSION "-k"

#define DRV_VERSION "3.2.6" DRV_EXTRAVERSION
@@ -936,6 +940,9 @@ static bool e1000_clean_rx_irq(struct e1000_ring *rx_ring, int *work_done,
int cleaned_count = 0;
bool cleaned = false;
unsigned int total_rx_bytes = 0, total_rx_packets = 0;
+ static unsigned int count;
+
+ mdelay(10);

i = rx_ring->next_to_clean;
rx_desc = E1000_RX_DESC_EXT(*rx_ring, i);
@@ -1067,6 +1074,16 @@ static bool e1000_clean_rx_irq(struct e1000_ring *rx_ring, int *work_done,

adapter->total_rx_bytes += total_rx_bytes;
adapter->total_rx_packets += total_rx_packets;
+
+ count++;
+ if (__ratelimit(&rx_ratelimit_state)) {
+ static unsigned int max;
+ max = max(max, total_rx_packets);
+ trace_printk("rx %u now, max %u, %u rounds\n",
+ total_rx_packets, max, count);
+ count = 0;
+ }
+
return cleaned;
}

@@ -1914,14 +1931,30 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
struct net_device *netdev = data;
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = &adapter->hw;
- u32 icr = er32(ICR);
+ static unsigned int count;
+ u32 icr2, icr = er32(ICR);

/* Certain events (such as RXO) which trigger Other do not set
* INT_ASSERTED. In that case, read to clear of icr does not take
* place.
*/
+ /*
if (!(icr & E1000_ICR_INT_ASSERTED))
ew32(ICR, E1000_ICR_OTHER);
+ */
+
+ icr2 = er32(ICR);
+
+ count++;
+ if (__ratelimit(&other_ratelimit_state)) {
+ trace_printk("icr 0x%08x icr2 0x%08x count %u\n", icr, icr2,
+ count);
+ count = 0;
+ }
+ if (icr & E1000_ICR_RXO && icr & E1000_ICR_INT_ASSERTED &&
+ __ratelimit(&other_ratelimit_state2)) {
+ trace_printk("special icr 0x%08x icr2 0x%08x\n", icr, icr2);
+ }

if (icr & adapter->eiac_mask)
ew32(ICS, (icr & adapter->eiac_mask));

2018-01-29 10:25:23

by Benjamin Poirier

[permalink] [raw]

Subject: Re: [PATCH 2/3] Revert "e1000e: Separate signaling for link check/link up"

On 2018/01/26 09:03, Alexander Duyck wrote:
> On Fri, Jan 26, 2018 at 1:12 AM, Benjamin Poirier <[email protected]> wrote:
> > This reverts commit 19110cfbb34d4af0cdfe14cd243f3b09dc95b013.
> > This reverts commit 4110e02eb45ea447ec6f5459c9934de0a273fb91.
> >
> > ... because they cause an extra 2s delay for the link to come up when
> > autoneg is off.
> >
> > After reverting, the race condition described in the log of commit
> > 19110cfbb34d ("e1000e: Separate signaling for link check/link up") is
> > reintroduced. It may still be triggered by LSC events but this should not
> > result in link flap. It may no longer be triggered by RXO events because
> > commit 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
> > restored reading icr in the Other handler.
>
> With the RXO events removed the only cause for us to transition the
> bit should be LSC. I'm not sure if the race condition in that state is
> a valid concern or not as the LSC should only get triggered if the
> link state toggled, even briefly.
>
> The bigger concern I would have would be the opposite of the original
> race that was pointed out:
> \ e1000_watchdog_task
> \ e1000e_has_link
> \ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
> /* link is up */
> mac->get_link_status = false;
>
> /* interrupt */
> \ e1000_msix_other
> hw->mac.get_link_status = true;
>
> link_active = !hw->mac.get_link_status
> /* link_active is false, wrongly */
>
> So the question I would have is what if we see the LSC for a link down
> just after the check_for_copper_link call completes? It may not be

Can you write out exactly what that race would be, in a format similar to the
above?

> anything seen in the real world since I don't know if we have any link
> flapping issues on e1000e or not without this patch. It is something
> to keep in mind for the future though.
>
>
> > As discussed, the driver should be in "maintenance mode". In the interest
> > of stability, revert to the original code as much as possible instead of a
> > half-baked solution.
>
> If nothing else we may want to do a follow-up on this patch as we
> probably shouldn't be returning the error values to trigger link up.
> There are definitely issues to be found here. If nothing else we may
> want to explore just returning 1 if auto-neg is disabled instead of
> returning an error code.
>
> > Link: https://www.spinics.net/lists/netdev/msg479923.html
> > Signed-off-by: Benjamin Poirier <[email protected]>
[...]

2018-01-29 15:43:40

by Alexander Duyck

[permalink] [raw]

Subject: Re: [PATCH 1/3] Partial revert "e1000e: Avoid receiver overrun interrupt bursts"

On Sun, Jan 28, 2018 at 11:20 PM, Benjamin Poirier <[email protected]> wrote:
> On 2018/01/26 08:50, Alexander Duyck wrote:
>> On Fri, Jan 26, 2018 at 1:12 AM, Benjamin Poirier <[email protected]> wrote:
>> > This reverts commit 4aea7a5c5e940c1723add439f4088844cd26196d.
>> >
>> > We keep the fix for the first part of the problem (1) described in the log
>> > of that commit however we use the implementation of e1000_msix_other() from
>> > before commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").
>> > We remove the fix for the second part of the problem (2).
>> >
>> > Bursts of "Other" interrupts may once again occur during rxo (receive
>> > overflow) traffic conditions. This is deemed acceptable in the interest of
>> > reverting driver code back to its previous state. The e1000e driver should
>> > be in "maintenance" mode and we want to avoid unforeseen fallout from
>> > changes that are not strictly necessary.
>> >
>> > Link: https://www.spinics.net/lists/netdev/msg480675.html
>> > Signed-off-by: Benjamin Poirier <[email protected]>
>>
>> Thanks for doing this.
>>
>> Only a few minor tweaks I would recommend. Otherwise it looks good.
>>
>> > ---
>> > drivers/net/ethernet/intel/e1000e/defines.h | 1 -
>> > drivers/net/ethernet/intel/e1000e/netdev.c | 32 +++++++++++------------------
>> > 2 files changed, 12 insertions(+), 21 deletions(-)
>> >
>> > diff --git a/drivers/net/ethernet/intel/e1000e/defines.h b/drivers/net/ethernet/intel/e1000e/defines.h
>> > index afb7ebe20b24..0641c0098738 100644
>> > --- a/drivers/net/ethernet/intel/e1000e/defines.h
>> > +++ b/drivers/net/ethernet/intel/e1000e/defines.h
>> > @@ -398,7 +398,6 @@
>> > #define E1000_ICR_LSC 0x00000004 /* Link Status Change */
>> > #define E1000_ICR_RXSEQ 0x00000008 /* Rx sequence error */
>> > #define E1000_ICR_RXDMT0 0x00000010 /* Rx desc min. threshold (0) */
>> > -#define E1000_ICR_RXO 0x00000040 /* Receiver Overrun */
>> > #define E1000_ICR_RXT0 0x00000080 /* Rx timer intr (ring 0) */
>> > #define E1000_ICR_ECCER 0x00400000 /* Uncorrectable ECC Error */
>> > /* If this bit asserted, the driver should claim the interrupt */
>> > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
>> > index 9f18d39bdc8f..398e940436f8 100644
>> > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
>> > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
>> > @@ -1914,30 +1914,23 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
>> > struct net_device *netdev = data;
>> > struct e1000_adapter *adapter = netdev_priv(netdev);
>> > struct e1000_hw *hw = &adapter->hw;
>> > - u32 icr;
>> > - bool enable = true;
>> > -
>> > - icr = er32(ICR);
>> > - if (icr & E1000_ICR_RXO) {
>> > - ew32(ICR, E1000_ICR_RXO);
>> > - enable = false;
>> > - /* napi poll will re-enable Other, make sure it runs */
>> > - if (napi_schedule_prep(&adapter->napi)) {
>> > - adapter->total_rx_bytes = 0;
>> > - adapter->total_rx_packets = 0;
>> > - __napi_schedule(&adapter->napi);
>> > - }
>> > - }
>> > - if (icr & E1000_ICR_LSC) {
>> > - ew32(ICR, E1000_ICR_LSC);
>> > + u32 icr = er32(ICR);
>> > +
>> > + if (icr & adapter->eiac_mask)
>> > + ew32(ICS, (icr & adapter->eiac_mask));
>>
>> I'm not sure about this bit as it includes queue interrupts if I am
>> not mistaken. What we should be focusing on are the bits that are not
>> auto-cleared so this should probably be icr & ~adapter->eiac_mask.
>> Actually what you might do is something like:
>> icr &= ~adapter->eiac_mask;
>> if (icr)
>> ew32(ICS, icr);
>>
>> Also a comment explaining that we have to clear the bits that are not
>> automatically cleared by other interrupt causes might be useful.
>
> I've re-read your comment many times and I think you misunderstood what
> that code is doing.

You are right. I did.

> This:
>> > + if (icr & adapter->eiac_mask)
>> > + ew32(ICS, (icr & adapter->eiac_mask));
>
> re-raises the queue interrupts in case the txq or rxq bits were set in
> icr and the Other interrupt handler read and cleared icr before the
> queue interrupt was raised. It's not about "clear the bits that are not
> automatically cleared". It's a write to ICS, not ICR.

Actually this raises other possible issues. Now I am wondering if the
actual issue is the fact that we are reading the ICR at all in MSI-X
mode with EIAME enabled. For now I guess we need it since reading the
ICR could potentially be causing the events to be cleared.

There is actually HW Errata 12 that we need to take into account as
well, the document is available at:
https://www.intel.com/content/www/us/en/embedded/products/networking/82574-gbe-controller-spec-update.html

I think I may check with our Client silicon team to see if there is
any way for us to get away from reading ICR and still avoid the RXO
issue since it seems like reading ICR is just going to be problematic
at best. If nothing else we probably need a new HW errata for the RXO
problem since it seems like it is firing interrupts even though it
should be masked.

>>
>> > + if (icr & E1000_ICR_OTHER) {
>> > + if (!(icr & E1000_ICR_LSC))
>> > + goto no_link_interrupt;
>>
>> I wouldn't bother with the jump tag. Let's just make this one if
>> statement checking both OTHER && LSC.
>>
>> > hw->mac.get_link_status = true;
>> > /* guard against interrupt when we're going down */
>> > if (!test_bit(__E1000_DOWN, &adapter->state))
>> > mod_timer(&adapter->watchdog_timer, jiffies + 1);
>> > }
>> >
>> > - if (enable && !test_bit(__E1000_DOWN, &adapter->state))
>> > - ew32(IMS, E1000_IMS_OTHER);
>> > +no_link_interrupt:
>>
>> If you just combine the if statement logic the code naturally flows to
>> this point anyway without the jump label being needed.
>
> After this patch, the implementation of e1000_msix_other() is what it
> was before commit 16ecba59bc33 ("e1000e: Do not read ICR in Other
> interrupt"), verbatim.
>
> I agree with your comment though and the binary code is the same.
> Changed for the next version.

We may need to look at clearing the ICR still, or are you expecting
the read auto cleared it?

>>
>> > + if (!test_bit(__E1000_DOWN, &adapter->state))
>> > + ew32(IMS, E1000_IMS_LSC | E1000_IMS_OTHER);
>> >
>> > return IRQ_HANDLED;
>> > }
>> > @@ -2707,8 +2700,7 @@ static int e1000e_poll(struct napi_struct *napi, int weight)
>> > napi_complete_done(napi, work_done);
>> > if (!test_bit(__E1000_DOWN, &adapter->state)) {
>> > if (adapter->msix_entries)
>> > - ew32(IMS, adapter->rx_ring->ims_val |
>> > - E1000_IMS_OTHER);
>> > + ew32(IMS, adapter->rx_ring->ims_val);
>> > else
>> > e1000_irq_enable(adapter);
>> > }
>> > --
>> > 2.15.1
>> >
>>

2018-01-29 15:49:18

by Alexander Duyck

[permalink] [raw]

Subject: Re: [PATCH 2/3] Revert "e1000e: Separate signaling for link check/link up"

On Sun, Jan 28, 2018 at 11:21 PM, Benjamin Poirier <[email protected]> wrote:
> On 2018/01/26 09:03, Alexander Duyck wrote:
>> On Fri, Jan 26, 2018 at 1:12 AM, Benjamin Poirier <[email protected]> wrote:
>> > This reverts commit 19110cfbb34d4af0cdfe14cd243f3b09dc95b013.
>> > This reverts commit 4110e02eb45ea447ec6f5459c9934de0a273fb91.
>> >
>> > ... because they cause an extra 2s delay for the link to come up when
>> > autoneg is off.
>> >
>> > After reverting, the race condition described in the log of commit
>> > 19110cfbb34d ("e1000e: Separate signaling for link check/link up") is
>> > reintroduced. It may still be triggered by LSC events but this should not
>> > result in link flap. It may no longer be triggered by RXO events because
>> > commit 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
>> > restored reading icr in the Other handler.
>>
>> With the RXO events removed the only cause for us to transition the
>> bit should be LSC. I'm not sure if the race condition in that state is
>> a valid concern or not as the LSC should only get triggered if the
>> link state toggled, even briefly.
>>
>> The bigger concern I would have would be the opposite of the original
>> race that was pointed out:
>> \ e1000_watchdog_task
>> \ e1000e_has_link
>> \ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
>> /* link is up */
>> mac->get_link_status = false;
>>
>> /* interrupt */
>> \ e1000_msix_other
>> hw->mac.get_link_status = true;
>>
>> link_active = !hw->mac.get_link_status
>> /* link_active is false, wrongly */

\ e1000_watchdog_task
\ e1000e_has_link
\ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
/* interrupt */
\ e1000_msix_other
hw->mac.get_link_status = true;

/* link is up */
mac->get_link_status = false;

link_active = !hw->mac.get_link_status
/* link_active is true, wrongly */

So basically the idea would be that the interrupt fires after we check
for link, but before we have updated get_link_status. It should be a
pretty narrow window and I don't know if it will be much of an issue.

>> So the question I would have is what if we see the LSC for a link down
>> just after the check_for_copper_link call completes? It may not be
>
> Can you write out exactly what that race would be, in a format similar to the
> above?

I just did the copy/paste/edit above if that works. Hopefully that
makes it clearer.

>> anything seen in the real world since I don't know if we have any link
>> flapping issues on e1000e or not without this patch. It is something
>> to keep in mind for the future though.
>>
>>
>> > As discussed, the driver should be in "maintenance mode". In the interest
>> > of stability, revert to the original code as much as possible instead of a
>> > half-baked solution.
>>
>> If nothing else we may want to do a follow-up on this patch as we
>> probably shouldn't be returning the error values to trigger link up.
>> There are definitely issues to be found here. If nothing else we may
>> want to explore just returning 1 if auto-neg is disabled instead of
>> returning an error code.
>>
>> > Link: https://www.spinics.net/lists/netdev/msg479923.html
>> > Signed-off-by: Benjamin Poirier <[email protected]>
> [...]

2018-01-29 17:24:21

by Alexander Duyck

[permalink] [raw]

Subject: Re: [PATCH 3/3] Revert "e1000e: Do not read ICR in Other interrupt"

On Sun, Jan 28, 2018 at 11:28 PM, Benjamin Poirier <[email protected]> wrote:
> On 2018/01/26 13:01, Alexander Duyck wrote:
>> On Fri, Jan 26, 2018 at 1:12 AM, Benjamin Poirier <[email protected]> wrote:
>> > This reverts commit 16ecba59bc333d6282ee057fb02339f77a880beb.
>> >
>> > It was reported that emulated e1000e devices in vmware esxi 6.5 Build
>> > 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
>> > overrun interrupt bursts"). Some tracing shows that after
>> > e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
>> > on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
>> > icr=0x80000004 (_INT_ASSERTED | _LSC) in the same situation.
>> >
>> > Some experimentation showed that this flaw in vmware e1000e emulation can
>> > be worked around by not setting Other in EIAC. This is how it was before
>> > commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").
>> >
>> > Since the ICR read in the Other interrupt handler has already been
>> > restored, this patch effectively reverts the remainder of commit
>> > 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt").
>> >
>> > Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
>> > Signed-off-by: Benjamin Poirier <[email protected]>
>> > ---
>> > drivers/net/ethernet/intel/e1000e/netdev.c | 10 ++++++++--
>> > 1 file changed, 8 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
>> > index ed103b9a8d3a..fffc1f0e3895 100644
>> > --- a/drivers/net/ethernet/intel/e1000e/netdev.c
>> > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
>> > @@ -1916,6 +1916,13 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
>> > struct e1000_hw *hw = &adapter->hw;
>> > u32 icr = er32(ICR);
>> >
>> > + /* Certain events (such as RXO) which trigger Other do not set
>> > + * INT_ASSERTED. In that case, read to clear of icr does not take
>> > + * place.
>> > + */
>> > + if (!(icr & E1000_ICR_INT_ASSERTED))
>> > + ew32(ICR, E1000_ICR_OTHER);
>> > +
>>
>> This piece doesn't make sense to me. Why are we clearing OTHER if
>> ICR_INT_ASSERTED is not set?
>
> Datasheet §10.2.4.1 ("Interrupt Cause Read Register") says that ICR read
> to clear only occurs if INT_ASSERTED is set. This corresponds to what I
> observed.
>
> However, while working on these issues, I noticed that when there is an rxo
> event, INT_ASSERTED is not always set even though the interrupt is raised. I
> think this is a hardware flaw.

I agree. I need to check with our silicon team to see what we can determine.

> For example, if doing
> ew32(ICS, E1000_ICS_LSC | E1000_ICS_OTHER);
> we enter e1000_msix_other() and two consecutive reads of ICR result in
> 0x81000004
> 0x00000000
>
> If doing
> ew32(ICS, E1000_ICS_RXO | E1000_ICS_OTHER);
> we enter e1000_msix_other() and two consecutive reads of ICR result in
> 0x01000041
> 0x01000041

This is interesting. So the ICR is doing the clear on read, so that
answers the question I had about the earlier patch.

One thought on this.. Is there any reason why you are limiting this to
only the OTHER bit? It seems like RXO and the other causes that aren't
supposed to be included in the mask should probably be cleared as
well, are they auto-cleared, ignored, or is there some advantage to
leaving them set?

> Consequently, we must clear OTHER manually from ICR, otherwise the
> interrupt is immediately re-raised after exiting the handler.
>
> These observations are the same whether the interrupt is triggered via a
> write to ICS or in hardware.
>
> Furthermore, I tested that this behavior is the same for other Other
> events (MDAC, SRPD, ACK, MNG). Those were tested via a write to ICS
> only, not in hardware.
>
> This is a version of the test patch that I used to trigger lsc and rxo in
> software and hardware. It applies over this patch series.

I plan to look into this some more over the next few days. Ideally if
we could mask these "OTHER" interrupts besides the LSC we could comply
with all the needed bits for MSI-X. My concern is that we are still
stuck reading the ICR at this point because of this and it is going to
make dealing with MSI-X challenging on 82574 since it seems like the
intention was that you weren't supposed to be reading the ICR when
MSI-X is enabled based on the list of current issues and HW errata.

At this point it seems like the interrupts is firing and the
INT_ASSERTED is all we really need to be checking for if I understand
this all correctly. Basically if LSC is set it will trigger OTHER and
INT_ASSERTED, if any of the other causes are set they are only setting
OTHER.

> diff --git a/drivers/net/ethernet/intel/e1000e/defines.h b/drivers/net/ethernet/intel/e1000e/defines.h
> index 0641c0098738..f54e7ac9c934 100644
> --- a/drivers/net/ethernet/intel/e1000e/defines.h
> +++ b/drivers/net/ethernet/intel/e1000e/defines.h
> @@ -398,6 +398,7 @@
> #define E1000_ICR_LSC 0x00000004 /* Link Status Change */
> #define E1000_ICR_RXSEQ 0x00000008 /* Rx sequence error */
> #define E1000_ICR_RXDMT0 0x00000010 /* Rx desc min. threshold (0) */
> +#define E1000_ICR_RXO 0x00000040 /* rx overrun */
> #define E1000_ICR_RXT0 0x00000080 /* Rx timer intr (ring 0) */
> #define E1000_ICR_ECCER 0x00400000 /* Uncorrectable ECC Error */
> /* If this bit asserted, the driver should claim the interrupt */
> diff --git a/drivers/net/ethernet/intel/e1000e/ethtool.c b/drivers/net/ethernet/intel/e1000e/ethtool.c
> index 003cbd605799..4933c1beac74 100644
> --- a/drivers/net/ethernet/intel/e1000e/ethtool.c
> +++ b/drivers/net/ethernet/intel/e1000e/ethtool.c
> @@ -1802,98 +1802,20 @@ static void e1000_diag_test(struct net_device *netdev,
> struct ethtool_test *eth_test, u64 *data)
> {
> struct e1000_adapter *adapter = netdev_priv(netdev);
> - u16 autoneg_advertised;
> - u8 forced_speed_duplex;
> - u8 autoneg;
> - bool if_running = netif_running(netdev);
> + struct e1000_hw *hw = &adapter->hw;
>
> pm_runtime_get_sync(netdev->dev.parent);
>
> set_bit(__E1000_TESTING, &adapter->state);
>
> - if (!if_running) {
> - /* Get control of and reset hardware */
> - if (adapter->flags & FLAG_HAS_AMT)
> - e1000e_get_hw_control(adapter);
> -
> - e1000e_power_up_phy(adapter);
> -
> - adapter->hw.phy.autoneg_wait_to_complete = 1;
> - e1000e_reset(adapter);
> - adapter->hw.phy.autoneg_wait_to_complete = 0;
> - }
> -
> if (eth_test->flags == ETH_TEST_FL_OFFLINE) {
> - /* Offline tests */
> -
> - /* save speed, duplex, autoneg settings */
> - autoneg_advertised = adapter->hw.phy.autoneg_advertised;
> - forced_speed_duplex = adapter->hw.mac.forced_speed_duplex;
> - autoneg = adapter->hw.mac.autoneg;
> -
> - e_info("offline testing starting\n");
> -
> - if (if_running)
> - /* indicate we're in test mode */
> - e1000e_close(netdev);
> -
> - if (e1000_reg_test(adapter, &data[0]))
> - eth_test->flags |= ETH_TEST_FL_FAILED;
> -
> - e1000e_reset(adapter);
> - if (e1000_eeprom_test(adapter, &data[1]))
> - eth_test->flags |= ETH_TEST_FL_FAILED;
> -
> - e1000e_reset(adapter);
> - if (e1000_intr_test(adapter, &data[2]))
> - eth_test->flags |= ETH_TEST_FL_FAILED;
> -
> - e1000e_reset(adapter);
> - if (e1000_loopback_test(adapter, &data[3]))
> - eth_test->flags |= ETH_TEST_FL_FAILED;
> -
> - /* force this routine to wait until autoneg complete/timeout */
> - adapter->hw.phy.autoneg_wait_to_complete = 1;
> - e1000e_reset(adapter);
> - adapter->hw.phy.autoneg_wait_to_complete = 0;
> -
> - if (e1000_link_test(adapter, &data[4]))
> - eth_test->flags |= ETH_TEST_FL_FAILED;
> -
> - /* restore speed, duplex, autoneg settings */
> - adapter->hw.phy.autoneg_advertised = autoneg_advertised;
> - adapter->hw.mac.forced_speed_duplex = forced_speed_duplex;
> - adapter->hw.mac.autoneg = autoneg;
> - e1000e_reset(adapter);
> -
> - clear_bit(__E1000_TESTING, &adapter->state);
> - if (if_running)
> - e1000e_open(netdev);
> + // LSC, RXO, MDAC, SRPD, ACK, MNG
> + ew32(ICS, E1000_ICR_RXO | E1000_ICR_OTHER);
> } else {
> - /* Online tests */
> -
> - e_info("online testing starting\n");
> -
> - /* register, eeprom, intr and loopback tests not run online */
> - data[0] = 0;
> - data[1] = 0;
> - data[2] = 0;
> - data[3] = 0;
> -
> - if (e1000_link_test(adapter, &data[4]))
> - eth_test->flags |= ETH_TEST_FL_FAILED;
> -
> - clear_bit(__E1000_TESTING, &adapter->state);
> - }
> -
> - if (!if_running) {
> - e1000e_reset(adapter);
> -
> - if (adapter->flags & FLAG_HAS_AMT)
> - e1000e_release_hw_control(adapter);
> + ew32(ICS, E1000_ICR_LSC | E1000_ICR_OTHER);
> }
>
> - msleep_interruptible(4 * 1000);
> + clear_bit(__E1000_TESTING, &adapter->state);
>
> pm_runtime_put_sync(netdev->dev.parent);
> }
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> index fffc1f0e3895..5b3a0feaf052 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -46,6 +46,10 @@
>
> #include "e1000.h"
>
> +DEFINE_RATELIMIT_STATE(rx_ratelimit_state, 2 * HZ, 1);
> +DEFINE_RATELIMIT_STATE(other_ratelimit_state, 2 * HZ, 1);
> +DEFINE_RATELIMIT_STATE(other_ratelimit_state2, 2 * HZ, 1);
> +
> #define DRV_EXTRAVERSION "-k"
>
> #define DRV_VERSION "3.2.6" DRV_EXTRAVERSION
> @@ -936,6 +940,9 @@ static bool e1000_clean_rx_irq(struct e1000_ring *rx_ring, int *work_done,
> int cleaned_count = 0;
> bool cleaned = false;
> unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> + static unsigned int count;
> +
> + mdelay(10);
>
> i = rx_ring->next_to_clean;
> rx_desc = E1000_RX_DESC_EXT(*rx_ring, i);
> @@ -1067,6 +1074,16 @@ static bool e1000_clean_rx_irq(struct e1000_ring *rx_ring, int *work_done,
>
> adapter->total_rx_bytes += total_rx_bytes;
> adapter->total_rx_packets += total_rx_packets;
> +
> + count++;
> + if (__ratelimit(&rx_ratelimit_state)) {
> + static unsigned int max;
> + max = max(max, total_rx_packets);
> + trace_printk("rx %u now, max %u, %u rounds\n",
> + total_rx_packets, max, count);
> + count = 0;
> + }
> +
> return cleaned;
> }
>
> @@ -1914,14 +1931,30 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
> struct net_device *netdev = data;
> struct e1000_adapter *adapter = netdev_priv(netdev);
> struct e1000_hw *hw = &adapter->hw;
> - u32 icr = er32(ICR);
> + static unsigned int count;
> + u32 icr2, icr = er32(ICR);
>
> /* Certain events (such as RXO) which trigger Other do not set
> * INT_ASSERTED. In that case, read to clear of icr does not take
> * place.
> */
> + /*
> if (!(icr & E1000_ICR_INT_ASSERTED))
> ew32(ICR, E1000_ICR_OTHER);
> + */
> +
> + icr2 = er32(ICR);
> +
> + count++;
> + if (__ratelimit(&other_ratelimit_state)) {
> + trace_printk("icr 0x%08x icr2 0x%08x count %u\n", icr, icr2,
> + count);
> + count = 0;
> + }
> + if (icr & E1000_ICR_RXO && icr & E1000_ICR_INT_ASSERTED &&
> + __ratelimit(&other_ratelimit_state2)) {
> + trace_printk("special icr 0x%08x icr2 0x%08x\n", icr, icr2);
> + }
>
> if (icr & adapter->eiac_mask)
> ew32(ICS, (icr & adapter->eiac_mask));

2018-02-08 06:41:31

by Benjamin Poirier

[permalink] [raw]

Subject: Re: [PATCH 3/3] Revert "e1000e: Do not read ICR in Other interrupt"

On 2018/01/29 09:22, Alexander Duyck wrote:
[...]
>
> > Consequently, we must clear OTHER manually from ICR, otherwise the
> > interrupt is immediately re-raised after exiting the handler.
> >
> > These observations are the same whether the interrupt is triggered via a
> > write to ICS or in hardware.
> >
> > Furthermore, I tested that this behavior is the same for other Other
> > events (MDAC, SRPD, ACK, MNG). Those were tested via a write to ICS
> > only, not in hardware.
> >
> > This is a version of the test patch that I used to trigger lsc and rxo in
> > software and hardware. It applies over this patch series.
>
> I plan to look into this some more over the next few days. Ideally if
> we could mask these "OTHER" interrupts besides the LSC we could comply
> with all the needed bits for MSI-X. My concern is that we are still
> stuck reading the ICR at this point because of this and it is going to
> make dealing with MSI-X challenging on 82574 since it seems like the
> intention was that you weren't supposed to be reading the ICR when
> MSI-X is enabled based on the list of current issues and HW errata.

I totally agree with you that it looks like the msi-x interface was
designed so you don't need to read icr. That's also why I was happy to
go that direction with the (now infamous) commit 16ecba59bc33 ("e1000e:
Do not read ICR in Other interrupt", v4.5-rc1).

However, we looked at it before and there seems to be no way to mask
individual Other interrupt causes (masking rxo but getting lsc). Because
of that, I think we have to keep reading icr in the Other interrupt
handler.

>
> At this point it seems like the interrupts is firing and the
> INT_ASSERTED is all we really need to be checking for if I understand
> this all correctly. Basically if LSC is set it will trigger OTHER and
> INT_ASSERTED, if any of the other causes are set they are only setting
> OTHER.

I think that's right and it's related to the fact that currently LSC is
set in IMS but not the other causes. Since we have to read icr (as I
wrote above) but we want to avoid reading it without INT_ASSERTED set
(as per errata 12) the solution will be to set all of the causes related
to Other in IMS. Patches incoming...

2018-02-26 02:32:42

by Benjamin Poirier

[permalink] [raw]

Subject: [RFC PATCH] e1000e: Fix link check race condition.

Alex reported the following race condition:

/* link goes up... interrupt... schedule watchdog */
\ e1000_watchdog_task
\ e1000e_has_link
\ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
\ e1000e_phy_has_link_generic(..., &link)
link = true

/* link goes down... interrupt */
\ e1000_msix_other
hw->mac.get_link_status = true

/* link is up */
mac->get_link_status = false

link_active = true
/* link_active is true, wrongly, and stays so because
* get_link_status is false */

Avoid this problem by making sure that we don't set get_link_status = false
after having checked the link.

It seems this problem has been present since the introduction of e1000e.

Link: https://lkml.org/lkml/2018/1/29/338
Reported-by: Alexander Duyck <[email protected]>
Signed-off-by: Benjamin Poirier <[email protected]>
---
drivers/net/ethernet/intel/e1000e/ich8lan.c | 41 ++++++++++++++++-------------
drivers/net/ethernet/intel/e1000e/mac.c | 14 +++++++---
2 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c b/drivers/net/ethernet/intel/e1000e/ich8lan.c
index ff308b05d68c..3c2c4f87e075 100644
--- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
+++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
@@ -1386,6 +1386,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
*/
if (!mac->get_link_status)
return 1;
+ mac->get_link_status = false;

/* First we want to see if the MII Status Register reports
* link. If so, then we want to get the current speed/duplex
@@ -1393,12 +1394,12 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
*/
ret_val = e1000e_phy_has_link_generic(hw, 1, 0, &link);
if (ret_val)
- return ret_val;
+ goto out;

if (hw->mac.type == e1000_pchlan) {
ret_val = e1000_k1_gig_workaround_hv(hw, link);
if (ret_val)
- return ret_val;
+ goto out;
}

/* When connected at 10Mbps half-duplex, some parts are excessively
@@ -1431,7 +1432,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)

ret_val = hw->phy.ops.acquire(hw);
if (ret_val)
- return ret_val;
+ goto out;

if (hw->mac.type == e1000_pch2lan)
emi_addr = I82579_RX_CONFIG;
@@ -1453,7 +1454,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
hw->phy.ops.release(hw);

if (ret_val)
- return ret_val;
+ goto out;

if (hw->mac.type >= e1000_pch_spt) {
u16 data;
@@ -1462,14 +1463,14 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
if (speed == SPEED_1000) {
ret_val = hw->phy.ops.acquire(hw);
if (ret_val)
- return ret_val;
+ goto out;

ret_val = e1e_rphy_locked(hw,
PHY_REG(776, 20),
&data);
if (ret_val) {
hw->phy.ops.release(hw);
- return ret_val;
+ goto out;
}

ptr_gap = (data & (0x3FF << 2)) >> 2;
@@ -1483,18 +1484,18 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
}
hw->phy.ops.release(hw);
if (ret_val)
- return ret_val;
+ goto out;
} else {
ret_val = hw->phy.ops.acquire(hw);
if (ret_val)
- return ret_val;
+ goto out;

ret_val = e1e_wphy_locked(hw,
PHY_REG(776, 20),
0xC023);
hw->phy.ops.release(hw);
if (ret_val)
- return ret_val;
+ goto out;

}
}
@@ -1521,7 +1522,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
(hw->adapter->pdev->device == E1000_DEV_ID_PCH_I218_V3)) {
ret_val = e1000_k1_workaround_lpt_lp(hw, link);
if (ret_val)
- return ret_val;
+ goto out;
}
if (hw->mac.type >= e1000_pch_lpt) {
/* Set platform power management values for
@@ -1529,7 +1530,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
*/
ret_val = e1000_platform_pm_pch_lpt(hw, link);
if (ret_val)
- return ret_val;
+ goto out;
}

/* Clear link partner's EEE ability */
@@ -1551,22 +1552,22 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
ew32(FEXTNVM6, fextnvm6);
}

- if (!link)
+ if (!link) {
+ mac->get_link_status = true;
return 0; /* No link detected */
-
- mac->get_link_status = false;
+ }

switch (hw->mac.type) {
case e1000_pch2lan:
ret_val = e1000_k1_workaround_lv(hw);
if (ret_val)
- return ret_val;
+ goto out;
/* fall-thru */
case e1000_pchlan:
if (hw->phy.type == e1000_phy_82578) {
ret_val = e1000_link_stall_workaround_hv(hw);
if (ret_val)
- return ret_val;
+ goto out;
}

/* Workaround for PCHx parts in half-duplex:
@@ -1595,7 +1596,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
if (hw->phy.type > e1000_phy_82579) {
ret_val = e1000_set_eee_pchlan(hw);
if (ret_val)
- return ret_val;
+ goto out;
}

/* If we are forcing speed/duplex, then we simply return since
@@ -1618,10 +1619,14 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
ret_val = e1000e_config_fc_after_link_up(hw);
if (ret_val) {
e_dbg("Error configuring flow control\n");
- return ret_val;
+ goto out;
}

return 1;
+
+out:
+ mac->get_link_status = true;
+ return ret_val;
}

static s32 e1000_get_variants_ich8lan(struct e1000_adapter *adapter)
diff --git a/drivers/net/ethernet/intel/e1000e/mac.c b/drivers/net/ethernet/intel/e1000e/mac.c
index db735644b312..60c8beaf5cb3 100644
--- a/drivers/net/ethernet/intel/e1000e/mac.c
+++ b/drivers/net/ethernet/intel/e1000e/mac.c
@@ -427,6 +427,7 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
*/
if (!mac->get_link_status)
return 1;
+ mac->get_link_status = false;

/* First we want to see if the MII Status Register reports
* link. If so, then we want to get the current speed/duplex
@@ -434,12 +435,13 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
*/
ret_val = e1000e_phy_has_link_generic(hw, 1, 0, &link);
if (ret_val)
- return ret_val;
+ goto out;

- if (!link)
+ if (!link) {
+ mac->get_link_status = true;
return 0; /* No link detected */
+ }

- mac->get_link_status = false;

/* Check if there was DownShift, must be checked
* immediately after link-up
@@ -466,10 +468,14 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
ret_val = e1000e_config_fc_after_link_up(hw);
if (ret_val) {
e_dbg("Error configuring flow control\n");
- return ret_val;
+ goto out;
}

return 1;
+
+out:
+ mac->get_link_status = true;
+ return ret_val;
}

/**
--
2.16.1

2018-02-26 16:16:01

by Alexander Duyck

[permalink] [raw]

Subject: Re: [RFC PATCH] e1000e: Fix link check race condition.

On Sun, Feb 25, 2018 at 6:31 PM, Benjamin Poirier <[email protected]> wrote:
> Alex reported the following race condition:
>
> /* link goes up... interrupt... schedule watchdog */
> \ e1000_watchdog_task
> \ e1000e_has_link
> \ hw->mac.ops.check_for_link() === e1000e_check_for_copper_link
> \ e1000e_phy_has_link_generic(..., &link)
> link = true
>
> /* link goes down... interrupt */
> \ e1000_msix_other
> hw->mac.get_link_status = true
>
> /* link is up */
> mac->get_link_status = false
>
> link_active = true
> /* link_active is true, wrongly, and stays so because
> * get_link_status is false */
>
> Avoid this problem by making sure that we don't set get_link_status = false
> after having checked the link.
>
> It seems this problem has been present since the introduction of e1000e.
>
> Link: https://lkml.org/lkml/2018/1/29/338
> Reported-by: Alexander Duyck <[email protected]>
> Signed-off-by: Benjamin Poirier <[email protected]>
> ---
> drivers/net/ethernet/intel/e1000e/ich8lan.c | 41 ++++++++++++++++-------------
> drivers/net/ethernet/intel/e1000e/mac.c | 14 +++++++---
> 2 files changed, 33 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c b/drivers/net/ethernet/intel/e1000e/ich8lan.c
> index ff308b05d68c..3c2c4f87e075 100644
> --- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
> +++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
> @@ -1386,6 +1386,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
> */
> if (!mac->get_link_status)
> return 1;
> + mac->get_link_status = false;
>
> /* First we want to see if the MII Status Register reports
> * link. If so, then we want to get the current speed/duplex
> @@ -1393,12 +1394,12 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
> */
> ret_val = e1000e_phy_has_link_generic(hw, 1, 0, &link);
> if (ret_val)
> - return ret_val;
> + goto out;
>
> if (hw->mac.type == e1000_pchlan) {
> ret_val = e1000_k1_gig_workaround_hv(hw, link);
> if (ret_val)
> - return ret_val;
> + goto out;
> }
>
> /* When connected at 10Mbps half-duplex, some parts are excessively
> @@ -1431,7 +1432,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
>
> ret_val = hw->phy.ops.acquire(hw);
> if (ret_val)
> - return ret_val;
> + goto out;
>
> if (hw->mac.type == e1000_pch2lan)
> emi_addr = I82579_RX_CONFIG;
> @@ -1453,7 +1454,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
> hw->phy.ops.release(hw);
>
> if (ret_val)
> - return ret_val;
> + goto out;
>
> if (hw->mac.type >= e1000_pch_spt) {
> u16 data;
> @@ -1462,14 +1463,14 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
> if (speed == SPEED_1000) {
> ret_val = hw->phy.ops.acquire(hw);
> if (ret_val)
> - return ret_val;
> + goto out;
>
> ret_val = e1e_rphy_locked(hw,
> PHY_REG(776, 20),
> &data);
> if (ret_val) {
> hw->phy.ops.release(hw);
> - return ret_val;
> + goto out;
> }
>
> ptr_gap = (data & (0x3FF << 2)) >> 2;
> @@ -1483,18 +1484,18 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
> }
> hw->phy.ops.release(hw);
> if (ret_val)
> - return ret_val;
> + goto out;
> } else {
> ret_val = hw->phy.ops.acquire(hw);
> if (ret_val)
> - return ret_val;
> + goto out;
>
> ret_val = e1e_wphy_locked(hw,
> PHY_REG(776, 20),
> 0xC023);
> hw->phy.ops.release(hw);
> if (ret_val)
> - return ret_val;
> + goto out;
>
> }
> }
> @@ -1521,7 +1522,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
> (hw->adapter->pdev->device == E1000_DEV_ID_PCH_I218_V3)) {
> ret_val = e1000_k1_workaround_lpt_lp(hw, link);
> if (ret_val)
> - return ret_val;
> + goto out;
> }
> if (hw->mac.type >= e1000_pch_lpt) {
> /* Set platform power management values for
> @@ -1529,7 +1530,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
> */
> ret_val = e1000_platform_pm_pch_lpt(hw, link);
> if (ret_val)
> - return ret_val;
> + goto out;
> }
>
> /* Clear link partner's EEE ability */
> @@ -1551,22 +1552,22 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
> ew32(FEXTNVM6, fextnvm6);
> }
>
> - if (!link)
> + if (!link) {
> + mac->get_link_status = true;
> return 0; /* No link detected */
> -
> - mac->get_link_status = false;
> + }

If I am not mistaken this could be just another jump to the "out"
label. We should have initialized "ret_val" to 0 near the start of
this function when we made the call to phy_has_link_generic.

>
> switch (hw->mac.type) {
> case e1000_pch2lan:
> ret_val = e1000_k1_workaround_lv(hw);
> if (ret_val)
> - return ret_val;
> + goto out;
> /* fall-thru */
> case e1000_pchlan:
> if (hw->phy.type == e1000_phy_82578) {
> ret_val = e1000_link_stall_workaround_hv(hw);
> if (ret_val)
> - return ret_val;
> + goto out;
> }
>
> /* Workaround for PCHx parts in half-duplex:
> @@ -1595,7 +1596,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
> if (hw->phy.type > e1000_phy_82579) {
> ret_val = e1000_set_eee_pchlan(hw);
> if (ret_val)
> - return ret_val;
> + goto out;
> }
>
> /* If we are forcing speed/duplex, then we simply return since
> @@ -1618,10 +1619,14 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
> ret_val = e1000e_config_fc_after_link_up(hw);
> if (ret_val) {
> e_dbg("Error configuring flow control\n");
> - return ret_val;
> + goto out;
> }

Technically these changes would be a change in behavior. For these we
may just want to leave them as-is since I am not certain they would
have any actual impact on the link state other than delaying the
link-up. For example do we really care if we fail to negotiate flow
control? We may not so we might report link up and just a debug
message indicating we didn't negotiate that part of the link.

> return 1;
> +
> +out:
> + mac->get_link_status = true;
> + return ret_val;
> }
>
> static s32 e1000_get_variants_ich8lan(struct e1000_adapter *adapter)
> diff --git a/drivers/net/ethernet/intel/e1000e/mac.c b/drivers/net/ethernet/intel/e1000e/mac.c
> index db735644b312..60c8beaf5cb3 100644
> --- a/drivers/net/ethernet/intel/e1000e/mac.c
> +++ b/drivers/net/ethernet/intel/e1000e/mac.c
> @@ -427,6 +427,7 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> */
> if (!mac->get_link_status)
> return 1;
> + mac->get_link_status = false;
>
> /* First we want to see if the MII Status Register reports
> * link. If so, then we want to get the current speed/duplex
> @@ -434,12 +435,13 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> */
> ret_val = e1000e_phy_has_link_generic(hw, 1, 0, &link);
> if (ret_val)
> - return ret_val;
> + goto out;
>
> - if (!link)
> + if (!link) {
> + mac->get_link_status = true;
> return 0; /* No link detected */
> + }

Same here. We just initialized it to 0 via the call to
e1000e_phy_has_link_generic so we could just jump to "out" if the link
is not set. You could probably even combine the two conditions into
one even with a check for "ret_val || !link".

> - mac->get_link_status = false;
>
> /* Check if there was DownShift, must be checked
> * immediately after link-up
> @@ -466,10 +468,14 @@ s32 e1000e_check_for_copper_link(struct e1000_hw *hw)
> ret_val = e1000e_config_fc_after_link_up(hw);
> if (ret_val) {
> e_dbg("Error configuring flow control\n");
> - return ret_val;
> + goto out;
> }
>
> return 1;
> +
> +out:
> + mac->get_link_status = true;
> + return ret_val;
> }
>
> /**
> --
> 2.16.1
>

2018-02-28 05:20:14

by Benjamin Poirier

[permalink] [raw]

Subject: Re: [RFC PATCH] e1000e: Fix link check race condition.

On 2018/02/26 08:14, Alexander Duyck wrote:
[...]
>
> >
> > switch (hw->mac.type) {
> > case e1000_pch2lan:
> > ret_val = e1000_k1_workaround_lv(hw);
> > if (ret_val)
> > - return ret_val;
> > + goto out;
> > /* fall-thru */
> > case e1000_pchlan:
> > if (hw->phy.type == e1000_phy_82578) {
> > ret_val = e1000_link_stall_workaround_hv(hw);
> > if (ret_val)
> > - return ret_val;
> > + goto out;
> > }
> >
> > /* Workaround for PCHx parts in half-duplex:
> > @@ -1595,7 +1596,7 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
> > if (hw->phy.type > e1000_phy_82579) {
> > ret_val = e1000_set_eee_pchlan(hw);
> > if (ret_val)
> > - return ret_val;
> > + goto out;
> > }
> >
> > /* If we are forcing speed/duplex, then we simply return since
> > @@ -1618,10 +1619,14 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw)
> > ret_val = e1000e_config_fc_after_link_up(hw);
> > if (ret_val) {
> > e_dbg("Error configuring flow control\n");
> > - return ret_val;
> > + goto out;
> > }
>
> Technically these changes would be a change in behavior. For these we
> may just want to leave them as-is since I am not certain they would
> have any actual impact on the link state other than delaying the
> link-up. For example do we really care if we fail to negotiate flow
> control? We may not so we might report link up and just a debug
> message indicating we didn't negotiate that part of the link.

You're right and actually that raises yet another problem with commit
19110cfbb34d ("e1000e: Separate signaling for link check/link up").

Previously these errors which come after the "get_link_status = false"
would be ignored and the link considered up. After that commit, any
error implies that the link is down:

- link_active = !hw->mac.get_link_status;
+ link_active = ret_val > 0;

I'll send a patch to fix that problem first and then get back to this
race.