Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756404AbYAPFC3 (ORCPT ); Wed, 16 Jan 2008 00:02:29 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751351AbYAPFCQ (ORCPT ); Wed, 16 Jan 2008 00:02:16 -0500 Received: from 74-93-104-97-Washington.hfc.comcastbusiness.net ([74.93.104.97]:33649 "EHLO sunset.davemloft.net" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1750910AbYAPFCP (ORCPT ); Wed, 16 Jan 2008 00:02:15 -0500 Date: Tue, 15 Jan 2008 21:02:14 -0800 (PST) Message-Id: <20080115.210214.170759690.davem@davemloft.net> To: jesse.brandeburg@intel.com Cc: slavon@bigtelecom.ru, elendil@planet.nl, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang From: David Miller In-Reply-To: <36D9DB17C6DE9E40B059440DB8D95F52042FA541@orsmsx418.amr.corp.intel.com> References: <200801151504.10047.elendil@planet.nl> <20080115190458.rxt3yhb2o8o404kc@mail.bigtelecom.ru> <36D9DB17C6DE9E40B059440DB8D95F52042FA541@orsmsx418.amr.corp.intel.com> X-Mailer: Mew version 5.2 on Emacs 22.1 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6249 Lines: 160 From: "Brandeburg, Jesse" Date: Tue, 15 Jan 2008 13:53:43 -0800 > The tx code has an "early exit" that tries to limit the amount of tx > packets handled in a single poll loop and requires napi or interrupt > rescheduling based on the return value from e1000_clean_tx_irq. That explains everything, thanks Jesse. Ok, here is the patch I'll propose to fix this. The goal is to make it as simple as possible without regressing the thing we were trying to fix. Something more sophisticated can be done later. Three of the 5 Intel drivers had the TX breakout logic. e1000, e1000e, and ixgbe. e100 and ixgb did not, so they don't have any problems we need to fix here. What the fix does is behave as if the budget was fully consumed if *_clean_tx_irq() returns true. The only valid way to return from ->poll() without copleting the NAPI poll is by returning work_done == budget. That signals to the caller that the NAPI instance has not been descheduled and therefore the caller fully owns the NAPI context. This does mean that for these drivers any time TX work is done, we'll loop at least one extra time in the ->poll() loop of net_rx_work() but that is historically what these drivers have caused to happen for years. For 2.6.25 or similar I would suggest investigating courses of action to bring closure and consistency to this: 1) Determine whether the loop breakout is actually necessary. Jesse explained to me that they had seen a case where a thread on one cpu feeding the TX ring could keep a thread on another cpu constantly running the *_clean_tx_irq() code in a loop. I find this hard to believe since even the slowest CPU should be able to free up TX entries faster than they can be transmitted on gigabit links :-) 2) If the investigation in #1 deems the breakout logic is necessary, then consistently amongst all the 5 drivers a policy should be implemented which is integrated with the NAPI budgetting logic. For example, the simplest thing to do is to pass the budget and the "work_done" thing down into *_clean_tx_irq() and break out if it is exceeded. As a further refinement we can say that TX work is about 1/4 the expense of RX work and adjust the budget checking logic to match that. [NET]: Fix TX timeout regression in Intel drivers. This fixes a regression added by changeset 53e52c729cc169db82a6105fac7a166e10c2ec36 ("[NET]: Make ->poll() breakout consistent in Intel ethernet drivers.") As pointed out by Jesse Brandeburg, for three of the drivers edited above there is breakout logic in the *_clean_tx_irq() code to prevent running TX reclaim forever. If this occurs, we have to elide NAPI poll completion or else those TX events will never be serviced. Signed-off-by: David S. Miller diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index 13d57b0..0c9a6f7 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -3919,7 +3919,7 @@ e1000_clean(struct napi_struct *napi, int budget) { struct e1000_adapter *adapter = container_of(napi, struct e1000_adapter, napi); struct net_device *poll_dev = adapter->netdev; - int work_done = 0; + int tx_cleaned = 0, work_done = 0; /* Must NOT use netdev_priv macro here. */ adapter = poll_dev->priv; @@ -3929,14 +3929,17 @@ e1000_clean(struct napi_struct *napi, int budget) * simultaneously. A failure obtaining the lock means * tx_ring[0] is currently being cleaned anyway. */ if (spin_trylock(&adapter->tx_queue_lock)) { - e1000_clean_tx_irq(adapter, - &adapter->tx_ring[0]); + tx_cleaned = e1000_clean_tx_irq(adapter, + &adapter->tx_ring[0]); spin_unlock(&adapter->tx_queue_lock); } adapter->clean_rx(adapter, &adapter->rx_ring[0], &work_done, budget); + if (tx_cleaned) + work_done = budget; + /* If budget not fully consumed, exit the polling mode */ if (work_done < budget) { if (likely(adapter->itr_setting & 3)) diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c index 4a6fc74..2ab3bfb 100644 --- a/drivers/net/e1000e/netdev.c +++ b/drivers/net/e1000e/netdev.c @@ -1384,7 +1384,7 @@ static int e1000_clean(struct napi_struct *napi, int budget) { struct e1000_adapter *adapter = container_of(napi, struct e1000_adapter, napi); struct net_device *poll_dev = adapter->netdev; - int work_done = 0; + int tx_cleaned = 0, work_done = 0; /* Must NOT use netdev_priv macro here. */ adapter = poll_dev->priv; @@ -1394,12 +1394,15 @@ static int e1000_clean(struct napi_struct *napi, int budget) * simultaneously. A failure obtaining the lock means * tx_ring is currently being cleaned anyway. */ if (spin_trylock(&adapter->tx_queue_lock)) { - e1000_clean_tx_irq(adapter); + tx_cleaned = e1000_clean_tx_irq(adapter); spin_unlock(&adapter->tx_queue_lock); } adapter->clean_rx(adapter, &work_done, budget); + if (tx_cleaned) + work_done = budget; + /* If budget not fully consumed, exit the polling mode */ if (work_done < budget) { if (adapter->itr_setting & 3) diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c index a564916..de3f45e 100644 --- a/drivers/net/ixgbe/ixgbe_main.c +++ b/drivers/net/ixgbe/ixgbe_main.c @@ -1468,13 +1468,16 @@ static int ixgbe_clean(struct napi_struct *napi, int budget) struct ixgbe_adapter *adapter = container_of(napi, struct ixgbe_adapter, napi); struct net_device *netdev = adapter->netdev; - int work_done = 0; + int tx_cleaned = 0, work_done = 0; /* In non-MSIX case, there is no multi-Tx/Rx queue */ - ixgbe_clean_tx_irq(adapter, adapter->tx_ring); + tx_cleaned = ixgbe_clean_tx_irq(adapter, adapter->tx_ring); ixgbe_clean_rx_irq(adapter, &adapter->rx_ring[0], &work_done, budget); + if (tx_cleaned) + work_done = budget; + /* If budget not fully consumed, exit the polling mode */ if (work_done < budget) { netif_rx_complete(netdev, napi); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/