Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752028AbbBUQWc (ORCPT ); Sat, 21 Feb 2015 11:22:32 -0500 Received: from mail-pa0-f49.google.com ([209.85.220.49]:42448 "EHLO mail-pa0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751867AbbBUQW3 (ORCPT ); Sat, 21 Feb 2015 11:22:29 -0500 Message-ID: <1424535746.5565.42.camel@edumazet-glaptop2.roam.corp.google.com> Subject: Re: 1e918876 breaks r8169 (linux-3.18+) From: Eric Dumazet To: Florian Westphal Cc: Tomas Szepe , Francois Romieu , Hayes Wang , Eric Dumazet , Tom Herbert , "David S. Miller" , Marco Berizzi , linux-kernel@vger.kernel.org, netdev Date: Sat, 21 Feb 2015 08:22:26 -0800 In-Reply-To: <20150221103104.GA26574@breakpoint.cc> References: <20150203100816.GA5807@louise.pinerecords.com> <20150203104214.GG24751@breakpoint.cc> <20150210154536.GB16264@breakpoint.cc> <20150221101512.GB17223@louise.pinerecords.com> <20150221103104.GA26574@breakpoint.cc> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.10.4-0ubuntu2 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2827 Lines: 84 On Sat, 2015-02-21 at 11:31 +0100, Florian Westphal wrote: > Tomas Szepe wrote: > > > I tried to reproduce this without success so far on my RTL8168d/8111d device. > > > I've been running 40 parallel netperf TCP_STREAM tests (1gbit) for the > > > last 5 hours and so far I saw no watchdog tx timeouts. > > > > > > I'll keep this running for a day or so to see if it just takes more time > > > to trigger. > > > > So, how's this coming along? Don't you think the patch should be reverted > > until the problem is diagnosed/understood/fixed? > > Sorry. > > David, please consider reverting > > 1e918876853aa85435e0f17fd8b4a92dcfff53d6 > (r8169: add support for Byte Queue Limits) > > and > > 0bec3b700d106a8b0a34227b2976d1a582f1aab7 > (r8169: add support for xmit_more) > > I cannot reproduce any hangs (tried for 2days with 40 parallel > netperfs using both 100mbit and 1gbit receiver). > > And I don't see anything wrong with the change either. > Seems like some revisions of the HW are just dodgy? > > I hate giving up, but I have no means to diagnose this any further. > Even reporter says it doesn't affect all of his r8169 nics. > > So I think the change is correct per se, but might be revealing some > HW/firmware bug. Hold on. I believe there is one race in the way you access skb->xmit_more _after_ txd->opts1 = cpu_to_le32(status); After this point, TX might have completed and TX completion already have freed skb Could Tomas try following fix ? diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index ad0020af2193..f2764366a36c 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -7050,6 +7050,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb, u32 opts[2]; int frags; bool stop_queue; + bool xmit_more; if (unlikely(!TX_FRAGS_READY_FOR(tp, skb_shinfo(skb)->nr_frags))) { netif_err(tp, drv, dev, "BUG! Tx Ring full when queue awake!\n"); @@ -7091,7 +7092,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb, txd->opts2 = cpu_to_le32(opts[1]); netdev_sent_queue(dev, skb->len); - + xmit_more = skb->xmit_more; skb_tx_timestamp(skb); /* Force memory writes to complete before releasing descriptor */ @@ -7108,7 +7109,7 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb, stop_queue = !TX_FRAGS_READY_FOR(tp, MAX_SKB_FRAGS); - if (!skb->xmit_more || stop_queue || + if (!xmit_more || stop_queue || netif_xmit_stopped(netdev_get_tx_queue(dev, 0))) { RTL_W8(TxPoll, NPQ); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/