Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756405AbYFQIdK (ORCPT ); Tue, 17 Jun 2008 04:33:10 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753485AbYFQIcw (ORCPT ); Tue, 17 Jun 2008 04:32:52 -0400 Received: from mx3.mail.elte.hu ([157.181.1.138]:49275 "EHLO mx3.mail.elte.hu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753386AbYFQIcu (ORCPT ); Tue, 17 Jun 2008 04:32:50 -0400 Date: Tue, 17 Jun 2008 10:32:20 +0200 From: Ingo Molnar To: David Miller Cc: kuznet@ms2.inr.ac.ru, vgusev@openvz.org, mcmanus@ducksong.com, xemul@openvz.org, netdev@vger.kernel.org, ilpo.jarvinen@helsinki.fi, linux-kernel@vger.kernel.org, e1000-devel@lists.sourceforge.net, "Rafael J. Wysocki" Subject: Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets Message-ID: <20080617083220.GA11393@elte.hu> References: <20080613114746.GA27811@elte.hu> <20080616.165900.189566405.davem@davemloft.net> <20080617072658.GA12535@elte.hu> <20080617.003832.130616157.davem@davemloft.net> <20080617080958.GC12535@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20080617080958.GC12535@elte.hu> User-Agent: Mutt/1.5.18 (2008-05-17) X-ELTE-VirusStatus: clean X-ELTE-SpamScore: -1.5 X-ELTE-SpamLevel: X-ELTE-SpamCheck: no X-ELTE-SpamVersion: ELTE 2.0 X-ELTE-SpamCheck-Details: score=-1.5 required=5.9 tests=BAYES_00 autolearn=no SpamAssassin version=3.2.3 -1.5 BAYES_00 BODY: Bayesian spam probability is 0 to 1% [score: 0.0000] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4493 Lines: 92 * Ingo Molnar wrote: > > > FWIW I don't think your TX timeout problem has anything to do with > > packet ordering. The TX element of the network device is totally > > stateless, but it's hanging under some set of circumstances to the > > point where we timeout and reset the hardware to get it going again. > > ok. That's e1000 then. Cc:s added. Stock T60 laptop, 32-bit: > > 02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller > Subsystem: Lenovo ThinkPad T60 > Flags: bus master, fast devsel, latency 0, IRQ 16 > Memory at ee000000 (32-bit, non-prefetchable) [size=128K] > I/O ports at 2000 [size=32] > Capabilities: > Kernel driver in use: e1000 > > the problem is this non-fatal warning showing up after bootup, > sporadically, in a non-reproducible way: > > [ 173.354049] NETDEV WATCHDOG: eth0: transmit timed out > [ 173.354148] ------------[ cut here ]------------ > [ 173.354221] WARNING: at net/sched/sch_generic.c:222 dev_watchdog+0x9a/0xec() > [ 173.354298] Modules linked in: > [ 173.354421] Pid: 13452, comm: cc1 Tainted: G W 2.6.26-rc6-00273-g81ae43a-dirty #2573 > [ 173.354516] [] warn_on_slowpath+0x46/0x76 > [ 173.354641] [] ? try_to_wake_up+0x1d6/0x1e0 > [ 173.354815] [] ? trace_hardirqs_off+0xb/0xd > [ 173.357370] [] ? default_wake_function+0xb/0xd > [ 173.357370] [] ? trace_hardirqs_off_caller+0x15/0xc9 > [ 173.357370] [] ? trace_hardirqs_off+0xb/0xd > [ 173.357370] [] ? trace_hardirqs_on+0xb/0xd > [ 173.357370] [] ? trace_hardirqs_on_caller+0x16/0x15b > [ 173.357370] [] ? trace_hardirqs_on+0xb/0xd > [ 173.357370] [] ? _spin_unlock_irqrestore+0x5b/0x71 > [ 173.357370] [] ? __queue_work+0x2d/0x32 > [ 173.357370] [] ? queue_work+0x50/0x72 > [ 173.357483] [] ? schedule_work+0x14/0x16 > [ 173.357654] [] dev_watchdog+0x9a/0xec > [ 173.357783] [] run_timer_softirq+0x13d/0x19d > [ 173.357905] [] ? dev_watchdog+0x0/0xec > [ 173.358073] [] ? dev_watchdog+0x0/0xec > [ 173.360804] [] __do_softirq+0xb2/0x15c > [ 173.360804] [] ? __do_softirq+0x0/0x15c > [ 173.360804] [] do_softirq+0x84/0xe9 > [ 173.360804] [] irq_exit+0x4b/0x88 > [ 173.360804] [] smp_apic_timer_interrupt+0x73/0x81 > [ 173.360804] [] apic_timer_interrupt+0x2d/0x34 > [ 173.360804] ======================= > [ 173.360804] ---[ end trace a7919e7f17c0a725 ]--- > > full report can be found at: > > http://lkml.org/lkml/2008/6/13/224 > > i have 3 other test-systems with e1000 (with a similar CPU) which are > _not_ showing this symptom, so this could be some model-specific e1000 > issue. btw., this reminds me that this is the same system that has a serious e1000 network latency bug which i have reported more than a year ago, but which still does not appear to be fixed in latest mainline: PING europe (10.0.1.15) 56(84) bytes of data. 64 bytes from europe (10.0.1.15): icmp_seq=1 ttl=64 time=1.51 ms 64 bytes from europe (10.0.1.15): icmp_seq=2 ttl=64 time=404 ms 64 bytes from europe (10.0.1.15): icmp_seq=3 ttl=64 time=487 ms 64 bytes from europe (10.0.1.15): icmp_seq=4 ttl=64 time=296 ms 64 bytes from europe (10.0.1.15): icmp_seq=5 ttl=64 time=305 ms 64 bytes from europe (10.0.1.15): icmp_seq=6 ttl=64 time=1011 ms 64 bytes from europe (10.0.1.15): icmp_seq=7 ttl=64 time=0.209 ms 64 bytes from europe (10.0.1.15): icmp_seq=8 ttl=64 time=763 ms 64 bytes from europe (10.0.1.15): icmp_seq=9 ttl=64 time=1000 ms 64 bytes from europe (10.0.1.15): icmp_seq=10 ttl=64 time=0.438 ms 64 bytes from europe (10.0.1.15): icmp_seq=11 ttl=64 time=1000 ms 64 bytes from europe (10.0.1.15): icmp_seq=12 ttl=64 time=0.299 ms ^C --- europe ping statistics --- 12 packets transmitted, 12 received, 0% packet loss, time 11085ms those up to 1000 msec delays can be 'felt' via ssh too, if this problem triggers then the system is almost unusable via the network. Local latencies are perfect so it's an e1000 problem. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/