Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932926AbYFFVNS (ORCPT ); Fri, 6 Jun 2008 17:13:18 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1756645AbYFFVM6 (ORCPT ); Fri, 6 Jun 2008 17:12:58 -0400 Received: from courier.cs.helsinki.fi ([128.214.9.1]:44359 "EHLO mail.cs.helsinki.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755666AbYFFVM5 (ORCPT ); Fri, 6 Jun 2008 17:12:57 -0400 Date: Sat, 7 Jun 2008 00:12:55 +0300 (EEST) From: "=?ISO-8859-1?Q?Ilpo_J=E4rvinen?=" X-X-Sender: ijjarvin@wrl-59.cs.helsinki.fi To: Patrick McManus , Arjan van de Ven cc: Ingo Molnar , David Miller , peterz@infradead.org, LKML , Netdev , rjw@sisk.pl, Andrew Morton , johnpol@2ka.mipt.ru Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+ In-Reply-To: <1212782937.23706.46.camel@tng> Message-ID: References: <20080603.150344.145518113.davem@davemloft.net> <20080605142244.GA19216@elte.hu> <1212708571.19522.10.camel@tng> <1212772293.23706.22.camel@tng> <20080606173339.GA30894@elte.hu> <20080606183926.GB12651@elte.hu> <1212782937.23706.46.camel@tng> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; boundary="-696208474-1018668328-1212784580=:9424" Content-ID: Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4706 Lines: 107 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---696208474-1018668328-1212784580=:9424 Content-Type: TEXT/PLAIN; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Content-ID: ...added Arjan. On Fri, 6 Jun 2008, Patrick McManus wrote: > This is all a bit confusing, but here are the conclusions I have drawn. Your observations here match what I've understood :-). > There definitely is a problem with the locking of the DA commit > ec3c0982a2dd1e671bad8e9d26c28dcba0039d87 . That code was part of 26-rc1 > but it never appeared in 25. It exists in pretty much the same form in > rc5 (there was 1 patch to it over that time to fix a different problem). > > We're certain this code has a problem with the accept queue both because > of code inspection and the fact that Ingo can back it out (as the > significant part of the 3-patch revert) and the problem goes away in his > testing. Problems were at least these: - Accept queue addition was racy and could leave dangling items - Dangling items caused inconsistent sk_ack_backlog - Checking for still in LISTEN state was racy, could be changed after the check was made (shouldn't happen with distcc though) I didn't read ->sk_data_ready that carefully, it could have some additional problems that are not listed (but they all should be fixed by the added locking anyway). AFAICT, rest of that ec3c change is safe wrt. locking, just holding sk is enough for the rest and those bits mostly shouldn't anyway be executed with a distcc setup. > I have run tests that can reproduce the hung socket with distcc over > localhost using 26-rc5. I can also apparently cure it using the locking > fix patch Ilpo sent (c9454f0..d21d2b9) on top of that. (My test of rc5 > +lockpatch is at 4.5+ hrs and counting without failures, it fails 6 > times an hour with vanilla rc5) > > Based on all of that, the right thing to do seems to be to apply the > lockpatch (c9454f0..d21d2b9) to Linus's tree and not revert anything - > just fix the code and I'll send Ilpo and Ingo cookies at Christmas time > for being great guys. Alternatively, Ingo could run the distcc servers > and clients on -tip with the lockpatch (nothing reverted) for more > testing. Anyway, we still would have an option to revert both the DA change + the locking fix later if the problem is still clearly more likely than with stable-2.6.25. > The only lingering problem is Ingo's report yesterday > http://marc.info/?l=linux-netdev&m=121267587715976&w=2 > of a distcc hang. In this one it was not over localhost and the distcc > server had the ec3c DA changes totally reverted. (The server is really > the only stack that matters in this case - the client is not impacted by > the DA changes). It definately didn't fit to picture that well if we would be talking just a single bug here. ...I wish Ingo would have provided the receiver state already then. :-) > This has to be a different issue, because the ec3c code > we're talking about here wasn't on the server at all. As Ilpo mentions, > Hakon is beleived to have a different problem and maybe you've tripped > over that too? ...The H?kon's case is definately different thing, also the symptoms are quite different because there's no deadlock at all but the TCP flow eventually dies, I don't yet know with what timescale that dying happens. Only common denominator actually was this receiver process missing, though it provably still was there. Besides, I don't know how long Ingo waited in this case until concluding that the TCP was stuck again? > If we're sure of that conclusion we should just take Ilpo's DA patch as > that will narrow the field for finding Hakon's issue. Its just with all > of these data points I'm not sure if I'm reaching the right conclusion. Lets widen the scope to two to three bugs then, one down already... In case you missed btw, also Arjan reported some problem quite early, but in his case claws mua+imap was the workload, so I doubt that DEFER_ACCEPT would be involved but who knows without strace, here: http://marc.info/?l=linux-kernel&m=121182171000434&w=2 Arjan, can you please check if your workload uses setsockopt TCP_DEFER_ACCEPT for the LISTENing socket? ...If not, then your case is different from Ingo's. -- i. ---696208474-1018668328-1212784580=:9424-- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/