From: swivel@shells.gnugeneration.com
Date: Mon, 20 Oct 2008 06:39:42 -0500
To: Nicolas Cannasse <ncannasse@motion-twin.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: poll() blocked / packets not received ?
Message-ID: <20081020113942.GJ2811@fc6222126.aspadmin.net>
References: <48FC4066.9060303@motion-twin.com> <20081020101549.GH2811@fc6222126.aspadmin.net> <48FC61A0.7010003@motion-twin.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <48FC61A0.7010003@motion-twin.com>
User-Agent: Mutt/1.5.13 (2006-08-11)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2802
Lines: 64

On Mon, Oct 20, 2008 at 12:46:56PM +0200, Nicolas Cannasse wrote:
> >>We have Shorewall installed and enabled, but what seems strange is that 
> >>the problem depends on multithreading. It also occurs much more often on 
> >>the 4 core machines than on a 2 core ones (both with Hyperthreading 
> >>activated). We're using kernel 2.6.20-15-server (#2 SMP) provided by 
> >>Ubuntu.
> >>
> >>Any tip on we could fix that or investigate further would be 
> >>appreciated. After one month of debugging we're really out of solution 
> >>now.
> >>
> >>Best,
> >>Nicolas
> >
> >Your usage pattern is a very common one, I highly doubt you are 
> >experiencing
> >a kernel bug here or many people (including myself) would be complaining.
> >
> >Shorewall sounds like it might be suspect, are FIN's not coming in when the
> >remote closes?  You can look in the output of netstat to see what state the
> >TCP is in, still ESTABLISHED?
> 
> Yes, it's still ESTABLISHED, but we can't see the corresponding 
> connection on the other machine while running netstat. I'm not a TCP 
> expert, so I'm not sure in which case this can occur.

If the end that's blocking still has the TCP in ESTABLISHED state, and
the other end doesnt have the TCP at all... you've already identified
why the one end is still ESTABLISHED.  ESTABLISHED state won't be left
until the FIN is received from the other end, then entering CLOSE_WAIT
state.

When the other end of the TCP is _gone_ that leads me to believe a FIN
will not be coming, hence the indefinite ESTABLISHED state.  Why it's
gone is a different question, maybe your problem is at the other end?
The end initiating a shutdown has to enter FIN_WAIT_1 then FIN_WAIT_2,
these transitions require the other side to leave ESTABLISHED (receive a
FIN then ACK) at the very least to proceed.

> 
> I agree with your comment in general, except that we have been running 
> the same application in single-thread environment for years without 
> running into this very specific problem.
> 

Perhaps when you run in multicore/threaded you are stressing the network
stacks at both ends more, including everything in-between?  The
threading vs. single process relationship is probably not causal, but
just coincidental.

What is the protocol?  Are there any timeouts to take care of these
situations?  Do you schedule an alarm or use SO_RCVTIMEO to shutdown
dead connections and free up consumed threads?

TCP being reliable can block indefinitely, you can employ TCP keepalive
to change indefinite to quite a long time.

Regards,
Vito Caputo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/