From: "David Schwartz" <davids@webmaster.com>
To: <swivel@shells.gnugeneration.com>, <ncannasse@motion-twin.com>
Cc: <linux-kernel@vger.kernel.org>
Subject: RE: poll() blocked / packets not received ?
Date: Mon, 20 Oct 2008 08:53:10 -0700
Message-ID: <MDEHLPKNGKAHNMBLJOLKAEJCAIAD.davids@webmaster.com>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
In-Reply-To: <48FC7BEE.1020701@motion-twin.com>
Importance: Normal
Reply-To: davids@webmaster.com
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2050
Lines: 52


Nick Cannasse wrote:

> Ok, funny thing is that we just found what is occurring...
>
> We had a process that was on a regular basis doing the following :
>
> conntrack -F
>
> This was done in order to prevent the table to grow too big, because we
> were reaching the maximum size as told by :
>
> /proc/sys/net/ipv4/netfilter/ip_conntrack_max
>    and
> /proc/sys/net/ipv4/netfilter/ip_conntrack_count
>
> Seems like when there are active connections, this will break netfilter
> and stop delivering packets to the socket.
>
> At least I will have nice sleep tonight.

Note that this solved your symptom, not your problem. You actually have two
problems:

1) You rely on TCP to detect a lost connection even by a side that will
never transmit any data. TCP simply does not do this. If you are not trying
to send data, you are not assured that a lost connection will be detected.
(You either need a timeout, or you need to send or dribble some data,
depending on the protocl.)

2) You hold a lock on a shared resource while you wait for a reply over a
network. If this is a low-level "block and wait indefinitely" lock, this
will cause many threads to line up behind a slow/stuck thread. The right fix
depends on your circumstances, but you need to use a synchronization
primitive that is suitable. (You need to be able to use multiple connections
or defer operations without holding a thread.)

With both of these bugs, you are vulnerable to precisely the scenario you
observed. The TCP connection close packets were lost (in this case due to
premature expiration of the connnection tracking, but other things can do
it, such as the server rebooting), TCP could not detect the lost connection
because you never sent any data, so one thread blocked forever, and other
threads got in line behind it.

DS


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/