2000-12-11 05:33:08

by Denis Perchine

[permalink] [raw]
Subject: Bad behavior of recv on already closed sockets.

Hello,

there was a small thread in PostgreSQL list about behavior of recv syscall on
Linux 2.2.16. Does anyone have similar expirience? Why such things can
happend? And how to workaround/fix this?

---------- Forwarded Message ----------
Subject: [HACKERS] Strange behavior of PostgreSQL on Linux
Date: Sun, 10 Dec 2000 12:51:46 +0600
From: Denis Perchine <[email protected]>
To: [email protected]


Hello,

I have quite strange problem. It's lsof -i -n -P
postmaste 20018 postgres 6u IPv4 12241380 TCP
127.0.0.1:5432->127.0.0.1:6651 (CLOSE)

And there is no pair for it.

Backtrace of the backend shows:
(gdb) bt
#0 0x40198ba2 in recv () from /lib/libc.so.6
#1 0x80ab2c5 in pq_recvbuf ()
#2 0x80ab395 in pq_getbytes ()
#3 0x80eaa7c in SocketBackend ()
#4 0x80eab13 in ReadCommand ()
#5 0x80ebb9c in PostgresMain ()
#6 0x80d69a2 in DoBackend ()
#7 0x80d6581 in BackendStartup ()
#8 0x80d593a in ServerLoop ()
#9 0x80d53c4 in PostmasterMain ()
#10 0x80abbb6 in main ()
#11 0x400fe9cb in __libc_start_main () at ../sysdeps/generic/libc-start.c:122

[root@mx /root]# ps axwfl|grep 20018
040 507 20018 21602 0 0 15292 0 tcp_re SW pts/0 0:00 \_
[postmaster]

[root@mx /root]# uname -a
Linux mx.xxx.com 2.2.16-3 #1 Mon Jun 19 19:11:44 EDT 2000 i686 unknown

Looks like it tries to read on socket which is already closed from other
side. And it seems like recv did not return in this case. Is this OK, or
kernel bug?

On the other side I see entries like this:
httpd 4260 root 4u IPv4 12173018 TCP
127.0.0.1:3994->127.0.0.1:5432 (CLOSE_WAIT)

And again. There is no any corresponding postmaster process. Does anyone has
such expirience before? And what can be the reason of such strange things.

--
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: [email protected]
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------

-------------------------------------------------------


---------- Forwarded Message ----------
Subject: Re: [HACKERS] Strange behavior of PostgreSQL on Linux
Date: Sun, 10 Dec 2000 13:18:17 -0500
From: Tom Lane <[email protected]>
To: Denis Perchine <[email protected]>


Denis Perchine <[email protected]> writes:
> Looks like it tries to read on socket which is already closed from other
> side. And it seems like recv did not return in this case. Is this OK, or
> kernel bug?

Sounds like a kernel bug --- recv() should *always* return immediately
if the socket is known closed. I'd think the kernel didn't believe the
socket was closed, if not for your lsof evidence. That's certainly
pointing a finger at the kernel...

We've heard (undetailed) reports before of backends hanging around when
the client was long gone. I always assumed that the client machine had
failed to disconnect properly, but now I wonder. A kernel bug might
explain those reports.

regards, tom lane

-------------------------------------------------------

--
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: [email protected]
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------


2000-12-11 20:06:36

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Bad behavior of recv on already closed sockets.

Hello!

> Looks like it tries to read on socket which is already closed from other
> side. And it seems like recv did not return in this case. Is this OK, or
> kernel bug?

This smells like an unknown bug in kernel.

It is unknown, hence there is no workaround (but upgrading to 2.4).

It would be better to understand the issue f.e. trying to restore
the history of this descriptor.


> On the other side I see entries like this:
> httpd 4260 root 4u IPv4 12173018 TCP
> 127.0.0.1:3994->127.0.0.1:5432 (CLOSE_WAIT)
>
> And again. There is no any corresponding postmaster process. Does anyone has
> such expirience before? And what can be the reason of such strange things.

And this is bug in the application, which forgot to close file.
Descriptor leakage in httpd or it is blocked at some another job.

But remembering about the first case, I am not so sure.
What does httpd make this time?

Alexey

2000-12-12 13:08:41

by Denis Perchine

[permalink] [raw]
Subject: Re: Bad behavior of recv on already closed sockets.

Hi,

> > Looks like it tries to read on socket which is already closed from other
> > side. And it seems like recv did not return in this case. Is this OK, or
> > kernel bug?
>
> This smells like an unknown bug in kernel.
>
> It is unknown, hence there is no workaround (but upgrading to 2.4).
>
> It would be better to understand the issue f.e. trying to restore
> the history of this descriptor.

How to do this? I mean what should I do to provide you with more information?

> > On the other side I see entries like this:
> > httpd 4260 root 4u IPv4 12173018 TCP
> > 127.0.0.1:3994->127.0.0.1:5432 (CLOSE_WAIT)
> >
> > And again. There is no any corresponding postmaster process. Does anyone
> > has such expirience before? And what can be the reason of such strange
> > things.
>
> And this is bug in the application, which forgot to close file.
> Descriptor leakage in httpd or it is blocked at some another job.
>
> But remembering about the first case, I am not so sure.
> What does httpd make this time?

Hmmm... It's like this. There is an apache with mod_perl. Actually this
connection should be closed, if there was a failure, but somehow it is not.
And possibly it is a fd in DBD::Pg code (or libpq code). I will think how to
check this...

--
Sincerely Yours,
Denis Perchine

----------------------------------
E-Mail: [email protected]
HomePage: http://www.perchine.com/dyp/
FidoNet: 2:5000/120.5
----------------------------------

2000-12-12 20:25:02

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Bad behavior of recv on already closed sockets.

Hello!

> > It would be better to understand the issue f.e. trying to restore
> > the history of this descriptor.
>
> How to do this? I mean what should I do to provide you with more information?

I do not know exactly. It depends on curcumstances, frequency
of the stalls and... your luck. 8)

Usually in such cases I ask to start from gathering single huge
binary tcpdump (tcpdump -i lo -b ip -w big.dump port 3994)
until the problem happens. Then you cut of it the problematic session
with tcpdump again (tcpdump -n -vvv -r big.dump port 2994 and port 5432).
Then we will have at least picture at network side.

If after this we will not understand what happened then tracing
at application level. It is more complicated and depends mostly
on particular application. strace is too promiscuous as rule
and does not work well with intensively forking applications.

Alexey