2008-10-10 16:43:31

by Nicolas Cannasse

[permalink] [raw]
Subject: recv() hangs until SIGCHLD ?

Hi,

We've been tracking a bug in our server application for some time now,
and now that we could isolate it we're stuck without a meaningful
explanation. Hope somehow would be able to give use some answers.

We run a multithread application which is using pthreads and sockets.
A thread uses accept() then dispatch the socket to one of the workers
threads that process it. Sockets are then not used simultaneously by
several threads.

In some rare cases, one (or several) threads are hanging in recv().
Both lsof and ls /proc/<pid>/fd show that the socket used is in
ESTABLISHED mode but when checking on the host on which it's connected
(a mysql DB) we can't find the corresponding client socket (as it's
been closed already on the other side).

We are using the Boehm GC which uses the signals SIGXCPU and SIGPWR to
pause+restart the threads when running a GC cycle. We are correctly
handling EINTR in send() and recv() by restarting the call in case
they get interrupted this way.

However, when attaching GDB to our locked thread it seems that even
when the GC runs, recv() does not exit (the breakpoint after it is not
reached). If we send SIGCHLD to the hanging thread with GDB, recv()
does exit and the thread is correctly unlocked. If we don't, it will
hang forever.

Additional details : recv() is using MSG_NOSIGNAL and we have enabled
TCP_NODELAY on the socket by using setsockopt. Some other
not-multithreaded apps are using the same Databases and this behavior
does not occur for them.

Any idea how we can stop this from happening or what additional things
we can check to get more informations on what's occurring ?

Thanks a lot,
Nicolas


2008-10-11 04:54:28

by David Schwartz

[permalink] [raw]
Subject: RE: recv() hangs until SIGCHLD ?


Nicolas Cannasse wrote:

> In some rare cases, one (or several) threads are hanging in recv().
> Both lsof and ls /proc/<pid>/fd show that the socket used is in
> ESTABLISHED mode but when checking on the host on which it's connected
> (a mysql DB) we can't find the corresponding client socket (as it's
> been closed already on the other side).

Blocking sockets will block until data is received. If no other thread is
sending data, this can block forever.

> We are using the Boehm GC which uses the signals SIGXCPU and SIGPWR to
> pause+restart the threads when running a GC cycle. We are correctly
> handling EINTR in send() and recv() by restarting the call in case
> they get interrupted this way.
>
> However, when attaching GDB to our locked thread it seems that even
> when the GC runs, recv() does not exit (the breakpoint after it is not
> reached). If we send SIGCHLD to the hanging thread with GDB, recv()
> does exit and the thread is correctly unlocked. If we don't, it will
> hang forever.

Why shouldn't it hang forever? What was supposed to wake it that's not?

> Any idea how we can stop this from happening or what additional things
> we can check to get more informations on what's occurring ?

You say a thread is hanging in receive and not returning. But you've yet to
explain why it should return. Was it interrupted by a signal? Was data
received? Is the socket non-blocking? Why isn't this expected behavior?
Blocking sockets block, full stop.

DS

2008-10-11 09:57:44

by Samuel Thibault

[permalink] [raw]
Subject: Re: recv() hangs until SIGCHLD ?

David Schwartz, le Fri 10 Oct 2008 21:48:45 -0700, a ?crit :
> > We are using the Boehm GC which uses the signals SIGXCPU and SIGPWR to
> > pause+restart the threads when running a GC cycle.
> > [...]
> >
> > However, when attaching GDB to our locked thread it seems that even
> > when the GC runs, recv() does not exit (the breakpoint after it is not
> > reached).
>
> But you've yet to explain why it should return. Was it interrupted by
> a signal?

See quote above.

> Was data received?

No

> Is the socket non-blocking?

No

> Why isn't this expected behavior? Blocking sockets block, full stop.

But using a signal is a common technique to have it getting unblocked,
since SUS says

? [EINTR] The recv() function was interrupted by a signal that was
caught, before any data was available.??

Samuel