2006-01-09 00:02:25

by Davide Libenzi

[permalink] [raw]
Subject: [PATCH/RFC] POLLHUP tinkering ...


Some time ago there was a discussion about adding a simple flag to the
Linux poll subsystem, to signal half-closed links during read operations and
epoll:

http://lkml.org/lkml/2003/7/12/116

The problem for users of EPOLLET is that they rely on a shorter-than-buffersize
read to detect the end of the currant payload, and go back to wait for events
again. What they do is basically (in a very simplified way), once EPOLLIN is
received:

do {
n = read(fd, buf, bufsiz);
...
} while (n == bufsiz);

But if and hangup happened with some data (data + FIN), they won't receive any
more events for the Linux poll subsystem (and epoll, when using the event
triggered interface), so they are forced to issue an extra read() after the
loop to detect the EOF condition. Besides from the extra read() overhead,
the code does not come exactly pretty. During the last few months I
received quite a few messages from dudes asking me why, after a
substantial agreement, that kind of patch has never been implemented.
After looking at the code yesterday, I realized that there is not need of
extra flags, and a correct POLLHUP reporting from f_op->poll() can do it.
Linux almost everywhere reports the POLLHUP in a way that is useful for
EPOLLET users, with only few exceptions. In those few places the full
SHUTDOWN_MASK is expected in order to report POLLHUP, whereas the simple
RCV_SHUTDOWN would be sufficent to detect a shutdown peer. When
RCV_SHUTDOWN is set, the outbound data channel is definitely shutdown, so
POLLHUP is deserved in such condition. Returning POLLHUP upon RCV_SHUTDOWN
would automatically solve the above problem, and won't conflict with the
SUS definition of POLLHUP:

[POLLHUP]
The device has been disconnected. This event and POLLOUT are mutually
exclusive; a stream can never be writable if a hangup has occurred.
However, this event and POLLIN, POLLRDNORM, POLLRDBAND or POLLPRI are not
mutually exclusive. This flag is only valid in the revents bitmask; it is
ignored in the events member.

With that in place, EPOLLET users can simply check for the EPOLLHUP flag
returned, and yank the session w/out extra cumbersome code.
The attached patch is against 2.6.15-git4. Comments?



- Davide



net/bluetooth/af_bluetooth.c | 5 ++---
net/core/datagram.c | 5 ++---
net/dccp/proto.c | 4 ++--
net/ipv4/tcp.c | 4 ++--
net/sctp/socket.c | 5 ++---
net/unix/af_unix.c | 5 ++---
6 files changed, 12 insertions(+), 16 deletions(-)


Attachments:
pollhup-2.6.15.git4-0.1.diff (4.12 kB)

2006-01-09 00:08:16

by David Miller

[permalink] [raw]
Subject: Re: [PATCH/RFC] POLLHUP tinkering ...

From: Davide Libenzi <[email protected]>
Date: Sun, 8 Jan 2006 16:02:10 -0800 (PST)

> But if and hangup happened with some data (data + FIN), they won't
> receive any more events for the Linux poll subsystem (and epoll,
> when using the event triggered interface), so they are forced to
> issue an extra read() after the loop to detect the EOF
> condition. Besides from the extra read() overhead, the code does not
> come exactly pretty.

The extra last read is always necessary, it's an error synchronization
barrier. Did you know that?

If a partial read or write hits an error, the successful amount of
bytes read or written before the error occurred is returned. Then any
subsequent read or write will report the error immediately.

2006-01-09 00:11:14

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH/RFC] POLLHUP tinkering ...

On Sun, 8 Jan 2006, David S. Miller wrote:

> From: Davide Libenzi <[email protected]>
> Date: Sun, 8 Jan 2006 16:02:10 -0800 (PST)
>
>> But if and hangup happened with some data (data + FIN), they won't
>> receive any more events for the Linux poll subsystem (and epoll,
>> when using the event triggered interface), so they are forced to
>> issue an extra read() after the loop to detect the EOF
>> condition. Besides from the extra read() overhead, the code does not
>> come exactly pretty.
>
> The extra last read is always necessary, it's an error synchronization
> barrier. Did you know that?
>
> If a partial read or write hits an error, the successful amount of
> bytes read or written before the error occurred is returned. Then any
> subsequent read or write will report the error immediately.

Sorry for the missing info, but I was clearly talking about O_NONBLOCK
here.



- Davide


2006-01-09 00:20:58

by David Miller

[permalink] [raw]
Subject: Re: [PATCH/RFC] POLLHUP tinkering ...

From: Davide Libenzi <[email protected]>
Date: Sun, 8 Jan 2006 16:11:10 -0800 (PST)

> On Sun, 8 Jan 2006, David S. Miller wrote:
>
> > The extra last read is always necessary, it's an error synchronization
> > barrier. Did you know that?
> >
> > If a partial read or write hits an error, the successful amount of
> > bytes read or written before the error occurred is returned. Then any
> > subsequent read or write will report the error immediately.
>
> Sorry for the missing info, but I was clearly talking about O_NONBLOCK
> here.

What I said still applies to O_NONBLOCK.

2006-01-09 00:35:38

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH/RFC] POLLHUP tinkering ...

On Sun, 8 Jan 2006, David S. Miller wrote:

> From: Davide Libenzi <[email protected]>
> Date: Sun, 8 Jan 2006 16:11:10 -0800 (PST)
>
>> On Sun, 8 Jan 2006, David S. Miller wrote:
>>
>>> The extra last read is always necessary, it's an error synchronization
>>> barrier. Did you know that?
>>>
>>> If a partial read or write hits an error, the successful amount of
>>> bytes read or written before the error occurred is returned. Then any
>>> subsequent read or write will report the error immediately.
>>
>> Sorry for the missing info, but I was clearly talking about O_NONBLOCK
>> here.
>
> What I said still applies to O_NONBLOCK.

I thought you said in _not_ necessary, sorry. The extra read() for error
discovery is just bogus, w/out proper Linux poll reporting. The epoll
interface will have wait queue heads dropped inside the monitored devices
wait queue, so I assume that an error condition would trigger a wakeup ->
epoll event. If this is not true (but I'm pretty much sure it is), look at
the extra read() for error reporting:

1) Good

read_loop();
--> Error happen on device
if (read() == ERROR)
gotcha();

2)

read_loop();
if (read() == ERROR)
whoops();
--> Error happen on device




- Davide


2006-01-09 03:14:04

by David Schwartz

[permalink] [raw]
Subject: RE: [PATCH/RFC] POLLHUP tinkering ...


> From: Davide Libenzi <[email protected]>
> Date: Sun, 8 Jan 2006 16:02:10 -0800 (PST)

> > But if and hangup happened with some data (data + FIN), they won't
> > receive any more events for the Linux poll subsystem (and epoll,
> > when using the event triggered interface), so they are forced to
> > issue an extra read() after the loop to detect the EOF
> > condition. Besides from the extra read() overhead, the code does not
> > come exactly pretty.

> The extra last read is always necessary, it's an error synchronization
> barrier. Did you know that?

If there is an error, an error event must be returned. An edge-triggered
interface must report every event that occurs with an indication of that
type.

> If a partial read or write hits an error, the successful amount of
> bytes read or written before the error occurred is returned. Then any
> subsequent read or write will report the error immediately.

If the connection closes and the edge-triggered interface does not give a
HUP indication, then it is broken.

A HUP is not a read event and signalling a read is not sufficient.

DS


2006-01-09 04:55:56

by Davide Libenzi

[permalink] [raw]
Subject: RE: [PATCH/RFC] POLLHUP tinkering ...

On Sun, 8 Jan 2006, David Schwartz wrote:

>
>> From: Davide Libenzi <[email protected]>
>> Date: Sun, 8 Jan 2006 16:02:10 -0800 (PST)
>
>>> But if and hangup happened with some data (data + FIN), they won't
>>> receive any more events for the Linux poll subsystem (and epoll,
>>> when using the event triggered interface), so they are forced to
>>> issue an extra read() after the loop to detect the EOF
>>> condition. Besides from the extra read() overhead, the code does not
>>> come exactly pretty.
>
>> The extra last read is always necessary, it's an error synchronization
>> barrier. Did you know that?
>
> If there is an error, an error event must be returned. An edge-triggered
> interface must report every event that occurs with an indication of that
> type.

Yes, that's the case.



>> If a partial read or write hits an error, the successful amount of
>> bytes read or written before the error occurred is returned. Then any
>> subsequent read or write will report the error immediately.
>
> If the connection closes and the edge-triggered interface does not give a
> HUP indication, then it is broken.

Same as above. I think DaveM was thinking at the classical poll/select
usage scenario, where the wait queue head stays in the device's wait queue
only during the poll/select system call. With epoll, they're resident and
always collection events through wakeups (callbacks in the epoll case)
done by the device on its poll wait queue.



- Davide