2004-04-01 16:54:47

by Ben Mansell

[permalink] [raw]
Subject: epoll reporting events when it hasn't been asked to

Hi,

epoll seems to behave oddly with TCP sockets that return ECONNRESET on a
read. Here's an abridged strace snippet:
(my strace doesn't cope very well with epoll syscalls, so I've mixed in
some of the program's debug output as well)

epoll_create(0x400) = 5

...

socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 10
setsockopt(10, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(10, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
bind(10, {sa_family=AF_INET, sin_port=htons(2342), sin_addr=inet_addr("10.100.1.208")}, 16) = 0

EPollMultiplexer::Add( fd 10 events 1 )
epoll_ctl(0x5, 0x1, 0xa, 0xbffff010) = 0

EPollMultiplexer::Poll()
epoll_wait(0x5, 0x831b390, 0x20, 0x3e8) = 1
fd 10 has events 1

accept(10, {sa_family=AF_INET, sin_port=htons(47255), sin_addr=inet_addr("10.100.1.208")}, [16]) = 7

EPollMultiplexer::Add( fd 7 events 1 )
epoll_ctl(0x5, 0x1, 0x7, 0xbffff5c0) = 0

socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 11
connect(11, {sa_family=AF_INET, sin_port=htons(21), sin_addr=inet_addr("10.100.1.5")}, 16) = -1 EINPROGRESS (Operation now in progress)

EPollMultiplexer::Add( fd 11 events 1 )
epoll_ctl(0x5, 0x1, 0xb, 0xbffff3f0) = 0

EPollMultiplexer::Poll()
epoll_wait(0x5, 0x831b390, 0x20, 0x3e8) = 1
fd 11 has events 1

EPollMultiplexer::Poll()
epoll_wait(0x5, 0x831b390, 0x20, 0x3e8) = 2
fd 10 has events 1
fd 7 has events 25

read(7, 0x8336240, 4096) = -1 ECONNRESET (Connection reset by peer)

EPollMultiplexer::Poll()
Setting events on fd 7 to 0
epoll_ctl(0x5, 0x3, 0x7, 0x8301550) = 0
epoll_wait(0x5, 0x831b390, 0x20, 0x3e8) = 2
fd 11 has events 1
fd 7 has events 16


This is odd. The epoll_ctl() got rid of all events for FD 7, yet the
epoll_wait() following it returns event 16 (EPOLLHUP). Is this a bug?
I don't normally see this behaviour for most connections, is it perhaps
to do with the read returning ECONNRESET ?

(This is on 2.6.2-rc1-mm1 and on 2.6.5-rc3-mm3, on x86 and x86_64)


Ben


2004-04-01 17:51:09

by Davide Libenzi

[permalink] [raw]
Subject: Re: epoll reporting events when it hasn't been asked to

On Thu, 1 Apr 2004, Ben wrote:

> This is odd. The epoll_ctl() got rid of all events for FD 7, yet the
> epoll_wait() following it returns event 16 (EPOLLHUP). Is this a bug?
> I don't normally see this behaviour for most connections, is it perhaps
> to do with the read returning ECONNRESET ?

It is a feature. epoll OR user events with POLLHUP|POLLERR so that even if
the user sets the event mask to zero, it can still know when something
like those abnormal condition happened. Which problem do you see with this?



- Davide


2004-04-01 18:25:52

by Ben Mansell

[permalink] [raw]
Subject: Re: epoll reporting events when it hasn't been asked to

> It is a feature. epoll OR user events with POLLHUP|POLLERR so that even if
> the user sets the event mask to zero, it can still know when something
> like those abnormal condition happened. Which problem do you see with this?

What should the application do if it gets events that it didn't ask for?
If you choose to ignore them, the next time epoll_wait() is called it
will return instantly with these same messages, so the app will spin and
eat CPU.

The alternative is to put some kind of sanity-check wrapper around
epoll_wait() calls, and match the output with what the app asked for.
If epoll starts returning messages that it doesn't want, the only
alternative is to get heavy-handed and try to get epoll to shut up with
EPOLL_CTL_DEL on the file descriptor. But this seems like fighting
against the OS.

Perhaps it should only OR the user event with POLLHUP|POLLERR if
POLLIN or POLLOUT is set?


Ben

2004-04-01 19:29:58

by Davide Libenzi

[permalink] [raw]
Subject: Re: epoll reporting events when it hasn't been asked to

On Thu, 1 Apr 2004, Ben Mansell wrote:

> > It is a feature. epoll OR user events with POLLHUP|POLLERR so that even if
> > the user sets the event mask to zero, it can still know when something
> > like those abnormal condition happened. Which problem do you see with this?
>
> What should the application do if it gets events that it didn't ask for?
> If you choose to ignore them, the next time epoll_wait() is called it
> will return instantly with these same messages, so the app will spin and
> eat CPU.

Shouldn't the application handle those exceptional conditions instead of
ignoring them?



> Perhaps it should only OR the user event with POLLHUP|POLLERR if
> POLLIN or POLLOUT is set?

This can certainly be done, since it's a one-liner fix. I'm not sure if it
is the correct behaviour. Anyone else?



- Davide


2004-04-01 23:29:43

by Steven Dake

[permalink] [raw]
Subject: Re: epoll reporting events when it hasn't been asked to

On Thu, 2004-04-01 at 12:28, Davide Libenzi wrote:
> On Thu, 1 Apr 2004, Ben Mansell wrote:
>
> > > It is a feature. epoll OR user events with POLLHUP|POLLERR so that even if
> > > the user sets the event mask to zero, it can still know when something
> > > like those abnormal condition happened. Which problem do you see with this?
> >
> > What should the application do if it gets events that it didn't ask for?
> > If you choose to ignore them, the next time epoll_wait() is called it
> > will return instantly with these same messages, so the app will spin and
> > eat CPU.
>
> Shouldn't the application handle those exceptional conditions instead of
> ignoring them?
>
>

If an exception occurs (example a socket is disconnected) the socket
should be removed from the fd list. There is really no point in passing
in an excepted fd.

epoll works just like poll and the expected SUS behavior in this regard.

Thanks
-steve


2004-04-02 09:04:29

by Ben Mansell

[permalink] [raw]
Subject: Re: epoll reporting events when it hasn't been asked to

On Thu, 1 Apr 2004, Steven Dake wrote:

> On Thu, 2004-04-01 at 12:28, Davide Libenzi wrote:
> > On Thu, 1 Apr 2004, Ben Mansell wrote:
> >
> > > > It is a feature. epoll OR user events with POLLHUP|POLLERR so that even if
> > > > the user sets the event mask to zero, it can still know when something
> > > > like those abnormal condition happened. Which problem do you see with this?
> > >
> > > What should the application do if it gets events that it didn't ask for?
> > > If you choose to ignore them, the next time epoll_wait() is called it
> > > will return instantly with these same messages, so the app will spin and
> > > eat CPU.
> >
> > Shouldn't the application handle those exceptional conditions instead of
> > ignoring them?
>
> If an exception occurs (example a socket is disconnected) the socket
> should be removed from the fd list. There is really no point in passing
> in an excepted fd.

Is there any difference, speed-wise, between turning off all events to
listen to with EPOLL_MOD, and removing the file descriptor with
EPOLL_DEL? I had vaguely assumed that the former would be faster
(especially if you might later want to resume listening for events),
although that was just a guess.


Ben

2004-04-02 15:22:36

by Davide Libenzi

[permalink] [raw]
Subject: Re: epoll reporting events when it hasn't been asked to

On Fri, 2 Apr 2004, Ben Mansell wrote:

> > If an exception occurs (example a socket is disconnected) the socket
> > should be removed from the fd list. There is really no point in passing
> > in an excepted fd.
>
> Is there any difference, speed-wise, between turning off all events to
> listen to with EPOLL_MOD, and removing the file descriptor with
> EPOLL_DEL? I had vaguely assumed that the former would be faster
> (especially if you might later want to resume listening for events),
> although that was just a guess.

It is faster. OTOH nothing prevent you to use your current method. You
have only to handle exceptional condition instead of ignoring them.
Handling by, for example, removing the fd from the epoll set and
unregistering/freeing the associated data structures. IMO we can leave the
current behaviour, but if someone sees huge problems with this, the fix is
a one-liner.



- Davide


2004-04-02 18:40:55

by Jamie Lokier

[permalink] [raw]
Subject: Is POLLHUP an input-only or bidirectional condition? (was: epoll reporting events when it hasn't been asked to)

[New thread because I want people who understand POLLHUP to clarify.
The parent thread's question was: why does epoll always report POLLHUP
and POLLERR conditions even when the program didn't ask for those.
The trivial answer is because that's what poll() does.]

Davide Libenzi wrote:
> Handling by, for example, removing the fd from the epoll set and
> unregistering/freeing the associated data structures. IMO we can leave the
> current behaviour, but if someone sees huge problems with this, the fix is
> a one-liner.

None of select, poll or epoll allow a program to ignore POLLERR while
checking POLLIN or POLLOUT. So at least epoll is consistent with the
other two.

It is possible to ignore POLLHUP conditions with select(), but not
poll() or epoll. For sockets at least, POLLHUP should indicate the
socket is fully closed, so that reading and writing will both fail.
Thus it makes sense that POLLHUP is not ignorable, although curiously
select() only treats POLLHUP as an _input_ condition, so it won't wake
something that's waiting only for output readiness. poll() will
always wake even if you're only waiting for POLLOUT.

POLLERR is set by UDP sockets with a pending error condition, and that
will be reported whether you read or write to the socket (except in
some perverse conditions where MSG_MORE has been used - then app state
machines could get confused). So it's appropriate for a POLLIN or
POLLOUT waiter to be woken when there's a POLLERR condition.

Summary: epoll is consistent with poll(). I'm not sure why poll() and
select() treat POLLHUP differnently. A poll() for POLLOUT will be
woken by a POLLHUP condition, yet a select() for output will _not_ be
woken by a POLLHUP condition.

Perhaps that indicates some confusion over what POLLHUP is supposed to
mean, and when it should be set by devices and/or sockets: is it for
input hangup conditions that allow further output, or for total hangup
conditions where input and output are both guaranteed to fail?

If it's the latter, as it seems to be for sockets, then the poll() and
epoll behaviour makes sense, but select() doesn't. If it's the
former, then the select() behaviour is the only one that makes sense.

Hence my question: does anyone know for sure which POLLHUP behaviour
is correct and sensible?

-- Jamie


2004-04-03 12:19:36

by Richard Kettlewell

[permalink] [raw]
Subject: Re: Is POLLHUP an input-only or bidirectional condition?

Jamie Lokier <[email protected]> writes:
> Perhaps that indicates some confusion over what POLLHUP is supposed to
> mean, and when it should be set by devices and/or sockets: is it for
> input hangup conditions that allow further output, or for total hangup
> conditions where input and output are both guaranteed to fail?

I spent some time a while ago looking into how various platforms treat
POLLHUP. It's rather random...
http://www.greenend.org.uk/rjk/2001/06/poll.html

--
http://www.greenend.org.uk/rjk/

2004-04-03 21:44:08

by Davide Libenzi

[permalink] [raw]
Subject: Re: Is POLLHUP an input-only or bidirectional condition? (was: epoll reporting events when it hasn't been asked to)

On Fri, 2 Apr 2004, Jamie Lokier wrote:

> None of select, poll or epoll allow a program to ignore POLLERR while
> checking POLLIN or POLLOUT. So at least epoll is consistent with the
> other two.
>
> It is possible to ignore POLLHUP conditions with select(), but not
> poll() or epoll. For sockets at least, POLLHUP should indicate the
> socket is fully closed, so that reading and writing will both fail.
> Thus it makes sense that POLLHUP is not ignorable, although curiously
> select() only treats POLLHUP as an _input_ condition, so it won't wake
> something that's waiting only for output readiness. poll() will
> always wake even if you're only waiting for POLLOUT.
>
> POLLERR is set by UDP sockets with a pending error condition, and that
> will be reported whether you read or write to the socket (except in
> some perverse conditions where MSG_MORE has been used - then app state
> machines could get confused). So it's appropriate for a POLLIN or
> POLLOUT waiter to be woken when there's a POLLERR condition.
>
> Summary: epoll is consistent with poll(). I'm not sure why poll() and
> select() treat POLLHUP differnently. A poll() for POLLOUT will be
> woken by a POLLHUP condition, yet a select() for output will _not_ be
> woken by a POLLHUP condition.

The issue here was a little bit different. On the contrary to poll(2),
with epoll and fd in resident inside the interest set, and epoll allows
you to set the interest event mask to 0. In such condition, epoll does
report POLLHUP and POLLERR events of the 0 masked fd, and this was the
original Ben's argoument. Looking at poll(2) though, it seems that it does
the same thing if you set the event mask to 0. So epoll is coherent with
poll(2) in this. I personally believe that an application should handle
those exceptional events in any case, by simply removing the fd from the
epoll set (and lazily freeing the associated userspace data structures).
So, if no big argouments will come against this, I'd rather prefer to keep
such behaviour. OTOH the patch would be trivial (one or two lines) , so
there will be no design problems in doing this.



- Davide


2004-04-03 22:35:56

by Jamie Lokier

[permalink] [raw]
Subject: Re: Is POLLHUP an input-only or bidirectional condition? (was: epoll reporting events when it hasn't been asked to)

Davide Libenzi wrote:
> Looking at poll(2) though, it seems that it does the same thing if
> you set the event mask to 0. So epoll is coherent with poll(2) in this.

Yes. SUSv3 says POLLHUP, POLLERR and POLLNVAL are always reported
even if not requested.

> I personally believe that an application should handle those
> exceptional events in any case, by simply removing the fd from the
> epoll set (and lazily freeing the associated userspace data structures).

Take a look at the new subject line :)

Linux select() treats it as an input-only condition, implying that
there might be useful things you can do with output to a file
descriptor that's reporting POLLHUP, including waiting for output.

However, SUSv3 says "This event [POLLHUP]and POLLOUT are mutually
exclusive; a stream can never be writable if a hangup has occurred",
implying that Linux select() is the oddity.

> So, if no big argouments will come against this, I'd rather prefer to keep
> such behaviour. OTOH the patch would be trivial (one or two lines) , so
> there will be no design problems in doing this.

I agree, in fact I'd argue specifically against changing it.

Programmers familiar with poll() know that you don't have to set
POLLHUP in the input mask -- because SUSv3 says so ("This flag
[POLLHUP] is only valid in the revents bitmask; it is ignored in the
events member"). They'd not be likely to notice a difference that
subtle for epoll, when they convert application code, so it's good
that there isn't a difference.

Btw, I notice epoll never reports POLLNVAL. Is that correct?

-- Jamie

2004-04-04 01:28:40

by Davide Libenzi

[permalink] [raw]
Subject: Re: Is POLLHUP an input-only or bidirectional condition? (was: epoll reporting events when it hasn't been asked to)

On Sat, 3 Apr 2004, Jamie Lokier wrote:

> Btw, I notice epoll never reports POLLNVAL. Is that correct?

Yep, epoll does not allow you to push an invalid/unopen file descriptor
inside the set. So you get an EBADF from epoll_ctl().



- Davide


2004-04-04 02:09:07

by Jamie Lokier

[permalink] [raw]
Subject: Re: Is POLLHUP an input-only or bidirectional condition? (was: epoll reporting events when it hasn't been asked to)

Davide Libenzi wrote:
> > Btw, I notice epoll never reports POLLNVAL. Is that correct?
>
> Yep, epoll does not allow you to push an invalid/unopen file descriptor
> inside the set. So you get an EBADF from epoll_ctl().

A comment in eventpoll.c says:

* This semaphore is acquired by ep_free() during the epoll file
* cleanup path and it is also acquired by eventpoll_release()
* if a file has been pushed inside an epoll set and it is then
* close()d without a previous call toepoll_ctl(EPOLL_CTL_DEL).

I.e. implying that the final close() is possible while it's registered.
(Btw, a function called eventpoll_release() doesn't exist).

What happens when a file descriptor is closed while it is inside the set?

I guess it's simply dropped from the set, is that right?

-- Jamie

2004-04-04 02:49:08

by Davide Libenzi

[permalink] [raw]
Subject: Re: Is POLLHUP an input-only or bidirectional condition? (was: epoll reporting events when it hasn't been asked to)

On Sun, 4 Apr 2004, Jamie Lokier wrote:

> A comment in eventpoll.c says:
>
> * This semaphore is acquired by ep_free() during the epoll file
> * cleanup path and it is also acquired by eventpoll_release()
> * if a file has been pushed inside an epoll set and it is then
> * close()d without a previous call toepoll_ctl(EPOLL_CTL_DEL).
>
> I.e. implying that the final close() is possible while it's registered.
> (Btw, a function called eventpoll_release() doesn't exist).

Woops, you're right. The function is inside include/linux/eventpoll.h
because it has been split into an inline to handle the fast path, plus the
slow path eventpoll_release_file(). I'll send a patch to Andrew to fix
comments.


> What happens when a file descriptor is closed while it is inside the set?
>
> I guess it's simply dropped from the set, is that right?

Yes, it is automatically removed from the epoll set, iif the underlying
file* count goes to zero.


- Davide


2004-04-04 18:51:38

by Ben Mansell

[permalink] [raw]
Subject: Re: Is POLLHUP an input-only or bidirectional condition? (was: epoll reporting events when it hasn't been asked to)

On Sat, 3 Apr 2004, Jamie Lokier wrote:

> Davide Libenzi wrote:
> > Looking at poll(2) though, it seems that it does the same thing if
> > you set the event mask to 0. So epoll is coherent with poll(2) in this.
>
> Yes. SUSv3 says POLLHUP, POLLERR and POLLNVAL are always reported
> even if not requested.

Fair enough. However, if you're writing poll()-based code that doesn't
want to listen to any events on a fd, you just wouldn't add it into the
pollfd array in the first place. Since you have to generate the pollfd
array for each time you call poll(), there is no real extra cost in
taking a fd out temporarily, and putting it back in later when we care
about what is going on with it.

With epoll, adding a fd into the epoll set is a separate operation from
the epoll_wait(), so if you really don't want to listen for any events
on one FD, you'll have to do a EPOLL_DEL, and then later on do a
EPOLL_ADD again if you want to bring it back in. Which is a bit nasty
and inefficient.

It would be nice if there was some shortcut to doing this. I'd still
favour epoll only reporting POLLHUP|POLLERR events if the fd was also
registered for POLLIN|POLLOUT. It may not be consistent with SUSv3's
definition of poll(), but does it matter? epoll is not poll.

As Richard Kettlewell's excellent poll test shows, relying on anything
but the basics of poll() is impossible if you are trying to write code
for several different OSs (or just different versions of the same OS!)
Whatever poll() returns, all you can do is force a read() or a write()
to try and find out what events really happened. This is not something
you'd want to do if the application, by unsetting POLLIN & POLLOUT, has
shown that it doesn't want to read() or write().



Ben

2004-04-04 19:40:59

by Davide Libenzi

[permalink] [raw]
Subject: Re: Is POLLHUP an input-only or bidirectional condition? (was: epoll reporting events when it hasn't been asked to)

On Sun, 4 Apr 2004, Ben Mansell wrote:

> With epoll, adding a fd into the epoll set is a separate operation from
> the epoll_wait(), so if you really don't want to listen for any events
> on one FD, you'll have to do a EPOLL_DEL, and then later on do a
> EPOLL_ADD again if you want to bring it back in. Which is a bit nasty
> and inefficient.

I really fail to see how handling POLLHUP and POLLERR would be a problem,
even for fds where you specified a 0 event mask. If you receive them, you
remove the fd from the set, and you flag the associated data structure for
a lazy removal at the end of the current event loop. Where is the problem
here?



- Davide


2004-04-04 20:24:54

by Jamie Lokier

[permalink] [raw]
Subject: Re: Is POLLHUP an input-only or bidirectional condition? (was: epoll reporting events when it hasn't been asked to)

Ben Mansell wrote:
> Since you have to generate the pollfd array for each time you call
> poll(), there is no real extra cost in taking a fd out temporarily

Wrong. You don't have to generate the pollfd array each time. That's
why there are separate events and revents fields. Quite often no
changes are required between each call to poll(), and only small
changes the rest of the time.

(Of course there is no escaping the O(n) overhead that the _kernel_
has when it scans the array, but it's avoidable in userspace).

> With epoll, adding a fd into the epoll set is a separate operation from
> the epoll_wait(), so if you really don't want to listen for any events
> on one FD, you'll have to do a EPOLL_DEL, and then later on do a
> EPOLL_ADD again if you want to bring it back in. Which is a bit nasty
> and inefficient.

No. If you don't want to listen for any events, and you predict those
events haven't occurred already (POLLHUP, POLLERR, usually POLLIN),
don't do any epoll_ctl() operations at all. Just call epoll_wait().

When you receive an event that you didn't want to listen for, set the
corresponding flag in your userspace structure, and call EPOLL_CTL_MOD
or EPOLL_CTL_DEL, depending on whether there are any other events you
still want to listen for.

See, your proposed method is slower than mine. I avoid *all*
epoll_ctl() calls in the common path.

Only in the uncommon path might I process an unwanted POLLHUP or
POLLERR event, and in those cases either I may as well close the fd
now (POLLHUP, after read to determine if it's EOF or an error), or the
EPOLL_CTL_DEL if I want to ignore that fd for a while (POLLERR) is
negligable because that's a rare event.

> As Richard Kettlewell's excellent poll test shows, relying on anything
> but the basics of poll() is impossible if you are trying to write code
> for several different OSs (or just different versions of the same OS!)
> Whatever poll() returns, all you can do is force a read() or a write()
> to try and find out what events really happened.

Indeed. Sometimes I wonder why there is anything other than POLLIN
and POLLOUT, given that the only reasonable response to the other
flags is to call read() to find out what happened. (Then again, maybe
read() isn't enough to get error conditions (as flagged by POLLERR) on
some broken OSs, and only MSG_ERRQUEUE will report them? I don't know).

> This is not something you'd want to do if the application, by
> unsetting POLLIN & POLLOUT, has shown that it doesn't want to read()
> or write().

Indeed. That's why if you do receive POLLHUP or POLLERR and you're
not interested in handling them right now, then _after_ receiving the
events call EPOLL_CTL_DEL, not before. That lazy method usually
avoids the system call.

-- Jamie

2004-04-14 17:59:47

by Dirk Morris

[permalink] [raw]
Subject: Re: epoll reporting events when it hasn't been asked to

Davide Libenzi wrote:

>On Fri, 2 Apr 2004, Ben Mansell wrote:
>
>
>>>If an exception occurs (example a socket is disconnected) the socket
>>>should be removed from the fd list. There is really no point in passing
>>>in an excepted fd.
>>>
>>>
>>Is there any difference, speed-wise, between turning off all events to
>>listen to with EPOLL_MOD, and removing the file descriptor with
>>EPOLL_DEL? I had vaguely assumed that the former would be faster
>>(especially if you might later want to resume listening for events),
>>although that was just a guess.
>>
>>

I'd like to weigh in on this issue as I'm having the same issue as Ben.
My application doesnt consider these to be exceptional events, but
normal expected events, and thus
I need them to be handled like normal events. (I can explain more off
list if you'd like)
So I just want to ignore all events for some time and then deal with any
HUP's or ERR's at the appropriate time.
When I used poll(), I always accomplished this by leaving this fd out of
the poll fd set.
This wasnt a huge hit because I basically had to rebuild the poll fd set
at every iteration anyway as it changes rapidly.

Now I'm switching to epoll, and the great thing about the epoll
interface is I don't have to rebuild the entire fd set at every iteration.
Like Ben, I'd prefer to be able to disable ALL events on a fd descriptor
for some time, instead of removing it entirely.
Since with poll I had to rebuild the set anyway, this 'disable' feature
wasnt really useful, but would be a nice-to-have for epoll.
:))






2004-04-14 19:40:05

by Jamie Lokier

[permalink] [raw]
Subject: Re: epoll reporting events when it hasn't been asked to

Dirk Morris wrote:
> I need them to be handled like normal events. (I can explain more off
> list if you'd like)

Did you read my explanation of how to do this using the present epoll
behaviour using _fewer_ syscalls than you are asking for?

-- Jamie

2004-04-14 20:23:01

by Dirk Morris

[permalink] [raw]
Subject: Re: epoll reporting events when it hasn't been asked to

Jamie Lokier wrote:

>Dirk Morris wrote:
>
>
>>I need them to be handled like normal events. (I can explain more off
>>list if you'd like)
>>
>>
>
>Did you read my explanation of how to do this using the present epoll
>behaviour using _fewer_ syscalls than you are asking for?
>
>
Ah yes, I just went back and read it.
From what I understand you're proposing to remove the fd from the set
lazily instead of immediately.
Which will save system calls in the cases were the HUP/ERR condition
does not occur during the 'disabled' time.

In my case, which you may choose to disregard, this condition is not
irregular or in any way a special case.
So the revision you have proposed is just an optimization.
You could even use this same optimization with the disable feature
(disable it lazily) and get even better performance with the same number
of syscalls you proposed.

I see no downside, except that it no longer conforms to the semantics of
poll and select.
Whether or not its worth it to deviate from this behavior over such a
detail, I don't know. :)



2004-04-14 21:48:36

by Jamie Lokier

[permalink] [raw]
Subject: Re: epoll reporting events when it hasn't been asked to

Dirk Morris wrote:
> From what I understand you're proposing to remove the fd from the set
> lazily instead of immediately.
> Which will save system calls in the cases were the HUP/ERR condition
> does not occur during the 'disabled' time.
>
> In my case, which you may choose to disregard, this condition is not
> irregular or in any way a special case.
> So the revision you have proposed is just an optimization.
> You could even use this same optimization with the disable feature
> (disable it lazily) and get even better performance with the same number
> of syscalls you proposed.

I don't think you would get any better performance even when HUP/ERR
conditions are commonplace.

A HUP condition means you cannot read & write from the fd any more, so
even though you may defer handling it in userspace, there's nothing to
be gained from disabling all epoll events after you receive the HUP:
instead of lazy disabling, you might as well delete the fd from epoll
as soon as you receive the HUP even though you don't want to handle it yet.
(Btw, the comment about HUP in net/ipv4/tcp.c:tcp_poll() is illuminating).

ERRs can occur many times while a socket is open so the algorithmic
efficiency is worth considering.

An ERR condition, at least on a socket, forces you to examine the
error before you can perform a further read or write. That's because
read and write operations will both check for pending error, so when
you know there's an error condition, you know that the next read or
write call is really a "tell me the error" call.

So, assuming you apply the lazy strategy, after you receive an ERR, in
principle you could decide that you want to do nothing with it until
the next IN or OUT which represents non-error data readiness, and then
you will examine the error code (by doing a read or write call) and
then read or write actual data. Then indeed being able to ignore just
ERR and still listen for IN and/or OUT would make a difference. But
you might as well just read the error condition using MSG_ERRQUEUE,
and spend your one system call that way instead - you can still defer
the processing of the error code.

There is a situation where that is algorithmically not as good as
being able to ignore just ERR: The example is when you are receiving a
malicious flood of ICMP packets which cause lots of error conditions
on a UDP socket (or other similar things), and you want to ignore all
of those while efficiently handling a lower rate of non-error data
transfer. That's a very unusual situation and it doesn't occur except
under attack circumstances with UDP (because real error ICMPs are a
response to something you transmitted yourself).

Other socket types or devices might give ERR a different meaning which
causes them to be common relative to read and write readiness. If so,
they probably shouldn't.

> I see no downside, except that it no longer conforms to the semantics of
> poll and select.
> Whether or not its worth it to deviate from this behavior over such a
> detail, I don't know. :)

I see two downsides. They're not performance downsides, just practical:

1. Currently, you can implement epoll in terms of poll(), if you
have an epoll-based program and want to create epoll emulation
functions for running on an old kernel. If epoll were extended
to permit ignoring HUP/ERR, that would no longer be possible.

2. Perhaps most programs will use a flexible library like libevent
or something made for the program. It's possible that library
will offer an API which sends the POLLIN/OUT/ERR/HUP bits to the
application, and lets the application programmer interpret those
bits in whatever way is appropriate. If it becomes easy to
ignore HUP when the library works with epoll, applications may
accidentally end up depending on that, and will unexpectedly fail
when they are run one day on an older system, or even another OS
where the same library works by calling poll() or select().

Summary: epoll is fine the way it is. Imho.

-- Jamie