2008-10-26 14:42:52

by Paul P

[permalink] [raw]
Subject: unexpected extra pollout events from epoll

I am programming a server using the epoll interface and have the receive portion of the server working fine, but for some reason as I implement the send portion, I noticed a few things that seem like strange behaviors in the implementation of epoll in the kernel.

I'm running Opensuse 11 and it has a 2.6.25 kernel.

The behavior that I can seeing is when I do a full read on an edge triggered fd, for some reason, it seems to be triggering an epollout event after each loop of the read events on a socket. (before I've done any writes at all to the socket)

This is very strange behavior as I would expect that the epollout event would only be triggered if I did a write and the socket recieved an ack which cleared out the send buffer.

The documentation on epollout is really sparse, so any help at all from the list would be very much appreciated. Do I need to manually arm the epollout flag after a write? I thought this was only necessary for level triggered epoll.

I was hoping someone more knowledgeable on the subject here might be able to help explain the epollout behavior and whether or not the extra events are normal and if so, what is the traditional way to handle these extra events in an edge triggered scenario.

Thanks!

Paul



2008-10-26 16:43:19

by Robert Hancock

[permalink] [raw]
Subject: Re: unexpected extra pollout events from epoll

Paul P wrote:
> I am programming a server using the epoll interface and have the receive portion of the server working fine, but for some reason as I implement the send portion, I noticed a few things that seem like strange behaviors in the implementation of epoll in the kernel.
>
> I'm running Opensuse 11 and it has a 2.6.25 kernel.
>
> The behavior that I can seeing is when I do a full read on an edge triggered fd, for some reason, it seems to be triggering an epollout event after each loop of the read events on a socket. (before I've done any writes at all to the socket)
>
> This is very strange behavior as I would expect that the epollout event would only be triggered if I did a write and the socket recieved an ack which cleared out the send buffer.
>
> The documentation on epollout is really sparse, so any help at all from the list would be very much appreciated. Do I need to manually arm the epollout flag after a write? I thought this was only necessary for level triggered epoll.
>
> I was hoping someone more knowledgeable on the subject here might be able to help explain the epollout behavior and whether or not the extra events are normal and if so, what is the traditional way to handle these extra events in an edge triggered scenario.

I'm not too familiar with the edge triggered mode, but you shouldn't be
requesting EPOLLOUT notifications if you don't care about them (i.e. if
you are not trying to write anything).

2008-10-26 22:07:38

by Davide Libenzi

[permalink] [raw]
Subject: Re: unexpected extra pollout events from epoll

On Sun, 26 Oct 2008, Paul P wrote:

> I am programming a server using the epoll interface and have the receive portion of the server working fine, but for some reason as I implement the send portion, I noticed a few things that seem like strange behaviors in the implementation of epoll in the kernel.
>
> I'm running Opensuse 11 and it has a 2.6.25 kernel.
>
> The behavior that I can seeing is when I do a full read on an edge
> triggered fd, for some reason, it seems to be triggering an epollout
> event after each loop of the read events on a socket. (before I've done
> any writes at all to the socket)
>
> This is very strange behavior as I would expect that the epollout event
> would only be triggered if I did a write and the socket recieved an ack
> which cleared out the send buffer.
>
> The documentation on epollout is really sparse, so any help at all from
> the list would be very much appreciated. Do I need to manually arm the
> epollout flag after a write? I thought this was only necessary for
> level triggered epoll.

The way epoll works, is by hooking into the existing kernel poll
subsystem. It hooks into the poll wakeups, via callback, and it that way
it knows that "something" is changed. Then it reads the status of a file
via f_op->poll() to know the status.
What happens is that, if you listen for EPOLLIN|EPOLLOUT, when a packet
arrives the callback hook is hit, and the file is put into a maybe-ready
list. Maybe-ready because at the time of the callback, epoll has no clue
of what happened.
After that, via epoll_wait(), f_op->poll() is called to get the status of
the file, and since POLLIN|POLLOUT is returned (and since you're listening
for EPOLLIN|EPOLLOUT), that gets reported back to you.
The POLLOUT event, by meaning a buffer-full->buffer-avail transition, did
not really happen, but since POLLOUT is true, that gets reported back too.
This, again, since epoll has no clue of what happened at callback hit time.
I'm working on changes that will make epoll aware (by using the existing
support for the "key" parameter of wakeups) of events at callback time,
but this is something that is still up for discussion and definitely won't
be in .28.
The best way to do it ATM, is to wait for POLLOUT only when really needed.




- Davide

2008-10-26 22:48:56

by Paul P

[permalink] [raw]
Subject: Re: unexpected extra pollout events from epoll

> After that, via epoll_wait(), f_op->poll() is called to get the status of the file, and since POLLIN|POLLOUT is returned (and since you're listening for EPOLLIN|EPOLLOUT), that gets reported back to you. The POLLOUT event, by meaning a buffer-full->buffer-avail transition, did not really happen, but since POLLOUT is true, that gets reported back too.

Ok, so make sure I understand you correctly, you're saying that currently the kernel doesn't have awareness of the difference between EPOLLIN and EPOLLOUT events because at the time of the event, both EPOLLIN/EPOLLOUT are returned from the kernel and that at least for the near term that's not going to change. At some point, we can expect the EPOLLOUT to give the correct event, but not till later than .28.

> The best way to do it ATM, is to wait for POLLOUT only when
> really needed.

I'm a little unclear how to do this. If I set the epoll_wait call to wait for just epollin events, that's fine. But when I send a large buffer of data and use epoll_ctl to look for epollin|epollout events, don't I have the same problem?

Let's say I'm sending a large buffer of data and I arm the fd to epollin|epollout (I'm adding an epollin flag because a message could come in while I'm sending)

If an event gets triggered on an fd, then I have no way of knowing if the event is from the socket being available to send data or if there is data waiting to be received since the epollin|epollout flag could be either one. So what am I to do when I get an event?

Are you saying that I can't do sending and receiving simultaneously with epoll? If that's the case, then is everyone simply setting the epollout flag when sending and ignoring the possibility of data coming in while data is being sent?

I didn't want to have to manually set fd's with epoll_ctl, but now I guess the epoll_one_shot flag makes more sense.

Paul


2008-10-26 23:12:17

by Davide Libenzi

[permalink] [raw]
Subject: Re: unexpected extra pollout events from epoll

[Can you try to trim lines at 80 chars or so?]


On Sun, 26 Oct 2008, Paul P wrote:

> > After that, via epoll_wait(), f_op->poll() is called to get the status
> > of the file, and since POLLIN|POLLOUT is returned (and since you're
> > listening for EPOLLIN|EPOLLOUT), that gets reported back to you. The
> > POLLOUT event, by meaning a buffer-full->buffer-avail transition, did
> > not really happen, but since POLLOUT is true, that gets reported back
> > too.
>
> Ok, so make sure I understand you correctly, you're saying that
> currently the kernel doesn't have awareness of the difference between
> EPOLLIN and EPOLLOUT events because at the time of the event, both
> EPOLLIN/EPOLLOUT are returned from the kernel and that at least for the
> near term that's not going to change. At some point, we can expect the
> EPOLLOUT to give the correct event, but not till later than .28.

The kernel knows the difference between EPOLLIN and EPOLLOUT, of course.
At the moment though, such condition is not reported during wakeups, and
this is what is going to be changing.



> > The best way to do it ATM, is to wait for POLLOUT only when
> > really needed.
>
> I'm a little unclear how to do this. If I set the epoll_wait call to
> wait for just epollin events, that's fine. But when I send a large
> buffer of data and use epoll_ctl to look for epollin|epollout events,
> don't I have the same problem?

You do that by writing data until it's finished, or you get EAGAIN. If you
get EAGAIN, you listen for EPOLLOUT.
Reading is same, but you'd wait for EPOLLIN.



- Davide

2008-10-27 00:59:00

by Paul P

[permalink] [raw]
Subject: Re: unexpected extra pollout events from epoll

> You do that by writing data until it's finished, or you
> get EAGAIN. If you
> get EAGAIN, you listen for EPOLLOUT.
> Reading is same, but you'd wait for EPOLLIN.

I've got a few questions about this approach. The most logical
way to do this seems to be:

1) Leave the epoll_wait with the EPOLLIN|EPOLLOUT event flags and
use epoll_ctl to switch the interest mask for each fd between EPOLLIN
and EPOLLOUT on a per fd basis.

2) When I'm ready to write, I do a write and if it does not fully
write and I get the EAGAIN flag, I switch the fd with epoll_ctl(fd,MOD,EPOLLOUT).

However, I get strange behavior when I tried adding fd's with only the
EPOLLIN interest mask. If I use epoll_wait with both the EPOLLIN and
EPOLLOUT interest mask, but add fd's with only the EPOLLIN interest mask,
I still seem to get EPOLLOUT events on the fd.

Am I supposed to change the main loop with epoll_wait so that when one
socket is reading that I switch the main loop to get EPOLLOUT events?
That means that I'm not receiving on any fd while I'm sending, so this
probably isn't right.

So, I'm a little confused.

Thanks in advance.

Paul



2008-10-27 01:18:27

by Davide Libenzi

[permalink] [raw]
Subject: Re: unexpected extra pollout events from epoll

On Sun, 26 Oct 2008, Paul P wrote:

> > You do that by writing data until it's finished, or you
> > get EAGAIN. If you
> > get EAGAIN, you listen for EPOLLOUT.
> > Reading is same, but you'd wait for EPOLLIN.
>
> I've got a few questions about this approach. The most logical
> way to do this seems to be:
>
> 1) Leave the epoll_wait with the EPOLLIN|EPOLLOUT event flags and
> use epoll_ctl to switch the interest mask for each fd between EPOLLIN
> and EPOLLOUT on a per fd basis.

Which version of epoll do you have? The epoll_wait() function does not
accept an event mask (like you write above, EPOLLIN|EPOLLOUT). It never
had.
But yes, you'd switch interest with epoll_ctl().



> 2) When I'm ready to write, I do a write and if it does not fully
> write and I get the EAGAIN flag, I switch the fd with epoll_ctl(fd,MOD,EPOLLOUT).

As optimization, if the EPOLLOUT bit is already set, you don't need to
keep calling epoll_ctl(fd,MOD,EPOLLOUT).



> However, I get strange behavior when I tried adding fd's with only the
> EPOLLIN interest mask. If I use epoll_wait with both the EPOLLIN and
> EPOLLOUT interest mask, but add fd's with only the EPOLLIN interest mask,
> I still seem to get EPOLLOUT events on the fd.

Again, how the heck do you "use epoll_wait with both the EPOLLIN and
EPOLLOUT"?!? There is not such a thing.




> So, I'm a little confused.

>From the wording above, that doesn't seem like a wrong guess.



- Davide

2008-10-27 01:23:46

by Davide Libenzi

[permalink] [raw]
Subject: Re: unexpected extra pollout events from epoll

On Sun, 26 Oct 2008, Davide Libenzi wrote:

> On Sun, 26 Oct 2008, Paul P wrote:
>
> > However, I get strange behavior when I tried adding fd's with only the
> > EPOLLIN interest mask. If I use epoll_wait with both the EPOLLIN and
> > EPOLLOUT interest mask, but add fd's with only the EPOLLIN interest mask,
> > I still seem to get EPOLLOUT events on the fd.
>
> Again, how the heck do you "use epoll_wait with both the EPOLLIN and
> EPOLLOUT"?!? There is not such a thing.

Wait? It's not that you pass EPOLLIN or EPOLLOUT to the "maxevents"
parameter of epoll_wait()?
That's the maximum event count you want to fetch, not an event mask.



- Davide

2008-10-27 03:48:43

by Paul P

[permalink] [raw]
Subject: Re: unexpected extra pollout events from epoll

> Which version of epoll do you have? The epoll_wait()
> function does not
> accept an event mask (like you write above,
> EPOLLIN|EPOLLOUT).

lol, I was a bit tired when I wrote that. Ok, ignore the stuff related
to epoll_wait in my previous post.

> As optimization, if the EPOLLOUT bit is already set, you
> don't need to
> keep calling epoll_ctl(fd,MOD,EPOLLOUT).

This is good to know.

So, I've got a few questions about what happens to data that accumulates
while I am sending and the fd is set to EPOLLOUT? If I am send out a
large buffer and incoming data wants to stream in on a full duplex
connection, what happens to that data when I am processing the socket
while it is in epollout mode?

Is the following accurate? When data comes in while I am sending, I guess
the data fills up the receive buffers until they are full and then it
stops accepting data until it is cleared out? When I switch back to
EPOLLIN, I'm guessing that I will get a notification on that fd that there
is data waiting.

The other question I have is there a way to do full-duplex networking so
that I can receive network messages while I am sending or vice versa? It
seems that the method of switching the socket between EPOLLIN and EPOLLOUT
means that I can't do both operations simultaneously. Thanks

Paul


2008-10-27 05:59:27

by Robert Hancock

[permalink] [raw]
Subject: Re: unexpected extra pollout events from epoll

Paul P wrote:
>> Which version of epoll do you have? The epoll_wait()
>> function does not
>> accept an event mask (like you write above,
>> EPOLLIN|EPOLLOUT).
>
> lol, I was a bit tired when I wrote that. Ok, ignore the stuff related
> to epoll_wait in my previous post.
>
>> As optimization, if the EPOLLOUT bit is already set, you
>> don't need to
>> keep calling epoll_ctl(fd,MOD,EPOLLOUT).
>
> This is good to know.
>
> So, I've got a few questions about what happens to data that accumulates
> while I am sending and the fd is set to EPOLLOUT? If I am send out a
> large buffer and incoming data wants to stream in on a full duplex
> connection, what happens to that data when I am processing the socket
> while it is in epollout mode?
>
> Is the following accurate? When data comes in while I am sending, I guess
> the data fills up the receive buffers until they are full and then it
> stops accepting data until it is cleared out? When I switch back to
> EPOLLIN, I'm guessing that I will get a notification on that fd that there
> is data waiting.
>
> The other question I have is there a way to do full-duplex networking so
> that I can receive network messages while I am sending or vice versa? It
> seems that the method of switching the socket between EPOLLIN and EPOLLOUT
> means that I can't do both operations simultaneously. Thanks

I don't quite follow. You shouldn't be switching back and forth if
you're trying to both send and receive, you can be registered for both
notifications at the same time and respond to whatever notifications
that you get. If you're not trying to write anything at the moment then
you shouldn't be registered for EPOLLOUT though, same for reading and
EPOLLIN.