LinuxLists.cc - epoll'ing tcp sockets for reading

2009-12-19 11:55:20

Subject: epoll'ing tcp sockets for reading

Hello people, I have a question about epoll'ing tcp sockets.

Is it possible (with epoll or some other good method) to get userspace
notified not only of the fact that some data has become available
for the socket, but also of the respective _size_ available for
reading connected with this exact event?

Yes, read'ing until EAGAIN or using FIONREAD would provide this
sort of information, but there is a problem. In case of subsequent
continuous data arrival, an application could get stuck reading
data for one socket infinitely (after epoll return, just before
the next epoll), unless it implements some kind of artifical safety
measures.

To my understanding, EPOLLONESHOT does not suffice here, because
it only "fixes" the epoll'ing part but not the read'ing part, whereas
_both_ epoll'ing and read'ing appear to be responsible. Therefore,
"event atomicity" is lost anyway. Or am I wrong?

(Please CC me, I'm not subscribed)

Thank you!

Nikolai ZHUBR

2009-12-19 18:07:13

by Davide Libenzi

[permalink] [raw]

Subject: Re: epoll'ing tcp sockets for reading

On Sat, 19 Dec 2009, Nikolai ZHUBR wrote:

> Hello people, I have a question about epoll'ing tcp sockets.
>
> Is it possible (with epoll or some other good method) to get userspace
> notified not only of the fact that some data has become available
> for the socket, but also of the respective _size_ available for
> reading connected with this exact event?
>
> Yes, read'ing until EAGAIN or using FIONREAD would provide this
> sort of information, but there is a problem. In case of subsequent
> continuous data arrival, an application could get stuck reading
> data for one socket infinitely (after epoll return, just before
> the next epoll), unless it implements some kind of artifical safety
> measures.

It is up to your application to handle data arrival correctly, according
to the latency/throughput constraints of your software.
The "read until EAGAIN" that is cited inside the epoll man pages, does not
mean that you have to exhaust the data in one single event processing loop.
After you have read and processed "enough data" (where enough depends on
the nature and constraints of your software), you can just drop that fd
into an "hot list" and pick the timeout for your next epoll_wait()
depending on the fact that such list is empty or not (you'd pick zero if
not empty). Proper handling of new and hot events will ensure that no
connections will be starving for service.

- Davide

2009-12-19 22:32:07

by Nikolai ZHUBR

[permalink] [raw]

Subject: Re[2]: epoll'ing tcp sockets for reading

Saturday, December 19, 2009, 9:07:05 PM, Davide Libenzi wrote:
> On Sat, 19 Dec 2009, Nikolai ZHUBR wrote:

>> Hello people, I have a question about epoll'ing tcp sockets.
>>
>> Is it possible (with epoll or some other good method) to get userspace
>> notified not only of the fact that some data has become available
>> for the socket, but also of the respective _size_ available for
>> reading connected with this exact event?
>>
>> Yes, read'ing until EAGAIN or using FIONREAD would provide this
>> sort of information, but there is a problem. In case of subsequent
>> continuous data arrival, an application could get stuck reading
>> data for one socket infinitely (after epoll return, just before
>> the next epoll), unless it implements some kind of artifical safety
>> measures.

> It is up to your application to handle data arrival correctly, according
> to the latency/throughput constraints of your software.
> The "read until EAGAIN" that is cited inside the epoll man pages, does not
> mean that you have to exhaust the data in one single event processing loop.
> After you have read and processed "enough data" (where enough depends on
> the nature and constraints of your software), you can just drop that fd
> into an "hot list" and pick the timeout for your next epoll_wait()
> depending on the fact that such list is empty or not (you'd pick zero if
> not empty). Proper handling of new and hot events will ensure that no
> connections will be starving for service.

Well, no doubt, terrible starvation can be avoided this way, ok.
However doesn't this look like userspace code is forced to make decisions
(when to pause reading new data and proceed to other sockets etc.) based on
some rather abstract/imprecise/overcomplicated assumptions and/or with
the help of additional syscalls, while a simple and reasonable hint for
such a decision being wasted somewhere on the way from kernelspace to
userspace?

(Not that I had something better really; I'm just trying to find the best
approach and its limitations)

Thank you!

Nikolai ZHUBR

> - Davide

2009-12-19 22:56:26

by Davide Libenzi

[permalink] [raw]

Subject: Re[2]: epoll'ing tcp sockets for reading

On Sun, 20 Dec 2009, Nikolai ZHUBR wrote:

> > It is up to your application to handle data arrival correctly, according
> > to the latency/throughput constraints of your software.
> > The "read until EAGAIN" that is cited inside the epoll man pages, does not
> > mean that you have to exhaust the data in one single event processing loop.
> > After you have read and processed "enough data" (where enough depends on
> > the nature and constraints of your software), you can just drop that fd
> > into an "hot list" and pick the timeout for your next epoll_wait()
> > depending on the fact that such list is empty or not (you'd pick zero if
> > not empty). Proper handling of new and hot events will ensure that no
> > connections will be starving for service.
>
> Well, no doubt, terrible starvation can be avoided this way, ok.
> However doesn't this look like userspace code is forced to make decisions
> (when to pause reading new data and proceed to other sockets etc.) based on
> some rather abstract/imprecise/overcomplicated assumptions and/or with
> the help of additional syscalls, while a simple and reasonable hint for
> such a decision being wasted somewhere on the way from kernelspace to
> userspace?

The kernel cannot make decisions based on something whose knowledge is
userspace bound.
What you define as "abstract/imprecise/overcomplicated" are simply
decisions that you, as implementor of the upper layer protocol, have to
take in order to implement your userspace code.
And I, personally, see nothing even close to be defined complicated in
such code.
Whenever you're asking for an abstraction/API to implement a kind
of software which exist in large quantities on a system, you've got to ask
yourself how relevant such abstraction is at the end, if all the existing
software have done w/out it.

- Davide

2009-12-20 00:21:04

by Nikolai ZHUBR

[permalink] [raw]

Subject: Re[3]: epoll'ing tcp sockets for reading

Sunday, December 20, 2009, 1:56:22 AM, Davide Libenzi wrote:
[trim]
> The kernel cannot make decisions based on something whose knowledge is
> userspace bound.
I didn't mean that. I just meant it would be usefull to let the caller
of epoll know also the size of data related to specific EPOLLIN event in
some "atomic" manner immediately, because the kernel probably knows this
size already.
The same thing can approximately be "emulated" by requesting FIOREAD for
all EPOLLIN-ready sockets just after epoll returns, before any other work.
It just would look not very elegant IMHO.

> What you define as "abstract/imprecise/overcomplicated" are simply
> decisions that you, as implementor of the upper layer protocol, have to
> take in order to implement your userspace code.
> And I, personally, see nothing even close to be defined complicated in
> such code.
> Whenever you're asking for an abstraction/API to implement a kind
> of software which exist in large quantities on a system, you've got to ask
> yourself how relevant such abstraction is at the end, if all the existing
> software have done w/out it.
Ok, maybe that's what I should have asked at the first place. What would
you recommend as a reference implementation showing clean, efficient,
fail-safe usage of epoll with sockets?
Actually I've been googling for about 2 weeks to find some, but the results
are rather scarce and dubious.

Thank you!

Nikolai ZHUBR

> - Davide

2009-12-20 15:54:18

by Davide Libenzi

[permalink] [raw]

Subject: Re[3]: epoll'ing tcp sockets for reading

On Sun, 20 Dec 2009, Nikolai ZHUBR wrote:

> Sunday, December 20, 2009, 1:56:22 AM, Davide Libenzi wrote:
> [trim]
> > The kernel cannot make decisions based on something whose knowledge is
> > userspace bound.
> I didn't mean that. I just meant it would be usefull to let the caller
> of epoll know also the size of data related to specific EPOLLIN event in
> some "atomic" manner immediately, because the kernel probably knows this
> size already.
> The same thing can approximately be "emulated" by requesting FIOREAD for
> all EPOLLIN-ready sockets just after epoll returns, before any other work.
> It just would look not very elegant IMHO.

No such a thing of "atomic matter", since by the time you read the event,
more data might have come. It's just flawed, you see that?

> > What you define as "abstract/imprecise/overcomplicated" are simply
> > decisions that you, as implementor of the upper layer protocol, have to
> > take in order to implement your userspace code.
> > And I, personally, see nothing even close to be defined complicated in
> > such code.
> > Whenever you're asking for an abstraction/API to implement a kind
> > of software which exist in large quantities on a system, you've got to ask
> > yourself how relevant such abstraction is at the end, if all the existing
> > software have done w/out it.
> Ok, maybe that's what I should have asked at the first place. What would
> you recommend as a reference implementation showing clean, efficient,
> fail-safe usage of epoll with sockets?
> Actually I've been googling for about 2 weeks to find some, but the results
> are rather scarce and dubious.

My reference was not epoll in particular, but more in general all the
network/server software that have been written using select/poll and the
current POSIX APIs, w/out the need of knowing how much data there was on
the link.

- Davide

2009-12-20 16:14:30

by Willy Tarreau

[permalink] [raw]

Subject: Re: epoll'ing tcp sockets for reading

Hi Davide,

On Sun, Dec 20, 2009 at 07:54:09AM -0800, Davide Libenzi wrote:
> On Sun, 20 Dec 2009, Nikolai ZHUBR wrote:
>
> > Sunday, December 20, 2009, 1:56:22 AM, Davide Libenzi wrote:
> > [trim]
> > > The kernel cannot make decisions based on something whose knowledge is
> > > userspace bound.
> > I didn't mean that. I just meant it would be usefull to let the caller
> > of epoll know also the size of data related to specific EPOLLIN event in
> > some "atomic" manner immediately, because the kernel probably knows this
> > size already.
> > The same thing can approximately be "emulated" by requesting FIOREAD for
> > all EPOLLIN-ready sockets just after epoll returns, before any other work.
> > It just would look not very elegant IMHO.
>
> No such a thing of "atomic matter", since by the time you read the event,
> more data might have come. It's just flawed, you see that?

I think that what Nikolai meant was the ability to wake up as soon as
there are *at least* XXX bytes ready. But while I can understand why
it would in theory save some code, in practice he would still have to
properly handle corner cases, which would defeat the original purpose
of his modification :

- if he waits for larger data than the socket buffer can handle, he
will never wake up ;

- if my memory serves me right, the copy_and_cksum() code only knows
whether a segment is correct during its transfer to userland, which
means that epoll() could very well wake up with XXX apparent bytes
ready, but the read would fail before XXX due to an invalid checksum
on an intermediate segment. So the code would still have to take
care of that situation anyway.

The last point implies the complete implementation of the code he wants
to avoid anyway, and the first one implies it will be hard to know when
this would work and when this would not. This means that while at first
glance this behaviour could be useful, it would in practice be useless.

Regards,
Willy

2009-12-20 23:18:20

by Nikolai ZHUBR

[permalink] [raw]

Subject: Re[2]: epoll'ing tcp sockets for reading

Hello Willy,
Sunday, December 20, 2009, 7:14:22 PM, Willy Tarreau wrote:
>> > The same thing can approximately be "emulated" by requesting FIOREAD for
>> > all EPOLLIN-ready sockets just after epoll returns, before any other work.
>> > It just would look not very elegant IMHO.
>>
>> No such a thing of "atomic matter", since by the time you read the event,
>> more data might have come. It's just flawed, you see that?
Well, a carefull application should choose to not read such newly appeared
data at this point yet, because this data actually belongs to the next
turn, see below. In other words, the read limit is known at the time of
epoll return and this value need not be changed till the next epoll,
no matter more data arrives meanwhile. (And that is why FIONREAD is not
perfectly good for that - it always reports all data at the moment)

> I think that what Nikolai meant was the ability to wake up as soon as
> there are *at least* XXX bytes ready. But while I can understand why
> it would in theory save some code, in practice he would still have to

Uhhh, no. What I want is to ensure that incoming blocks of network data
(possibly belonging to different connections) are pulled in and processed
by application approximately in the same order as they arrive from the
network. As long as no real queue exists for that, an application must
at least care to _limit_ the amount of data it reads from any socket per
one epoll call. (Otherwise, some very active connection with lots of
incoming data might cause other connections starve badly).
So, the application will need to find the value for the above limit.
Most reasonable value, imho, would be simply the amount of data that
actually arrived on this socket between two successive epoll calls (the
latest one and the previous one). My point was that it would be handy
if epoll offered some way to get this value automatically (filled in
epoll_event maybe?).
(Though, probably FIONREAD can do the job reasonably well in most cases)

Thank you!

Nikolai ZHUBR

> properly handle corner cases, which would defeat the original purpose
> of his modification :

> - if he waits for larger data than the socket buffer can handle, he
> will never wake up ;

> - if my memory serves me right, the copy_and_cksum() code only knows
> whether a segment is correct during its transfer to userland, which
> means that epoll() could very well wake up with XXX apparent bytes
> ready, but the read would fail before XXX due to an invalid checksum
> on an intermediate segment. So the code would still have to take
> care of that situation anyway.

> The last point implies the complete implementation of the code he wants
> to avoid anyway, and the first one implies it will be hard to know when
> this would work and when this would not. This means that while at first
> glance this behaviour could be useful, it would in practice be useless.

> Regards,
> Willy

2009-12-21 05:46:21

by Willy Tarreau

[permalink] [raw]

Subject: Re: epoll'ing tcp sockets for reading

On Mon, Dec 21, 2009 at 02:26:19AM +0300, Nikolai ZHUBR wrote:
> Hello Willy,
> Sunday, December 20, 2009, 7:14:22 PM, Willy Tarreau wrote:
> >> > The same thing can approximately be "emulated" by requesting FIOREAD for
> >> > all EPOLLIN-ready sockets just after epoll returns, before any other work.
> >> > It just would look not very elegant IMHO.
> >>
> >> No such a thing of "atomic matter", since by the time you read the event,
> >> more data might have come. It's just flawed, you see that?
> Well, a carefull application should choose to not read such newly appeared
> data at this point yet, because this data actually belongs to the next
> turn, see below. In other words, the read limit is known at the time of
> epoll return and this value need not be changed till the next epoll,
> no matter more data arrives meanwhile. (And that is why FIONREAD is not
> perfectly good for that - it always reports all data at the moment)
>
> > I think that what Nikolai meant was the ability to wake up as soon as
> > there are *at least* XXX bytes ready. But while I can understand why
> > it would in theory save some code, in practice he would still have to
>
> Uhhh, no. What I want is to ensure that incoming blocks of network data
> (possibly belonging to different connections) are pulled in and processed
> by application approximately in the same order as they arrive from the
> network.

Then I don't even see any remote possibility, because pollers will generally
return a limited number of file descriptors on which activity was detected,
so it is very possible that the one who talked first will be reported last.

> As long as no real queue exists for that, an application must
> at least care to _limit_ the amount of data it reads from any socket per
> one epoll call. (Otherwise, some very active connection with lots of
> incoming data might cause other connections starve badly).

Then it's up to the application to limit the amount of data it reads from
an FD. But in general, you'd prefer to read everything and only make that
data available to the readers at a lower pace.

> So, the application will need to find the value for the above limit.
> Most reasonable value, imho, would be simply the amount of data that
> actually arrived on this socket between two successive epoll calls (the
> latest one and the previous one). My point was that it would be handy
> if epoll offered some way to get this value automatically (filled in
> epoll_event maybe?).

it will not change anything, as Davide explained it, because that amount
will increase by the time you proceed with read() and you will get more
data. So it is the read() you have to limit, not epoll.

Also, you should be conscious that the time you'd waste doing this is
probably greater than the delta time between two reads on two distinct
sockets, which means that in practice you will not necessarily report
data in the network order, but you'll just be approximately fair
between all FDs, at the expense of a high syscall overhead caused by
small reads.

> (Though, probably FIONREAD can do the job reasonably well in most cases)

FIONREAD will just tell you how many bytes are pending for a read. Its
only use is for allocating memory before performing the read. If you
already have a buffer available, just read into it, you'll get the
info too.

Willy

2009-12-21 09:26:46

by Nikolai ZHUBR

[permalink] [raw]

Subject: Re[2]: epoll'ing tcp sockets for reading

Monday, December 21, 2009, 8:46:15 AM, Willy Tarreau wrote:
> Also, you should be conscious that the time you'd waste doing this is
> probably greater than the delta time between two reads on two distinct
> sockets, which means that in practice you will not necessarily report
> data in the network order, but you'll just be approximately fair
> between all FDs, at the expense of a high syscall overhead caused by
> small reads.
Agreed absolutely.
Anyway, I think I should stop abusing the ml at this point :)
Thank you!

Nikolai ZHUBR