2002-10-17 17:54:09

by John Myers

[permalink] [raw]
Subject: epoll (was Re: [PATCH] async poll for 2.5)

Charles 'Buck' Krasic wrote:

>Not exactly. I'm saying that the context in which /dev/epoll is used
>(at least originally), is non-blocking socket IO. Anybody who has
>worked with that API can tell you there are subtleties, and that if
>they're ignored, will certainly lead to pitfalls. These are not the
>fault of the /dev/epoll interface.
>
The particular subtlety I am pointing out is the fault of the currently
defined /dev/epoll interface and can be fixed by making a minor change
to the /dev/epoll interface.

>I agree if you are talking about AIO as a whole. But epoll is more
>limited in its scope, it really relates only to poll()/select not the
>whole IO api.
>
>
My objection to the current /dev/epoll API does apply to said "limited
scope," it is not dependent on the scope that is "AIO as a whole."


Attachments:
smime.p7s (3.45 kB)
S/MIME Cryptographic Signature

2002-10-18 17:52:38

by Davide Libenzi

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On Fri, 18 Oct 2002, Dan Kegel wrote:

> I was afraid someone would be confused by the examples. Davide loves
> coroutines (check out http://www.xmailserver.org/linux-patches/nio-improve.html )
> and I think his examples are written in that style. He really means
> what you think he should be meaning :-)
> which is something like
> while (1) {
> grab next bunch of events from epoll
> for each event
> while (do_io(event->fd) != EAGAIN);
> }
> I'm pretty sure.

Yes, I like coroutines :) even if sometimes you have to be carefull with
the stack usage ( at least if you do not want to waste all your memory ).
Since there're N coroutines/stacks for N connections even 4Kb does mean
something when N is about 100000. The other solution is a state machine,
cheaper about memory, a little bit more complex about coding. Coroutines
though helps a graceful migration from a thread based application to a
multiplexed one. If you take a thread application and you code your
connect()/accept()/recv()/send() like the ones coded in the example http
server linked inside the epoll page, you can easily migrate a threaded app
by simply adding a distribution loop like :

for (;;) {
get_events();
for_each_fd
call_coroutines_associated_with_ready_fd();
}



- Davide


2002-10-18 17:31:59

by Dan Kegel

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

Mark Mielke wrote:
>>>>> while (read() == EAGAIN)
>>>>> wait(POLLIN);
>
> I find myself still not understanding this thread. Lots of examples of
> code that should or should not be used, but I would always choose:
>
> ... ensure file descriptor is blocking ...
> for (;;) {
> int nread = read(...);
> ...
> }
>
> Over the above, or any derivative of the above.
>
> What would be the point of using an event notification mechanism for
> synchronous reads with no other multiplexed options?
>
> A 'proper' event loop is significantly more complicated. Since everybody
> here knows this... I'm still confused...

I was afraid someone would be confused by the examples. Davide loves
coroutines (check out http://www.xmailserver.org/linux-patches/nio-improve.html )
and I think his examples are written in that style. He really means
what you think he should be meaning :-)
which is something like
while (1) {
grab next bunch of events from epoll
for each event
while (do_io(event->fd) != EAGAIN);
}
I'm pretty sure.

- Dan


2002-10-18 18:57:18

by Mark Mielke

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On Fri, Oct 18, 2002 at 02:55:20PM -0400, Chris Friesen wrote:
> Mark Mielke wrote:
> >I find myself still not understanding this thread. Lots of examples of
> The main point here is determining which of many open connections need
> servicing.

> select() and poll() do not scale well, so this is where stuff like
> /dev/epoll comes in--to tell you which of those file descriptors need to be
> serviced.

I know what the point is, and I know the concept behind /dev/epoll.

I'm speaking more about the snippets of code that fail to show the
real benefits of /dev/epoll, and the following discussion about which
snippet is better than which other snippet.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-10-18 19:11:28

by Chris Friesen

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

Mark Mielke wrote:
>>>>> while (read() == EAGAIN)
>>>>> wait(POLLIN);
>>>>>
>
> I find myself still not understanding this thread. Lots of examples of
> code that should or should not be used, but I would always choose:
>
> ... ensure file descriptor is blocking ...
> for (;;) {
> int nread = read(...);
> ...
> }
>
> Over the above, or any derivative of the above.

The main point here is determining which of many open connections need servicing.

select() and poll() do not scale well, so this is where stuff like /dev/epoll comes in--to tell you
which of those file descriptors need to be serviced.

Chris



--
Chris Friesen | MailStop: 043/33/F10
Nortel Networks | work: (613) 765-0557
3500 Carling Avenue | fax: (613) 765-2986
Nepean, ON K2H 8E9 Canada | email: [email protected]

2002-10-18 18:51:55

by Mark Mielke

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On Fri, Oct 18, 2002 at 10:41:28AM -0700, Davide Libenzi wrote:
> Yes, I like coroutines :) even if sometimes you have to be carefull with
> the stack usage ( at least if you do not want to waste all your memory ).
> Since there're N coroutines/stacks for N connections even 4Kb does mean
> something when N is about 100000. The other solution is a state machine,
> cheaper about memory, a little bit more complex about coding. Coroutines
> though helps a graceful migration from a thread based application to a
> multiplexed one. If you take a thread application and you code your
> connect()/accept()/recv()/send() like the ones coded in the example http
> server linked inside the epoll page, you can easily migrate a threaded app
> by simply adding a distribution loop like :

> for (;;) {
> get_events();
> for_each_fd
> call_coroutines_associated_with_ready_fd();
> }

If each of these co-routines does "while (read() != EAGAIN) wait",
your implementation is seriously flawed unless you do not mind certain
file descriptors that may have lower numbers to have a real time
priority higher than file descriptors with a higher number.

If efficiency is the true goal - the event loop itself needs to be
abstracted - not just the query and dispatch routines. Using /dev/epoll
in a way that it is compatible with other applications is an excercise
in abuse, that will only show positive results because the alternatives
to /dev/epoll are ineffective, not because /dev/epoll is a better model.

I like the idea of /dev/epoll - which means that if I used it, I would
implement an efficient model with it, not a traditional model that through
hackery will function transparently using /dev/epoll.

But that might just be me...

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-10-18 19:05:06

by Davide Libenzi

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On Fri, 18 Oct 2002, Mark Mielke wrote:

> On Fri, Oct 18, 2002 at 10:41:28AM -0700, Davide Libenzi wrote:
> > Yes, I like coroutines :) even if sometimes you have to be carefull with
> > the stack usage ( at least if you do not want to waste all your memory ).
> > Since there're N coroutines/stacks for N connections even 4Kb does mean
> > something when N is about 100000. The other solution is a state machine,
> > cheaper about memory, a little bit more complex about coding. Coroutines
> > though helps a graceful migration from a thread based application to a
> > multiplexed one. If you take a thread application and you code your
> > connect()/accept()/recv()/send() like the ones coded in the example http
> > server linked inside the epoll page, you can easily migrate a threaded app
> > by simply adding a distribution loop like :
>
> > for (;;) {
> > get_events();
> > for_each_fd
> > call_coroutines_associated_with_ready_fd();
> > }
>
> If each of these co-routines does "while (read() != EAGAIN) wait",
> your implementation is seriously flawed unless you do not mind certain
> file descriptors that may have lower numbers to have a real time
> priority higher than file descriptors with a higher number.

No, once a file descriptor is ready the associated coroutine is called
with a co_call() and, for example, the read function is :

int dph_read(struct dph_conn *conn, char *buf, int nbyte)
{
int n;

while ((n = read(conn->sfd, buf, nbyte)) < 0) {
if (errno == EINTR)
continue;
if (errno != EAGAIN && errno != EWOULDBLOCK)
return -1;
conn->events = POLLIN | POLLERR | POLLHUP;
co_resume(conn);
}
return n;
}

where co_resume() preempt the current coroutine and run another one ( that
is the scheduler coroutine ). The scheduler coroutine looks like :

static int dph_scheduler(int loop, unsigned int timeout)
{
int ii;
static int nfds = 0;
struct dph_conn *conn;
static struct pollfd *pfds = NULL;

do {
if (!nfds) {
struct evpoll evp;

evp.ep_timeout = timeout;
evp.ep_resoff = 0;

nfds = ioctl(kdpfd, EP_POLL, &evp);
pfds = (struct pollfd *) (map + evp.ep_resoff);
}
for (ii = 0; ii < EPLIMTEVENTS && nfds > 0; ii++, nfds--, pfds++) {
if ((conn = dph_find(pfds->fd))) {
conn->revents = pfds->revents;

if (conn->revents & conn->events)
co_call(conn->co, conn);
}
}
} while (loop);
return 0;
}

These functions are taken from the really simple example http server used
to test/compare /dev/epoll with poll()/select()/rt-sig//dev/poll :

http://www.xmailserver.org/linux-patches/dphttpd_last.tar.gz




- Davide


2002-10-19 06:50:21

by Mark Mielke

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On Fri, Oct 18, 2002 at 12:16:48PM -0700, Davide Libenzi wrote:
> These functions are taken from the really simple example http server used
> to test/compare /dev/epoll with poll()/select()/rt-sig//dev/poll :

They still represent an excessive complicated model that attempts to
implement /dev/epoll the same way that one would implement poll()/select().

Sometimes the answer isn't emulation, or comparability.

Sometimes the answer is innovation.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-10-19 16:04:59

by Charlie Krasic

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)


Mark Mielke <[email protected]> writes:

> They still represent an excessive complicated model that attempts to
> implement /dev/epoll the same way that one would implement poll()/select().

epoll is about fixing one aspect of an otherwise well established api.
That is, fixing the scalability of poll()/select() for applications
based on non-blocking sockets.

Yes, programming non-blocking sockets has its complexity. Most people
probably end up writing their own API to simplify things, building
wrapper libraries above. However, a wrapper library can not fix the
performance problems of poll()/select().

epoll() is a relatively modest kernel modification that delivers
impressive performance benefits. It isn't exactly a drop in
replacement for poll()/select(), especially because of the difference
between level and edge semantics. That's why wrapper libraries make
sense.

> Sometimes the answer isn't emulation, or comparability.
>
> Sometimes the answer is innovation.

> mark

Yes, building a better API into the kernel makes sense. Not only to
eliminate wrappers, but to generalize beyond sockets. I went to local
user group meeting this week where a guy gave an overview of what's
new in kernel 2.5. When AIO was presented, the first question was
"why not use nonblocking IO?". I bet more than half the programmers
in the room had no idea that nonblocking flags have absolutely no
effect for files.

It seems to me that most people pushing on AIO so far have been using
it for files, e.g. Oracle. They have no choice. It's either that or
use clumsy worker thread arrangements. Network code on the other
hand, has had non-blocking IO available for years. That's why there
are all kinds of examples of web servers that use threads only for the
disk side of things.

But let's not mix them up. AIO is a new model (for linux anyway),
whose implementation is necessitating fairly major architectural
changes to the kernel.

epoll is a nice well contained fix, that fits neatly into the existing
architecture.

As it happens, epoll will probably have a place in AIO too. But there
is no mutual dependency between them. They are both important on
their own.

-- Buck

2002-10-19 17:05:31

by Davide Libenzi

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On Sat, 19 Oct 2002, Mark Mielke wrote:

> On Fri, Oct 18, 2002 at 12:16:48PM -0700, Davide Libenzi wrote:
> > These functions are taken from the really simple example http server used
> > to test/compare /dev/epoll with poll()/select()/rt-sig//dev/poll :
>
> They still represent an excessive complicated model that attempts to
> implement /dev/epoll the same way that one would implement poll()/select().
>
> Sometimes the answer isn't emulation, or comparability.
>
> Sometimes the answer is innovation.

Hem ... they're about 100 lines of code and they rapresent a complete I/O
dispatching engine. And yes, like you could guess from the source code,
the same skeleton was used, with different event retrieval methods, to
test poll() , rt-sig , /dev/poll and /dev/epoll. And, as I said in the
previous email, you could have implemented an I/O driven state machine. I
personally like a little bit more coroutines, that the reason of such
implementation.



- Davide


2002-10-19 17:29:19

by Dan Kegel

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

Mark Mielke wrote:
> On Fri, Oct 18, 2002 at 05:55:21PM -0700, John Myers wrote:
>
>>So whether or not a proposed set of epoll semantics is consistent with
>>your Platonic ideal of "use the fd until EAGAIN" is simply not an issue.
>>What matters is what works best in practice.
>
>
>>From this side of the fence: One vote for "use the fd until EAGAIN" being
> flawed. If I wanted a method of monopolizing the event loop with real time
> priorities, I would implement real time priorities within the event loop.

The choice I see is between:
1. re-arming the one-shot notification when the user gets EAGAIN
2. re-arming the one-shot notification when the user reads all the data
that was waiting (such that the very next read would return EGAIN).

#1 is what Davide wants; I think John and Mark are arguing for #2.

I suspect that Davide would be happy with #2, but advises
programmers to read until EGAIN anyway just to make things clear.

If the programmer is smart enough to figure out how to do that without
hitting EAGAIN, that's fine. Essentially, if he tries to get away
without getting an EAGAIN, and his program stalls because he didn't
read all the data that's available and thereby doesn't reset the
one-shot readiness event, it's his own damn fault, and he should
go back to using level-triggered techniques like classical poll()
or blocking i/o.

- Dan


2002-10-19 18:46:32

by Charlie Krasic

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)


Dan Kegel <[email protected]> writes:

> The choice I see is between:
> 1. re-arming the one-shot notification when the user gets EAGAIN
> 2. re-arming the one-shot notification when the user reads all the data
> that was waiting (such that the very next read would return EGAIN).

> #1 is what Davide wants; I think John and Mark are arguing for #2.

I thought the debate was over how the initial arming of the one-shot
was done.

The choice above is a different issue.

Neither of 1 or 2 above accurately reflects what the code in the
kernel actually does.

Hitting EAGAIN does not "re-arm" the one-shot notification.

Consider TCP.

The tcp write code issues a POLLOUT edge when the socket-buffer fill
level drops below a hi-water mark (tcp_min_write_space()).

For reads, AFAIK, tcp issues POLLIN for every new TCP segment that
arrives, which get coalesced automatically by virtue of the getevents
barrier.

A short count means the appliation has hit an extreme end of the
buffer (completely full or empty). EAGAIN means the buffer was
already at the extreme. Either way you know just as well that new
activity will trigger a new edge event.

In summary, a short count is every bit as reliable as EAGAIN to know
that it is safe to wait on epoll_getevents.

Davide?

-- Buck

> I suspect that Davide would be happy with #2, but advises
> programmers to read until EGAIN anyway just to make things clear.

It's a minor point, but based on the logic above, I think this advise
is overkill.

-- Buck

> If the programmer is smart enough to figure out how to do that without
> hitting EAGAIN, that's fine. Essentially, if he tries to get away
> without getting an EAGAIN, and his program stalls because he didn't
> read all the data that's available and thereby doesn't reset the
> one-shot readiness event, it's his own damn fault, and he should
> go back to using level-triggered techniques like classical poll()
> or blocking i/o.

> - Dan

2002-10-19 20:13:03

by Charlie Krasic

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)


Whoops. I just realized a flaw in my own argument.

With read, a short count might precede EOF. Indeed, in that case,
calling epoll_getevents would cause the connection to get stuck.

Never mind my earlier message then.

-- Buck

"Charles 'Buck' Krasic" <[email protected]> writes:

> In summary, a short count is every bit as reliable as EAGAIN to know
> that it is safe to wait on epoll_getevents.

> -- Buck

2002-10-19 20:48:53

by Dan Kegel

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

Charles 'Buck' Krasic wrote:
>>In summary, a short count is every bit as reliable as EAGAIN to know
>>that it is safe to wait on epoll_getevents.
>
> Whoops. I just realized a flaw in my own argument.
>
> With read, a short count might precede EOF. Indeed, in that case,
> calling epoll_getevents would cause the connection to get stuck.

Maybe epoll should be extended with a specific EOF event.
Then short reads would be fine.
- Dan

2002-10-22 17:16:39

by Mark Mielke

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On Sat, Oct 19, 2002 at 09:10:52AM -0700, Charles 'Buck' Krasic wrote:
> Mark Mielke <[email protected]> writes:
> > They still represent an excessive complicated model that attempts to
> > implement /dev/epoll the same way that one would implement poll()/select().
> epoll is about fixing one aspect of an otherwise well established api.
> That is, fixing the scalability of poll()/select() for applications
> based on non-blocking sockets.

epoll is not a poll()/select() enhancement (unless it is used in
conjuction with poll()/select()). It is a poll()/select()
replacement.

Meaning... purposefully creating an API that is designed the way one
would design a poll()/select() loop is purposefully limiting the benefits
of /dev/epoll.

It's like inventing a power drill to replace the common screw driver,
but rather than plugging the power drill in, manually turning the
drill as if it was a socket wrench for the drill bit.

I find it an excercise in self defeat... except that /dev/epoll used the
same way one would use poll()/select() happens to perform better even
when it is crippled.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-10-22 17:26:12

by Dan Kegel

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

Mark Mielke wrote:
> epoll is not a poll()/select() enhancement (unless it is used in
> conjuction with poll()/select()). It is a poll()/select()
> replacement.
>
> Meaning... purposefully creating an API that is designed the way one
> would design a poll()/select() loop is purposefully limiting the benefits
> of /dev/epoll.
>
> It's like inventing a power drill to replace the common screw driver,
> but rather than plugging the power drill in, manually turning the
> drill as if it was a socket wrench for the drill bit.
>
> I find it an excercise in self defeat... except that /dev/epoll used the
> same way one would use poll()/select() happens to perform better even
> when it is crippled.

Agreed.
- Dan


2002-10-22 17:32:47

by Davide Libenzi

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On Tue, 22 Oct 2002, Mark Mielke wrote:

> On Sat, Oct 19, 2002 at 09:10:52AM -0700, Charles 'Buck' Krasic wrote:
> > Mark Mielke <[email protected]> writes:
> > > They still represent an excessive complicated model that attempts to
> > > implement /dev/epoll the same way that one would implement poll()/select().
> > epoll is about fixing one aspect of an otherwise well established api.
> > That is, fixing the scalability of poll()/select() for applications
> > based on non-blocking sockets.
>
> epoll is not a poll()/select() enhancement (unless it is used in
> conjuction with poll()/select()). It is a poll()/select()
> replacement.
>
> Meaning... purposefully creating an API that is designed the way one
> would design a poll()/select() loop is purposefully limiting the benefits
> of /dev/epoll.
>
> It's like inventing a power drill to replace the common screw driver,
> but rather than plugging the power drill in, manually turning the
> drill as if it was a socket wrench for the drill bit.
>
> I find it an excercise in self defeat... except that /dev/epoll used the
> same way one would use poll()/select() happens to perform better even
> when it is crippled.

Since the sys_epoll ( and /dev/epoll ) fd support standard polling, you
can mix sys_epoll handling with other methods like poll() and the AIO's
POLL function when it'll be ready. For example, for devices that sys_epoll
intentionally does not support, you can use a method like :

put_sys_epoll_fd_inside_XXX();
...
wait_for_XXX_events();
...
if (XXX_event_fd() == sys_epoll_fd) {
sys_epoll_wait();
for_each_sys_epoll_event {
handle_fd_event();
}
}



- Davide


2002-10-22 17:52:22

by Alan

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On Tue, 2002-10-22 at 18:47, Davide Libenzi wrote:
> Since the sys_epoll ( and /dev/epoll ) fd support standard polling, you
> can mix sys_epoll handling with other methods like poll() and the AIO's
> POLL function when it'll be ready. For example, for devices that sys_epoll
> intentionally does not support, you can use a method like :

The more important question is why do you need epoll. asynchronous I/O
completions setting a list of futexes can already be made to do the job
and is much more flexible.

2002-10-22 18:05:50

by Davide Libenzi

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On 22 Oct 2002, Alan Cox wrote:

> On Tue, 2002-10-22 at 18:47, Davide Libenzi wrote:
> > Since the sys_epoll ( and /dev/epoll ) fd support standard polling, you
> > can mix sys_epoll handling with other methods like poll() and the AIO's
> > POLL function when it'll be ready. For example, for devices that sys_epoll
> > intentionally does not support, you can use a method like :
>
> The more important question is why do you need epoll. asynchronous I/O
> completions setting a list of futexes can already be made to do the job
> and is much more flexible.

Alan, could you provide a code snipped to show how easy it is and how well
it fits a 1:N ( one task/thread , N connections ) architecture ? And
looking at Ben's presentation about benchmarks ( and for pipe's ), you'll
discover that both poll() and AIO are "a little bit slower" than
sys_epoll. Anyway I do not want anything superflous added to the kernel
w/out reason, that's why, beside the Ben's presentation, there're curretly
people benchmarking existing solutions.




- Davide


2002-10-22 18:36:51

by Charlie Krasic

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)


I don't think the big picture is that complicated.

epoll is useful for programs that use the old nonblocking socket API.
It improves performance significantly for a case where poll() and
select() are deficient (large numbers of slow or idle connections).

There are a number of people, including myself, who already use epoll.
At least some of us don't think it is too complicated. I claim most
of the complication is in the nonblocking socket API, in which case
the complexity falls under the category of "the devil you know...".

The old nonblocking socket API (and hence epoll) does nothing for file
IO, and it just doesn't make sense relative to file IO. (EAGAIN,
POLLIN, POLLOUT, etc. aren't terribly useful signals from a disk
device).

So, its a great thing that the new AIO API is forthcoming.

So maybe epoll's moment of utility is only transient. It should have
been in the kernel a long time ago. Is it too late now that AIO is
imminent?

-- Buck

Mark Mielke <[email protected]> writes:

> On Sat, Oct 19, 2002 at 09:10:52AM -0700, Charles 'Buck' Krasic wrote:
> > Mark Mielke <[email protected]> writes:
> > > They still represent an excessive complicated model that attempts to
> > > implement /dev/epoll the same way that one would implement poll()/select().
> > epoll is about fixing one aspect of an otherwise well established api.
> > That is, fixing the scalability of poll()/select() for applications
> > based on non-blocking sockets.
>
> epoll is not a poll()/select() enhancement (unless it is used in
> conjuction with poll()/select()). It is a poll()/select()
> replacement.
>
> Meaning... purposefully creating an API that is designed the way one
> would design a poll()/select() loop is purposefully limiting the benefits
> of /dev/epoll.
>
> It's like inventing a power drill to replace the common screw driver,
> but rather than plugging the power drill in, manually turning the
> drill as if it was a socket wrench for the drill bit.
>
> I find it an excercise in self defeat... except that /dev/epoll used the
> same way one would use poll()/select() happens to perform better even
> when it is crippled.
>
> mark
>
> --
> [email protected]/[email protected]/[email protected] __________________________
> . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
> |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
> | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada
>
> One ring to rule them all, one ring to find them, one ring to bring them all
> and in the darkness bind them...
>
> http://mark.mielke.cc/

2002-10-22 18:31:00

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On Tue, Oct 22, 2002 at 11:18:20AM -0700, Davide Libenzi wrote:
> Alan, could you provide a code snipped to show how easy it is and how well
> it fits a 1:N ( one task/thread , N connections ) architecture ? And
> looking at Ben's presentation about benchmarks ( and for pipe's ), you'll
> discover that both poll() and AIO are "a little bit slower" than
> sys_epoll. Anyway I do not want anything superflous added to the kernel
> w/out reason, that's why, beside the Ben's presentation, there're curretly
> people benchmarking existing solutions.

That's why I was hoping async poll would get fixed to have the same
performance characteristics as /dev/epoll. But.... :-/

-ben
--
"Do you seek knowledge in time travel?"

2002-10-22 19:16:57

by John Myers

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)



Benjamin LaHaise wrote:

>That's why I was hoping async poll would get fixed to have the same
>performance characteristics as /dev/epoll. But.... :-/
>
If you would like this to happen, it would help if you would respond to
questions asked and proposals made on the linux-aio mailing list. It's
a little difficult trying to read your mind.


Attachments:
smime.p7s (3.62 kB)
S/MIME Cryptographic Signature

2002-10-22 19:20:20

by Davide Libenzi

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On 22 Oct 2002, Charles 'Buck' Krasic wrote:

> So maybe epoll's moment of utility is only transient. It should have
> been in the kernel a long time ago. Is it too late now that AIO is
> imminent?

This is not my call actually. But beside comparing actual performance
between AIO and sys_epoll, one of the advantages that the patch had is
this :

arch/i386/kernel/entry.S | 4
drivers/char/Makefile | 4
fs/Makefile | 4
fs/file_table.c | 4
fs/pipe.c | 36 +
include/asm-i386/poll.h | 1
include/asm-i386/unistd.h | 3
include/linux/fs.h | 4
include/linux/list.h | 5
include/linux/pipe_fs_i.h | 4
include/linux/sys.h | 2
include/net/sock.h | 10
net/ipv4/tcp.c | 4

That is, it has a very little "intrusion" in the original code by plugging
in the existing architecture.



- Davide


2002-10-22 19:22:38

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On Tue, Oct 22, 2002 at 12:22:59PM -0700, John Gardiner Myers wrote:
>
>
> Benjamin LaHaise wrote:
>
> >That's why I was hoping async poll would get fixed to have the same
> >performance characteristics as /dev/epoll. But.... :-/
> >
> If you would like this to happen, it would help if you would respond to
> questions asked and proposals made on the linux-aio mailing list. It's
> a little difficult trying to read your mind.

*Which* proposals? There was enough of a discussion that I don't know
what people had decided on.

-ben
--
"Do you seek knowledge in time travel?"

2002-10-22 19:35:01

by Davide Libenzi

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On Tue, 22 Oct 2002, Benjamin LaHaise wrote:

> On Tue, Oct 22, 2002 at 11:18:20AM -0700, Davide Libenzi wrote:
> > Alan, could you provide a code snipped to show how easy it is and how well
> > it fits a 1:N ( one task/thread , N connections ) architecture ? And
> > looking at Ben's presentation about benchmarks ( and for pipe's ), you'll
> > discover that both poll() and AIO are "a little bit slower" than
> > sys_epoll. Anyway I do not want anything superflous added to the kernel
> > w/out reason, that's why, beside the Ben's presentation, there're curretly
> > people benchmarking existing solutions.
>
> That's why I was hoping async poll would get fixed to have the same
> performance characteristics as /dev/epoll. But.... :-/

Yep, like I wrote you yesterday ( and you did not answer ) is that I was
trying to add supprt for AIO into the test HTTP server I use to do
benchmark tests ( http://www.xmailserver.org/linux-patches/ephttpd-0.1.tar.gz )
but since there's no support for AIO POLL, it is impossible ( at least to
my knowledge ) to compare sys_epoll and AIO on a "real HTTP load"
networking test.




- Davide


2002-10-22 19:44:44

by John Myers

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)



Benjamin LaHaise wrote:

>*Which* proposals? There was enough of a discussion that I don't know
>what people had decided on.
>
Primarily the ones in my message of Tue, 15 Oct 2002 16:26:59 -0700. In
that I repeat a question I posed in my message of Tue, 01 Oct 2002
14:16:23 -0700.

There's also the IOCB_CMD_NOOP strawman of Fri, 18 Oct 2002 17:16:41 -0700.


Attachments:
smime.p7s (3.62 kB)
S/MIME Cryptographic Signature

2002-10-22 19:54:18

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On Tue, Oct 22, 2002 at 12:50:44PM -0700, John Gardiner Myers wrote:
>
>
> Benjamin LaHaise wrote:
>
> >*Which* proposals? There was enough of a discussion that I don't know
> >what people had decided on.
> >
> Primarily the ones in my message of Tue, 15 Oct 2002 16:26:59 -0700. In
> that I repeat a question I posed in my message of Tue, 01 Oct 2002
> 14:16:23 -0700.

How does it perform?

> There's also the IOCB_CMD_NOOP strawman of Fri, 18 Oct 2002 17:16:41 -0700.

That's going in unless there are any other objections to it from folks.
Part of it was that I had problems with 2.4.43-bk not working on my
test machines last week that delayed a few things.

-ben
--
"Do you seek knowledge in time travel?"

2002-10-22 20:19:29

by John Myers

[permalink] [raw]
Subject: async poll



Benjamin LaHaise wrote:

>How does it perform?
>
>
That was one of the questions, since you claimed it didn't "scale"
without being specific as to which axes that was with respect to.

If the problem is the cost of reregistration (which seems likely) then
fixing that requires extending the model upon which the aio framework is
based. If the proposed extended model is not acceptable, it would be
best to fix that before spending the time to implement it.


Attachments:
smime.p7s (3.62 kB)
S/MIME Cryptographic Signature

2002-10-22 21:56:55

by Dan Kegel

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

Erich Nahum wrote:
>>There're a couple of reason's why the drop of the initial event is a waste
>>of time :
>>
>>1) The I/O write space is completely available at fd creation
>>2) For sockets it's very likely that the first packet brought something
>> more than the SYN == The I/O read space might have something for you
>>
>>I strongly believe that the concept "use the fd until EAGAIN" should be
>>applied even at creation time, w/out making exceptions to what is the
>>API's rule to follow.
>
>
> There is a third way, described in the original Banga/Mogul/Druschel
> paper, available via Dan Kegel's web site: extend the accept() call to
> return whether an event has already happened on that FD. That way you
> can service a ready FD without reading /dev/epoll or calling
> sigtimedwait, and you don't have to waste a read() call on the socket
> only to find out you got EAGAIN.
>
> Of course, this changes the accept API, which is another matter. But
> if we're talking a new API then there's no problem.

That would be the fastest way of finding out, maybe.
But I'd rather use a uniform way of notifying about readiness events.
Rather than using a new API (acceptEx :-) or a rule ("always ready initially"),
when not just deliver the initial readiness event via the usual channel?
And for ease of coding for the moment, David can just deliver
a 'ready for everything' event initially unconditionally.

No API changes, just a simple, uniform way of tickling the user's
"I gotta do I/O" code.
- Dan


2002-10-23 05:37:51

by Suparna Bhattacharya

[permalink] [raw]
Subject: Latest aio code (was Re: [PATCH] async poll for 2.5)

On Tue, Oct 22, 2002 at 04:00:22PM -0400, Benjamin LaHaise wrote:
> On Tue, Oct 22, 2002 at 12:50:44PM -0700, John Gardiner Myers wrote:
> >
> >
> > Benjamin LaHaise wrote:
> >
> > >*Which* proposals? There was enough of a discussion that I don't know
> > >what people had decided on.
> > >
> > Primarily the ones in my message of Tue, 15 Oct 2002 16:26:59 -0700. In
> > that I repeat a question I posed in my message of Tue, 01 Oct 2002
> > 14:16:23 -0700.
>
> How does it perform?
>
> > There's also the IOCB_CMD_NOOP strawman of Fri, 18 Oct 2002 17:16:41 -0700.
>
> That's going in unless there are any other objections to it from folks.
> Part of it was that I had problems with 2.4.43-bk not working on my
> test machines last week that delayed a few things.

Ben,

Is there a patch against 2.5.44 with all the latest fixes that
we can sync up with ?

Regards
Suparna

>
> -ben
> --
> "Do you seek knowledge in time travel?"
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/

2002-10-23 16:31:33

by Dan Kegel

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

Davide Libenzi <[email protected]> wrote:
> On 22 Oct 2002, Charles 'Buck' Krasic wrote:
>
>> So maybe epoll's moment of utility is only transient. It should have
>> been in the kernel a long time ago. Is it too late now that AIO is
>> imminent?
>
> This is not my call actually. But beside comparing actual performance
> between AIO and sys_epoll, one of the advantages that the patch had is
> ... it has a very little "intrusion" in the original code by plugging
> in the existing architecture.

epoll has another benefit: it works with read() and write(). That
makes it easier to use with existing libraries like OpenSSL
without having to recode them to use aio_read() and aio_write().

Furthermore, epoll is nice because it delivers one-shot readiness change
notification (I used to think that was a drawback, but coding
nonblocking OpenSSL apps has convinced me otherwise).
I may be confused, but I suspect the async poll being proposed by
Ben only delivers absolute readiness, not changes in readiness.

I think epoll is worth having, even if Ben's AIO already handled
networking properly.
- Dan

2002-10-23 17:32:51

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On Wed, Oct 23, 2002 at 09:49:54AM -0700, Dan Kegel wrote:
> Furthermore, epoll is nice because it delivers one-shot readiness change
> notification (I used to think that was a drawback, but coding
> nonblocking OpenSSL apps has convinced me otherwise).
> I may be confused, but I suspect the async poll being proposed by
> Ben only delivers absolute readiness, not changes in readiness.
>
> I think epoll is worth having, even if Ben's AIO already handled
> networking properly.

That depends on how it compares to async read/write, which hasn't
been looked into yet. The way the pipe code worked involved walking
the page tables, which is still quite expensive for small data sizes.
With the new code, the CPU's tlb will be used, which will make a big
difference, especially for the case where only a single address space
is in use on the system.

-ben
--
"Do you seek knowledge in time travel?"

2002-10-23 17:43:20

by Charlie Krasic

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)


Dan Kegel <[email protected]> writes:

> Davide Libenzi <[email protected]> wrote:

> I may be confused, but I suspect the async poll being proposed by
> Ben only delivers absolute readiness, not changes in readiness.

> I think epoll is worth having, even if Ben's AIO already handled
> networking properly.

> - Dan

Can someone remind me why poll is needed in the AIO api at all?

How would it be used?

-- Buck








2002-10-23 17:59:48

by Davide Libenzi

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On 23 Oct 2002, Charles 'Buck' Krasic wrote:

>
> Dan Kegel <[email protected]> writes:
>
> > Davide Libenzi <[email protected]> wrote:
>
> > I may be confused, but I suspect the async poll being proposed by
> > Ben only delivers absolute readiness, not changes in readiness.
>
> > I think epoll is worth having, even if Ben's AIO already handled
> > networking properly.
>
> > - Dan
>
> Can someone remind me why poll is needed in the AIO api at all?
>
> How would it be used?

Maybe my understanding of AIO on Linux is limited but how would you do
async accept/connect ? Will you be using std poll/select for that, and
then you'll switch to AIO for read/write requests ?



- Davide






2002-10-23 18:26:28

by Charlie Krasic

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)


I see. Adding async accept/connect would seem to make more sense to me.

-- Buck

Davide Libenzi <[email protected]> writes:

> Maybe my understanding of AIO on Linux is limited but how would you do
> async accept/connect ? Will you be using std poll/select for that, and
> then you'll switch to AIO for read/write requests ?

> - Davide

2002-10-23 18:32:27

by Davide Libenzi

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On Wed, 23 Oct 2002, Benjamin LaHaise wrote:

> On Wed, Oct 23, 2002 at 09:49:54AM -0700, Dan Kegel wrote:
> > Furthermore, epoll is nice because it delivers one-shot readiness change
> > notification (I used to think that was a drawback, but coding
> > nonblocking OpenSSL apps has convinced me otherwise).
> > I may be confused, but I suspect the async poll being proposed by
> > Ben only delivers absolute readiness, not changes in readiness.
> >
> > I think epoll is worth having, even if Ben's AIO already handled
> > networking properly.
>
> That depends on how it compares to async read/write, which hasn't
> been looked into yet. The way the pipe code worked involved walking
> the page tables, which is still quite expensive for small data sizes.
> With the new code, the CPU's tlb will be used, which will make a big
> difference, especially for the case where only a single address space
> is in use on the system.

Ben, does it work at all currently read/write requests on sockets ? I
would like to test AIO on networking using my test http server, and I was
thinking about using poll() for async accept and AIO for read/write. The
poll() should be pretty fast because there's only one fd in the set and
the remaining code will use AIO for read/write. Might this work currently ?



- Davide


2002-10-23 20:30:21

by John Myers

[permalink] [raw]
Subject: async poll



Davide Libenzi wrote:

>Maybe my understanding of AIO on Linux is limited but how would you do
>async accept/connect ? Will you be using std poll/select for that, and
>then you'll switch to AIO for read/write requests ?
>
If a connection is likely to be idle, one would want to use an async
read poll instead of an async read in order to avoid having to allocate
input buffers to idle connections. (What one really wants is a variant
of async read that allocates an input buffer to an fd at completion
time, not submission time).

Sometimes one wants to use a library which only has a nonblocking
interface, so when the library says WOULDBLOCK you have to do an async
write poll.

Sometimes one wants to use a kernel interface (e.g. sendfile) that does
not yet have an async equivalent. Accept/connect are in this
class--there should be nothing to prevent us from creating async
versions of accept/connect.


Attachments:
smime.p7s (3.62 kB)
S/MIME Cryptographic Signature

2002-10-23 20:39:22

by Dan Kegel

[permalink] [raw]
Subject: Re: async poll

John Myers wrote:
> Davide Libenzi wrote:
>
>> Maybe my understanding of AIO on Linux is limited but how would you do
>> async accept/connect ? Will you be using std poll/select for that, and
>> then you'll switch to AIO for read/write requests ?
>>
> If a connection is likely to be idle, one would want to use an async
> read poll instead of an async read in order to avoid having to allocate
> input buffers to idle connections. (What one really wants is a variant
> of async read that allocates an input buffer to an fd at completion
> time, not submission time).

In that situation, why not just add the fd to an epoll, and have the
epoll deliver events through Ben's interface? That way you'd get
notified of changes in readability, and wouldn't have to issue
the read poll call over and over.

- Dan

2002-10-23 21:11:22

by Charlie Krasic

[permalink] [raw]
Subject: Re: async poll



[email protected] (John Myers) writes:

> If a connection is likely to be idle, one would want to use an async
> read poll instead of an async read in order to avoid having to
> allocate input buffers to idle connections. (What one really wants
> is a variant of async read that allocates an input buffer to an fd
> at completion time, not submission time).

Right. This does make sense for a server with thousands of idle
connections. You could have lots of unused memory in that case.

> Sometimes one wants to use a library which only has a nonblocking
> interface, so when the library says WOULDBLOCK you have to do an
> async write poll.

Right. I can see this too. This might be thorny for epoll though,
since it's entirely conceivable that such libraries would be expecting
level style poll semantics.

> Sometimes one wants to use a kernel interface (e.g. sendfile) that
> does not yet have an async equivalent. Accept/connect are in this
> class--there should be nothing to prevent us from creating async
> versions of accept/connect.

Right. However in this case, I think using poll is a stop gap
solution.

It would be better to give accept, connect, and sendfile (and
sendfile64) native status in AIO. Even if the implementations are not
going to be ready for a while, I think their spots in the API should
be reserved now.

That said, the first two points above convince me that there are valid
reasons why poll is also needed in AIO.

-- Buck

2002-10-23 21:15:23

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On Wed, Oct 23, 2002 at 11:47:33AM -0700, Davide Libenzi wrote:
> Ben, does it work at all currently read/write requests on sockets ? I
> would like to test AIO on networking using my test http server, and I was
> thinking about using poll() for async accept and AIO for read/write. The
> poll() should be pretty fast because there's only one fd in the set and
> the remaining code will use AIO for read/write. Might this work currently ?

The socket async read/write code is not yet in the kernel.

-ben
--
"Do you seek knowledge in time travel?"

2002-10-23 21:17:53

by John Myers

[permalink] [raw]
Subject: Re: async poll



Dan Kegel wrote:

> In that situation, why not just add the fd to an epoll, and have the
> epoll deliver events through Ben's interface?

Because you might need to use the aio_data facility of the iocb
interface. Because you might want to keep the kernel from
simultaneously delivering two events for the same fd to two different
threads.


Attachments:
smime.p7s (3.62 kB)
S/MIME Cryptographic Signature

2002-10-23 21:20:26

by Davide Libenzi

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On Wed, 23 Oct 2002, Benjamin LaHaise wrote:

> On Wed, Oct 23, 2002 at 11:47:33AM -0700, Davide Libenzi wrote:
> > Ben, does it work at all currently read/write requests on sockets ? I
> > would like to test AIO on networking using my test http server, and I was
> > thinking about using poll() for async accept and AIO for read/write. The
> > poll() should be pretty fast because there's only one fd in the set and
> > the remaining code will use AIO for read/write. Might this work currently ?
>
> The socket async read/write code is not yet in the kernel.

Ok, this pretty much stops every attempt to test/compare AIO with sys_epoll ...
ETA ?



- Davide


2002-10-23 21:33:36

by John Myers

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)



Davide Libenzi wrote:

>Ok, this pretty much stops every attempt to test/compare AIO with sys_epoll ...
>
It would be useful to compare async poll (the patch that started this
thread) with sys_epoll. sys_epoll is expected to perform better since
it ignores multithreading issues and amortizes registration across
multiple events, but it would be interesting to know by how much.


Attachments:
smime.p7s (3.62 kB)
S/MIME Cryptographic Signature

2002-10-23 21:36:14

by Davide Libenzi

[permalink] [raw]
Subject: Re: async poll

On Wed, 23 Oct 2002, John Gardiner Myers wrote:

> > In that situation, why not just add the fd to an epoll, and have the
> > epoll deliver events through Ben's interface?
>
> Because you might need to use the aio_data facility of the iocb
> interface. Because you might want to keep the kernel from
> simultaneously delivering two events for the same fd to two different
> threads.

Why would you want to have a single fd simultaneously handled by two
different threads with all the locking issues that would arise ? I can
understand loving threads but this seems to be too much :)



- Davide



2002-10-23 21:39:21

by Davide Libenzi

[permalink] [raw]
Subject: Re: epoll (was Re: [PATCH] async poll for 2.5)

On Wed, 23 Oct 2002, John Gardiner Myers wrote:

>
>
> Davide Libenzi wrote:
>
> >Ok, this pretty much stops every attempt to test/compare AIO with sys_epoll ...
> >
> It would be useful to compare async poll (the patch that started this
> thread) with sys_epoll. sys_epoll is expected to perform better since
> it ignores multithreading issues and amortizes registration across
> multiple events, but it would be interesting to know by how much.

I think this can be done, David Stevens ( IBM ) already proposed me the
patch to test.



- Davide


2002-10-23 21:45:03

by bert hubert

[permalink] [raw]
Subject: Re: async poll

On Wed, Oct 23, 2002 at 02:51:21PM -0700, Davide Libenzi wrote:

> Why would you want to have a single fd simultaneously handled by two
> different threads with all the locking issues that would arise ? I can
> understand loving threads but this seems to be too much :)

We in fact tried to do this and for good reason. Our nameserver sofware gets
great benefit when two processes listen to the same socket on an SMP system.
In some cases, this means 70% more packets/second, which is close to the
theoretical maximum beneft.

We would heavily prefer to have two *threads* listening to the same socket
instead of to processes. The two processes do not share caching information
now because that expects to live in the same memory.

Right now, we can't do that because of very weird locking behaviour, which
is documented here: http://www.mysql.com/doc/en/Linux.html and leads to
250.000 context switches/second and dysmal peformance.

I expect NPTL to fix this situation and I would just love to be able to call
select() or poll() or recvfrom() on the same fd(s) from different threads.

Regards,

bert hubert

--
http://www.PowerDNS.com Versatile DNS Software & Services
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2002-10-23 21:48:25

by John Myers

[permalink] [raw]
Subject: Re: async poll



Davide Libenzi wrote:

>Why would you want to have a single fd simultaneously handled by two
>different threads with all the locking issues that would arise ?
>
You would not want this to happen. Thus you would want the poll
facility to somehow prevent returning event N+1 until after the thread
that got event N has somehow indicated that it has finished handling the
event.



Attachments:
smime.p7s (3.62 kB)
S/MIME Cryptographic Signature

2002-10-23 21:54:58

by Davide Libenzi

[permalink] [raw]
Subject: Re: async poll

On Wed, 23 Oct 2002, bert hubert wrote:

> On Wed, Oct 23, 2002 at 02:51:21PM -0700, Davide Libenzi wrote:
>
> > Why would you want to have a single fd simultaneously handled by two
> > different threads with all the locking issues that would arise ? I can
> > understand loving threads but this seems to be too much :)
>
> We in fact tried to do this and for good reason. Our nameserver sofware gets
> great benefit when two processes listen to the same socket on an SMP system.
> In some cases, this means 70% more packets/second, which is close to the
> theoretical maximum beneft.
>
> We would heavily prefer to have two *threads* listening to the same socket
> instead of to processes. The two processes do not share caching information
> now because that expects to live in the same memory.
>
> Right now, we can't do that because of very weird locking behaviour, which
> is documented here: http://www.mysql.com/doc/en/Linux.html and leads to
> 250.000 context switches/second and dysmal peformance.
>
> I expect NPTL to fix this situation and I would just love to be able to call
> select() or poll() or recvfrom() on the same fd(s) from different threads.

I feel this topic is going somewhere else ;) ... but if you need your
processes to share memory, and hence you would like them to be threads,
you're very likely going to have some form of syncronization mechanism to
access the shared area, don't you? So, isn't it better to have two
separate tasks, that can freely access the memory w/out locks instead of N
threads?



- Davide



2002-10-23 22:06:21

by Dan Kegel

[permalink] [raw]
Subject: Re: async poll

John Gardiner Myers wrote:
>
> Dan Kegel wrote:
>
>> In that situation, why not just add the fd to an epoll, and have the
>> epoll deliver events through Ben's interface?
>
>
> Because you might need to use the aio_data facility of the iocb
> interface.

Presumably epoll_add could be enhanced to let user specify a user data
word.

> Because you might want to keep the kernel from
> simultaneously delivering two events for the same fd to two different
> threads.

You might want to use aio_write() for writes, and read() for reads,
in which case you could tell epoll you're not interested in
write readiness events. Then there'd be no double notification
for reads or writes on same fd.
It's a bit contrived, but I can imagine it being useful.

- Dan



2002-10-23 22:07:15

by Davide Libenzi

[permalink] [raw]
Subject: Re: async poll

On Wed, 23 Oct 2002, John Gardiner Myers wrote:

> Davide Libenzi wrote:
>
> >Why would you want to have a single fd simultaneously handled by two
> >different threads with all the locking issues that would arise ?
> >
> You would not want this to happen. Thus you would want the poll
> facility to somehow prevent returning event N+1 until after the thread
> that got event N has somehow indicated that it has finished handling the
> event.

We're again looping talking about threads and fd being bounced between
threads. It seems that we've very different opinions about the use of
threads and how server applications should be designed. IMHO if you're
thinking of bouncing fds among threads for their handling you're doing
something somehow wrong, but this is just my opinion ...



- Davide




2002-10-23 22:16:26

by Davide Libenzi

[permalink] [raw]
Subject: Re: async poll

On Wed, 23 Oct 2002, Dan Kegel wrote:

> John Gardiner Myers wrote:
> >
> > Dan Kegel wrote:
> >
> >> In that situation, why not just add the fd to an epoll, and have the
> >> epoll deliver events through Ben's interface?
> >
> >
> > Because you might need to use the aio_data facility of the iocb
> > interface.
>
> Presumably epoll_add could be enhanced to let user specify a user data
> word.

It'll take 2 minutes to do such a thing. Actually the pollfd struct
contains the "events" field that is wasted when returning events and it
could be used for something more useful.



- Davide


2002-10-23 22:23:44

by John Myers

[permalink] [raw]
Subject: Re: async poll



Davide Libenzi wrote:

>We're again looping talking about threads and fd being bounced between
>threads. It seems that we've very different opinions about the use of
>threads and how server applications should be designed. IMHO if you're
>thinking of bouncing fds among threads for their handling you're doing
>something somehow wrong, but this is just my opinion ...
>
Again, it comes down to whether or not one has the luxury of being able
to (re)write one's application and supporting libraries from scratch.
If one can ensure that the code for handling an event will never block
for any significant amount of time, then single-threaded process-per-CPU
will most likely perform best. If, on the other hand, the code for
handling an event can occasionally block, then one needs a thread pool
in order to have reasonable latency.

A thread pool based server that is released will trivially outperform a
single threaded server that needs a few more years development to
convert all the blocking calls to use the event subsystem.


Attachments:
smime.p7s (3.62 kB)
S/MIME Cryptographic Signature

2002-10-23 22:38:28

by Davide Libenzi

[permalink] [raw]
Subject: Re: async poll

On Wed, 23 Oct 2002, Davide Libenzi wrote:

> It'll take 2 minutes to do such a thing. Actually the pollfd struct
> contains the "events" field that is wasted when returning events and it
> could be used for something more useful.

Also, I was just wondering if this might be usefull :

asmlinkage int sys_epoll_wait(int epfd, int minevents, struct pollfd **events, int timeout);

Where "minevents" rapresent the minimum number of events returned by
sys_epoll ...



- Davide


2002-10-23 22:35:43

by Davide Libenzi

[permalink] [raw]
Subject: Re: async poll

On Wed, 23 Oct 2002, John Gardiner Myers wrote:

> Again, it comes down to whether or not one has the luxury of being able
> to (re)write one's application and supporting libraries from scratch.
> If one can ensure that the code for handling an event will never block
> for any significant amount of time, then single-threaded process-per-CPU
> will most likely perform best. If, on the other hand, the code for
> handling an event can occasionally block, then one needs a thread pool
> in order to have reasonable latency.
>
> A thread pool based server that is released will trivially outperform a
> single threaded server that needs a few more years development to
> convert all the blocking calls to use the event subsystem.

I beg you pardon but where an application is possibly waiting for ?
Couldn't it be waiting to something somehow identifiable as a file ? So,
supposing that an interface like sys_epoll ( or AIO, or whatever )
delivers you events for all your file descriptors your application is
waiting for, why would you need threads ? In fact, I personally find that
coroutines make threaded->single-task transaction very easy. Your virtual
threads shares everything by default w/out having a single lock inside
your application.



- Davide



2002-10-24 07:26:33

by Eduardo Pérez

[permalink] [raw]
Subject: Re: async poll

On 2002-10-23 15:50:43 -0700, Davide Libenzi wrote:
> On Wed, 23 Oct 2002, John Gardiner Myers wrote:
> > Again, it comes down to whether or not one has the luxury of being able
> > to (re)write one's application and supporting libraries from scratch.
> > If one can ensure that the code for handling an event will never block
> > for any significant amount of time, then single-threaded process-per-CPU
> > will most likely perform best. If, on the other hand, the code for
> > handling an event can occasionally block, then one needs a thread pool
> > in order to have reasonable latency.
> >
> > A thread pool based server that is released will trivially outperform a
> > single threaded server that needs a few more years development to
> > convert all the blocking calls to use the event subsystem.
>
> I beg you pardon but where an application is possibly waiting for ?
> Couldn't it be waiting to something somehow identifiable as a file ? So,
> supposing that an interface like sys_epoll ( or AIO, or whatever )
> delivers you events for all your file descriptors your application is
> waiting for, why would you need threads ? In fact, I personally find that
> coroutines make threaded->single-task transaction very easy. Your virtual
> threads shares everything by default w/out having a single lock inside
> your application.

The only uses of threads in a full aio application is task independence
(or interactivity) and process context separation

Example from GUI side:
Suppose your web (http) client is fully ported to aio (thus only one
thread), if you have two windows and one window receives a big
complicated html page that needs much CPU time to render, this window
can block the other one. If you have a thread for each window, once the
html parser has elapsed its timeslice, the other window can continue
parsing or displaying its (tiny html) page.
(In fact you should use two (or more) threads per window, as html parsing
shouldn't block widget redrawing (like menus and toolbars))

2002-10-24 14:59:42

by Charlie Krasic

[permalink] [raw]
Subject: Re: async poll

Eduardo P?rez <[email protected]> writes:

> The only uses of threads in a full aio application is task
> independence (or interactivity) and process context separation

> Example from GUI side: Suppose your web (http) client is fully
> ported to aio (thus only one thread), if you have two windows and
> one window receives a big complicated html page that needs much CPU
> time to render, this window can block the other one. If you have a
> thread for each window, once the html parser has elapsed its
> timeslice, the other window can continue parsing or displaying its
> (tiny html) page. (In fact you should use two (or more) threads per
> window, as html parsing shouldn't block widget redrawing (like menus
> and toolbars))

It's not strictly necessary to use threads here. At least not with
gtk+. I don't know whether other toolkits are the same.

Anyway with gtk+, you can install "idle callbacks". These are
functions to be called whenever the GUI code has nothing to do. If
you can transform your html parser(s) into idle function(s), then it
won't necessarily disrupt the GUI. You just have to make sure that
the parser yields periodically (returns to gtk+). Since parsers are
loop oriented, this shouldn't be too hard, you just make each call to
the function do 1 iteration of the top level loop.

My traffic generator, mxtraf, works this way. It has a software
oscilliscope (gscope) that graphically displays signals in real time.
At the same time it does lots of IO work to generate synthetic network
traffic. mxtraf is single threaded.

-- Buck

--
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/