2000-10-26 16:14:31

by Dan Kegel

[permalink] [raw]
Subject: Re: Linux's implementation of poll() not scalable?

"Eric W. Biederman" wrote:
>
> Dan Kegel <[email protected]> writes:
> > It's harder to write correct programs that use edge-triggered events.
>
> Huh? The race between when an event is reported, and when you take action
> on it effectively means all events are edge triggered.

Nope. With any of these interfaces, whether level or edge, the app must
treat all the events as hints, and be prepared for them to be wrong.
That's why code that uses poll() tends to use nonblocking sockets.
No matter what you do, you can't get away from that. Consider
accepting a connection. Poll or whatever tells you a listening socket
is ready for you to call accept(). Before you do, the other end sends
an RST. Consequence: app must be prepared for a readiness event to be wrong.
cf. http://boudicca.tux.org/hypermail/linux-kernel/2000week44/0616.html

This is orthogonal to the question of whether edge or level triggering
is easier to write code for, I think.

> So making the interface clearly edge triggered seems to be a win for
> correctness.

With level-triggered interfaces like poll(), your chances
of writing a correctly functioning program are higher because you
can throw events away (on purpose or accidentally) with no consequences;
the next time around the loop, poll() will happily tell you the current
state of all the fd's.

With edge-triggered interfaces, the programmer must be much more careful
to avoid ever dropping a single event; if he does, he's screwed.
This gives rise to complicated code in apps to remember the current
state of fd's in those cases where the programmer has to drop events.
And there are many cases like that; see
http://boudicca.tux.org/hypermail/linux-kernel/2000week44/0529.html and
http://boudicca.tux.org/hypermail/linux-kernel/2000week44/0592.html
for examples.

Better to let apps ask the kernel for the current state of each socket if
they want, is what I say. This reduces the amount of code and effort needed
to write many apps, *and makes migrating legacy apps to high-performance
interfaces easier*.
For new apps that are willing to maintain the state
themselves, by all means, provide an edge-oriented interface, too.
It's probably better if your code is manly enough to handle it.
- Dan


2000-10-26 16:47:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux's implementation of poll() not scalable?



On Thu, 26 Oct 2000, Dan Kegel wrote:
>
> With level-triggered interfaces like poll(), your chances
> of writing a correctly functioning program are higher because you
> can throw events away (on purpose or accidentally) with no consequences;
> the next time around the loop, poll() will happily tell you the current
> state of all the fd's.

Agreed.

However, we also need to remember what got us to this discussion in the
first place. One of the reasons why poll() is such a piggy interface is
exactly because it tries to be "nice" to the programmer.

I'd much rather have an event interface that is documented to be edge-
triggered and is really _lightweight_, than have another interface that
starts out with some piggy features.

I do understand that level to some degree is "nicer", but everybody pretty
much agrees that apart from requireing more care, edge-triggered certainly
does everything the level-triggered interfaces do.

For example, if you want to only partially handle an event (one of the
more valid arguments I've seen, although I do not agree with it actually
being all that common or important thing to do), the edge-triggered
interface _does_ allow for that. It's fairly easy, in fact: you just
re-bind the thing.

(My suggested "bind()" interface would be to just always add a newly bound
fd to the event queue, but I would not object to a "one-time test for
active" kind of "bind()" interface either - which would basically allow
for a poll() interface - and the existing Linux internal "poll()"
implementation actually already allows for exactly this so it wouldn't
even add any complexity).

> With edge-triggered interfaces, the programmer must be much more careful
> to avoid ever dropping a single event; if he does, he's screwed.
> This gives rise to complicated code in apps to remember the current
> state of fd's in those cases where the programmer has to drop events.

No, the "re-bind()" approach works very simply, and means that the
overhead of testing whether the event is still active is not a generic
thing that _always_ has to be done, but something where the application
can basically give the kernel the information that "this time we're
leaving the event possibly half-done, please re-test now".

Advantage: performance again. The common case doesn't take the performance
hit.

Linus

2000-10-26 19:55:28

by jg

[permalink] [raw]
Subject: Re: Linux's implementation of poll() not scalable?


Note that there is another aspect to the efficiency / performance of the
select/poll style of interfaces not immediately obvious, but which occurs
as a result of how some (streaming/batching) protocols work.

An X server does not call select all that often (probably one of the two items most frequently used that care;
though I believe getting the Apache case right is more important).

X is such a streaming protocol: it is a feature that I don't have to do
extra reads or system calls to deal with more data arriving from a client.
An X server doesn't want one event generated for each incoming TCP segment:
it merely needs to know there is data available on a file descriptor as
a binary condition. I really don't want to have to do one operation per
segment; this is less efficient than the current situation.

Similarly, it is a feature that with one call I can find out that there
is work to do on multiple file descriptors.

In short, the X server does a select, and then loops across all the file
descriptors with work to do before doing another select: the system call
overhead gets amortized across multiple clients and buffers received from
the client. As the server gets busier, it is more and more likely
that there is more than one client with work to do, and/or multiple
TCP segments have arrived to process (in the common single client
is busy case). So we make the system call less and less often
as a fraction of work done.

This has the happy consequence that the select caused overhead DROPS as
a fraction of total work as the X server gets busier, and X is most efficient
at the point in time you care the most: when you have the most work to
do. The system call is returning more information each time it is called,
and some of that information is aggregated as well (additional data arriving).
It doesn't practically matter how efficient the X server is when
you aren't busy, after all.

This aggregation property is therefore important, and there needs to be
some way to achieve this, IMHO.

Web servers often have similar behavior, though since most current
HTTP clients don't implement streaming behavior, the benefit is currently
much lower (would that HTTP clients start driving HTTP servers the
way the HTTP/1.1 protocol allows... Sigh...). Right now, scaling
to large numbers of descriptors is most urgent for big web servers.

So I want an interface in which I can get as many events as possible
at once, and one in which the events themselves can have appropriate
aggregation behavior. It isn't quite clear to me if the proposed interface
would have this property.

As I said in early talks about X: "X is an exercise in avoiding system
calls"....

- Jim Gettys
--
Jim Gettys
Technology and Corporate Development
Compaq Computer Corporation
[email protected]

2000-10-26 21:43:56

by Dan Kegel

[permalink] [raw]
Subject: Re: Linux's implementation of poll() not scalable?

Jim Gettys wrote:
> So I want an interface in which I can get as many events as possible
> at once, and one in which the events themselves can have appropriate
> aggregation behavior. It isn't quite clear to me if the proposed interface
> would have this property.

I believe get_event, /dev/poll, and kqueue all share this property.
e.g. none of them will present multiple POLLIN events per fd per call.
Is that what you meant?

- Dan

2000-10-27 00:47:42

by Dan Kegel

[permalink] [raw]
Subject: Re: Linux's implementation of poll() not scalable?

Linus Torvalds wrote:
> However, we also need to remember what got us to this discussion in the
> first place. One of the reasons why poll() is such a piggy interface is
> exactly because it tries to be "nice" to the programmer.

poll() is a piggy interface because it is O(n) in polled file
descriptors, not because of its level-triggered nature.

> I'd much rather have an event interface that is documented to be edge-
> triggered and is really _lightweight_, than have another interface that
> starts out with some piggy features.

Agreed (except for that 'edge-triggered' part), but I don't think
'level-triggered' implies piggy. I haven't benchmarked whether
kqueue() slows down the networking layer of FreeBSD yet; do you
suspect maintaining the level-triggered structures actually is
a bottleneck for them?

> I do understand that level to some degree is "nicer", but everybody pretty
> much agrees that apart from requireing more care, edge-triggered certainly
> does everything the level-triggered interfaces do.

Agreed.

> For example, if you want to only partially handle an event (one of the
> more valid arguments I've seen, although I do not agree with it actually
> being all that common or important thing to do), the edge-triggered
> interface _does_ allow for that. It's fairly easy, in fact: you just
> re-bind the thing.
>
> (My suggested "bind()" interface would be to just always add a newly bound
> fd to the event queue, but I would not object to a "one-time test for
> active" kind of "bind()" interface either - which would basically allow
> for a poll() interface - and the existing Linux internal "poll()"
> implementation actually already allows for exactly this so it wouldn't
> even add any complexity).
> ... the "re-bind()" approach works very simply, and means that the
> overhead of testing whether the event is still active is not a generic
> thing that _always_ has to be done, but something where the application
> can basically give the kernel the information that "this time we're
> leaving the event possibly half-done, please re-test now".

Hmm. I don't like the extra system call, though. Any chance you'd be
willing to make get_events() take a vector of bind requests, so we can
avoid the system call overhead of re-binding? (Or is that too close
to kqueue for you?)

And are you sure apps will always know whether they need to rebind?
Sometimes you're faced with a protocol stack which may or may not
read the requests fully, and which you aren't allowed to change.
It'd be nice to still have a high-performance interface that can deal with
that situation.
- Dan

2000-10-27 13:53:34

by Chris Swiedler

[permalink] [raw]
Subject: RE: Linux's implementation of poll() not scalable?

> It doesn't practically matter how efficient the X server is when
> you aren't busy, after all.

A simple polling scheme (i.e. not using poll() or select(), just looping
through all fd's trying nonblocking reads) is perfectly efficient when the
server is 100% busy, and perfectly inefficient when there is nothing to do.
I'm not saying that your statements are wrong--in your example, X is calling
select() which is not wasting as much time as a hard-polling loop--but it's
wrong to say that high-load efficiency is the primary concern. I would be
horrified if X took a signifigant portion of the CPU time when many clients
were connected, but none were actually doing anything.

chris