LinuxLists.cc - [patch, rfc] lt-epoll ( level triggered epoll ) ...

2003-03-10 19:56:09

Subject: [patch, rfc] lt-epoll ( level triggered epoll ) ...

During the last three months I spent considerable amount of time explainig
developers how edge triggered APIs works, and how to use epoll inside
their apps. It's time for me to face the reality, that is that :

1) developers not quite understand ET APIs
2) most existing apps are written using LT APIs

To briefly explain the difference between an Edge Triggered and an Level
Triggered interface, suppose this sequence :

1) Pipe writer writes 2Kb of data on the write side
2) Pipe reader read 1Kb
3) Pipe reader calls epoll_wait()

With an ET API, since the I/O space is not exhausted, the application will
hang on 3) because to trasition 0->1 won't be possible. With an LT API the
application will receive the EPOLLIN event, like if it was using poll().
Thanks to Niels and Marius to have convinced me to try to implement epoll
as LT API during the last weekend. This is the patch that implements the
epoll interface as LT API, against the 2.5.64 kernel. Because of the way
the original design was architected, it really took a few bits to be
changed. Tests on my machine ( UP ) looks fine and performance are aligned
to the ET epoll ( on UP ). IMHO there are two major values in adoptiong an
LT epoll :

1) Developers uderstand it better than ET APIs and will very likely use it
in a more widespread way
2) Existing apps using poll/select can easily be ported usinf LT epoll

The LT epoll is by all means the fastest poll available and can be used
wherever poll can be used. To test it I also ported thttpd to LT
epoll and, so far, it didn't puke on my face. Niels and Marius also wrote
a nice event library :

http://monkey.org/~provos/libevent/

that uses LT epoll, as long as poll and select. The usage pattern of an LT
epoll is slightly different from the ET epoll. Where with the ET epoll you
very likely registered the EPOLLIN|EPOLLOUT during the insertion of the fd
inside the interface, and you left the edge behaviour of the interface
itself to filter events, with LT epoll you simply use epoll_ctl(EPOLL_CTL_MOD)
to switch between EPOLLIN and EPOLLOUT. In front of this considerations we
have three options that I can think :

1) We leave epoll as is ( ET )
2) We apply the patch that will make epoll LT
3) We add a parameter to epoll_create() to fix the interface behaviour at
creation time ( small change on the current patch )

With 2) and 3) there are also man pages to be reviewed to be posted to
Andries.
Comments ?

- Davide

fs/eventpoll.c | 216 +++++++++++++++++++++++++++++++---------------
include/linux/eventpoll.h | 2
2 files changed, 150 insertions, 68 deletions

diff -Nru linux-2.5.64.vanilla/fs/eventpoll.c linux-2.5.64.epoll/fs/eventpoll.c
--- linux-2.5.64.vanilla/fs/eventpoll.c Tue Mar 4 19:29:16 2003
+++ linux-2.5.64.epoll/fs/eventpoll.c Sun Mar 9 13:59:00 2003
@@ -1,6 +1,6 @@
/*
* fs/eventpoll.c ( Efficent event polling implementation )
- * Copyright (C) 2001,...,2002 Davide Libenzi
+ * Copyright (C) 2001,...,2003 Davide Libenzi
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
@@ -117,11 +117,6 @@
*/
#define EP_MAX_BUF_EVENTS 32

-/*
- * Used to optimize ready items collection by reducing the irqlock/irqunlock
- * switching rate. This is kept in stack too, so do not go wild with this number.
- */
-#define EP_MAX_COLLECT_ITEMS 64

/*
@@ -223,6 +218,15 @@

/* List header used to link this item to the "struct file" items list */
struct list_head fllink;
+
+ /* List header used to link the item to the transfer list */
+ struct list_head txlink;
+
+ /*
+ * This is used during the collection/transfer of events to userspace
+ * to pin items empty events set.
+ */
+ unsigned int revents;
};

/* Wrapper struct used by poll queueing */
@@ -256,9 +260,10 @@
static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync);
static int ep_eventpoll_close(struct inode *inode, struct file *file);
static unsigned int ep_eventpoll_poll(struct file *file, poll_table *wait);
-static int ep_collect_ready_items(struct eventpoll *ep, struct epitem **aepi, int maxepi);
-static int ep_send_events(struct eventpoll *ep, struct epitem **aepi, int nepi,
+static int ep_collect_ready_items(struct eventpoll *ep, struct list_head *txlist, int maxevents);
+static int ep_send_events(struct eventpoll *ep, struct list_head *txlist,
struct epoll_event *events);
+static void ep_reinject_items(struct eventpoll *ep, struct list_head *txlist);
static int ep_events_transfer(struct eventpoll *ep, struct epoll_event *events, int maxevents);
static int ep_poll(struct eventpoll *ep, struct epoll_event *events, int maxevents,
long timeout);
@@ -340,13 +345,14 @@
unsigned long flags;
task_t *this_task = current;
struct list_head *lsthead = &psw->wake_task_list, *lnk;
+ struct wake_task_node *tncur;
struct wake_task_node tnode;

spin_lock_irqsave(&psw->lock, flags);

/* Try to see if the current task is already inside this wakeup call */
list_for_each(lnk, lsthead) {
- struct wake_task_node *tncur = list_entry(lnk, struct wake_task_node, llink);
+ tncur = list_entry(lnk, struct wake_task_node, llink);

if (tncur->task == this_task) {
if (tncur->wq == wq || ++wake_nests > EP_MAX_POLLWAKE_NESTS) {
@@ -830,6 +836,7 @@
{
unsigned int i, hsize;
struct list_head *lsthead, *lnk;
+ struct epitem *epi;

/*
* We need to lock this because we could be hit by
@@ -844,7 +851,7 @@
lsthead = ep_hash_entry(ep, i);

list_for_each(lnk, lsthead) {
- struct epitem *epi = list_entry(lnk, struct epitem, llink);
+ epi = list_entry(lnk, struct epitem, llink);

ep_unregister_pollwait(ep, epi);
}
@@ -860,7 +867,7 @@
lsthead = ep_hash_entry(ep, i);

while (!list_empty(lsthead)) {
- struct epitem *epi = list_entry(lsthead->next, struct epitem, llink);
+ epi = list_entry(lsthead->next, struct epitem, llink);

ep_remove(ep, epi);
}
@@ -971,6 +978,7 @@
INIT_LIST_HEAD(&epi->llink);
INIT_LIST_HEAD(&epi->rdllink);
INIT_LIST_HEAD(&epi->fllink);
+ INIT_LIST_HEAD(&epi->txlink);
INIT_LIST_HEAD(&epi->pwqlist);
epi->ep = ep;
epi->file = tfile;
@@ -1077,16 +1085,28 @@
/* Copy the data member from inside the lock */
epi->event.data = event->data;

- /* If the file is already "ready" we drop it inside the ready list */
- if ((revents & event->events) && EP_IS_LINKED(&epi->llink) &&
- !EP_IS_LINKED(&epi->rdllink)) {
- list_add_tail(&epi->rdllink, &ep->rdllist);
-
- /* Notify waiting tasks that events are available */
- if (waitqueue_active(&ep->wq))
- wake_up(&ep->wq);
- if (waitqueue_active(&ep->poll_wait))
- pwake++;
+ /*
+ * If the item is not linked to the hash it means that it's on its
+ * way toward the removal. Do nothing in this case.
+ */
+ if (EP_IS_LINKED(&epi->llink)) {
+ /*
+ * If the item is "hot" and it is not registered inside the ready
+ * list, push it inside. If the item is not "hot" and it is currently
+ * registered inside the ready list, unlink it.
+ */
+ if (revents & event->events) {
+ if (!EP_IS_LINKED(&epi->rdllink)) {
+ list_add_tail(&epi->rdllink, &ep->rdllist);
+
+ /* Notify waiting tasks that events are available */
+ if (waitqueue_active(&ep->wq))
+ wake_up(&ep->wq);
+ if (waitqueue_active(&ep->poll_wait))
+ pwake++;
+ }
+ } else if (EP_IS_LINKED(&epi->rdllink))
+ EP_LIST_DEL(&epi->rdllink);
}

write_unlock_irqrestore(&ep->lock, flags);
@@ -1113,8 +1133,7 @@
/* This is called without locks, so we need the atomic exchange */
nwait = xchg(&epi->nwait, 0);

- if (nwait)
- {
+ if (nwait) {
while (!list_empty(lsthead)) {
pwq = list_entry(lsthead->next, struct eppoll_entry, llink);

@@ -1295,28 +1314,45 @@
* during the f_op->poll() call, we try to collect the maximum number of items
* by reducing the irqlock/irqunlock switching rate.
*/
-static int ep_collect_ready_items(struct eventpoll *ep, struct epitem **aepi, int maxepi)
+static int ep_collect_ready_items(struct eventpoll *ep, struct list_head *txlist, int maxevents)
{
int nepi;
unsigned long flags;
- struct list_head *lsthead = &ep->rdllist;
+ struct list_head *lsthead = &ep->rdllist, *lnk;
+ struct epitem *epi;

write_lock_irqsave(&ep->lock, flags);

- for (nepi = 0; nepi < maxepi && !list_empty(lsthead);) {
- struct epitem *epi = list_entry(lsthead->next, struct epitem, rdllink);
+ for (nepi = 0, lnk = lsthead->next; lnk != lsthead && nepi < maxevents;) {
+ epi = list_entry(lnk, struct epitem, rdllink);

- /* Remove the item from the ready list */
- EP_LIST_DEL(&epi->rdllink);
+ lnk = lnk->next;

- /*
- * We need to increase the usage count of the "struct epitem" because
- * another thread might call EPOLL_CTL_DEL on this target and make the
- * object to vanish underneath our nose.
- */
- ep_use_epitem(epi);
+ /* If this file is already in the ready list we exit soon */
+ if (!EP_IS_LINKED(&epi->txlink)) {
+ /*
+ * We need to increase the usage count of the "struct epitem" because
+ * another thread might call EPOLL_CTL_DEL on this target and make the
+ * object to vanish underneath our nose.
+ */
+ ep_use_epitem(epi);

- aepi[nepi++] = epi;
+ /*
+ * This is initialized in this way so that the default
+ * behaviour of the reinjecting code will be to push back
+ * the item inside the ready list.
+ */
+ epi->revents = epi->event.events;
+
+ /* Link the ready item into the transfer list */
+ list_add(&epi->txlink, txlist);
+ nepi++;
+
+ /*
+ * Unlink the item from the ready list.
+ */
+ EP_LIST_DEL(&epi->rdllink);
+ }
}

write_unlock_irqrestore(&ep->lock, flags);
@@ -1330,36 +1366,40 @@
* __copy_to_user() might sleep, and also f_op->poll() might reenable the IRQ
* because of the way poll() is traditionally implemented in Linux.
*/
-static int ep_send_events(struct eventpoll *ep, struct epitem **aepi, int nepi,
+static int ep_send_events(struct eventpoll *ep, struct list_head *txlist,
struct epoll_event *events)
{
- int i, eventcnt, eventbuf, revents;
+ int eventcnt = 0, eventbuf = 0;
+ unsigned int revents;
+ struct list_head *lnk;
struct epitem *epi;
struct epoll_event event[EP_MAX_BUF_EVENTS];

- for (i = 0, eventcnt = 0, eventbuf = 0; i < nepi; i++, aepi++) {
- epi = *aepi;
+ list_for_each(lnk, txlist) {
+ epi = list_entry(lnk, struct epitem, txlink);

/* Get the ready file event set */
revents = epi->file->f_op->poll(epi->file, NULL);

- if (revents & epi->event.events) {
+ /*
+ * Set the return event set for the current file descriptor.
+ * Note that only the task task was successfully able to link
+ * the item to its "txlist" will write this field.
+ */
+ epi->revents = revents & epi->event.events;
+
+ if (epi->revents) {
event[eventbuf] = epi->event;
event[eventbuf].events &= revents;
eventbuf++;
if (eventbuf == EP_MAX_BUF_EVENTS) {
if (__copy_to_user(&events[eventcnt], event,
- eventbuf * sizeof(struct epoll_event))) {
- for (; i < nepi; i++, aepi++)
- ep_release_epitem(*aepi);
+ eventbuf * sizeof(struct epoll_event)))
return -EFAULT;
- }
eventcnt += eventbuf;
eventbuf = 0;
}
}
-
- ep_release_epitem(epi);
}

if (eventbuf) {
@@ -1374,12 +1414,66 @@

/*
+ * Walk through the transfer list we collected with ep_collect_ready_items()
+ * and, if 1) the item is still "alive" 2) its event set is not empty 3) it's
+ * not already linked, links it to the ready list.
+ */
+static void ep_reinject_items(struct eventpoll *ep, struct list_head *txlist)
+{
+ int ricnt = 0, pwake = 0;
+ unsigned long flags;
+ struct epitem *epi;
+
+ write_lock_irqsave(&ep->lock, flags);
+
+ while (!list_empty(txlist)) {
+ epi = list_entry(txlist->next, struct epitem, txlink);
+
+ /* Unlink the current item from the transfer list */
+ EP_LIST_DEL(&epi->txlink);
+
+ /*
+ * If the item is no more linked to the interest set, we don't
+ * have to push it inside the ready list because the following
+ * ep_release_epitem() is going to drop it.
+ */
+ if (EP_IS_LINKED(&epi->llink) && (epi->revents & epi->event.events) &&
+ !EP_IS_LINKED(&epi->rdllink)) {
+ list_add_tail(&epi->rdllink, &ep->rdllist);
+ ricnt++;
+ }
+
+ ep_release_epitem(epi);
+ }
+
+ if (ricnt) {
+ /*
+ * Wake up ( if active ) both the eventpoll wait list and the ->poll()
+ * wait list.
+ */
+ if (waitqueue_active(&ep->wq))
+ wake_up(&ep->wq);
+ if (waitqueue_active(&ep->poll_wait))
+ pwake++;
+ }
+
+ write_unlock_irqrestore(&ep->lock, flags);
+
+ /* We have to call this outside the lock */
+ if (pwake)
+ ep_poll_safewake(&psw, &ep->poll_wait);
+}
+
+
+/*
* Perform the transfer of events to user space.
*/
static int ep_events_transfer(struct eventpoll *ep, struct epoll_event *events, int maxevents)
{
- int eventcnt, nepi, sepi, maxepi;
- struct epitem *aepi[EP_MAX_COLLECT_ITEMS];
+ int eventcnt = 0;
+ struct list_head txlist;
+
+ INIT_LIST_HEAD(&txlist);

/*
* We need to lock this because we could be hit by
@@ -1392,25 +1486,13 @@
*/
down_read(&epsem);

- for (eventcnt = 0; eventcnt < maxevents;) {
- /* Maximum items we can extract this time */
- maxepi = min(EP_MAX_COLLECT_ITEMS, maxevents - eventcnt);
-
- /* Collect/extract ready items */
- nepi = ep_collect_ready_items(ep, aepi, maxepi);
-
- if (nepi) {
- /* Send events to userspace */
- sepi = ep_send_events(ep, aepi, nepi, &events[eventcnt]);
- if (sepi < 0) {
- up_read(&epsem);
- return sepi;
- }
- eventcnt += sepi;
- }
+ /* Collect/extract ready items */
+ if (ep_collect_ready_items(ep, &txlist, maxevents)) {
+ /* Build result set in userspace */
+ eventcnt = ep_send_events(ep, &txlist, events);

- if (nepi < maxepi)
- break;
+ /* Reinject ready items into the ready list */
+ ep_reinject_items(ep, &txlist);
}

up_read(&epsem);
diff -Nru linux-2.5.64.vanilla/include/linux/eventpoll.h linux-2.5.64.epoll/include/linux/eventpoll.h
--- linux-2.5.64.vanilla/include/linux/eventpoll.h Tue Mar 4 19:29:17 2003
+++ linux-2.5.64.epoll/include/linux/eventpoll.h Sun Mar 9 09:09:23 2003
@@ -1,6 +1,6 @@
/*
* include/linux/eventpoll.h ( Efficent event polling implementation )
- * Copyright (C) 2001,...,2002 Davide Libenzi
+ * Copyright (C) 2001,...,2003 Davide Libenzi
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by

2003-03-11 04:21:22

by Niels Provos

[permalink] [raw]

Subject: Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...

On Mon, Mar 10, 2003 at 12:15:25PM -0800, Davide Libenzi wrote:
> The LT epoll is by all means the fastest poll available and can be used
> wherever poll can be used. To test it I also ported thttpd to LT
> epoll and, so far, it didn't puke on my face. Niels and Marius also wrote
> a nice event library :
[...]
> that uses LT epoll, as long as poll and select. The usage pattern of an LT
I compared the performance of LT epoll using libevent against other
event mechanisms: select, poll and kqueue.

You can find the results at

http://www.monkey.org/~provos/libevent/

As kqueue and epoll are not available on the same operating system,
you have to take the results qualitatively. However, kqueue and epoll
are in the same ballpark. Both outperform select and poll.

It seems that option 3) which implements both "edge" and "level"
triggered behavior is the best solution. This is similar to kqueue
which supports both triggering modes.

Niels.

2003-03-11 09:23:47

by bert hubert

[permalink] [raw]

Subject: Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...

On Mon, Mar 10, 2003 at 12:15:25PM -0800, Davide Libenzi wrote:

> 2) Existing apps using poll/select can easily be ported usinf LT epoll

This is a big thing. I created a webserver based on MTasker
(ds9a.nl/mtasker) that used select, poll or epoll and it was very hard to
abstract this properly as level and edge semantics differ so wildly.

Most programs will not abandon 'legacy' interfaces like poll and select and
will only want to offer epoll in addition. Right now that is hard to do.

I implemented this by 'relevelling' the edgeness of epoll, which is double
work.

> 1) We leave epoll as is ( ET )
> 2) We apply the patch that will make epoll LT
> 3) We add a parameter to epoll_create() to fix the interface behaviour at
> creation time ( small change on the current patch )
>
> With 2) and 3) there are also man pages to be reviewed to be posted to
> Andries. Comments ?

I'd vote for 2.

Regards,

bert

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO
http://netherlabs.nl Consulting

2003-03-11 14:14:14

by Jamie Lokier

[permalink] [raw]

Subject: Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...

Davide Libenzi wrote:
> To briefly explain the difference between an Edge Triggered and an Level
> Triggered interface, suppose this sequence :
>
> 1) Pipe writer writes 2Kb of data on the write side
> 2) Pipe reader read 1Kb
> 3) Pipe reader calls epoll_wait()
> [...]
> The LT epoll is by all means the fastest poll available and can be used
> wherever poll can be used. To test it I also ported thttpd to LT
> epoll and, so far, it didn't puke on my face. Niels and Marius also wrote
> a nice event library :

I'm wondering about performance in terms of number of system calls.

If I do not completely read from a pipe or socket fd and then decide
to service some other fds (to avoid starvation when one fd is flooded
with data), with LT epoll I will need to issue an extra two
epoll_ctl() system calls. These are to (a) unregister my interest in
the first fd because it's moved out of my "live" set in userspace,
while I spend a timeslice or so servicing other fdss; (b) reregister
my interest when it is time to reactivate that fd's handler. These
are not required with ET epoll.

That is a con.

On the other hand, with LT poll I do not have to keep trying to read()
until I see an EAGAIN: I can "pause" a userspace handler when it sees
a short read(), guessing that there is no more data right now. If
therre actually is more data, I can be confident the event loop will
report it. With UDP-style fds, I can simply read one packet. In both
scenarios, LT epoll saves one read() system call.

I am not sure; perhaps there is a similar saving of one accept() call
per event on a listening socket?

This is a pro.

So there you go. I was about to complain that LT epoll would increase
the number of system calls in some cases, but I correct myself
already. I think it will decrease the number of system calls on
average due to the removal of extraneous read() and maybe accept() calls.

> LT epoll you simply use epoll_ctl(EPOLL_CTL_MOD) to switch between
> EPOLLIN and EPOLLOUT.

?? Is this poorly worded? EPOLLIN and EPOLLOUT are independent events,
aren't they?

> In front of this considerations we
> have three options that I can think :
>
> 1) We leave epoll as is ( ET )
> 2) We apply the patch that will make epoll LT
> 3) We add a parameter to epoll_create() to fix the interface behaviour at
> creation time ( small change on the current patch )

Is it not better to (4) select the behaviour when an fd interest is
registered? I think this is cleanest, if the code is not too
horrible.

Actually I think _this_ is cleanest: A three-way flag per registered
fd interest saying whether to:

1. Report 0->1 edges for this interest. (Initial 1 counts as an event).
2. Continually report 1 levels for this interest.
3. One-shot, report the first time 1 is noted and unregister.

ET poll is equivalent to 1. LT poll is equivalent to 2. dnotify's
one-shot mode is equivalent to 3.

I don't know whether it would make the epoll code messy to do this. I
suspect it would be quite clean.

cheers,
-- Jamie

2003-03-11 16:00:23

by Giuliano Pochini

[permalink] [raw]

Subject: RE: [patch, rfc] lt-epoll ( level triggered epoll ) ...

On 10-Mar-2003 Davide Libenzi wrote:
>
> During the last three months I spent considerable amount of time explainig
> developers how edge triggered APIs works, and how to use epoll inside
> their apps. It's time for me to face the reality, that is that:
>
> 1) developers not quite understand ET APIs
> 2) most existing apps are written using LT APIs

1) is easily solvable writing a good FAQ. "Developers" who really
can't undestand edge triggered APIs will continue to use poll().
If ET il faster than LT, tell people to stop whining and to learn
the API. Otherwise choose LT, mainly because of 2), but also
because ET API is more subtle bug prone.

Bye.

2003-03-11 17:55:34

by Davide Libenzi

[permalink] [raw]

Subject: Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...

On Mon, 10 Mar 2003, Niels Provos wrote:

> On Mon, Mar 10, 2003 at 12:15:25PM -0800, Davide Libenzi wrote:
> > The LT epoll is by all means the fastest poll available and can be used
> > wherever poll can be used. To test it I also ported thttpd to LT
> > epoll and, so far, it didn't puke on my face. Niels and Marius also wrote
> > a nice event library :
> [...]
> > that uses LT epoll, as long as poll and select. The usage pattern of an LT
> I compared the performance of LT epoll using libevent against other
> event mechanisms: select, poll and kqueue.
>
> You can find the results at
>
> http://www.monkey.org/~provos/libevent/

Niels, can you publish inside the libevent web site the library that has
epoll inside ( possibly with the changes I made to epoll_create() ) ? I'm
including the test application, so other can repeat tests if they want.

- Davide

/*
* Copyright 2003 Niels Provos <[email protected]>
* All rights reserved.
*
*
* Mon 03/10/2003 - Modified by Davide Libenzi <[email protected]>
*
* Added chain event propagation to improve the sensitivity of
* the measure respect to the event loop efficency.
*
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* 3. All advertising materials mentioning features or use of this software
* must display the following acknowledgement:
* This product includes software developed by Niels Provos.
* 4. The name of the author may not be used to endorse or promote products
* derived from this software without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
* IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
* INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
* NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
* THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

#ifdef HAVE_CONFIG_H
#include "config.h"
#endif

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/socket.h>
#include <sys/signal.h>
#include <sys/resource.h>
#include <fcntl.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>

#include <event.h>

static int count, writes, fired;
static int *pipes;
static int num_pipes, num_active, num_writes;
static struct event *events;

void
read_cb(int fd, short which, void *arg)
{
int idx = (int) arg, widx = idx + 1;
struct event *ev = &events[idx];
u_char ch;

count += read(fd, &ch, sizeof(ch));
if (writes) {
if (widx >= num_pipes)
widx -= num_pipes;
write(pipes[2 * widx + 1], "e", 1);
writes--;
fired++;
}
}

struct timeval *
run_once(void)
{
int *cp, i, space;
static struct timeval ts, te;

for (cp = pipes, i = 0; i < num_pipes; i++, cp += 2) {
event_del(&events[i]);
event_set(&events[i], cp[0], EV_READ | EV_PERSIST, read_cb, (void *) i);
event_add(&events[i], NULL);
}

event_loop(EVLOOP_ONCE | EVLOOP_NONBLOCK);

fired = 0;
space = num_pipes / num_active;
space = space * 2;
for (i = 0; i < num_active; i++, fired++)
write(pipes[i * space + 1], "e", 1);

count = 0;
writes = num_writes;

gettimeofday(&ts, NULL);
do {
event_loop(EVLOOP_ONCE | EVLOOP_NONBLOCK);
} while (count != fired);
gettimeofday(&te, NULL);

timersub(&te, &ts, &te);

return (&te);
}

int
main (int argc, char **argv)
{
struct rlimit rl;
int i, c;
struct timeval *tv;
int *cp;
extern char *optarg;

num_pipes = 100;
num_active = 1;
num_writes = num_pipes;
while ((c = getopt(argc, argv, "n:a:w:")) != -1) {
switch (c) {
case 'n':
num_pipes = atoi(optarg);
break;
case 'a':
num_active = atoi(optarg);
break;
case 'w':
num_writes = atoi(optarg);
break;
default:
fprintf(stderr, "Illegal argument \"%c\"\n", c);
exit(1);
}
}

rl.rlim_cur = rl.rlim_max = num_pipes * 2 + 50;
if (setrlimit(RLIMIT_NOFILE, &rl) == -1) {
perror("setrlimit");
exit(1);
}

events = calloc(num_pipes, sizeof(struct event));
pipes = calloc(num_pipes * 2, sizeof(int));
if (events == NULL || pipes == NULL) {
perror("malloc");
exit(1);
}

memset(events, 0, num_pipes * sizeof(struct event));

event_init();

for (cp = pipes, i = 0; i < num_pipes; i++, cp += 2) {
if (pipe(cp) == -1) {
perror("pipe");
exit(1);
}
}

for (i = 0; i < 25; i++) {
tv = run_once();
if (tv == NULL)
exit(1);
fprintf(stdout, "%ld\n",
tv->tv_sec * 1000000L + tv->tv_usec);
}

exit(0);
}

2003-03-11 18:00:54

by Davide Libenzi

[permalink] [raw]

Subject: Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...

On Tue, 11 Mar 2003, bert hubert wrote:

> On Mon, Mar 10, 2003 at 12:15:25PM -0800, Davide Libenzi wrote:
>
> > 2) Existing apps using poll/select can easily be ported usinf LT epoll
>
> This is a big thing. I created a webserver based on MTasker
> (ds9a.nl/mtasker) that used select, poll or epoll and it was very hard to
> abstract this properly as level and edge semantics differ so wildly.
>
> Most programs will not abandon 'legacy' interfaces like poll and select and
> will only want to offer epoll in addition. Right now that is hard to do.

I agree here. It took 15 minutes to port thttpd to LT epoll.

> > 1) We leave epoll as is ( ET )
> > 2) We apply the patch that will make epoll LT
> > 3) We add a parameter to epoll_create() to fix the interface behaviour at
> > creation time ( small change on the current patch )
> >
> > With 2) and 3) there are also man pages to be reviewed to be posted to
> > Andries. Comments ?
>
> I'd vote for 2.

I received a bunch of private emails ( ppl that are using ET epoll )
asking me to have both behaviours. The code require no more than 10 lines
of code to give both possibilities. We have two options in doing that :

1)
We add a parameter to epoll_create() that will set the interface behaviour
at creation time :

#define EPOLL_ET (1 << 0)

int epoll_create(int size, int flags);

2)
We can go at fd granularity by leaving the API the same, and we define :

#define EPOLLET (1 << 31)

...
events = EPOLLIN | EPOLLET;

What do you think ?

- Davide

2003-03-11 18:08:07

by Davide Libenzi

[permalink] [raw]

Subject: Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...

On Tue, 11 Mar 2003, Jamie Lokier wrote:

> > LT epoll you simply use epoll_ctl(EPOLL_CTL_MOD) to switch between
> > EPOLLIN and EPOLLOUT.
>
> ?? Is this poorly worded? EPOLLIN and EPOLLOUT are independent events,
> aren't they?

Yes, they're. But since ET epoll has the edge feature that filter events,
you can safely register both events and insertion time *if you want*.

> > In front of this considerations we
> > have three options that I can think :
> >
> > 1) We leave epoll as is ( ET )
> > 2) We apply the patch that will make epoll LT
> > 3) We add a parameter to epoll_create() to fix the interface behaviour at
> > creation time ( small change on the current patch )
>
> Is it not better to (4) select the behaviour when an fd interest is
> registered? I think this is cleanest, if the code is not too
> horrible.

The code does not change much ( I think about 10 lines of code :) ), and
I'm for this option. That has also the other advantage to not change the
API parameters. Obviously man pages will have to be reviewed, but this is
another story :)

- Davide

2003-03-11 18:15:00

by Davide Libenzi

[permalink] [raw]

Subject: RE: [patch, rfc] lt-epoll ( level triggered epoll ) ...

On Tue, 11 Mar 2003, Giuliano Pochini wrote:

>
> On 10-Mar-2003 Davide Libenzi wrote:
> >
> > During the last three months I spent considerable amount of time explainig
> > developers how edge triggered APIs works, and how to use epoll inside
> > their apps. It's time for me to face the reality, that is that:
> >
> > 1) developers not quite understand ET APIs
> > 2) most existing apps are written using LT APIs
>
> 1) is easily solvable writing a good FAQ. "Developers" who really
> can't undestand edge triggered APIs will continue to use poll().
> If ET il faster than LT, tell people to stop whining and to learn
> the API. Otherwise choose LT, mainly because of 2), but also
> because ET API is more subtle bug prone.

Believe me, many don't understand ET APIs. The epoll(4) man page tries to
answer to many questions but still it's not easy to wipe LT behaviour from
developers brain. On the other side, the merging of epoll inside the
kernel will have to have the goal to be actually and actively used by
developers. Due to the initial epoll architecture it was extremely easy to
have LT epoll and, right now, it has almost no cost to support both
behaviours. My current thought is to have the epoll user to choose the
behaviour on a per-fd basis, using a new EPOLLET event flag. I'm obviously
open to better suggestions ...

- Davide

2003-03-11 19:54:29

by Niels Provos

[permalink] [raw]

Subject: Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...

On Tue, Mar 11, 2003 at 10:15:10AM -0800, Davide Libenzi wrote:
> Niels, can you publish inside the libevent web site the library that has
> epoll inside ( possibly with the changes I made to epoll_create() ) ? I'm
> including the test application, so other can repeat tests if they want.
I just put up a snapshot at

http://naughty.monkey.org/~provos/libevent-snapshot.tar.gz

that includes the benchmark.

Niels.

2003-03-12 11:54:56

by bert hubert

[permalink] [raw]

Subject: Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...

On Tue, Mar 11, 2003 at 10:20:38AM -0800, Davide Libenzi wrote:

> > Most programs will not abandon 'legacy' interfaces like poll and select and
> > will only want to offer epoll in addition. Right now that is hard to do.
>
> I agree here. It took 15 minutes to port thttpd to LT epoll.

Having level ability will massively speed up epoll adoption. By the way, was
there a reason to go to edge in the first place?

> I received a bunch of private emails ( ppl that are using ET epoll )
> asking me to have both behaviours. The code require no more than 10 lines
> of code to give both possibilities. We have two options in doing that :

Impressive.

> We add a parameter to epoll_create() that will set the interface behaviour
> at creation time :
...
> We can go at fd granularity by leaving the API the same, and we define :
> #define EPOLLET (1 << 31)

This last option would retain the current ABI *and* semantics for unchanged
programs. I do wonder if there is a case where you'd want to run in mixed
mode, however. But if the code to support mixed operation is truly trivial,
I think we should not set policy from the kernel ('only do epoll in one
mode') and leave it up to userspace to discover if there is a use for this.

Anyhow, as a member of the kCowSay [1] association of userspace people
meddling in the affairs of kernel coders, I vote strongly for having level
triggered epoll on the kernel, with the ability to do mixed mode.

Regards,

bert

[1] Founded by John Levon on the assumption that kernel coders assume all
userspace code to be trivial, so we should have a massively trivial
name. kCowSay tries to educate kernel deities about the needs of us
userspace dwellers.

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO
http://netherlabs.nl Consulting

2003-03-12 16:58:05

by Davide Libenzi

[permalink] [raw]

Subject: Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...

On Wed, 12 Mar 2003, bert hubert wrote:

> On Tue, Mar 11, 2003 at 10:20:38AM -0800, Davide Libenzi wrote:
>
> > > Most programs will not abandon 'legacy' interfaces like poll and select and
> > > will only want to offer epoll in addition. Right now that is hard to do.
> >
> > I agree here. It took 15 minutes to port thttpd to LT epoll.
>
> Having level ability will massively speed up epoll adoption. By the way, was
> there a reason to go to edge in the first place?

Well, the first version of /dev/epoll was truly ET and it wasn't using
poll kernel hooks. With the usage of poll hook it became more affordable
to consider to have both ET and LT behaviours. ET would be my pick if I
have to start a new application from scratch anyway.

> > We add a parameter to epoll_create() that will set the interface behaviour
> > at creation time :
> ...
> > We can go at fd granularity by leaving the API the same, and we define :
> > #define EPOLLET (1 << 31)
>
> This last option would retain the current ABI *and* semantics for unchanged
> programs. I do wonder if there is a case where you'd want to run in mixed
> mode, however. But if the code to support mixed operation is truly trivial,
> I think we should not set policy from the kernel ('only do epoll in one
> mode') and leave it up to userspace to discover if there is a use for this.
>
> Anyhow, as a member of the kCowSay [1] association of userspace people
> meddling in the affairs of kernel coders, I vote strongly for having level
> triggered epoll on the kernel, with the ability to do mixed mode.

The patch I posted yesterday does the per-fd selectable ET/LT behaviour.
It ran fine on UP and 2SMP tonight, so I'm going to ping Linus for a merge
later today.

- Davide

2003-03-12 17:55:22

by Martin Waitz

[permalink] [raw]

Subject: Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...

hi :)

On Mon, Mar 10, 2003 at 11:32:02PM -0500, Niels Provos wrote:
> It seems that option 3) which implements both "edge" and "level"
> triggered behavior is the best solution. This is similar to kqueue
> which supports both triggering modes.
imho the kqueue api is a lot nicer anyway.

what about simply implementing kqueue?
it's already available in other OS's,
so it's easier for application developers to adopt it, too.

--
CU, / Friedrich-Alexander University Erlangen, Germany
Martin Waitz // [Tali on IRCnet] [tali.home.pages.de] _________
______________/// - - - - - - - - - - - - - - - - - - - - ///
dies ist eine manuell generierte mail, sie beinhaltet //
tippfehler und ist auch ohne grossbuchstaben gueltig. /
-
Wer bereit ist, grundlegende Freiheiten aufzugeben, um sich
kurzfristige Sicherheit zu verschaffen, der hat weder Freiheit
noch Sicherheit verdient. Benjamin Franklin (1706 - 1790)

2003-03-12 17:57:46

by Martin Waitz

[permalink] [raw]

Subject: Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...

On Tue, Mar 11, 2003 at 05:10:56PM +0100, Giuliano Pochini wrote:
> If ET il faster than LT, tell people to stop whining and to learn
> the API. Otherwise choose LT, mainly because of 2), but also
> because ET API is more subtle bug prone.

in some situations, ET simply has wrong semantics.

--
CU, / Friedrich-Alexander University Erlangen, Germany
Martin Waitz // [Tali on IRCnet] [tali.home.pages.de] _________
______________/// - - - - - - - - - - - - - - - - - - - - ///
dies ist eine manuell generierte mail, sie beinhaltet //
tippfehler und ist auch ohne grossbuchstaben gueltig. /
-
Wer bereit ist, grundlegende Freiheiten aufzugeben, um sich
kurzfristige Sicherheit zu verschaffen, der hat weder Freiheit
noch Sicherheit verdient. Benjamin Franklin (1706 - 1790)

2003-03-12 18:11:20

by Davide Libenzi

[permalink] [raw]

Subject: Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...

On Wed, 12 Mar 2003, Martin Waitz wrote:

> On Tue, Mar 11, 2003 at 05:10:56PM +0100, Giuliano Pochini wrote:
> > If ET il faster than LT, tell people to stop whining and to learn
> > the API. Otherwise choose LT, mainly because of 2), but also
> > because ET API is more subtle bug prone.
>
> in some situations, ET simply has wrong semantics.

IMO ET has perfectly nice semantics. The fact that ppl fail to understand
it does not make it automatically wrong. If things not understood would
have been flagged as wrong, we would be still living in caves.

- Davide

2003-03-12 18:16:48

by Davide Libenzi

[permalink] [raw]

Subject: Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...

On Wed, 12 Mar 2003, Martin Waitz wrote:

> On Mon, Mar 10, 2003 at 11:32:02PM -0500, Niels Provos wrote:
> > It seems that option 3) which implements both "edge" and "level"
> > triggered behavior is the best solution. This is similar to kqueue
> > which supports both triggering modes.
> imho the kqueue api is a lot nicer anyway.
>
> what about simply implementing kqueue?
> it's already available in other OS's,
> so it's easier for application developers to adopt it, too.

See opinions about APIs are strictly personal. IMO kqueue is overbloated
for example. The epoll API is extremely easy to use and very much remember
the poll one, that many developers are used to. If you want to make your
software completely abstract, you can use Niels's libevent library for
example, that supports poll/select/epoll/kqueue.

- Davide

2003-03-12 19:10:58

by Martin Waitz

[permalink] [raw]

Subject: Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...

hi :)

On Wed, Mar 12, 2003 at 10:30:54AM -0800, Davide Libenzi wrote:
> > in some situations, ET simply has wrong semantics.
>
> IMO ET has perfectly nice semantics. The fact that ppl fail to understand
> it does not make it automatically wrong. If things not understood would
> have been flagged as wrong, we would be still living in caves.

if ppl don't understand an API, it usually is flawed.

but that's not the point here, i just wanted to point out that there are
situations that are easier to solve with one or the other semantics.
and there /is/ a need for level-based events.

--
CU, / Friedrich-Alexander University Erlangen, Germany
Martin Waitz // [Tali on IRCnet] [tali.home.pages.de] _________
______________/// - - - - - - - - - - - - - - - - - - - - ///
dies ist eine manuell generierte mail, sie beinhaltet //
tippfehler und ist auch ohne grossbuchstaben gueltig. /
-
Wer bereit ist, grundlegende Freiheiten aufzugeben, um sich
kurzfristige Sicherheit zu verschaffen, der hat weder Freiheit
noch Sicherheit verdient. Benjamin Franklin (1706 - 1790)

2003-03-12 19:19:38

by Davide Libenzi

[permalink] [raw]

Subject: Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...

On Wed, 12 Mar 2003, Martin Waitz wrote:

> On Wed, Mar 12, 2003 at 10:30:54AM -0800, Davide Libenzi wrote:
> > > in some situations, ET simply has wrong semantics.
> >
> > IMO ET has perfectly nice semantics. The fact that ppl fail to understand
> > it does not make it automatically wrong. If things not understood would
> > have been flagged as wrong, we would be still living in caves.
>
> if ppl don't understand an API, it usually is flawed.

It's not the API that ppl does not understand, the API is the same. It's
the Edge Triggered event distribution architecture. The equation :

"not understood architecture" == "flawed architecture"

is false in all my books.

> but that's not the point here, i just wanted to point out that there are
> situations that are easier to solve with one or the other semantics.
> and there /is/ a need for level-based events.

That's a completely different thing. The new epoll gives you both
behaviours on a per-fd basis.

- Davide

2003-03-12 19:46:47

by Martin Waitz

[permalink] [raw]

Subject: Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...

hi :)

On Wed, Mar 12, 2003 at 11:39:31AM -0800, Davide Libenzi wrote:
> It's not the API that ppl does not understand, the API is the same. It's
> the Edge Triggered event distribution architecture.
which is part of the API ;)

> The equation :
>
> "not understood architecture" == "flawed architecture"
>
> is false in all my books.

that is right.
however, the probability of being flawed is much higher for things
that are not being understood...

but anyway, as you say APIs are subjective and that's perfectly fine.
if anyone wants to have a different api, he can create it on his own.

> > but that's not the point here, i just wanted to point out that there are
> > situations that are easier to solve with one or the other semantics.
> > and there /is/ a need for level-based events.
>
> That's a completely different thing. The new epoll gives you both
> behaviours on a per-fd basis.
which is a good thing

thanks for the great work!

--
CU, / Friedrich-Alexander University Erlangen, Germany
Martin Waitz // [Tali on IRCnet] [tali.home.pages.de] _________
______________/// - - - - - - - - - - - - - - - - - - - - ///
dies ist eine manuell generierte mail, sie beinhaltet //
tippfehler und ist auch ohne grossbuchstaben gueltig. /
-
Wer bereit ist, grundlegende Freiheiten aufzugeben, um sich
kurzfristige Sicherheit zu verschaffen, der hat weder Freiheit
noch Sicherheit verdient. Benjamin Franklin (1706 - 1790)

Attachments:

(No filename) (1.43 kB)
(No filename) (189.00 B)
Download all attachments

2003-03-14 15:49:02

by Valentin Nechayev

[permalink] [raw]

Subject: Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...

Tue, Mar 11, 2003 at 14:27:50, jamie wrote about "Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...":

> Actually I think _this_ is cleanest: A three-way flag per registered
> fd interest saying whether to:
>
> 1. Report 0->1 edges for this interest. (Initial 1 counts as an event).
> 2. Continually report 1 levels for this interest.
> 3. One-shot, report the first time 1 is noted and unregister.
>
> ET poll is equivalent to 1. LT poll is equivalent to 2. dnotify's
> one-shot mode is equivalent to 3.

kqueue can do all three variants (1st with EV_CLEAR, 3rd with EV_ONESHOT).

So, result of this whole epoll work is trivially predictable - Linux will have
analog of "overbloated" and "poorly designed" kqueue, but more poor
and with incompatible interface, adding its own stone to hell of
different APIs. Congratulations.

Linus was true: Linux is just for fun, not for work.

I say nothing bad for the real work to implement it. But the person who said
"Do poor, do incompatible, but do our own" should be blamed. You know him.

-netch-

2003-03-14 16:33:48

by Davide Libenzi

[permalink] [raw]

Subject: Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...

On Fri, 14 Mar 2003, Valentin Nechayev wrote:

> Tue, Mar 11, 2003 at 14:27:50, jamie wrote about "Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...":
>
> > Actually I think _this_ is cleanest: A three-way flag per registered
> > fd interest saying whether to:
> >
> > 1. Report 0->1 edges for this interest. (Initial 1 counts as an event).
> > 2. Continually report 1 levels for this interest.
> > 3. One-shot, report the first time 1 is noted and unregister.
> >
> > ET poll is equivalent to 1. LT poll is equivalent to 2. dnotify's
> > one-shot mode is equivalent to 3.
>
> kqueue can do all three variants (1st with EV_CLEAR, 3rd with EV_ONESHOT).
>
> So, result of this whole epoll work is trivially predictable - Linux will have
> analog of "overbloated" and "poorly designed" kqueue, but more poor
> and with incompatible interface, adding its own stone to hell of
> different APIs. Congratulations.

See, this is a free world, and I very much respect your opinion. On the
other side you might want to actually *read* the kqueue man page and find
out of its 24590 flags, where 99% of its users will use only 1% of its
functionality. Talking about overbloating. You might also want to know
that quite a few kqueue users currently running on your favourite OS, are
moving to Linux+epoll. The reason is still unclear to me, but I can leave
you to discover it as exercise.

> Linus was true: Linux is just for fun, not for work.

You're right here. We're having *a lot* of fun listening this kind of
statements.

- Davide

2003-03-14 17:46:10

by Tim Smith

[permalink] [raw]

Subject: Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...

On Fri, 14 Mar 2003, Davide Libenzi wrote:
> See, this is a free world, and I very much respect your opinion. On the
> other side you might want to actually *read* the kqueue man page and find
> out of its 24590 flags, where 99% of its users will use only 1% of its
> functionality. Talking about overbloating. You might also want to know

Wow...that does sound overbloated. Simpler is usually better in this kind
of thing, because 99% of the users will be doing the same thing: a lot of
TCP connections. From what I've seen so far, I'm very much looking forward
to your epoll stuff.

However, just for the heck of it, let me throw out a (probably stupid) idea
for the ultimate in non-overbloated interfaces for handling a ton of TCP
connections in the (probably most) common case of those connections all
being to the same port. I've not looked into the kernel at all to see if
this would actually be feasible...just speculating based on what I'd like
as someone writing a server that I'd like to have handle 100k TCP
connections on commodity hardware.

How about an option to put a bound socket in a mode I'll call TCP Datagram
Mode (TDM). You can listen() on a TDM socket. When you accept() on a TDM
socket, you get a socket for the new connection, just like now. However,
that socket is only used for writing to the connection.

When data is available to read on the connection, instead of getting POLLIN
on the connection socket, you get a new event on the listen socket: POLLSDG
(SDG == "stream datagram"...generalization of "TCP Datagram"). You can then
use recvmsg on the listen socket, and that gives you a chunk of data from
one of the connections. The ancillary data tells you what connection the
data is from.

With this interface, plain old poll() should be good enough. For reading,
you are only poll()ing on the listen socket. You only need to poll() on the
write sockets if you fill up output buffers. So, most of the time, poll()
would only be used on one socket. Even plain old poll() scales well
to 1. :-)

(Actually...it might even be reasonable to use sendmsg() on the listen
socket to send data, too, and then get rid of the whole accept() thing for
TDM sockets. Basically, turn multiple TCP connections into a reliable form
of UDP from the application's point of view)

--Tim Smith

2003-03-14 18:40:50

by Davide Libenzi

[permalink] [raw]

Subject: Re: [patch, rfc] lt-epoll ( level triggered epoll ) ...

On Fri, 14 Mar 2003, Tim Smith wrote:

> On Fri, 14 Mar 2003, Davide Libenzi wrote:
> > See, this is a free world, and I very much respect your opinion. On the
> > other side you might want to actually *read* the kqueue man page and find
> > out of its 24590 flags, where 99% of its users will use only 1% of its
> > functionality. Talking about overbloating. You might also want to know
>
> Wow...that does sound overbloated. Simpler is usually better in this kind
> of thing, because 99% of the users will be doing the same thing: a lot of
> TCP connections. From what I've seen so far, I'm very much looking forward
> to your epoll stuff.
>
> However, just for the heck of it, let me throw out a (probably stupid) idea
> for the ultimate in non-overbloated interfaces for handling a ton of TCP
> connections in the (probably most) common case of those connections all
> being to the same port. I've not looked into the kernel at all to see if
> this would actually be feasible...just speculating based on what I'd like
> as someone writing a server that I'd like to have handle 100k TCP
> connections on commodity hardware.

The only problem having 100K connections using epoll is RAM. You need to
have 100K socket buffers, 100K connection "status", ...

> How about an option to put a bound socket in a mode I'll call TCP Datagram
> Mode (TDM). You can listen() on a TDM socket. When you accept() on a TDM
> socket, you get a socket for the new connection, just like now. However,
> that socket is only used for writing to the connection.
>
> When data is available to read on the connection, instead of getting POLLIN
> on the connection socket, you get a new event on the listen socket: POLLSDG
> (SDG == "stream datagram"...generalization of "TCP Datagram"). You can then
> use recvmsg on the listen socket, and that gives you a chunk of data from
> one of the connections. The ancillary data tells you what connection the
> data is from.
>
> With this interface, plain old poll() should be good enough. For reading,
> you are only poll()ing on the listen socket. You only need to poll() on the
> write sockets if you fill up output buffers. So, most of the time, poll()
> would only be used on one socket. Even plain old poll() scales well
> to 1. :-)
>
> (Actually...it might even be reasonable to use sendmsg() on the listen
> socket to send data, too, and then get rid of the whole accept() thing for
> TDM sockets. Basically, turn multiple TCP connections into a reliable form
> of UDP from the application's point of view)

This will work too. Sadly you have to abbandon POSIX semantics for those
sockets. The idea of detaching a new socket through an accept in to have
the abstraction on one file/socket per connection/client.

- Davide