Generic event handling mechanism.
Consider for inclusion.
Changes from 'take21' patchset:
* minor cleanups (different return values, removed unneded variables, whitespaces and so on)
* fixed bug in kevent removal in case when kevent being removed
is the same as overflow_kevent (spotted by Eric Dumazet)
Changes from 'take20' patchset:
* new ring buffer implementation
* removed artificial limit on possible number of kevents
With this release and fixed userspace web server it was possible to
achive 3960+ req/s with client connection rate of 4000 con/s
over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which
is too close to wire speed if we get into account headers and the like.
Changes from 'take19' patchset:
* use __init instead of __devinit
* removed 'default N' from config for user statistic
* removed kevent_user_fini() since kevent can not be unloaded
* use KERN_INFO for statistic output
Changes from 'take18' patchset:
* use __init instead of __devinit
* removed 'default N' from config for user statistic
* removed kevent_user_fini() since kevent can not be unloaded
* use KERN_INFO for statistic output
Changes from 'take17' patchset:
* Use RB tree instead of hash table.
At least for a web sever, frequency of addition/deletion of new kevent
is comparable with number of search access, i.e. most of the time events
are added, accesed only couple of times and then removed, so it justifies
RB tree usage over AVL tree, since the latter does have much slower deletion
time (max O(log(N)) compared to 3 ops),
although faster search time (1.44*O(log(N)) vs. 2*O(log(N))).
So for kevents I use RB tree for now and later, when my AVL tree implementation
is ready, it will be possible to compare them.
* Changed readiness check for socket notifications.
With both above changes it is possible to achieve more than 3380 req/second compared to 2200,
sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same
hardware.
It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is
4096 events.
Changes from 'take16' patchset:
* misc cleanups (__read_mostly, const ...)
* created special macro which is used for mmap size (number of pages) calculation
* export kevent_socket_notify(), since it is used in network protocols which can be
built as modules (IPv6 for example)
Changes from 'take15' patchset:
* converted kevent_timer to high-resolution timers, this forces timer API update at
http://linux-net.osdl.org/index.php/Kevent
* use struct ukevent* instead of void * in syscalls (documentation has been updated)
* added warning in kevent_add_ukevent() if ring has broken index (for testing)
Changes from 'take14' patchset:
* added kevent_wait()
This syscall waits until either timeout expires or at least one event
becomes ready. It also commits that @num events from @start are processed
by userspace and thus can be be removed or rearmed (depending on it's flags).
It can be used for commit events read by userspace through mmap interface.
Example userspace code (evtest.c) can be found on project's homepage.
* added socket notifications (send/recv/accept)
Changes from 'take13' patchset:
* do not get lock aroung user data check in __kevent_search()
* fail early if there were no registered callbacks for given type of kevent
* trailing whitespace cleanup
Changes from 'take12' patchset:
* remove non-chardev interface for initialization
* use pointer to kevent_mring instead of unsigned longs
* use aligned 64bit type in raw user data (can be used by high-res timer if needed)
* simplified enqueue/dequeue callbacks and kevent initialization
* use nanoseconds for timeout
* put number of milliseconds into timer's return data
* move some definitions into user-visible header
* removed filenames from comments
Changes from 'take11' patchset:
* include missing headers into patchset
* some trivial code cleanups (use goto instead of if/else games and so on)
* some whitespace cleanups
* check for ready_callback() callback before main loop which should save us some ticks
Changes from 'take10' patchset:
* removed non-existent prototypes
* added helper function for kevent_registered_callbacks
* fixed 80 lines comments issues
* added shared between userspace and kernelspace header instead of embedd them in one
* core restructuring to remove forward declarations
* s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
* use vm_insert_page() instead of remap_pfn_range()
Changes from 'take9' patchset:
* fixed ->nopage method
Changes from 'take8' patchset:
* fixed mmap release bug
* use module_init() instead of late_initcall()
* use better structures for timer notifications
Changes from 'take7' patchset:
* new mmap interface (not tested, waiting for other changes to be acked)
- use nopage() method to dynamically substitue pages
- allocate new page for events only when new added kevent requres it
- do not use ugly index dereferencing, use structure instead
- reduced amount of data in the ring (id and flags),
maximum 12 pages on x86 per kevent fd
Changes from 'take6' patchset:
* a lot of comments!
* do not use list poisoning for detection of the fact, that entry is in the list
* return number of ready kevents even if copy*user() fails
* strict check for number of kevents in syscall
* use ARRAY_SIZE for array size calculation
* changed superblock magic number
* use SLAB_PANIC instead of direct panic() call
* changed -E* return values
* a lot of small cleanups and indent fixes
Changes from 'take5' patchset:
* removed compilation warnings about unused wariables when lockdep is not turned on
* do not use internal socket structures, use appropriate (exported) wrappers instead
* removed default 1 second timeout
* removed AIO stuff from patchset
Changes from 'take4' patchset:
* use miscdevice instead of chardevice
* comments fixes
Changes from 'take3' patchset:
* removed serializing mutex from kevent_user_wait()
* moved storage list processing to RCU
* removed lockdep screaming - all storage locks are initialized in the same function, so it was
learned
to differentiate between various cases
* remove kevent from storage if is marked as broken after callback
* fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion
Changes from 'take2' patchset:
* split kevent_finish_user() to locked and unlocked variants
* do not use KEVENT_STAT ifdefs, use inline functions instead
* use array of callbacks of each type instead of each kevent callback initialization
* changed name of ukevent guarding lock
* use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
* do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
* various indent cleanups
* added optimisation, which is aimed to help when a lot of kevents are being copied from
userspace
* mapped buffer (initial) implementation (no userspace yet)
Changes from 'take1' patchset:
- rebased against 2.6.18-git tree
- removed ioctl controlling
- added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
unsigned int timeout, void __user *buf, unsigned flags)
- use old syscall kevent_ctl for creation/removing, modification and initial kevent
initialization
- use mutuxes instead of semaphores
- added file descriptor check and return error if provided descriptor does not match
kevent file operations
- various indent fixes
- removed aio_sendfile() declarations.
Thank you.
Signed-off-by: Evgeniy Polyakov <[email protected]>
Timer notifications.
Timer notifications can be used for fine grained per-process time
management, since interval timers are very inconvenient to use,
and they are limited.
This subsystem uses high-resolution timers.
id.raw[0] is used as number of seconds
id.raw[1] is used as number of nanoseconds
Signed-off-by: Evgeniy Polyakov <[email protected]>
diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..04acc46
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,113 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/hrtimer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+struct kevent_timer
+{
+ struct hrtimer ktimer;
+ struct kevent_storage ktimer_storage;
+ struct kevent *ktimer_event;
+};
+
+static int kevent_timer_func(struct hrtimer *timer)
+{
+ struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer);
+ struct kevent *k = t->ktimer_event;
+
+ kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL);
+ hrtimer_forward(timer, timer->base->softirq_time,
+ ktime_set(k->event.id.raw[0], k->event.id.raw[1]));
+ return HRTIMER_RESTART;
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+ int err;
+ struct kevent_timer *t;
+
+ t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+ if (!t)
+ return -ENOMEM;
+
+ hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL);
+ t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]);
+ t->ktimer.function = kevent_timer_func;
+ t->ktimer_event = k;
+
+ err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+ if (err)
+ goto err_out_free;
+ lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+ err = kevent_storage_enqueue(&t->ktimer_storage, k);
+ if (err)
+ goto err_out_st_fini;
+
+ printk("%s: jiffies: %lu, timer: %p.\n", __func__, jiffies, &t->ktimer);
+ hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL);
+
+ return 0;
+
+err_out_st_fini:
+ kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+ kfree(t);
+
+ return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+ struct kevent_storage *st = k->st;
+ struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage);
+
+ hrtimer_cancel(&t->ktimer);
+ kevent_storage_dequeue(st, k);
+ kfree(t);
+
+ return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+ k->event.ret_data[0] = jiffies_to_msecs(jiffies);
+ return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+ struct kevent_callbacks tc = {
+ .callback = &kevent_timer_callback,
+ .enqueue = &kevent_timer_enqueue,
+ .dequeue = &kevent_timer_dequeue};
+
+ return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);
+
poll/select() notifications.
This patch includes generic poll/select notifications.
kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake, a lot of allocations and so on).
Signed-off-by: Evgeniy Polyakov <[email protected]>
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5baf3a1..f81299f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -276,6 +276,7 @@ #include <linux/prio_tree.h>
#include <linux/init.h>
#include <linux/sched.h>
#include <linux/mutex.h>
+#include <linux/kevent.h>
#include <asm/atomic.h>
#include <asm/semaphore.h>
@@ -586,6 +587,10 @@ #ifdef CONFIG_INOTIFY
struct mutex inotify_mutex; /* protects the watches list */
#endif
+#ifdef CONFIG_KEVENT_SOCKET
+ struct kevent_storage st;
+#endif
+
unsigned long i_state;
unsigned long dirtied_when; /* jiffies of first dirtying */
@@ -739,6 +744,9 @@ #ifdef CONFIG_EPOLL
struct list_head f_ep_links;
spinlock_t f_ep_lock;
#endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+ struct kevent_storage st;
+#endif
struct address_space *f_mapping;
};
extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..94facbb
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,222 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+ struct poll_table_struct pt;
+ struct kevent *k;
+};
+
+struct kevent_poll_wait_container
+{
+ struct list_head container_entry;
+ wait_queue_head_t *whead;
+ wait_queue_t wait;
+ struct kevent *k;
+};
+
+struct kevent_poll_private
+{
+ struct list_head container_list;
+ spinlock_t container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+ unsigned mode, int sync, void *key)
+{
+ struct kevent_poll_wait_container *cont =
+ container_of(wait, struct kevent_poll_wait_container, wait);
+ struct kevent *k = cont->k;
+ struct file *file = k->st->origin;
+ u32 revents;
+
+ revents = file->f_op->poll(file, NULL);
+
+ kevent_storage_ready(k->st, NULL, revents);
+
+ return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+ struct poll_table_struct *poll_table)
+{
+ struct kevent *k =
+ container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+ struct kevent_poll_private *priv = k->priv;
+ struct kevent_poll_wait_container *cont;
+ unsigned long flags;
+
+ cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+ if (!cont) {
+ kevent_break(k);
+ return;
+ }
+
+ cont->k = k;
+ init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+ cont->whead = whead;
+
+ spin_lock_irqsave(&priv->container_lock, flags);
+ list_add_tail(&cont->container_entry, &priv->container_list);
+ spin_unlock_irqrestore(&priv->container_lock, flags);
+
+ add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+ struct file *file;
+ int err, ready = 0;
+ unsigned int revents;
+ struct kevent_poll_ctl ctl;
+ struct kevent_poll_private *priv;
+
+ file = fget(k->event.id.raw[0]);
+ if (!file)
+ return -EBADF;
+
+ err = -EINVAL;
+ if (!file->f_op || !file->f_op->poll)
+ goto err_out_fput;
+
+ err = -ENOMEM;
+ priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+ if (!priv)
+ goto err_out_fput;
+
+ spin_lock_init(&priv->container_lock);
+ INIT_LIST_HEAD(&priv->container_list);
+
+ k->priv = priv;
+
+ ctl.k = k;
+ init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+
+ err = kevent_storage_enqueue(&file->st, k);
+ if (err)
+ goto err_out_free;
+
+ revents = file->f_op->poll(file, &ctl.pt);
+ if (revents & k->event.event) {
+ ready = 1;
+ kevent_poll_dequeue(k);
+ }
+
+ return ready;
+
+err_out_free:
+ kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+ fput(file);
+ return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+ struct file *file = k->st->origin;
+ struct kevent_poll_private *priv = k->priv;
+ struct kevent_poll_wait_container *w, *n;
+ unsigned long flags;
+
+ kevent_storage_dequeue(k->st, k);
+
+ spin_lock_irqsave(&priv->container_lock, flags);
+ list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+ list_del(&w->container_entry);
+ remove_wait_queue(w->whead, &w->wait);
+ kmem_cache_free(kevent_poll_container_cache, w);
+ }
+ spin_unlock_irqrestore(&priv->container_lock, flags);
+
+ kmem_cache_free(kevent_poll_priv_cache, priv);
+ k->priv = NULL;
+
+ fput(file);
+
+ return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+ struct file *file = k->st->origin;
+ unsigned int revents = file->f_op->poll(file, NULL);
+
+ k->event.ret_data[0] = revents & k->event.event;
+
+ return (revents & k->event.event);
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+ struct kevent_callbacks pc = {
+ .callback = &kevent_poll_callback,
+ .enqueue = &kevent_poll_enqueue,
+ .dequeue = &kevent_poll_dequeue};
+
+ kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache",
+ sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+ if (!kevent_poll_container_cache) {
+ printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+ return -ENOMEM;
+ }
+
+ kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache",
+ sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+ if (!kevent_poll_priv_cache) {
+ printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+ kmem_cache_destroy(kevent_poll_container_cache);
+ kevent_poll_container_cache = NULL;
+ return -ENOMEM;
+ }
+
+ kevent_add_callbacks(&pc, KEVENT_POLL);
+
+ printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+ return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+ lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+ kmem_cache_destroy(kevent_poll_priv_cache);
+ kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);
Socket notifications.
This patch includes socket send/recv/accept notifications.
Using trivial web server based on kevent and this features
instead of epoll it's performance increased more than noticebly.
More details about various benchmarks and server itself
(evserver_kevent.c) can be found on project's homepage.
Signed-off-by: Evgeniy Polyakov <[email protected]>
diff --git a/fs/inode.c b/fs/inode.c
index ada7643..ff1b129 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@ #include <linux/pagemap.h>
#include <linux/cdev.h>
#include <linux/bootmem.h>
#include <linux/inotify.h>
+#include <linux/kevent.h>
#include <linux/mount.h>
/*
@@ -164,12 +165,18 @@ #endif
}
inode->i_private = 0;
inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET
+ kevent_storage_init(inode, &inode->st);
+#endif
}
return inode;
}
void destroy_inode(struct inode *inode)
{
+#if defined CONFIG_KEVENT_SOCKET
+ kevent_storage_fini(&inode->st);
+#endif
BUG_ON(inode_has_buffers(inode));
security_inode_free(inode);
if (inode->i_sb->s_op->destroy_inode)
diff --git a/include/net/sock.h b/include/net/sock.h
index edd4d73..d48ded8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -48,6 +48,7 @@ #include <linux/lockdep.h>
#include <linux/netdevice.h>
#include <linux/skbuff.h> /* struct sk_buff */
#include <linux/security.h>
+#include <linux/kevent.h>
#include <linux/filter.h>
@@ -450,6 +451,21 @@ static inline int sk_stream_memory_free(
extern void sk_stream_rfree(struct sk_buff *skb);
+struct socket_alloc {
+ struct socket socket;
+ struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+ return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+ return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
+}
+
static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
{
skb->sk = sk;
@@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct
sk->sk_backlog.tail = skb;
}
skb->next = NULL;
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
}
#define sk_wait_event(__sk, __timeo, __condition) \
@@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio
return si->kiocb;
}
-struct socket_alloc {
- struct socket socket;
- struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
- return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
- return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
-}
-
extern void __sk_stream_mem_reclaim(struct sock *sk);
extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..69f4ad2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so
tp->ucopy.memory = 0;
} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
wake_up_interruptible(sk->sk_sleep);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
if (!inet_csk_ack_scheduled(sk))
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
(3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 0000000..5040b4c
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,129 @@
+/*
+ * kevent_socket.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/tcp.h>
+#include <linux/kevent.h>
+
+#include <net/sock.h>
+#include <net/request_sock.h>
+#include <net/inet_connection_sock.h>
+
+static int kevent_socket_callback(struct kevent *k)
+{
+ struct inode *inode = k->st->origin;
+ return SOCKET_I(inode)->ops->poll(SOCKET_I(inode)->file, SOCKET_I(inode), NULL);
+}
+
+int kevent_socket_enqueue(struct kevent *k)
+{
+ struct inode *inode;
+ struct socket *sock;
+ int err = -EBADF;
+
+ sock = sockfd_lookup(k->event.id.raw[0], &err);
+ if (!sock)
+ goto err_out_exit;
+
+ inode = igrab(SOCK_INODE(sock));
+ if (!inode)
+ goto err_out_fput;
+
+ err = kevent_storage_enqueue(&inode->st, k);
+ if (err)
+ goto err_out_iput;
+
+ err = k->callbacks.callback(k);
+ if (err)
+ goto err_out_dequeue;
+
+ return err;
+
+err_out_dequeue:
+ kevent_storage_dequeue(k->st, k);
+err_out_iput:
+ iput(inode);
+err_out_fput:
+ sockfd_put(sock);
+err_out_exit:
+ return err;
+}
+
+int kevent_socket_dequeue(struct kevent *k)
+{
+ struct inode *inode = k->st->origin;
+ struct socket *sock;
+
+ kevent_storage_dequeue(k->st, k);
+
+ sock = SOCKET_I(inode);
+ iput(inode);
+ sockfd_put(sock);
+
+ return 0;
+}
+
+void kevent_socket_notify(struct sock *sk, u32 event)
+{
+ if (sk->sk_socket)
+ kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event);
+}
+
+/*
+ * It is required for network protocols compiled as modules, like IPv6.
+ */
+EXPORT_SYMBOL_GPL(kevent_socket_notify);
+
+#ifdef CONFIG_LOCKDEP
+static struct lock_class_key kevent_sock_key;
+
+void kevent_socket_reinit(struct socket *sock)
+{
+ struct inode *inode = SOCK_INODE(sock);
+
+ lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+}
+
+void kevent_sk_reinit(struct sock *sk)
+{
+ if (sk->sk_socket) {
+ struct inode *inode = SOCK_INODE(sk->sk_socket);
+
+ lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+ }
+}
+#endif
+static int __init kevent_init_socket(void)
+{
+ struct kevent_callbacks sc = {
+ .callback = &kevent_socket_callback,
+ .enqueue = &kevent_socket_enqueue,
+ .dequeue = &kevent_socket_dequeue};
+
+ return kevent_add_callbacks(&sc, KEVENT_SOCKET);
+}
+module_init(kevent_init_socket);
diff --git a/net/core/sock.c b/net/core/sock.c
index b77e155..7d5fa3e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1402,6 +1402,7 @@ static void sock_def_wakeup(struct sock
if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
wake_up_interruptible_all(sk->sk_sleep);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}
static void sock_def_error_report(struct sock *sk)
@@ -1411,6 +1412,7 @@ static void sock_def_error_report(struct
wake_up_interruptible(sk->sk_sleep);
sk_wake_async(sk,0,POLL_ERR);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}
static void sock_def_readable(struct sock *sk, int len)
@@ -1420,6 +1422,7 @@ static void sock_def_readable(struct soc
wake_up_interruptible(sk->sk_sleep);
sk_wake_async(sk,1,POLL_IN);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}
static void sock_def_write_space(struct sock *sk)
@@ -1439,6 +1442,7 @@ static void sock_def_write_space(struct
}
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
}
static void sock_def_destruct(struct sock *sk)
@@ -1489,6 +1493,8 @@ #endif
sk->sk_state = TCP_CLOSE;
sk->sk_socket = sock;
+ kevent_sk_reinit(sk);
+
sock_set_flag(sk, SOCK_ZAPPED);
if(sock)
@@ -1555,8 +1561,10 @@ void fastcall release_sock(struct sock *
if (sk->sk_backlog.tail)
__release_sock(sk);
sk->sk_lock.owner = NULL;
- if (waitqueue_active(&sk->sk_lock.wq))
+ if (waitqueue_active(&sk->sk_lock.wq)) {
wake_up(&sk->sk_lock.wq);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
+ }
spin_unlock_bh(&sk->sk_lock.slock);
}
EXPORT_SYMBOL(release_sock);
diff --git a/net/core/stream.c b/net/core/stream.c
index d1d7dec..2878c2a 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock *
wake_up_interruptible(sk->sk_sleep);
if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
sock_wake_async(sock, 2, POLL_OUT);
+ kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
}
}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3f884ce..e7dd989 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3119,6 +3119,7 @@ static void tcp_ofo_queue(struct sock *s
__skb_unlink(skb, &tp->out_of_order_queue);
__skb_queue_tail(&sk->sk_receive_queue, skb);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
if(skb->h.th->fin)
tcp_fin(skb, sk, skb->h.th);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index c83938b..b0dd70d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -61,6 +61,7 @@ #include <linux/cache.h>
#include <linux/jhash.h>
#include <linux/init.h>
#include <linux/times.h>
+#include <linux/kevent.h>
#include <net/icmp.h>
#include <net/inet_hashtables.h>
@@ -870,6 +871,7 @@ #endif
reqsk_free(req);
} else {
inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+ kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT);
}
return 0;
diff --git a/net/socket.c b/net/socket.c
index 1bc4167..5582b4a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -85,6 +85,7 @@ #include <linux/compat.h>
#include <linux/kmod.h>
#include <linux/audit.h>
#include <linux/wireless.h>
+#include <linux/kevent.h>
#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -490,6 +491,8 @@ static struct socket *sock_alloc(void)
inode->i_uid = current->fsuid;
inode->i_gid = current->fsgid;
+ kevent_socket_reinit(sock);
+
get_cpu_var(sockets_in_use)++;
put_cpu_var(sockets_in_use);
return sock;
Core files.
This patch includes core kevent files:
* userspace controlling
* kernelspace interfaces
* initialization
* notification state machines
Some bits of documentation can be found on project's homepage (and links from there):
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent
Signed-off-by: Evgeniy Polyakov <[email protected]>
diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 7e639f7..a9560eb 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -318,3 +318,6 @@ ENTRY(sys_call_table)
.long sys_vmsplice
.long sys_move_pages
.long sys_getcpu
+ .long sys_kevent_get_events
+ .long sys_kevent_ctl /* 320 */
+ .long sys_kevent_wait
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index b4aa875..cf18955 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -714,8 +714,11 @@ #endif
.quad compat_sys_get_robust_list
.quad sys_splice
.quad sys_sync_file_range
- .quad sys_tee
+ .quad sys_tee /* 315 */
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
.quad sys_getcpu
+ .quad sys_kevent_get_events
+ .quad sys_kevent_ctl /* 320 */
+ .quad sys_kevent_wait
ia32_syscall_end:
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index bd99870..f009677 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -324,10 +324,13 @@ #define __NR_tee 315
#define __NR_vmsplice 316
#define __NR_move_pages 317
#define __NR_getcpu 318
+#define __NR_kevent_get_events 319
+#define __NR_kevent_ctl 320
+#define __NR_kevent_wait 321
#ifdef __KERNEL__
-#define NR_syscalls 319
+#define NR_syscalls 322
#include <linux/err.h>
/*
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 6137146..c53d156 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,16 @@ #define __NR_vmsplice 278
__SYSCALL(__NR_vmsplice, sys_vmsplice)
#define __NR_move_pages 279
__SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events 280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl 281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
+#define __NR_kevent_wait 282
+__SYSCALL(__NR_kevent_wait, sys_kevent_wait)
#ifdef __KERNEL__
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_wait
#include <linux/err.h>
#ifndef __NO_STUBS
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..743b328
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,205 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/kevent_storage.h>
+#include <linux/ukevent.h>
+
+#define KEVENT_MIN_BUFFS_ALLOC 3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+ kevent_callback_t callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY 0x1
+#define KEVENT_STORAGE 0x2
+#define KEVENT_USER 0x4
+
+struct kevent
+{
+ /* Used for kevent freeing.*/
+ struct rcu_head rcu_head;
+ struct ukevent event;
+ /* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+ spinlock_t ulock;
+
+ /* Entry of user's tree. */
+ struct rb_node kevent_node;
+ /* Entry of origin's queue. */
+ struct list_head storage_entry;
+ /* Entry of user's ready. */
+ struct list_head ready_entry;
+
+ u32 flags;
+
+ /* User who requested this kevent. */
+ struct kevent_user *user;
+ /* Kevent container. */
+ struct kevent_storage *st;
+
+ struct kevent_callbacks callbacks;
+
+ /* Private data for different storages.
+ * poll()/select storage has a list of wait_queue_t containers
+ * for each ->poll() { poll_wait()' } here.
+ */
+ void *priv;
+};
+
+struct kevent_user
+{
+ struct rb_root kevent_root;
+ spinlock_t kevent_lock;
+ /* Number of queued kevents. */
+ unsigned int kevent_num;
+
+ /* List of ready kevents. */
+ struct list_head ready_list;
+ /* Number of ready kevents. */
+ unsigned int ready_num;
+ /* Protects all manipulations with ready queue. */
+ spinlock_t ready_lock;
+
+ /* Protects against simultaneous kevent_user control manipulations. */
+ struct mutex ctl_mutex;
+ /* Wait until some events are ready. */
+ wait_queue_head_t wait;
+
+ /* Reference counter, increased for each new kevent. */
+ atomic_t refcnt;
+
+ /* First kevent which was not put into ring buffer due to overflow.
+ * It will be copied into the buffer, when first event will be removed
+ * from ready queue (and thus there will be an empty place in the
+ * ring buffer).
+ */
+ struct kevent *overflow_kevent;
+ /* Array of pages forming mapped ring buffer */
+ struct kevent_mring **pring;
+
+#ifdef CONFIG_KEVENT_USER_STAT
+ unsigned long im_num;
+ unsigned long wait_num, mmap_num;
+ unsigned long total;
+#endif
+};
+
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos);
+
+int kevent_user_ring_add_event(struct kevent *k);
+
+void kevent_storage_ready(struct kevent_storage *st,
+ kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+ u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+ printk(KERN_INFO "%s: u: %p, wait: %lu, mmap: %lu, immediately: %lu, total: %lu.\n",
+ __func__, u, u->wait_num, u->mmap_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+ u->im_num++;
+}
+static inline void kevent_stat_mmap(struct kevent_user *u)
+{
+ u->mmap_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+ u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+ u->total++;
+}
+#else
+#define kevent_stat_print(u) ({ (void) u;})
+#define kevent_stat_init(u) ({ (void) u;})
+#define kevent_stat_im(u) ({ (void) u;})
+#define kevent_stat_wait(u) ({ (void) u;})
+#define kevent_stat_mmap(u) ({ (void) u;})
+#define kevent_stat_total(u) ({ (void) u;})
+#endif
+
+#ifdef CONFIG_LOCKDEP
+void kevent_socket_reinit(struct socket *sock);
+void kevent_sk_reinit(struct sock *sk);
+#else
+static inline void kevent_socket_reinit(struct socket *sock)
+{
+}
+static inline void kevent_sk_reinit(struct sock *sk)
+{
+}
+#endif
+#ifdef CONFIG_KEVENT_SOCKET
+void kevent_socket_notify(struct sock *sock, u32 event);
+int kevent_socket_dequeue(struct kevent *k);
+int kevent_socket_enqueue(struct kevent *k);
+#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)
+#else
+static inline void kevent_socket_notify(struct sock *sock, u32 event)
+{
+}
+#define sock_async(__sk) ({ (void)__sk; 0; })
+#endif
+
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+ void *origin; /* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+ struct list_head list; /* List of queued kevents. */
+ spinlock_t lock; /* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2d1c3d5..71a758f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -54,6 +54,7 @@ struct compat_stat;
struct compat_timeval;
struct robust_list_head;
struct getcpu_cache;
+struct ukevent;
#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -599,4 +600,8 @@ asmlinkage long sys_set_robust_list(stru
size_t len);
asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max,
+ __u64 timeout, struct ukevent __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, struct ukevent __user *buf);
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned int num, __u64 timeout);
#endif
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
new file mode 100644
index 0000000..daa8202
--- /dev/null
+++ b/include/linux/ukevent.h
@@ -0,0 +1,163 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __UKEVENT_H
+#define __UKEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+/* Process this event only once and then dequeue. */
+#define KEVENT_REQ_ONESHOT 0x1
+
+/*
+ * Kevent return flags.
+ */
+/* Kevent is broken. */
+#define KEVENT_RET_BROKEN 0x1
+/* Kevent processing was finished successfully. */
+#define KEVENT_RET_DONE 0x2
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 0
+#define KEVENT_INODE 1
+#define KEVENT_TIMER 2
+#define KEVENT_POLL 3
+#define KEVENT_NAIO 4
+#define KEVENT_AIO 5
+#define KEVENT_MAX 6
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define KEVENT_TIMER_FIRED 0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#define KEVENT_SOCKET_RECV 0x1
+#define KEVENT_SOCKET_ACCEPT 0x2
+#define KEVENT_SOCKET_SEND 0x4
+
+/*
+ * Inode events.
+ */
+#define KEVENT_INODE_CREATE 0x1
+#define KEVENT_INODE_REMOVE 0x2
+
+/*
+ * Poll events.
+ */
+#define KEVENT_POLL_POLLIN 0x0001
+#define KEVENT_POLL_POLLPRI 0x0002
+#define KEVENT_POLL_POLLOUT 0x0004
+#define KEVENT_POLL_POLLERR 0x0008
+#define KEVENT_POLL_POLLHUP 0x0010
+#define KEVENT_POLL_POLLNVAL 0x0020
+
+#define KEVENT_POLL_POLLRDNORM 0x0040
+#define KEVENT_POLL_POLLRDBAND 0x0080
+#define KEVENT_POLL_POLLWRNORM 0x0100
+#define KEVENT_POLL_POLLWRBAND 0x0200
+#define KEVENT_POLL_POLLMSG 0x0400
+#define KEVENT_POLL_POLLREMOVE 0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define KEVENT_AIO_BIO 0x1
+
+#define KEVENT_MASK_ALL 0xffffffff
+/* Mask of all possible event values. */
+#define KEVENT_MASK_EMPTY 0x0
+/* Empty mask of ready events. */
+
+struct kevent_id
+{
+ union {
+ __u32 raw[2];
+ __u64 raw_u64 __attribute__((aligned(8)));
+ };
+};
+
+struct ukevent
+{
+ /* Id of this request, e.g. socket number, file descriptor and so on... */
+ struct kevent_id id;
+ /* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+ __u32 type;
+ /* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+ __u32 event;
+ /* Per-event request flags */
+ __u32 req_flags;
+ /* Per-event return flags */
+ __u32 ret_flags;
+ /* Event return data. Event originator fills it with anything it likes. */
+ __u32 ret_data[2];
+ /* User's data. It is not used, just copied to/from user.
+ * The whole structure is aligned to 8 bytes already, so the last union
+ * is aligned properly.
+ */
+ union {
+ __u32 user[2];
+ void *ptr;
+ };
+};
+
+struct mukevent
+{
+ struct kevent_id id;
+ __u32 ret_flags;
+};
+
+#define KEVENT_MAX_PAGES 2
+
+/*
+ * Note that kevents does not exactly fill the page (each mukevent is 12 bytes),
+ * so we reuse 4 bytes at the begining of the page to store index.
+ * Take that into account if you want to change size of struct mukevent.
+ */
+#define KEVENTS_ON_PAGE ((PAGE_SIZE-2*sizeof(unsigned int))/sizeof(struct mukevent))
+struct kevent_mring
+{
+ unsigned int kidx, uidx;
+ struct mukevent event[KEVENTS_ON_PAGE];
+};
+
+/*
+ * Used only for sanitizing of the kevent_wait() input data - do not
+ * allow user to specify number of events more than it is possible to place
+ * into ring buffer. This does not limit number of events which can be
+ * put into kevent queue (which is unlimited).
+ */
+#define KEVENT_MAX_EVENTS (KEVENT_MAX_PAGES * KEVENTS_ON_PAGE)
+
+#define KEVENT_CTL_ADD 0
+#define KEVENT_CTL_REMOVE 1
+#define KEVENT_CTL_MODIFY 2
+
+#endif /* __UKEVENT_H */
diff --git a/init/Kconfig b/init/Kconfig
index d2eb7a8..c7d8250 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -201,6 +201,8 @@ config AUDITSYSCALL
such as SELinux. To use audit's filesystem watch feature, please
ensure that INOTIFY is configured.
+source "kernel/kevent/Kconfig"
+
config IKCONFIG
bool "Kernel .config support"
---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
obj-$(CONFIG_RELAY) += relay.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..5ba8086
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,39 @@
+config KEVENT
+ bool "Kernel event notification mechanism"
+ help
+ This option enables event queue mechanism.
+ It can be used as replacement for poll()/select(), AIO callback
+ invocations, advanced timer notifications and other kernel
+ object status changes.
+
+config KEVENT_USER_STAT
+ bool "Kevent user statistic"
+ depends on KEVENT
+ help
+ This option will turn kevent_user statistic collection on.
+ Statistic data includes total number of kevent, number of kevents
+ which are ready immediately at insertion time and number of kevents
+ which were removed through readiness completion.
+ It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_TIMER
+ bool "Kernel event notifications for timers"
+ depends on KEVENT
+ help
+ This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+ bool "Kernel event notifications for poll()/select()"
+ depends on KEVENT
+ help
+ This option allows to use kevent subsystem for poll()/select()
+ notifications.
+
+config KEVENT_SOCKET
+ bool "Kernel event notifications for sockets"
+ depends on NET && KEVENT
+ help
+ This option enables notifications through KEVENT subsystem of
+ sockets operations, like new packet receiving conditions,
+ ready for accept conditions and so on.
+
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..9130cad
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,4 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
+obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..25404d3
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,227 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+ return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+ return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.ret_flags |= KEVENT_RET_BROKEN;
+ spin_unlock_irqrestore(&k->ulock, flags);
+ return -EINVAL;
+}
+
+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX] __read_mostly;
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos)
+{
+ struct kevent_callbacks *p;
+
+ if (pos >= KEVENT_MAX)
+ return -EINVAL;
+
+ p = &kevent_registered_callbacks[pos];
+
+ p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
+ p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
+ p->callback = (cb->callback) ? cb->callback : kevent_break;
+
+ printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
+ return 0;
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+ spin_lock_init(&k->ulock);
+ k->flags = 0;
+
+ if (unlikely(k->event.type >= KEVENT_MAX ||
+ !kevent_registered_callbacks[k->event.type].callback))
+ return kevent_break(k);
+
+ k->callbacks = kevent_registered_callbacks[k->event.type];
+ if (unlikely(k->callbacks.callback == kevent_break))
+ return kevent_break(k);
+
+ return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+ unsigned long flags;
+
+ k->st = st;
+ spin_lock_irqsave(&st->lock, flags);
+ list_add_tail_rcu(&k->storage_entry, &st->list);
+ k->flags |= KEVENT_STORAGE;
+ spin_unlock_irqrestore(&st->lock, flags);
+ return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue.
+ * It does not decrease origin's reference counter in any way
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&st->lock, flags);
+ if (k->flags & KEVENT_STORAGE) {
+ list_del_rcu(&k->storage_entry);
+ k->flags &= ~KEVENT_STORAGE;
+ }
+ spin_unlock_irqrestore(&st->lock, flags);
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static void __kevent_requeue(struct kevent *k, u32 event)
+{
+ int ret, rem;
+ unsigned long flags;
+
+ ret = k->callbacks.callback(k);
+
+ spin_lock_irqsave(&k->ulock, flags);
+ if (ret > 0)
+ k->event.ret_flags |= KEVENT_RET_DONE;
+ else if (ret < 0)
+ k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE);
+ else
+ ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+ rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+ spin_unlock_irqrestore(&k->ulock, flags);
+
+ if (ret) {
+ if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) {
+ list_del_rcu(&k->storage_entry);
+ k->flags &= ~KEVENT_STORAGE;
+ }
+
+ spin_lock_irqsave(&k->user->ready_lock, flags);
+ if (!(k->flags & KEVENT_READY)) {
+ kevent_user_ring_add_event(k);
+ list_add_tail(&k->ready_entry, &k->user->ready_list);
+ k->flags |= KEVENT_READY;
+ k->user->ready_num++;
+ }
+ spin_unlock_irqrestore(&k->user->ready_lock, flags);
+ wake_up(&k->user->wait);
+ }
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->st->lock, flags);
+ __kevent_requeue(k, 0);
+ spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st,
+ kevent_callback_t ready_callback, u32 event)
+{
+ struct kevent *k;
+
+ rcu_read_lock();
+ if (ready_callback)
+ list_for_each_entry_rcu(k, &st->list, storage_entry)
+ (*ready_callback)(k);
+
+ list_for_each_entry_rcu(k, &st->list, storage_entry)
+ if (event & k->event.event)
+ __kevent_requeue(k, event);
+ rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+ spin_lock_init(&st->lock);
+ st->origin = origin;
+ INIT_LIST_HEAD(&st->list);
+ return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, sockt and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+ kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..f3fec9b
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,1004 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static const char kevent_name[] = "kevent";
+static kmem_cache_t *kevent_cache __read_mostly;
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+ struct kevent_user *u = file->private_data;
+ unsigned int mask;
+
+ poll_wait(file, &u->wait, wait);
+ mask = 0;
+
+ if (u->ready_num)
+ mask |= POLLIN | POLLRDNORM;
+
+ return mask;
+}
+
+/*
+ * Called under kevent_user->ready_lock, so updates are always protected.
+ */
+int kevent_user_ring_add_event(struct kevent *k)
+{
+ unsigned int pidx, off;
+ struct kevent_mring *ring, *copy_ring;
+
+ ring = k->user->pring[0];
+
+ if ((ring->kidx + 1 == ring->uidx) ||
+ ((ring->kidx + 1 == KEVENT_MAX_EVENTS) && ring->uidx == 0)) {
+ if (k->user->overflow_kevent == NULL)
+ k->user->overflow_kevent = k;
+ return -EAGAIN;
+ }
+
+ pidx = ring->kidx/KEVENTS_ON_PAGE;
+ off = ring->kidx%KEVENTS_ON_PAGE;
+
+ if (unlikely(pidx >= KEVENT_MAX_PAGES)) {
+ printk(KERN_ERR "%s: kidx: %u, pidx: %u, on_page: %lu, pidx: %u.\n",
+ __func__, ring->kidx, ring->uidx, KEVENTS_ON_PAGE, pidx);
+ return -EINVAL;
+ }
+
+ copy_ring = k->user->pring[pidx];
+
+ copy_ring->event[off].id.raw[0] = k->event.id.raw[0];
+ copy_ring->event[off].id.raw[1] = k->event.id.raw[1];
+ copy_ring->event[off].ret_flags = k->event.ret_flags;
+
+ if (++ring->kidx >= KEVENT_MAX_EVENTS)
+ ring->kidx = 0;
+
+ return 0;
+}
+
+/*
+ * Initialize mmap ring buffer.
+ * It will store ready kevents, so userspace could get them directly instead
+ * of using syscall. Esentially syscall becomes just a waiting point.
+ * @KEVENT_MAX_PAGES is an arbitrary number of pages to store ready events.
+ */
+static int kevent_user_ring_init(struct kevent_user *u)
+{
+ int i;
+
+ u->pring = kzalloc(KEVENT_MAX_PAGES * sizeof(struct kevent_mring *), GFP_KERNEL);
+ if (!u->pring)
+ return -ENOMEM;
+
+ for (i=0; i<KEVENT_MAX_PAGES; ++i) {
+ u->pring[i] = (struct kevent_mring *)__get_free_page(GFP_KERNEL);
+ if (!u->pring[i])
+ goto err_out_free;
+ }
+
+ u->pring[0]->uidx = u->pring[0]->kidx = 0;
+
+ return 0;
+
+err_out_free:
+ for (i=0; i<KEVENT_MAX_PAGES; ++i) {
+ if (!u->pring[i])
+ break;
+
+ free_page((unsigned long)u->pring[i]);
+ }
+
+ kfree(u->pring);
+
+ return -ENOMEM;
+}
+
+static void kevent_user_ring_fini(struct kevent_user *u)
+{
+ int i;
+
+ for (i=0; i<KEVENT_MAX_PAGES; ++i)
+ free_page((unsigned long)u->pring[i]);
+
+ kfree(u->pring);
+}
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+ struct kevent_user *u;
+
+ u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+ if (!u)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&u->ready_list);
+ spin_lock_init(&u->ready_lock);
+ kevent_stat_init(u);
+ spin_lock_init(&u->kevent_lock);
+ u->kevent_root = RB_ROOT;
+
+ mutex_init(&u->ctl_mutex);
+ init_waitqueue_head(&u->wait);
+
+ atomic_set(&u->refcnt, 1);
+
+ if (unlikely(kevent_user_ring_init(u))) {
+ kfree(u);
+ return -ENOMEM;
+ }
+
+ file->private_data = u;
+ return 0;
+}
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+ atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+ if (atomic_dec_and_test(&u->refcnt)) {
+ kevent_stat_print(u);
+ kevent_user_ring_fini(u);
+ kfree(u);
+ }
+}
+
+/*
+ * Mmap implementation for ring buffer, which is created as array
+ * of pages, so vm_pgoff is an offset (in pages, not in bytes) of
+ * the first page to be mapped.
+ */
+static int kevent_user_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ unsigned long start = vma->vm_start, off = vma->vm_pgoff / PAGE_SIZE;
+ struct kevent_user *u = file->private_data;
+
+ if (off >= KEVENT_MAX_PAGES)
+ return -EINVAL;
+
+ if (vma->vm_flags & VM_WRITE)
+ return -EPERM;
+
+ vma->vm_flags |= VM_RESERVED;
+ vma->vm_file = file;
+
+ if (vm_insert_page(vma, start, virt_to_page(u->pring[off])))
+ return -EFAULT;
+
+ return 0;
+}
+
+static inline int kevent_compare_id(struct kevent_id *left, struct kevent_id *right)
+{
+ if (left->raw_u64 > right->raw_u64)
+ return -1;
+
+ if (right->raw_u64 > left->raw_u64)
+ return 1;
+
+ return 0;
+}
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+ struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+ kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Must be called under u->ready_lock.
+ * This function removes kevent from ready queue and
+ * tries to add new kevent into ring buffer.
+ */
+static void kevent_remove_ready(struct kevent *k)
+{
+ struct kevent_user *u = k->user;
+
+ if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS)
+ u->pring[0]->uidx = 0;
+
+ if (u->overflow_kevent) {
+ int err;
+
+ err = kevent_user_ring_add_event(u->overflow_kevent);
+ if (!err || u->overflow_kevent == k) {
+ if (u->overflow_kevent->ready_entry.next == &u->ready_list)
+ u->overflow_kevent = NULL;
+ else
+ u->overflow_kevent =
+ list_entry(u->overflow_kevent->ready_entry.next,
+ struct kevent, ready_entry);
+ }
+ }
+ list_del(&k->ready_entry);
+ k->flags &= ~KEVENT_READY;
+ u->ready_num--;
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+
+ if (deq)
+ kevent_dequeue(k);
+
+ spin_lock_irqsave(&u->ready_lock, flags);
+ if (k->flags & KEVENT_READY)
+ kevent_remove_ready(k);
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+
+ kevent_user_put(u);
+ call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+
+ rb_erase(&k->kevent_node, &u->kevent_root);
+ k->flags &= ~KEVENT_USER;
+ u->kevent_num--;
+ kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events,
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ rb_erase(&k->kevent_node, &u->kevent_root);
+ k->flags &= ~KEVENT_USER;
+ u->kevent_num--;
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+ kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
+{
+ unsigned long flags;
+ struct kevent *k = NULL;
+
+ spin_lock_irqsave(&u->ready_lock, flags);
+ if (u->ready_num && !list_empty(&u->ready_list)) {
+ k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+ kevent_remove_ready(k);
+ }
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+
+ return k;
+}
+
+/*
+ * Search a kevent inside kevent tree for given ukevent.
+ */
+static struct kevent *__kevent_search(struct kevent_id *id, struct kevent_user *u)
+{
+ struct kevent *k, *ret = NULL;
+ struct rb_node *n = u->kevent_root.rb_node;
+ int cmp;
+
+ while (n) {
+ k = rb_entry(n, struct kevent, kevent_node);
+ cmp = kevent_compare_id(&k->event.id, id);
+
+ if (cmp > 0)
+ n = n->rb_right;
+ else if (cmp < 0)
+ n = n->rb_left;
+ else {
+ ret = k;
+ break;
+ }
+ }
+
+ return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ int err = -ENODEV;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&uk->id, u);
+ if (k) {
+ spin_lock(&k->ulock);
+ k->event.event = uk->event;
+ k->event.req_flags = uk->req_flags;
+ k->event.ret_flags = 0;
+ spin_unlock(&k->ulock);
+ kevent_requeue(k);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+ int err = -ENODEV;
+ struct kevent *k;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&uk->id, u);
+ if (k) {
+ __kevent_finish_user(k, 1);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+ struct kevent_user *u = file->private_data;
+ struct kevent *k;
+ struct rb_node *n;
+
+ for (n = rb_first(&u->kevent_root); n; n = rb_next(n)) {
+ k = rb_entry(n, struct kevent, kevent_node);
+ kevent_finish_user(k, 1);
+ }
+
+ kevent_user_put(u);
+ file->private_data = NULL;
+
+ return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+ struct ukevent *ukev;
+
+ ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+ if (!ukev)
+ return NULL;
+
+ if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+ kfree(ukev);
+ return NULL;
+ }
+
+ return ukev;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = 0, i;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > u->kevent_num) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ if (kevent_modify(&ukev[i], u))
+ ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+ ukev[i].ret_flags |= KEVENT_RET_DONE;
+ }
+ if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+ err = -EFAULT;
+ kfree(ukev);
+ goto out;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ if (kevent_modify(&uk, u))
+ uk.ret_flags |= KEVENT_RET_BROKEN;
+ uk.ret_flags |= KEVENT_RET_DONE;
+
+ if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ arg += sizeof(struct ukevent);
+ }
+out:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = 0, i;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > u->kevent_num) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ if (kevent_remove(&ukev[i], u))
+ ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+ ukev[i].ret_flags |= KEVENT_RET_DONE;
+ }
+ if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+ err = -EFAULT;
+ kfree(ukev);
+ goto out;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ if (kevent_remove(&uk, u))
+ uk.ret_flags |= KEVENT_RET_BROKEN;
+
+ uk.ret_flags |= KEVENT_RET_DONE;
+
+ if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ arg += sizeof(struct ukevent);
+ }
+out:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static int kevent_user_enqueue(struct kevent_user *u, struct kevent *new)
+{
+ unsigned long flags;
+ struct rb_node **p = &u->kevent_root.rb_node, *parent = NULL;
+ struct kevent *k;
+ int err = 0, cmp;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ while (*p) {
+ parent = *p;
+ k = rb_entry(parent, struct kevent, kevent_node);
+
+ cmp = kevent_compare_id(&k->event.id, &new->event.id);
+ if (cmp > 0)
+ p = &parent->rb_right;
+ else if (cmp < 0)
+ p = &parent->rb_left;
+ else {
+ err = -EEXIST;
+ break;
+ }
+ }
+ if (likely(!err)) {
+ rb_link_node(&new->kevent_node, parent, p);
+ rb_insert_color(&new->kevent_node, &u->kevent_root);
+ new->flags |= KEVENT_USER;
+ u->kevent_num++;
+ kevent_user_get(u);
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ int err;
+
+ k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+ if (!k) {
+ err = -ENOMEM;
+ goto err_out_exit;
+ }
+
+ memcpy(&k->event, uk, sizeof(struct ukevent));
+ INIT_RCU_HEAD(&k->rcu_head);
+
+ k->event.ret_flags = 0;
+
+ err = kevent_init(k);
+ if (err) {
+ kmem_cache_free(kevent_cache, k);
+ goto err_out_exit;
+ }
+ k->user = u;
+ kevent_stat_total(u);
+ err = kevent_user_enqueue(u, k);
+ if (err) {
+ kmem_cache_free(kevent_cache, k);
+ goto err_out_exit;
+ }
+
+ err = kevent_enqueue(k);
+ if (err) {
+ memcpy(uk, &k->event, sizeof(struct ukevent));
+ kevent_finish_user(k, 0);
+ goto err_out_exit;
+ }
+
+ return 0;
+
+err_out_exit:
+ if (err < 0) {
+ uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
+ uk->ret_data[1] = err;
+ } else if (err > 0)
+ uk->ret_flags |= KEVENT_RET_DONE;
+ return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one
+ * and add them into appropriate kevent_storages,
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err, cerr = 0, rnum = 0, i;
+ void __user *orig = arg;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ err = -EINVAL;
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ err = kevent_user_add_ukevent(&ukev[i], u);
+ if (err) {
+ kevent_stat_im(u);
+ if (i != rnum)
+ memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+ rnum++;
+ }
+ }
+ if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+ cerr = -EFAULT;
+ kfree(ukev);
+ goto out_setup;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ cerr = -EFAULT;
+ break;
+ }
+ arg += sizeof(struct ukevent);
+
+ err = kevent_user_add_ukevent(&uk, u);
+ if (err) {
+ kevent_stat_im(u);
+ if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+ cerr = -EFAULT;
+ break;
+ }
+ orig += sizeof(struct ukevent);
+ rnum++;
+ }
+ }
+
+out_setup:
+ if (cerr < 0) {
+ err = cerr;
+ goto out_remove;
+ }
+
+ err = rnum;
+out_remove:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u,
+ unsigned int min_nr, unsigned int max_nr, __u64 timeout,
+ void __user *buf)
+{
+ struct kevent *k;
+ int num = 0;
+
+ if (!(file->f_flags & O_NONBLOCK)) {
+ wait_event_interruptible_timeout(u->wait,
+ u->ready_num >= min_nr,
+ clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+ }
+
+ while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
+ if (copy_to_user(buf + num*sizeof(struct ukevent),
+ &k->event, sizeof(struct ukevent)))
+ break;
+
+ /*
+ * If it is one-shot kevent, it has been removed already from
+ * origin's queue, so we can easily free it here.
+ */
+ if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+ kevent_finish_user(k, 1);
+ ++num;
+ kevent_stat_wait(u);
+ }
+
+ return num;
+}
+
+static struct file_operations kevent_user_fops = {
+ .mmap = kevent_user_mmap,
+ .open = kevent_user_open,
+ .release = kevent_user_release,
+ .poll = kevent_user_poll,
+ .owner = THIS_MODULE,
+};
+
+static struct miscdevice kevent_miscdev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = kevent_name,
+ .fops = &kevent_user_fops,
+};
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+ int err;
+ struct kevent_user *u = file->private_data;
+
+ switch (cmd) {
+ case KEVENT_CTL_ADD:
+ err = kevent_user_ctl_add(u, num, arg);
+ break;
+ case KEVENT_CTL_REMOVE:
+ err = kevent_user_ctl_remove(u, num, arg);
+ break;
+ case KEVENT_CTL_MODIFY:
+ err = kevent_user_ctl_modify(u, num, arg);
+ break;
+ default:
+ err = -EINVAL;
+ break;
+ }
+
+ return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - timeout in nanoseconds to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - ununsed for now (will be used for mmap implementation).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+ __u64 timeout, struct ukevent __user *buf, unsigned flags)
+{
+ int err = -EINVAL;
+ struct file *file;
+ struct kevent_user *u;
+
+ file = fget(ctl_fd);
+ if (!file)
+ return -EBADF;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+ u = file->private_data;
+
+ err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * This syscall is used to perform waiting until there is free space in kevent queue
+ * and removes/requeues requested number of events (commits them). Function returns
+ * number of actually committed events.
+ *
+ * @ctl_fd - kevent file descriptor.
+ * @start - number of first ready event.
+ * @num - number of processed kevents.
+ * @timeout - this timeout specifies number of nanoseconds to wait until there is
+ * free space in kevent queue.
+ *
+ * Ring buffer is designed in a way that first ready kevent will be at @ring->uidx
+ * position, and all other ready events will be in FIFO order after it.
+ * So when we need to commit @num events, it means we should just remove first @num
+ * kevents from ready queue and commit them. We do not use any special locking to
+ * protect this function against simultaneous running - kevent dequeueing is atomic,
+ * and we do not care about order in which events were committed.
+ * An example: thread 1 and thread 2 simultaneously call kevent_wait() to
+ * commit 2 and 3 events. It is possible that first thread will commit
+ * events 0 and 2 while second thread will commit events 1, 3 and 4.
+ * If there were only 3 ready events, then one of the calls will return lesser number
+ * of committed events than it was requested.
+ * ring->uidx update is atomic, since it is protected by u->ready_lock,
+ * which removes race with kevent_user_ring_add_event().
+ *
+ * If user asks to commit events which have beed removed by kevent_get_events() recently
+ * (for example when one thread looked into ring indexes and started to commit evets,
+ * which were simultaneously committed by other thread through kevent_get_events(),
+ * kevent_wait() will not commit unprocessed events, but will return number of actually
+ * committed events instead.
+ *
+ * It is forbidden to try to commit events not from the start of the buffer, but from
+ * some 'futher' event.
+ *
+ * An example: if ready events use positions 2-5,
+ * it is permitted to start to commit 3 events from position 0,
+ * in this case 0 and 1 positions will be ommited and only event in position 2 will
+ * be committed and kevent_wait() will return 1, since only one event was actually committed.
+ * It is forbidden to try to commit from position 4, 0 will be returned.
+ * This means that if some events were committed using kevent_get_events(),
+ * they will not be counted, instead userspace should check ring index and try to commit again.
+ */
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned int num, __u64 timeout)
+{
+ int err = -EINVAL, committed = 0;
+ struct file *file;
+ struct kevent_user *u;
+ struct kevent *k;
+ struct kevent_mring *ring;
+ unsigned int i, actual;
+ unsigned long flags;
+
+ if (num >= KEVENT_MAX_EVENTS)
+ return -EINVAL;
+
+ file = fget(ctl_fd);
+ if (!file)
+ return -EBADF;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+ u = file->private_data;
+
+ ring = u->pring[0];
+
+ spin_lock_irqsave(&u->ready_lock, flags);
+ actual = (ring->kidx > ring->uidx)?
+ (ring->kidx - ring->uidx):
+ (KEVENT_MAX_EVENTS - (ring->uidx - ring->kidx));
+
+ if (actual < num)
+ num = actual;
+
+ if (start < ring->uidx) {
+ /*
+ * Some events have been committed through kevent_get_events().
+ * ready events
+ * |==========|RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR|==========|
+ * ring->uidx ring->kidx
+ * | |
+ * start start+num
+ *
+ */
+ unsigned int diff = ring->uidx - start;
+
+ if (num < diff)
+ num = 0;
+ else
+ num -= diff;
+ } else if (start > ring->uidx)
+ num = 0;
+
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+
+ for (i=0; i<num; ++i) {
+ k = kqueue_dequeue_ready(u);
+ if (!k)
+ break;
+
+ if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+ kevent_finish_user(k, 1);
+ kevent_stat_mmap(u);
+ committed++;
+ }
+
+ if (!(file->f_flags & O_NONBLOCK)) {
+ wait_event_interruptible_timeout(u->wait,
+ u->ready_num >= 1,
+ clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+ }
+
+ fput(file);
+
+ return committed;
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent __user *arg)
+{
+ int err = -EINVAL;
+ struct file *file;
+
+ file = fget(fd);
+ if (!file)
+ return -EBADF;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+
+ err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * Kevent subsystem initialization - create kevent cache and register
+ * filesystem to get control file descriptors from.
+ */
+static int __init kevent_user_init(void)
+{
+ int err = 0;
+
+ kevent_cache = kmem_cache_create("kevent_cache",
+ sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+
+ err = misc_register(&kevent_miscdev);
+ if (err) {
+ printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
+ goto err_out_exit;
+ }
+
+ printk(KERN_INFO "KEVENT subsystem has been successfully registered.\n");
+
+ return 0;
+
+err_out_exit:
+ kmem_cache_destroy(kevent_cache);
+ return err;
+}
+
+module_init(kevent_user_init);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7a3b2e7..bc0582b 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,10 @@ cond_syscall(ppc_rtas);
cond_syscall(sys_spu_run);
cond_syscall(sys_spu_create);
+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_wait);
+cond_syscall(sys_kevent_ctl);
+
/* mmu depending weak syscall entries */
cond_syscall(sys_mprotect);
cond_syscall(sys_msync);
Hi!
> Generic event handling mechanism.
>
> Consider for inclusion.
>
> Changes from 'take21' patchset:
We are not interrested in how many times you spammed us, nor we want
to know what was wrong in previous versions. It would be nice to have
short summary of what this is good for, instead.
Pavel
--
Thanks, Sharp!
On Wed, Nov 01, 2006 at 02:06:14PM +0100, Pavel Machek ([email protected]) wrote:
> Hi!
>
> > Generic event handling mechanism.
> >
> > Consider for inclusion.
> >
> > Changes from 'take21' patchset:
>
> We are not interrested in how many times you spammed us, nor we want
> to know what was wrong in previous versions. It would be nice to have
> short summary of what this is good for, instead.
Let me guess, short explaination in subsequent emails is not enough...
If changelog will be removed, then how people will detect what happend
after previous release?
Kevent is a generic subsytem which allows to handle event notifications.
It supports both level and edge triggered events. It is similar to
poll/epoll in some cases, but it is more scalable, it is faster and
allows to work with essentially eny kind of events.
Events are provided into kernel through control syscall and can be read
back through mmaped ring or syscall.
Kevent update (i.e. readiness switching) happens directly from internals
of the appropriate state machine of the underlying subsytem (like
network, filesystem, timer or any other).
I will put that text into introduction message.
> Pavel
> --
> Thanks, Sharp!
--
Evgeniy Polyakov
Hi!
> > > Generic event handling mechanism.
> > >
> > > Consider for inclusion.
> > >
> > > Changes from 'take21' patchset:
> >
> > We are not interrested in how many times you spammed us, nor we want
> > to know what was wrong in previous versions. It would be nice to have
> > short summary of what this is good for, instead.
>
> Let me guess, short explaination in subsequent emails is not
> enough...
Yes.
> Kevent is a generic subsytem which allows to handle event notifications.
> It supports both level and edge triggered events. It is similar to
> poll/epoll in some cases, but it is more scalable, it is faster and
> allows to work with essentially eny kind of events.
Quantifying "how much more scalable" would be nice, as would be some
example where it is useful. ("It makes my webserver twice as fast on
monster 64-cpu box").
> Events are provided into kernel through control syscall and can be read
> back through mmaped ring or syscall.
> Kevent update (i.e. readiness switching) happens directly from internals
> of the appropriate state machine of the underlying subsytem (like
> network, filesystem, timer or any other).
>
> I will put that text into introduction message.
Thanks.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Wed, 1 Nov 2006, Pavel Machek wrote:
> Hi!
>
> > Generic event handling mechanism.
> >
> > Consider for inclusion.
> >
> > Changes from 'take21' patchset:
>
> We are not interrested in how many times you spammed us, nor we want
> to know what was wrong in previous versions. It would be nice to have
> short summary of what this is good for, instead.
I'm interested in knowing which version the patches belong to and what has
changed (geez, it's rare enough that someone actually bothers to do this
with an updated patchset, and to complain about it?)
- James
--
James Morris
<[email protected]>
On Wed, Nov 01, 2006 at 05:05:51PM +0100, Pavel Machek ([email protected]) wrote:
> Hi!
Hi Pavel.
> > Kevent is a generic subsytem which allows to handle event notifications.
> > It supports both level and edge triggered events. It is similar to
> > poll/epoll in some cases, but it is more scalable, it is faster and
> > allows to work with essentially eny kind of events.
>
> Quantifying "how much more scalable" would be nice, as would be some
> example where it is useful. ("It makes my webserver twice as fast on
> monster 64-cpu box").
Trivial kevent web-server can handle 3960+ req/sec on Xeon 2.4Ghz with
1Gb RAM, epoll based - 2200-2500 req/sec.
100 Mbit wire is filled almost 100% (10582.7 KB/s of data without
TCP and below headers).
More benchmarks created by me and Johann Borck can be found on project's
homepage as long as all my sources used in tests.
--
Evgeniy Polyakov
Hallo, Evgeniy Polyakov.
On 2006-11-01, you wrote:
[]
>> Quantifying "how much more scalable" would be nice, as would be some
>> example where it is useful. ("It makes my webserver twice as fast on
>> monster 64-cpu box").
>
> Trivial kevent web-server can handle 3960+ req/sec on Xeon 2.4Ghz with
[...]
Seriously. I'm seeing that patches also. New, shiny, always ready "for
inclusion". But considering kernel (linux in this case) as not thing
for itself, i want to ask following question.
Where's real-life application to do configure && make && make install?
There were some comments about laking much of such programs, answers were
"was in prev. e-mail", "need to update them", something like that.
"Trivial web server" sources url, mentioned in benchmark isn't pointed
in patch advertisement. If it was, should i actually try that new
*trivial* wheel?
Saying that, i want to give you some short examples, i know.
*Linux kernel <-> userspace*:
o Alexey Kuznetsov networking <-> (excellent) iproute set of utilities;
o Maxim Krasnyansky tun net driver <-> vtun daemon application;
*Glibc with mister Drepper* has huge set of tests, please search for
`tst*' files in the sources.
To make a little hint to you, Evgeniy, why don't you find a little
animal in the open source zoo to implement little interface to
proposed kernel subsystem and then show it to The Big Jury (not me),
we have here? And i can not see, how you've managed to implement
something like that having almost nothing on the test basket.
Very *suspicious* ch.
One, that comes in mind is lighthttpd <http://www.lighttpd.net/>.
It had sub-interface for event systems like select,poll,epoll, when i
checked its sources last time. And it is mature, btw.
Cheers.
[ -*- OT -*- ]
[ I wouldn't write all this, unless saw your opinion about the ]
[ reportbug (part of the Debian Bug Tracking System) this week. ]
[ While i'm nobody here, imho, the first thing about good programmer ]
[ must be, that he is excellent user. ]
____
On Wed, Nov 01, 2006 at 06:20:43PM +0000, Oleg Verych ([email protected]) wrote:
>
> Hallo, Evgeniy Polyakov.
Hello, Oleg.
> On 2006-11-01, you wrote:
> []
> >> Quantifying "how much more scalable" would be nice, as would be some
> >> example where it is useful. ("It makes my webserver twice as fast on
> >> monster 64-cpu box").
> >
> > Trivial kevent web-server can handle 3960+ req/sec on Xeon 2.4Ghz with
> [...]
>
> Seriously. I'm seeing that patches also. New, shiny, always ready "for
> inclusion". But considering kernel (linux in this case) as not thing
> for itself, i want to ask following question.
>
> Where's real-life application to do configure && make && make install?
Your real life or mine as developer?
I fortunately do not know anything about your real life, but my real life
applications can be found on project's homepage.
There is a link to archive there, where you can find plenty of sources.
You likely do not know, but it is a bit risky business to patch all
existing applications to show that approach is correct, if
implementation is not completed.
You likely do not know, but after I first time announced kevents in
February I changed interfaces 4 times - and it is just interfaces, not
including numerous features added/removed by developer's requests.
> There were some comments about laking much of such programs, answers were
> "was in prev. e-mail", "need to update them", something like that.
> "Trivial web server" sources url, mentioned in benchmark isn't pointed
> in patch advertisement. If it was, should i actually try that new
> *trivial* wheel?
Answer is trivial - there is archive where one can find a source code
(filenames are posted regulary). Should I create a rpm? For what glibc
version?
> Saying that, i want to give you some short examples, i know.
> *Linux kernel <-> userspace*:
> o Alexey Kuznetsov networking <-> (excellent) iproute set of utilities;
iproute documentation was way too bad when Alexey presented it first
time :)
> o Maxim Krasnyansky tun net driver <-> vtun daemon application;
>
> *Glibc with mister Drepper* has huge set of tests, please search for
> `tst*' files in the sources.
Btw, show me splice() 'shiny' application? Does lighttpd use it?
Or move_pages().
> To make a little hint to you, Evgeniy, why don't you find a little
> animal in the open source zoo to implement little interface to
> proposed kernel subsystem and then show it to The Big Jury (not me),
> we have here? And i can not see, how you've managed to implement
> something like that having almost nothing on the test basket.
> Very *suspicious* ch.
There are always people who do not like something, what can I do with
it? I present the code, we discuss it, I ask for inclusion (since it is
the only way to get feedback), something requires changes, it is changed
and so on - it is development process.
I created 'little animal in the open source zoo' by myself to show how
simple kevents are.
> One, that comes in mind is lighthttpd <http://www.lighttpd.net/>.
> It had sub-interface for event systems like select,poll,epoll, when i
> checked its sources last time. And it is mature, btw.
As I already told several times, I changed only interfaces 4 times
already, since no one seems to know what we really want and how
interface should look like. You suggest to patch lighttpd? Well, it is
doable, but then I will be asked to change apache and nginx. And then
someone will suggest to change order of parameters. Will you help me
rewrite userspace? No, you will not. You asks for something without
providing anything back (not getting into account code, but discussion,
ideas, testing time, nothing), and you do it in ultimate manner.
Btw, kevent also support AIO notifications - do you suggest to patch
reactor/proactor for tests?
It supports network AIO - do you suggest to write support for that into
apache?
What about timers? It is possible to rewrite all POSIX timers users to
usem instead.
There is feature request for userspace events and singal delivery - what
to do with that?
I created trivial web servers, which send single static page and use
various event handling schemes, and I test new subsystem with new tools,
when tests are completed and all requested features are implemented it
is time to work on different more complex users.
So let's at least complete what we have right now, so no developer's
efforts could be wasted writing empty chars in various places.
> Cheers.
>
> [ -*- OT -*- ]
> [ I wouldn't write all this, unless saw your opinion about the ]
> [ reportbug (part of the Debian Bug Tracking System) this week. ]
> [ While i'm nobody here, imho, the first thing about good programmer ]
> [ must be, that he is excellent user. ]
> ____
--
Evgeniy Polyakov
On 11/1/06, Evgeniy Polyakov <[email protected]> wrote:
> On Wed, Nov 01, 2006 at 06:20:43PM +0000, Oleg Verych ([email protected]) wrote:
> >
> > Hallo, Evgeniy Polyakov.
>
> Hello, Oleg.
>
> > On 2006-11-01, you wrote:
> > []
> > >> Quantifying "how much more scalable" would be nice, as would be some
> > >> example where it is useful. ("It makes my webserver twice as fast on
> > >> monster 64-cpu box").
> > >
> > > Trivial kevent web-server can handle 3960+ req/sec on Xeon 2.4Ghz with
> > [...]
> >
> > Seriously. I'm seeing that patches also. New, shiny, always ready "for
> > inclusion". But considering kernel (linux in this case) as not thing
> > for itself, i want to ask following question.
> >
> > Where's real-life application to do configure && make && make install?
>
> Your real life or mine as developer?
> I fortunately do not know anything about your real life, but my real life
> applications can be found on project's homepage.
> There is a link to archive there, where you can find plenty of sources.
> You likely do not know, but it is a bit risky business to patch all
> existing applications to show that approach is correct, if
> implementation is not completed.
> You likely do not know, but after I first time announced kevents in
> February I changed interfaces 4 times - and it is just interfaces, not
> including numerous features added/removed by developer's requests.
>
> > There were some comments about laking much of such programs, answers were
> > "was in prev. e-mail", "need to update them", something like that.
> > "Trivial web server" sources url, mentioned in benchmark isn't pointed
> > in patch advertisement. If it was, should i actually try that new
> > *trivial* wheel?
>
> Answer is trivial - there is archive where one can find a source code
> (filenames are posted regulary). Should I create a rpm? For what glibc
> version?
>
> > Saying that, i want to give you some short examples, i know.
> > *Linux kernel <-> userspace*:
> > o Alexey Kuznetsov networking <-> (excellent) iproute set of utilities;
>
> iproute documentation was way too bad when Alexey presented it first
> time :)
>
> > o Maxim Krasnyansky tun net driver <-> vtun daemon application;
> >
> > *Glibc with mister Drepper* has huge set of tests, please search for
> > `tst*' files in the sources.
>
> Btw, show me splice() 'shiny' application? Does lighttpd use it?
> Or move_pages().
>
> > To make a little hint to you, Evgeniy, why don't you find a little
> > animal in the open source zoo to implement little interface to
> > proposed kernel subsystem and then show it to The Big Jury (not me),
> > we have here? And i can not see, how you've managed to implement
> > something like that having almost nothing on the test basket.
> > Very *suspicious* ch.
>
> There are always people who do not like something, what can I do with
> it? I present the code, we discuss it, I ask for inclusion (since it is
> the only way to get feedback), something requires changes, it is changed
> and so on - it is development process.
> I created 'little animal in the open source zoo' by myself to show how
> simple kevents are.
>
> > One, that comes in mind is lighthttpd <http://www.lighttpd.net/>.
> > It had sub-interface for event systems like select,poll,epoll, when i
> > checked its sources last time. And it is mature, btw.
>
> As I already told several times, I changed only interfaces 4 times
> already, since no one seems to know what we really want and how
> interface should look like.
Indesiciveness has certainly been an issue here, but I remember akpm
and Ulrich both giving concrete suggestions. I was particularly
interested in Andrew's request to explain and justify the differences
between kevent and BSD's kqueue interface. Was there a discussion
that I missed? I am very interested to see your work on this
mechanism merged, because you've clearly emphasized performance and
shown impressive results. But it seems like we lose out on a lot by
throwing out all the applications that already use kqueue.
NATE
performance is great, and we are exciting at the result.
I want to know why there can be so much improvement, can we improve epoll too ?
for Kevent, I am more interesting at a universal event machanism.
In one interface, we can wait for timer event, socket event, and disk AIO event,
this can make the userland application easier to handle multiple event.
2006/11/2, Nate Diller <[email protected]>:
> On 11/1/06, Evgeniy Polyakov <[email protected]> wrote:
> > On Wed, Nov 01, 2006 at 06:20:43PM +0000, Oleg Verych ([email protected]) wrote:
> > >
> > > Hallo, Evgeniy Polyakov.
> >
> > Hello, Oleg.
> >
> > > On 2006-11-01, you wrote:
> > > []
> > > >> Quantifying "how much more scalable" would be nice, as would be some
> > > >> example where it is useful. ("It makes my webserver twice as fast on
> > > >> monster 64-cpu box").
> > > >
> > > > Trivial kevent web-server can handle 3960+ req/sec on Xeon 2.4Ghz with
> > > [...]
> > >
> > > Seriously. I'm seeing that patches also. New, shiny, always ready "for
> > > inclusion". But considering kernel (linux in this case) as not thing
> > > for itself, i want to ask following question.
> > >
> > > Where's real-life application to do configure && make && make install?
> >
> > Your real life or mine as developer?
> > I fortunately do not know anything about your real life, but my real life
> > applications can be found on project's homepage.
> > There is a link to archive there, where you can find plenty of sources.
> > You likely do not know, but it is a bit risky business to patch all
> > existing applications to show that approach is correct, if
> > implementation is not completed.
> > You likely do not know, but after I first time announced kevents in
> > February I changed interfaces 4 times - and it is just interfaces, not
> > including numerous features added/removed by developer's requests.
> >
> > > There were some comments about laking much of such programs, answers were
> > > "was in prev. e-mail", "need to update them", something like that.
> > > "Trivial web server" sources url, mentioned in benchmark isn't pointed
> > > in patch advertisement. If it was, should i actually try that new
> > > *trivial* wheel?
> >
> > Answer is trivial - there is archive where one can find a source code
> > (filenames are posted regulary). Should I create a rpm? For what glibc
> > version?
> >
> > > Saying that, i want to give you some short examples, i know.
> > > *Linux kernel <-> userspace*:
> > > o Alexey Kuznetsov networking <-> (excellent) iproute set of utilities;
> >
> > iproute documentation was way too bad when Alexey presented it first
> > time :)
> >
> > > o Maxim Krasnyansky tun net driver <-> vtun daemon application;
> > >
> > > *Glibc with mister Drepper* has huge set of tests, please search for
> > > `tst*' files in the sources.
> >
> > Btw, show me splice() 'shiny' application? Does lighttpd use it?
> > Or move_pages().
> >
> > > To make a little hint to you, Evgeniy, why don't you find a little
> > > animal in the open source zoo to implement little interface to
> > > proposed kernel subsystem and then show it to The Big Jury (not me),
> > > we have here? And i can not see, how you've managed to implement
> > > something like that having almost nothing on the test basket.
> > > Very *suspicious* ch.
> >
> > There are always people who do not like something, what can I do with
> > it? I present the code, we discuss it, I ask for inclusion (since it is
> > the only way to get feedback), something requires changes, it is changed
> > and so on - it is development process.
> > I created 'little animal in the open source zoo' by myself to show how
> > simple kevents are.
> >
> > > One, that comes in mind is lighthttpd <http://www.lighttpd.net/>.
> > > It had sub-interface for event systems like select,poll,epoll, when i
> > > checked its sources last time. And it is mature, btw.
> >
> > As I already told several times, I changed only interfaces 4 times
> > already, since no one seems to know what we really want and how
> > interface should look like.
>
> Indesiciveness has certainly been an issue here, but I remember akpm
> and Ulrich both giving concrete suggestions. I was particularly
> interested in Andrew's request to explain and justify the differences
> between kevent and BSD's kqueue interface. Was there a discussion
> that I missed? I am very interested to see your work on this
> mechanism merged, because you've clearly emphasized performance and
> shown impressive results. But it seems like we lose out on a lot by
> throwing out all the applications that already use kqueue.
>
> NATE
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Wed, Nov 01, 2006 at 06:12:41PM -0800, Nate Diller ([email protected]) wrote:
> Indesiciveness has certainly been an issue here, but I remember akpm
> and Ulrich both giving concrete suggestions. I was particularly
> interested in Andrew's request to explain and justify the differences
> between kevent and BSD's kqueue interface. Was there a discussion
> that I missed? I am very interested to see your work on this
> mechanism merged, because you've clearly emphasized performance and
> shown impressive results. But it seems like we lose out on a lot by
> throwing out all the applications that already use kqueue.
It looks you missed that discussion - freebsd kqueue has fields in the
kevent structure which have diffent sizes in 32 and 64 bit environments.
> NATE
--
Evgeniy Polyakov
/*
* How to stress epoll
*
* This program uses many pipes and two threads.
* First we open as many pipes we can. (see ulimit -n)
* Then we create a worker thread.
* The worker thread will send bytes to random pipes.
* The main thread uses epoll to collect ready pipes and read them.
* Each second, a number of collected bytes is printed on stderr
*
* Usage : epoll_bench [-n X]
*/
#include <pthread.h>
#include <stdlib.h>
#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <sys/epoll.h>
#include <signal.h>
#include <unistd.h>
#include <sys/time.h>
int nbpipes = 1024;
struct pipefd {
int fd[2];
} *tab;
int epoll_fd;
static int alloc_pipes()
{
int i;
epoll_fd = epoll_create(nbpipes);
if (epoll_fd == -1) {
perror("epoll_create");
return -1;
}
tab = malloc(sizeof(struct pipefd) * nbpipes);
if (tab ==NULL) {
perror("malloc");
return -1;
}
for (i = 0 ; i < nbpipes ; i++) {
struct epoll_event ev;
if (pipe(tab[i].fd) == -1)
break;
ev.events = EPOLLIN | EPOLLOUT | EPOLLHUP | EPOLLPRI | EPOLLET;
ev.data.u64 = (uint64_t)i;
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, tab[i].fd[0], &ev);
}
nbpipes = i;
printf("%d pipes setup\n", nbpipes);
return 0;
}
unsigned long nbhandled;
static void timer_func()
{
char buffer[32];
size_t len;
static unsigned long old;
unsigned long delta = nbhandled - old;
old = nbhandled;
len = sprintf(buffer, "%lu\n", delta);
write(2, buffer, len);
}
static void timer_setup()
{
struct itimerval it;
struct sigaction sg;
memset(&sg, 0, sizeof(sg));
sg.sa_handler = timer_func;
sigaction(SIGALRM, &sg, 0);
it.it_interval.tv_sec = 1;
it.it_interval.tv_usec = 0;
it.it_value.tv_sec = 1;
it.it_value.tv_usec = 0;
if (setitimer(ITIMER_REAL, &it, 0))
perror("setitimer");
}
static void * worker_thread_func(void *arg)
{
int fd;
char c = 1;
for (;;) {
fd = rand() % nbpipes;
write(tab[fd].fd[1], &c, 1);
}
}
int main(int argc, char *argv[])
{
char buff[1024];
pthread_t tid;
int c;
while ((c = getopt(argc, argv, "n:")) != EOF) {
if (c == 'n') nbpipes = atoi(optarg);
}
alloc_pipes();
pthread_create(&tid, NULL, worker_thread_func, (void *)0);
timer_setup();
for (;;) {
struct epoll_event events[128];
int nb = epoll_wait(epoll_fd, events, 128, 10000);
int i, fd;
for (i = 0 ; i < nb ; i++) {
fd = tab[events[i].data.u64].fd[0];
if (read(fd, buff, 1024) > 0)
nbhandled++;
}
}
}
On Thu, Nov 02, 2006 at 08:46:41AM +0100, Eric Dumazet ([email protected]) wrote:
> zhou drangon a écrit :
> >performance is great, and we are exciting at the result.
> >
> >I want to know why there can be so much improvement, can we improve
> >epoll too ?
>
> Why did you remove most of CC addresses but lkml ?
> Dont do that please...
Sure, since for example I'm not subscribed (fortunately) to lkml, and I
think you want me to answer the question too...
> Good question :)
>
> Hum, I think I can look into epoll and see how it can be improved (if
> necessary)
epoll can not be improved, since the whole polling is designed to have
several layers of dereferencing, kevent simplifies that chain noticebly.
> This is not to say we dont need kevent ! Please Evgeniy continue your work !
I will :)
> Just to remind you that according to
> http://www.xmailserver.org/linux-patches/nio-improve.html David Libenzi had
> to wait 18 months before epoll being officialy added into kernel.
kevent exists for about 10 month. We have plenty of time :)
> At that time, many applications were using epoll, and we were patching our
> kernels for that.
>
>
> I cooked a very simple program (attached in this mail), using pipes and
> epoll, and got 250.000 events received per second on an otherwise lightly
> loaded machine (dual opteron 246 , 2GHz, 1MB cache per cpu) with 10.000
> pipes (20.000 handles)
pipes will work with kevent's poll mechanisms only, so there will not be
any performance gain at all since it is essentially the same as epoll
design with waiting and rescheduling (all my measurements with
epoll vs. kevent_poll always showed the same rates), pipes require the same
notifications as sockets for maximum perfomance.
I've put it into todo list.
> It could be nice to add support for other event providers in this program
> (AF_INET & AF_UNIX sockets for example), and also add support for kevent,
> so that we really can compare epoll/kevent without a complex setup.
> I should extend the program to also add/remove sources during lifetime, not
> only insert at setup time.
If there would exist sockets support, then I could patch it to work with
kevents.
--
Evgeniy Polyakov
Evgeniy Polyakov a écrit :
> pipes will work with kevent's poll mechanisms only, so there will not be
> any performance gain at all since it is essentially the same as epoll
> design with waiting and rescheduling (all my measurements with
> epoll vs. kevent_poll always showed the same rates), pipes require the same
> notifications as sockets for maximum perfomance.
> I've put it into todo list.
Evgeniy I think this part is *important*. I think most readers of lkml are not
aware of exact mechanisms used in epoll, kevent poll, and 'kevent'
I dont understand why epoll is bad for you, since for me, ep_poll_callback()
is fast enough, even if we can make it touch less cache lines if reoredering
'struct epitem' correctly. My epoll_pipe_bench doesnt change the rescheduling
rate of the test machine.
Could you in your home page add some doc that clearly show the path taken for
those 3 mechanisms and different events sources (At least sockets)
Eric
On Thu, Nov 02, 2006 at 09:18:55AM +0100, Eric Dumazet ([email protected]) wrote:
> Evgeniy Polyakov a écrit :
> >pipes will work with kevent's poll mechanisms only, so there will not be
> >any performance gain at all since it is essentially the same as epoll
> >design with waiting and rescheduling (all my measurements with
> >epoll vs. kevent_poll always showed the same rates), pipes require the same
> >notifications as sockets for maximum perfomance.
> >I've put it into todo list.
>
> Evgeniy I think this part is *important*. I think most readers of lkml are
> not aware of exact mechanisms used in epoll, kevent poll, and 'kevent'
>
> I dont understand why epoll is bad for you, since for me,
> ep_poll_callback() is fast enough, even if we can make it touch less cache
> lines if reoredering 'struct epitem' correctly. My epoll_pipe_bench doesnt
> change the rescheduling rate of the test machine.
>
> Could you in your home page add some doc that clearly show the path taken
> for those 3 mechanisms and different events sources (At least sockets)
It is.
"It [kevent] supports socket notifications (accept, sending and receiving),
network AIO (aio_send(), aio_recv() and aio_sendfile()), inode
notifications (create/remove), generic poll()/select() notifications and
timer notifications."
In each patch I give a short description and socket notification patch
By poll design we have to setup following data:
poll_table_struct, which contains a callback
that callback will be called in each
sys_poll()->drivers_poll()->poll_wait(),
callback will allocate new private structure, which must have
wait_queue_t (it's callback will be invoked each time wake_up() is
called for given wait_queue_head), which should be linked to the given
wait_queue_head.
Kevent has different approach: so called origins (files, inodes,
sockets and so on) have a queues of userspace requests, for example
socket origin can only have a queue which will contain one of the
following events ($type.$event): socket.send, socket.recv,
socket.accept. So when new data has arrived, appropriate event is marked
as ready and moved into ready queue (very short operations) and
requested thread is awakened, which can then get ready events and
requeue them back (or remove, depending on flags). There are no
allocations in kevent_get_events() (epoll_wait() does not have it too),
no potentially long lists of wait_queue linked to the same
wait_queue_head_t, which is traversed each time we call wake_up(),
it has much smaller memory footprint compared to epoll (there is only
one kevent compared to epitem and eppoll_entry).
> Eric
--
Evgeniy Polyakov
On Thursday 02 November 2006 09:46, Evgeniy Polyakov wrote:
> By poll design we have to setup following data:
> poll_table_struct, which contains a callback
> that callback will be called in each
> sys_poll()->drivers_poll()->poll_wait(),
> callback will allocate new private structure, which must have
> wait_queue_t (it's callback will be invoked each time wake_up() is
> called for given wait_queue_head), which should be linked to the given
> wait_queue_head.
In epoll case, the setup is done at epoll_ctl() time, not for each event
received by the consumer like poll()/select()
As for wake_up() overhead, I feel it's necessary if we want to wakeup a
consumer (tell him some events are availlable)
I suspect benchmark results might depend for a large part on some 'features'
of the scheduler, or the number of events provider and CPU cache size, not on
inherent limitations of epoll.
I changed a litle bit epoll_pipe_bench and we can see effects of scheduling on
the overall rate of events per second.
I can now receive 350.000 events per second, instead of 230.000
Tests done on my laptop : a UP machine (Intel(R) Pentium(R) M processor
1.60GHz) (Dell latitude D610)
# ./epoll_pipe_bench -n 1000 -l 1
1000 pipes setup
223065 evts/sec 1.00563 samples per call
228508 evts/sec 1.00277 samples per call
229076 evts/sec 1.00184 samples per call
227860 evts/sec 1.00191 samples per call
229498 evts/sec 1.00153 samples per call
228027 evts/sec 1.00136 samples per call
229465 evts/sec 1.00122 samples per call
227845 evts/sec 1.00132 samples per call
227456 evts/sec 1.00118 samples per call
228355 evts/sec 1.00113 samples per call
# ./epoll_pipe_bench -n 1000 -l 10
1000 pipes setup
328599 evts/sec 314.75 samples per call
344947 evts/sec 314.447 samples per call
342844 evts/sec 314.185 samples per call
345486 evts/sec 314.013 samples per call
345144 evts/sec 314.079 samples per call
344270 evts/sec 313.989 samples per call
344249 evts/sec 314.004 samples per call
320577 evts/sec 326.967 samples per call
313990 evts/sec 343.926 samples per call
313578 evts/sec 359.034 samples per call
In oprofile, It's not obvious epoll is responsible for cpu costs...
CPU: PIII, speed 1600 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit
mask of 0x00 (No unit mask) count 50000
samples % symbol name
206966 7.9385 sysenter_past_esp
158337 6.0733 _spin_lock_irqsave
143339 5.4980 pipe_writev
137434 5.2715 fget_light
132293 5.0743 _spin_unlock_irqrestore
109593 4.2036 pipe_readv
102669 3.9380 ep_poll_callback
99801 3.8280 dnotify_parent
99388 3.8122 sys_epoll_wait
98710 3.7862 vfs_write
94560 3.6270 _write_lock_irqsave
91774 3.5201 current_fs_time
90541 3.4728 pipe_poll
89153 3.4196 _spin_lock
85150 3.2661 __wake_up
83213 3.1918 __wake_up_common
72913 2.7967 try_to_wake_up
67996 2.6081 rw_verify_area
64243 2.4641 file_update_time
54211 2.0793 vfs_read
But but but ! If I add more pipes, results are reversed, because CPU cache is
not large enough : It's better to deliver events as fast as possible to
consumer to keep hot caches (but only on UP machine, and because my test prog
is threaded. If it was using two process, context switch would be more
expensive)
# ./epoll_pipe_bench -n 10000 -l 1
10000 pipes setup
171444 evts/sec 1 samples per call
174556 evts/sec 1 samples per call
173976 evts/sec 1 samples per call
174715 evts/sec 1.00003 samples per call
173215 evts/sec 1.00478 samples per call
174930 evts/sec 1.00397 samples per call
# ./epoll_pipe_bench -n 10000 -l 10
10000 pipes setup
149701 evts/sec 759.904 samples per call
153476 evts/sec 767.537 samples per call
149217 evts/sec 767.585 samples per call
152396 evts/sec 763.624 samples per call
153517 evts/sec 762.215 samples per call
148026 evts/sec 763.434 samples per call
CPU: PIII, speed 1600 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit
mask of 0x00 (No unit mask) count 50000
samples % symbol name
404924 12.4647 pipe_poll
263399 8.1081 pipe_writev
259230 7.9798 rw_verify_area
246368 7.5839 fget_light
242360 7.4605 sys_epoll_wait
174913 5.3843 kmap_atomic
168623 5.1907 __wake_up_common
147661 4.5454 ep_poll_callback
137003 4.2173 pipe_readv
128799 3.9648 _spin_lock_irqsave
100015 3.0787 sysenter_past_esp
74031 2.2789 __copy_to_user_ll
72857 2.2427 file_update_time
69523 2.1401 vfs_write
69444 2.1377 mutex_lock
I also added a -f flag, to bypass epoll completely and measure the
pipe/read/write overhead/performance.
# ./epoll_pipe_bench -n 10000 -f
10000 pipes setup
300230 evts/sec
309770 evts/sec
264426 evts/sec
265842 evts/sec
265551 evts/sec
266814 evts/sec
266551 evts/sec
264415 evts/sec
On 11/1/06, Evgeniy Polyakov <[email protected]> wrote:
> On Wed, Nov 01, 2006 at 06:12:41PM -0800, Nate Diller ([email protected]) wrote:
> > Indesiciveness has certainly been an issue here, but I remember akpm
> > and Ulrich both giving concrete suggestions. I was particularly
> > interested in Andrew's request to explain and justify the differences
> > between kevent and BSD's kqueue interface. Was there a discussion
> > that I missed? I am very interested to see your work on this
> > mechanism merged, because you've clearly emphasized performance and
> > shown impressive results. But it seems like we lose out on a lot by
> > throwing out all the applications that already use kqueue.
>
> It looks you missed that discussion - freebsd kqueue has fields in the
> kevent structure which have diffent sizes in 32 and 64 bit environments.
Are you saying that the *only* reason we choose not to be
source-compatible with BSD is the 32 bit userland on 64 bit arch
problem? I've followed every thread that gmail 'kqueue' search
returns, which thread are you referring to? Nicholas Miell, in "The
Proposed Linux kevent API" thread, seems to think that there are no
advantages over kqueue to justify the incompatibility, an argument you
made no effort to refute. I've also read the Kevent wiki at
linux-net.osdl.org, but it too is lacking in any direct comparisons
(even theoretical, let alone benchmarks) of the flexibility,
performance, etc. between the two.
I'm not arguing that you've done a bad design, I'm asking you to brag
about the things you improved on vs. kqueue. Your emphasis on
unifying all the different event types into one interface is really
cool, fill me in on why that can't be effectively done with the kqueue
compatability and I also will advocate for kevent inclusion.
NATE
2006/11/2, Eric Dumazet <[email protected]>:
> zhou drangon a ?crit :
> > performance is great, and we are exciting at the result.
> >
> > I want to know why there can be so much improvement, can we improve
> > epoll too ?
>
> Why did you remove most of CC addresses but lkml ?
> Dont do that please...
I seldom reply to the mailing list, Sorry for this.
>
> Good question :)
>
> Hum, I think I can look into epoll and see how it can be improved (if necessary)
>
I have an other question.
As for the VFS system, when we introduce the AIO machinism, we add aio_read,
aio_write, etc... to file ops, and then we make the read, write op to
call aio_read,
aio_write, so that we only remain one implement in kernel.
Can we do event machinism the same way?
when kevent is robust enough, can we implement epoll/select/io_submit etc...
base on kevent ??
In this way, we can simplified the kernel, and epoll can gain
improvement from kevent.
> This is not to say we dont need kevent ! Please Evgeniy continue your work !
Yes! We are expecting for you greate work.
I create an userland event-driven framework for my application.
but I have to use multiple thread to receive event, epoll to wait most event,
and io_getevent to wait disk AIO event, I hope we can get a universal
event machinism
to make the code elegance.
>
> Just to remind you that according to
> http://www.xmailserver.org/linux-patches/nio-improve.html David Libenzi had to
> wait 18 months before epoll being officialy added into kernel.
>
> At that time, many applications were using epoll, and we were patching our
> kernels for that.
>
>
> I cooked a very simple program (attached in this mail), using pipes and epoll,
> and got 250.000 events received per second on an otherwise lightly loaded
> machine (dual opteron 246 , 2GHz, 1MB cache per cpu) with 10.000 pipes (20.000
> handles)
>
> It could be nice to add support for other event providers in this program
> (AF_INET & AF_UNIX sockets for example), and also add support for kevent, so
> that we really can compare epoll/kevent without a complex setup.
> I should extend the program to also add/remove sources during lifetime, not
> only insert at setup time.
>
> # gcc -O2 -o epoll_pipe_bench epoll_pipe_bench.c -lpthread
> # ulimit -n 1000000
> # epoll_pipe_bench -n 10000
> ^C after a while...
>
> oprofile results say that ep_poll_callback() and sys_epoll_wait() use 20% of
> cpu time.
> Even if we gain a two factor in cpu time or cache usage, we wont eliminate
> other costs...
>
> oprofile results gave :
>
> Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit
> mask of 0x00 (No unit mask) count 50000
> samples % symbol name
> 2015420 11.1309 ep_poll_callback
> 1867431 10.3136 pipe_writev
> 1791872 9.8963 sys_epoll_wait
> 1357297 7.4962 fget_light
> 1277515 7.0556 pipe_readv
> 998447 5.5143 current_fs_time
> 801597 4.4271 __mark_inode_dirty
> 755268 4.1713 __wake_up
> 587065 3.2423 __write_lock_failed
> 582931 3.2195 system_call
> 297132 1.6410 iov_fault_in_pages_read
> 296136 1.6355 sys_write
> 290106 1.6022 __wake_up_common
> 270692 1.4950 bad_pipe_w
> 261516 1.4443 do_pipe
> 257208 1.4205 tg3_start_xmit_dma_bug
> 254917 1.4079 pipe_poll
> 252925 1.3969 copy_user_generic_c
> 234212 1.2935 generic_pipe_buf_map
> 228659 1.2629 ret_from_sys_call
> 212541 1.1738 sysret_check
> 166529 0.9197 sys_read
> 160038 0.8839 vfs_write
> 151091 0.8345 pipe_ioctl
> 136301 0.7528 file_update_time
> 107173 0.5919 tg3_poll
> 77846 0.4299 ipt_do_table
> 75081 0.4147 schedule
> 73059 0.4035 vfs_read
> 69787 0.3854 get_task_comm
> 63923 0.3530 memcpy
> 60019 0.3315 touch_atime
> 57490 0.3175 eventpoll_release_file
> 56152 0.3101 tg3_write_flush_reg32
> 54468 0.3008 rw_verify_area
> 47833 0.2642 generic_pipe_buf_unmap
> 47777 0.2639 __switch_to
> 44106 0.2436 bad_pipe_r
> 41824 0.2310 proc_nr_files
> 41319 0.2282 pipe_iov_copy_from_user
>
>
> Eric
>
>
>
> /*
> * How to stress epoll
> *
> * This program uses many pipes and two threads.
> * First we open as many pipes we can. (see ulimit -n)
> * Then we create a worker thread.
> * The worker thread will send bytes to random pipes.
> * The main thread uses epoll to collect ready pipes and read them.
> * Each second, a number of collected bytes is printed on stderr
> *
> * Usage : epoll_bench [-n X]
> */
> #include <pthread.h>
> #include <stdlib.h>
> #include <errno.h>
> #include <stdio.h>
> #include <string.h>
> #include <sys/epoll.h>
> #include <signal.h>
> #include <unistd.h>
> #include <sys/time.h>
>
> int nbpipes = 1024;
>
> struct pipefd {
> int fd[2];
> } *tab;
>
> int epoll_fd;
>
> static int alloc_pipes()
> {
> int i;
>
> epoll_fd = epoll_create(nbpipes);
> if (epoll_fd == -1) {
> perror("epoll_create");
> return -1;
> }
> tab = malloc(sizeof(struct pipefd) * nbpipes);
> if (tab ==NULL) {
> perror("malloc");
> return -1;
> }
> for (i = 0 ; i < nbpipes ; i++) {
> struct epoll_event ev;
> if (pipe(tab[i].fd) == -1)
> break;
> ev.events = EPOLLIN | EPOLLOUT | EPOLLHUP | EPOLLPRI | EPOLLET;
> ev.data.u64 = (uint64_t)i;
> epoll_ctl(epoll_fd, EPOLL_CTL_ADD, tab[i].fd[0], &ev);
> }
> nbpipes = i;
> printf("%d pipes setup\n", nbpipes);
> return 0;
> }
>
>
> unsigned long nbhandled;
> static void timer_func()
> {
> char buffer[32];
> size_t len;
> static unsigned long old;
> unsigned long delta = nbhandled - old;
> old = nbhandled;
> len = sprintf(buffer, "%lu\n", delta);
> write(2, buffer, len);
> }
>
> static void timer_setup()
> {
> struct itimerval it;
> struct sigaction sg;
>
> memset(&sg, 0, sizeof(sg));
> sg.sa_handler = timer_func;
> sigaction(SIGALRM, &sg, 0);
> it.it_interval.tv_sec = 1;
> it.it_interval.tv_usec = 0;
> it.it_value.tv_sec = 1;
> it.it_value.tv_usec = 0;
> if (setitimer(ITIMER_REAL, &it, 0))
> perror("setitimer");
> }
>
> static void * worker_thread_func(void *arg)
> {
> int fd;
> char c = 1;
> for (;;) {
> fd = rand() % nbpipes;
> write(tab[fd].fd[1], &c, 1);
> }
> }
>
>
> int main(int argc, char *argv[])
> {
> char buff[1024];
> pthread_t tid;
> int c;
>
> while ((c = getopt(argc, argv, "n:")) != EOF) {
> if (c == 'n') nbpipes = atoi(optarg);
> }
> alloc_pipes();
> pthread_create(&tid, NULL, worker_thread_func, (void *)0);
> timer_setup();
>
> for (;;) {
> struct epoll_event events[128];
> int nb = epoll_wait(epoll_fd, events, 128, 10000);
> int i, fd;
> for (i = 0 ; i < nb ; i++) {
> fd = tab[events[i].data.u64].fd[0];
> if (read(fd, buff, 1024) > 0)
> nbhandled++;
> }
> }
> }
>
>
>
On Thu, Nov 02, 2006 at 11:40:43AM -0800, Nate Diller ([email protected]) wrote:
> Are you saying that the *only* reason we choose not to be
> source-compatible with BSD is the 32 bit userland on 64 bit arch
> problem? I've followed every thread that gmail 'kqueue' search
I.e. do you want that generic event handling mechanism would not work on
x86_64? I doubt you do.
> returns, which thread are you referring to? Nicholas Miell, in "The
> Proposed Linux kevent API" thread, seems to think that there are no
> advantages over kqueue to justify the incompatibility, an argument you
> made no effort to refute. I've also read the Kevent wiki at
> linux-net.osdl.org, but it too is lacking in any direct comparisons
> (even theoretical, let alone benchmarks) of the flexibility,
> performance, etc. between the two.
>
> I'm not arguing that you've done a bad design, I'm asking you to brag
> about the things you improved on vs. kqueue. Your emphasis on
> unifying all the different event types into one interface is really
> cool, fill me in on why that can't be effectively done with the kqueue
> compatability and I also will advocate for kevent inclusion.
kqueue just can not be used as is in Linux (_maybe_ *bsd has different
types, not those which I found in /usr/include in my FC5 and Debian
distro). It will not work on x86_64 for example. Some kind of a pointer
or unsigned long in structures which are transferred between kernelspace
and userspace is so much questionable, than it is much better even do
not see there... (if I would not have so political correctness, I would
describe it in a much different words actually).
So, kqueue API and structures can not be usd in Linux.
> NATE
--
Evgeniy Polyakov
Hi!
> > returns, which thread are you referring to? Nicholas Miell, in "The
> > Proposed Linux kevent API" thread, seems to think that there are no
> > advantages over kqueue to justify the incompatibility, an argument you
> > made no effort to refute. I've also read the Kevent wiki at
> > linux-net.osdl.org, but it too is lacking in any direct comparisons
> > (even theoretical, let alone benchmarks) of the flexibility,
> > performance, etc. between the two.
> >
> > I'm not arguing that you've done a bad design, I'm asking you to brag
> > about the things you improved on vs. kqueue. Your emphasis on
> > unifying all the different event types into one interface is really
> > cool, fill me in on why that can't be effectively done with the kqueue
> > compatability and I also will advocate for kevent inclusion.
>
> kqueue just can not be used as is in Linux (_maybe_ *bsd has different
> types, not those which I found in /usr/include in my FC5 and Debian
> distro). It will not work on x86_64 for example. Some kind of a pointer
> or unsigned long in structures which are transferred between kernelspace
> and userspace is so much questionable, than it is much better even do
> not see there... (if I would not have so political correctness, I would
> describe it in a much different words actually).
> So, kqueue API and structures can not be usd in Linux.
Not sure what you are smoking, but "there's unsigned long in *bsd
version, lets rewrite it from scratch" sounds like very bad idea. What
about fixing that one bit you don't like?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
From: Pavel Machek <[email protected]>
Date: Fri, 3 Nov 2006 09:57:12 +0100
> Not sure what you are smoking, but "there's unsigned long in *bsd
> version, lets rewrite it from scratch" sounds like very bad idea. What
> about fixing that one bit you don't like?
I disagree, it's more like since we have to be structure incompatible
anyways, let's design something superior if we can.
On Fri, Nov 03, 2006 at 09:57:12AM +0100, Pavel Machek ([email protected]) wrote:
> > So, kqueue API and structures can not be usd in Linux.
>
> Not sure what you are smoking, but "there's unsigned long in *bsd
> version, lets rewrite it from scratch" sounds like very bad idea. What
> about fixing that one bit you don't like?
It is not about what I dislike, but about what is broken or not.
Putting u64 instead of a long or some kind of that _is_ incompatible
already, so why should we even use it?
And, btw, what we are talking about? Is it about the whole kevent
compared to kqueue in kernelspace, or just about what structure is being
transferred between kernelspace and userspace?
I'm sure, it was some kind of a joke to 'not rewrite *bsd from scratch
and use kqueue in Linux kernel as is'.
> Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
Evgeniy Polyakov
On Fri, Nov 03, 2006 at 10:42:04AM +0800, zhou drangon ([email protected]) wrote:
> As for the VFS system, when we introduce the AIO machinism, we add aio_read,
> aio_write, etc... to file ops, and then we make the read, write op to
> call aio_read,
> aio_write, so that we only remain one implement in kernel.
> Can we do event machinism the same way?
> when kevent is robust enough, can we implement epoll/select/io_submit etc...
> base on kevent ??
> In this way, we can simplified the kernel, and epoll can gain
> improvement from kevent.
There is AIO implementaion on top of kevent, although it was confirmed
that it has a good design, except minor API layering changes, it was
postponed for a while.
--
Evgeniy Polyakov
In article <local.mail.linux-kernel/[email protected]>,
Evgeniy Polyakov <[email protected]> wrote:
>On Thu, Nov 02, 2006 at 11:40:43AM -0800, Nate Diller
>([email protected]) wrote:
>> Are you saying that the *only* reason we choose not to be
>> source-compatible with BSD is the 32 bit userland on 64 bit arch
>> problem? I've followed every thread that gmail 'kqueue' search
>
>I.e. do you want that generic event handling mechanism would not work on
>x86_64? I doubt you do.
>
>> returns, which thread are you referring to? Nicholas Miell, in "The
>> Proposed Linux kevent API" thread, seems to think that there are no
>> advantages over kqueue to justify the incompatibility, an argument you
>> made no effort to refute. I've also read the Kevent wiki at
>> linux-net.osdl.org, but it too is lacking in any direct comparisons
>> (even theoretical, let alone benchmarks) of the flexibility,
>> performance, etc. between the two.
>>
>> I'm not arguing that you've done a bad design, I'm asking you to brag
>> about the things you improved on vs. kqueue. Your emphasis on
>> unifying all the different event types into one interface is really
>> cool, fill me in on why that can't be effectively done with the kqueue
>> compatability and I also will advocate for kevent inclusion.
>
>kqueue just can not be used as is in Linux (_maybe_ *bsd has different
>types, not those which I found in /usr/include in my FC5 and Debian
>distro). It will not work on x86_64 for example. Some kind of a pointer
>or unsigned long in structures which are transferred between kernelspace
>and userspace is so much questionable, than it is much better even do
>not see there... (if I would not have so political correctness, I would
>describe it in a much different words actually).
>So, kqueue API and structures can not be usd in Linux.
Let me be a little blunt here: that is just so much bullshit.
Yes, I understand the problem that 32-bit userspace on a 64-bit kernel has.
Mea culpa - I didn't forsee this years ago, and none of my many reviewers
caught it either. It was designed for 32/32 and 64/64, not 32/64.
However, this is trivially fixed by adding a union to the structure, as
pointed out earlier on this list. Code would still be source compatible
with any kqueue apps, which is what counts. Even NetBSD and FreeBSD have
differing definitions of the kq constants, and nobody notices.
I really have no stake in this matter, so if you want to go invent a
better mousetrap, more power to you. But don't claim that "kqueue can
not be used on Linux"; this just makes you look foolish - I have code
running on x86_64 that trivially disproves your statement.
--
Jonathan
On Wed, Nov 01, 2006 at 09:57:46PM +0300, Evgeniy Polyakov wrote:
> On Wed, Nov 01, 2006 at 06:20:43PM +0000, Oleg Verych ([email protected]) wrote:
[]
> > Where's real-life application to do configure && make && make install?
>
> Your real life or mine as developer?
> I fortunately do not know anything about your real life, but my real life
To do not further shift conversation in no technical way, think of my
sentence as question *and* as definition.
> applications can be found on project's homepage.
> There is a link to archive there, where you can find plenty of sources.
But no single makefile. Or what CC and options do not mater really?
You can easily find in your server's apache logs, my visit of that
archive in the day of my message (today i just confirmed my assertions):
browser lynx, host flower.upol.cz.
> You likely do not know, but it is a bit risky business to patch all
> existing applications to show that approach is correct, if
> implementation is not completed.
Fortunately to me, `lighthttpd' is real-life *and* in the benchmark
area also. Just see that site how much there was measured: different OSes,
special tunning. *That* is i'm talking about. Epoll _wrapper_ there,
is 3461 byte long, your answer to _me_ 2580. People are bringing you a
test bed, with all set up ready to use; need less code, go on, comment
needless out!
> You likely do not know, but after I first time announced kevents in
> February I changed interfaces 4 times - and it is just interfaces, not
> including numerous features added/removed by developer's requests.
I think that called open source, linux kernel case.
> > There were some comments about laking much of such programs, answers were
> > "was in prev. e-mail", "need to update them", something like that.
> > "Trivial web server" sources url, mentioned in benchmark isn't pointed
> > in patch advertisement. If it was, should i actually try that new
> > *trivial* wheel?
>
> Answer is trivial - there is archive where one can find a source code
> (filenames are posted regulary). Should I create a rpm? For what glibc
> version?
Hmm. Let me answer on that "dup" with stuff from LKML archive. That
will reveal, that my guesses were told by The Big Jury to you already:
[^0] Message-ID: [email protected]
[^1] Message-ID: [email protected],
Message-ID: [email protected]
more than 10 takes ago.
> > Saying that, i want to give you some short examples, i know.
> > *Linux kernel <-> userspace*:
> > o Alexey Kuznetsov networking <-> (excellent) iproute set of utilities;
>
> iproute documentation was way too bad when Alexey presented it first
> time :)
As example, after have read some books on TCP/IP and Ethernet, internal
help of `ip' was all i needed to know.
> Btw, show me splice() 'shiny' application? Does lighttpd use it?
> Or move_pages().
You know who proposed that, and you know how many (few) releases ago.
> > To make a little hint to you, Evgeniy, why don't you find a little
> > animal in the open source zoo to implement little interface to
> > proposed kernel subsystem and then show it to The Big Jury (not me),
> > we have here? And i can not see, how you've managed to implement
> > something like that having almost nothing on the test basket.
> > Very *suspicious* ch.
>
> There are always people who do not like something, what can I do with
I didn't think, that my message was offensive. Also i didn't even say,
that you have not bothered feed your code to "scripts/Lindent".
[]
> I created trivial web servers, which send single static page and use
> various event handling schemes, and I test new subsystem with new tools,
> when tests are completed and all requested features are implemented it
> is time to work on different more complex users.
Please, see [^0],
> So let's at least complete what we have right now, so no developer's
> efforts could be wasted writing empty chars in various places.
and [^1].
[ Please do not answer just to answer, cc list is big, no one from ]
[ The Big Jury seems to care. (well, Jonathan does, but he wasn't in cc) ]
Friendly, Oleg.
____
On Fri, Nov 03, 2006 at 07:49:16PM +0100, Oleg Verych ([email protected]) wrote:
> > applications can be found on project's homepage.
> > There is a link to archive there, where you can find plenty of sources.
>
> But no single makefile. Or what CC and options do not mater really?
> You can easily find in your server's apache logs, my visit of that
> archive in the day of my message (today i just confirmed my assertions):
> browser lynx, host flower.upol.cz.
If you can not compile that sources, than you should not use kevent for
a while. Definitely.
Options are pretty simple: -W -Wall -I$(path_to_kernel_tree)/include
> > You likely do not know, but it is a bit risky business to patch all
> > existing applications to show that approach is correct, if
> > implementation is not completed.
>
> Fortunately to me, `lighthttpd' is real-life *and* in the benchmark
> area also. Just see that site how much there was measured: different OSes,
> special tunning. *That* is i'm talking about. Epoll _wrapper_ there,
> is 3461 byte long, your answer to _me_ 2580. People are bringing you a
> test bed, with all set up ready to use; need less code, go on, comment
> needless out!
So what?
People bring me tons of various stuff, and I prefer to use my own for
tests. If _you_ need it, _you_ can always patch any sources you like.
> > You likely do not know, but after I first time announced kevents in
> > February I changed interfaces 4 times - and it is just interfaces, not
> > including numerous features added/removed by developer's requests.
>
> I think that called open source, linux kernel case.
You missed the point - I'm not going to patch tons of existing
applications when I'm asked to change an interface once per month.
When all requested features are implemented I definitely with patch some
popular web-server to show how kevent is used.
> > > There were some comments about laking much of such programs, answers were
> > > "was in prev. e-mail", "need to update them", something like that.
> > > "Trivial web server" sources url, mentioned in benchmark isn't pointed
> > > in patch advertisement. If it was, should i actually try that new
> > > *trivial* wheel?
> >
> > Answer is trivial - there is archive where one can find a source code
> > (filenames are posted regulary). Should I create a rpm? For what glibc
> > version?
>
> Hmm. Let me answer on that "dup" with stuff from LKML archive. That
> will reveal, that my guesses were told by The Big Jury to you already:
>
> [^0] Message-ID: [email protected]
> [^1] Message-ID: [email protected],
> Message-ID: [email protected]
>
> more than 10 takes ago.
And? Please provide a link to archive.
> > > Saying that, i want to give you some short examples, i know.
> > > *Linux kernel <-> userspace*:
> > > o Alexey Kuznetsov networking <-> (excellent) iproute set of utilities;
> >
> > iproute documentation was way too bad when Alexey presented it first
> > time :)
>
> As example, after have read some books on TCP/IP and Ethernet, internal
> help of `ip' was all i needed to know.
:)) i.e. it is ok for you to 'read some books on TCP/IP and Ethernet' to
understand how utility works, and it is not ok to determine how to
compile my sources? Do not compile my sources.
> > Btw, show me splice() 'shiny' application? Does lighttpd use it?
> > Or move_pages().
>
> You know who proposed that, and you know how many (few) releases ago.
And why lighttpd still do not use it?
You should start to blame authors of the splice() for that.
You will not? Then I can not consider your words in my direction as
serious.
> > > To make a little hint to you, Evgeniy, why don't you find a little
> > > animal in the open source zoo to implement little interface to
> > > proposed kernel subsystem and then show it to The Big Jury (not me),
> > > we have here? And i can not see, how you've managed to implement
> > > something like that having almost nothing on the test basket.
> > > Very *suspicious* ch.
> >
> > There are always people who do not like something, what can I do with
>
> I didn't think, that my message was offensive. Also i didn't even say,
> that you have not bothered feed your code to "scripts/Lindent".
You do not use kevent, why do you care about indent of the userspace
tools?
> []
> > I created trivial web servers, which send single static page and use
> > various event handling schemes, and I test new subsystem with new tools,
> > when tests are completed and all requested features are implemented it
> > is time to work on different more complex users.
>
> Please, see [^0],
>
> > So let's at least complete what we have right now, so no developer's
> > efforts could be wasted writing empty chars in various places.
>
> and [^1].
>
> [ Please do not answer just to answer, cc list is big, no one from ]
> [ The Big Jury seems to care. (well, Jonathan does, but he wasn't in cc) ]
This thread is just to answer for the sake of answers - there is
completely no sense in it.
You blame me that I did not create some benchmarks you like, but I do not
care about it. I created usefull patch and test is in the way I like,
because it is much more productive, than spending a lot of time
detemining how different sources work with appropriate loads.
When there will be strong requirement to perform additional tests, I
will do them.
> Friendly, Oleg.
> ____
--
Evgeniy Polyakov
On Fri, Nov 03, 2006 at 07:49:16PM +0100, Oleg Verych ([email protected]) wrote:
> [ Please do not answer just to answer, cc list is big, no one from ]
> [ The Big Jury seems to care. (well, Jonathan does, but he wasn't in cc) ]
>
> Friendly, Oleg.
Just in case some misunderstanding happend: I do not want to insult
anyone who is against kevent, I just do not understand cases, when
people require me to do something to convince them in rude manner.
--
Evgeniy Polyakov
Hi!
On Fri 2006-11-03 12:13:02, Evgeniy Polyakov wrote:
> On Fri, Nov 03, 2006 at 09:57:12AM +0100, Pavel Machek ([email protected]) wrote:
> > > So, kqueue API and structures can not be usd in Linux.
> >
> > Not sure what you are smoking, but "there's unsigned long in *bsd
> > version, lets rewrite it from scratch" sounds like very bad idea. What
> > about fixing that one bit you don't like?
>
> It is not about what I dislike, but about what is broken or not.
> Putting u64 instead of a long or some kind of that _is_ incompatible
> already, so why should we even use it?
Well.. u64 vs unsigned long *is* binary incompatible, but it is
similar enough that it is going to be compatible at source level, or
maybe userland app will need *minor* ifdefs... That's better than two
completely different versions...
> And, btw, what we are talking about? Is it about the whole kevent
> compared to kqueue in kernelspace, or just about what structure is being
> transferred between kernelspace and userspace?
> I'm sure, it was some kind of a joke to 'not rewrite *bsd from scratch
> and use kqueue in Linux kernel as is'.
No, it is probably not possible to take code from BSD kernel and "just
port it". But keeping same/similar userland interface would be nice.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Sun, Nov 05, 2006 at 12:19:33PM +0100, Pavel Machek ([email protected]) wrote:
> Hi!
>
> On Fri 2006-11-03 12:13:02, Evgeniy Polyakov wrote:
> > On Fri, Nov 03, 2006 at 09:57:12AM +0100, Pavel Machek ([email protected]) wrote:
> > > > So, kqueue API and structures can not be usd in Linux.
> > >
> > > Not sure what you are smoking, but "there's unsigned long in *bsd
> > > version, lets rewrite it from scratch" sounds like very bad idea. What
> > > about fixing that one bit you don't like?
> >
> > It is not about what I dislike, but about what is broken or not.
> > Putting u64 instead of a long or some kind of that _is_ incompatible
> > already, so why should we even use it?
>
> Well.. u64 vs unsigned long *is* binary incompatible, but it is
> similar enough that it is going to be compatible at source level, or
> maybe userland app will need *minor* ifdefs... That's better than two
> completely different versions...
>
> > And, btw, what we are talking about? Is it about the whole kevent
> > compared to kqueue in kernelspace, or just about what structure is being
> > transferred between kernelspace and userspace?
> > I'm sure, it was some kind of a joke to 'not rewrite *bsd from scratch
> > and use kqueue in Linux kernel as is'.
>
> No, it is probably not possible to take code from BSD kernel and "just
> port it". But keeping same/similar userland interface would be nice.
It is not only probably, but not even unlikely - it is impossible to get
FreeBSD kqueue code and port it - that port will be completely different
system.
It is impossible to have the same event structure, one should create
#if defined kqueue
fill all members of the structure
#else if defined kevent
fill different members name, since Linux does not even have some types
#endif
*BSD kevent (structure transferred between userspace and kernelspace)
struct kevent {
uintptr_t ident; /* identifier for this event */
short filter; /* filter for event */
u_short flags; /* action flags for kqueue */
u_int fflags; /* filter flag value */
intptr_t data; /* filter data value */
void *udata; /* opaque user data identifier */
};
You must fill all fields differently due to above.
Just an example: Linux kevent has extended ID field which is grouped
into type.event, kqueue has different pointer indent and short filter.
Linux kevent does not have filters, but instead it has generic storages
of events which can be processed in any way origin of the storage wants
(this for example allows to create aio_sendfile() (which is dropped from
patchset currently) which no other system in the wild has).
There are too many differences. It is just different systems.
If both can be described by sentence "system which handles events", it
does not mean that they are the same and can use the structures or even
have similar design.
Kevent is not kqueue in any way (although there are certain
similarities), so they can not share anything.
> Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
Evgeniy Polyakov
Hi!
On Fri 2006-11-03 10:30:12, Jonathan Lemon wrote:
> In article <local.mail.linux-kernel/[email protected]>,
> Evgeniy Polyakov <[email protected]> wrote:
> >On Thu, Nov 02, 2006 at 11:40:43AM -0800, Nate Diller
> >kqueue just can not be used as is in Linux (_maybe_ *bsd has different
> >types, not those which I found in /usr/include in my FC5 and Debian
> >distro). It will not work on x86_64 for example. Some kind of a pointer
> >or unsigned long in structures which are transferred between kernelspace
> >and userspace is so much questionable, than it is much better even do
> >not see there... (if I would not have so political correctness, I would
> >describe it in a much different words actually).
> >So, kqueue API and structures can not be usd in Linux.
>
> Let me be a little blunt here: that is just so much bullshit.
>
> Yes, I understand the problem that 32-bit userspace on a 64-bit kernel has.
> Mea culpa - I didn't forsee this years ago, and none of my many reviewers
> caught it either. It was designed for 32/32 and 64/64, not 32/64.
>
> However, this is trivially fixed by adding a union to the structure, as
> pointed out earlier on this list. Code would still be source compatible
> with any kqueue apps, which is what counts. Even NetBSD and FreeBSD have
> differing definitions of the kq constants, and nobody notices.
It has been show in this thread that kevent is too different to kqueue
as is... but what are the advantages of kevent? Perhaps we should use
kqueue on Linux, too (even if it means one more rewrite for you...?)
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Sun, Nov 05, 2006 at 09:47:41PM +0100, Pavel Machek ([email protected]) wrote:
> It has been show in this thread that kevent is too different to kqueue
> as is... but what are the advantages of kevent? Perhaps we should use
> kqueue on Linux, too (even if it means one more rewrite for you...?)
Should we use *BSD VMM system when we have superiour Linux one?
P.S. Do not drop Cc: list.
--
Evgeniy Polyakov
On Mon 2006-11-06 13:13:29, Evgeniy Polyakov wrote:
> On Sun, Nov 05, 2006 at 09:47:41PM +0100, Pavel Machek ([email protected]) wrote:
> > It has been show in this thread that kevent is too different to kqueue
> > as is... but what are the advantages of kevent? Perhaps we should use
> > kqueue on Linux, too (even if it means one more rewrite for you...?)
>
> Should we use *BSD VMM system when we have superiour Linux one?
Very different question; VMM system is not something that has userland
API.
Can you explain why kevent is better than kqueue?
> P.S. Do not drop Cc: list.
It was not me who dropped cc list.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Mon, Nov 06, 2006 at 11:16:33AM +0100, Pavel Machek ([email protected]) wrote:
> On Mon 2006-11-06 13:13:29, Evgeniy Polyakov wrote:
> > On Sun, Nov 05, 2006 at 09:47:41PM +0100, Pavel Machek ([email protected]) wrote:
> > > It has been show in this thread that kevent is too different to kqueue
> > > as is... but what are the advantages of kevent? Perhaps we should use
> > > kqueue on Linux, too (even if it means one more rewrite for you...?)
> >
> > Should we use *BSD VMM system when we have superiour Linux one?
>
> Very different question; VMM system is not something that has userland
> API.
So what? We still create new things, which work better than old ones
even if it requires 'to reinvent the wheel'.
It was shown too many times already why kqueue api can not be used in
Linux.
Btw, if you want someone to rewrite something, you can start with mmaped
based malloc for example. Why don't you want to do it - although API is
the same, but underlying logic is different.
> Can you explain why kevent is better than kqueue?
According to my tests kevent is noticebly faster.
It is already too big flag that old system should not be used.
And half of my previous mail to you shows why kevent is better/different
from kqueue.
--
Evgeniy Polyakov
Hi!
> > Can you explain why kevent is better than kqueue?
>
> According to my tests kevent is noticebly faster.
> It is already too big flag that old system should not be used.
> And half of my previous mail to you shows why kevent is better/different
> from kqueue.
You shown why it is _different_. How much faster is "noticebly
faster"?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Mon, Nov 06, 2006 at 01:58:18PM +0100, Pavel Machek ([email protected]) wrote:
> Hi!
>
> > > Can you explain why kevent is better than kqueue?
> >
> > According to my tests kevent is noticebly faster.
> > It is already too big flag that old system should not be used.
> > And half of my previous mail to you shows why kevent is better/different
> > from kqueue.
>
> You shown why it is _different_. How much faster is "noticebly
> faster"?
It is different on purpose, don't you think?
If I will put all benchmark results in all mails, no one will read even
half of it.
Here is conlusion section on kevent homepage where FreeBSD kqueue is
compared with kevent (different NIC than recent Linux kevent tests,
but there are links to old kevent benchamrks there):
"After various sysctls have been changed (sysctl -a output is available
here) things become slightly better (btw, default FreeBSD installation
does not allow such tests at all due to default network parameters), but
number of "connection reset" errors is still very high.
FreeBSD drops too many connections due to either misconfiguration or
lack of resources.
According to FreeBSD and Linux comparison, in Linux number of connection
errors is much smaller than in FreeBSD with comparable or bigger
requests rate."
Briefly saying, FreeBSD kqueue behaves like Linux epoll, sometimes
better (with small request rate), sometimes worse (with 3k simultaneous
connections rate), and the latter was shown to behave worse than kevent.
Actually, Pavel, I do not understand your point. Why do you want to use
*BSD subsystem even if it is impossible to have the same API? You want
me to rewrite kevent so it would look like kqueue, but you did not know
how it looks like, likely you did not know it's API (it uses switches of
commands which are too much frowned upon in Linux kernel), you did not
know what features kevent provides and what is present and what does not
exist in kqueue.
So please point me to the magic Bodhi way which can enlighten me to think
that completely different system, which works with completely different
OS with completely different API, ABI and kernel internals, should be
ported to Linux instead of creation new and superior system?
When I become as luminous as you I will go and create new sendfile()
system call which will have the same parameters as BSD. Or not, I will ask
you to do it (actually not, why should we create something new, when
there is BSD system which already has everything we want?).
> Pavel
--
Evgeniy Polyakov
/*
* How to stress epoll
*
* This program uses many pipes|sockets and two threads.
* First we open as many pipes|sockets we can. (see ulimit -n)
* Then we create a worker thread.
* The worker thread will send bytes to random streams.
* The main thread uses epoll to collect ready events and clear them, reading streams.
* Each second, a number of collected events is printed on stderr
* After one minute, program prints an average value and stops.
*
* Usage : epoll_bench [-f] [-{u|i}] [-n X]
* -f : No epoll loop, just feed streams in a cyclic manner
* -u : Use AF_UNIX sockets (instead of pipes)
* -i : Use AF_INET sockets
*/
#include <pthread.h>
#include <stdlib.h>
#include <errno.h>
#include <stdio.h>
#include <string.h>
#include <sys/epoll.h>
#include <signal.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/socket.h>
# include <netinet/in.h>
#include <fcntl.h>
#include <sys/ioctl.h>
int nbhandles = 1024;
int time_test = 15;
unsigned long nbhandled;
unsigned long epw_samples;
unsigned long epw_samples_cnt;
struct pipefd {
int fd[2];
} *tab;
int epoll_fd;
int fflag;
int afunix;
int afinet;
static int alloc_streams()
{
int i;
int listen_sock;
struct sockaddr_in me, to;
socklen_t namelen;
int on = 1;
int off = 0;
if (!fflag) {
epoll_fd = epoll_create(nbhandles);
if (epoll_fd == -1) {
perror("epoll_create");
return -1;
}
}
tab = malloc(sizeof(struct pipefd) * nbhandles);
if (tab == NULL) {
perror("malloc");
return -1;
}
if (afinet) {
listen_sock = socket(AF_INET, SOCK_STREAM, 0);
if (listen_sock == -1) {
perror("socket");
return -1;
}
if (listen(listen_sock, 256) == -1) {
perror("listen");
return -1;
}
namelen = sizeof(me);
getsockname(listen_sock, (struct sockaddr *)&me, &namelen);
}
for (i = 0 ; i < nbhandles ; i++) {
if (afinet) {
tab[i].fd[0] = socket(AF_INET, SOCK_STREAM, 0);
if (tab[i].fd[0] == -1)
break;
to = me;
ioctl(tab[i].fd[0], FIONBIO, &on);
if (connect(tab[i].fd[0], (struct sockaddr *)&to, sizeof(to)) != -1 || errno != EINPROGRESS)
break;
tab[i].fd[1] = accept(listen_sock, (struct sockaddr *)&to, &namelen);
if (tab[i].fd[1] == -1)
break;
ioctl(tab[i].fd[0], FIONBIO, &off);
}
else if (afunix) {
if (socketpair(AF_UNIX, SOCK_STREAM, 0, tab[i].fd) == -1)
break;
} else {
if (pipe(tab[i].fd) == -1)
break;
}
if (!fflag) {
struct epoll_event ev;
ev.events = EPOLLIN | EPOLLET;
ev.data.u64 = (uint64_t)i;
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, tab[i].fd[0], &ev);
}
}
nbhandles = i;
printf("%d handles setup\n", nbhandles);
return 0;
}
int sample_proc_stat(long *ctxt)
{
int fd = open("/proc/stat", 0);
char buffer[4096+1], *p;
int lu;
*ctxt = 0;
if (fd == -1) {
perror("/proc/stat");
return -1;
}
lu = read(fd, buffer, sizeof(buffer));
close(fd);
if (lu < 10)
return -1;
buffer[lu] = 0;
p = strstr(buffer, "ctxt");
if (p)
*ctxt = atol(p + 4);
return 0;
}
static void timer_func()
{
char buffer[128];
size_t len;
static unsigned long old;
static unsigned long oldctxt=0;
unsigned long ctxt;
unsigned long delta = nbhandled - old;
static int alarm_events = 0;
old = nbhandled;
len = sprintf(buffer, "%lu evts/sec", delta);
sample_proc_stat(&ctxt);
delta = ctxt - oldctxt;
if (delta && oldctxt)
len += sprintf(buffer + len, " %lu ctxt/sec", delta);
oldctxt = ctxt;
if (epw_samples)
len += sprintf(buffer + len, " %g samples per call", (double)epw_samples_cnt/(double)epw_samples);
buffer[len++] = '\n';
write(2, buffer, len);
if (++alarm_events >= time_test) {
delta = nbhandled/alarm_events;
len = sprintf(buffer, "Avg: %lu evts/sec\n", delta);
write(2, buffer, len);
exit(0);
}
}
static void timer_setup()
{
struct itimerval it;
struct sigaction sg;
memset(&sg, 0, sizeof(sg));
sg.sa_handler = timer_func;
sigaction(SIGALRM, &sg, 0);
it.it_interval.tv_sec = 1;
it.it_interval.tv_usec = 0;
it.it_value.tv_sec = 1;
it.it_value.tv_usec = 0;
if (setitimer(ITIMER_REAL, &it, 0))
perror("setitimer");
}
static void * worker_thread_func(void *arg)
{
int fd = -1;
char c = 1;
int cnt = 0;
nice(10);
for (;;) {
if (fflag)
fd = (fd + 1) % nbhandles;
else
fd = rand() % nbhandles;
write(tab[fd].fd[1], &c, 1);
if (++cnt >= nbhandles) {
cnt = 0 ;
pthread_yield(); /* relax :) */
}
}
}
void usage(int code)
{
fprintf(stderr, "Usage : epoll_bench [-n num] [-{u|i}] [-f] [-t duration] [-l limit] [-e maxepoll]\n");
exit(code);
}
int main(int argc, char *argv[])
{
char buff[1024];
pthread_t tid;
int c, fd;
int limit = 1000;
int max_epoll = 1024;
while ((c = getopt(argc, argv, "fuin:l:e:t:")) != EOF) {
if (c == 'n') nbhandles = atoi(optarg);
else if (c == 'f') fflag++;
else if (c == 'l') limit = atoi(optarg);
else if (c == 'e') max_epoll = atoi(optarg);
else if (c == 't') time_test = atoi(optarg);
else if (c == 'u') afunix++;
else if (c == 'i') afinet++;
else usage(1);
}
alloc_streams();
pthread_create(&tid, NULL, worker_thread_func, (void *)0);
timer_setup();
if (fflag) {
for (fd = 0;;fd = (fd + 1) % nbhandles) {
if (read(tab[fd].fd[0], buff, 1024) > 0)
nbhandled++;
}
}
else {
struct epoll_event *events;
events = malloc(sizeof(struct epoll_event) * max_epoll) ;
for (;;) {
int nb = epoll_wait(epoll_fd, events, max_epoll, -1);
int i;
epw_samples++;
epw_samples_cnt += nb;
for (i = 0 ; i < nb ; i++) {
fd = tab[events[i].data.u64].fd[0];
if (read(fd, buff, 1024) > 0)
nbhandled++;
}
if (nb < limit)
pthread_yield();
}
}
}
On Mon, Nov 06, 2006 at 10:17:37PM +0100, Eric Dumazet ([email protected]) wrote:
> AF_INET
> # ./epoll_bench -n 2000 -i
> 2000 handles setup
> 69210 evts/sec 2.97224 samples per call
> 59436 evts/sec 12876 ctxt/sec 5.48675 samples per call
> 60722 evts/sec 12093 ctxt/sec 8.03185 samples per call
> 60583 evts/sec 14582 ctxt/sec 10.5644 samples per call
> 58192 evts/sec 12066 ctxt/sec 12.999 samples per call
> 54291 evts/sec 10613 ctxt/sec 15.2398 samples per call
> 47978 evts/sec 10942 ctxt/sec 17.2222 samples per call
> 59009 evts/sec 13692 ctxt/sec 19.6426 samples per call
> 58248 evts/sec 15099 ctxt/sec 22.0306 samples per call
> 58708 evts/sec 15118 ctxt/sec 24.4497 samples per call
> 58613 evts/sec 14608 ctxt/sec 26.816 samples per call
> 58490 evts/sec 13593 ctxt/sec 29.1708 samples per call
> 59108 evts/sec 15078 ctxt/sec 31.5557 samples per call
> 59636 evts/sec 15053 ctxt/sec 33.9292 samples per call
> 59355 evts/sec 15531 ctxt/sec 36.2914 samples per call
> Avg: 58771 evts/sec
>
> The last sample shows that epoll overhead is very small indeed, since
> disabling it doesnt boost AF_INET perf at all.
> AF_INET + no epoll
> # ./epoll_bench -n 2000 -i -f
> 2000 handles setup
> 79939 evts/sec
> 78468 evts/sec 9989 ctxt/sec
> 73153 evts/sec 10207 ctxt/sec
> 73668 evts/sec 10163 ctxt/sec
> 73667 evts/sec 20084 ctxt/sec
> 74106 evts/sec 10068 ctxt/sec
> 73442 evts/sec 10119 ctxt/sec
> 74220 evts/sec 10122 ctxt/sec
> 74367 evts/sec 10097 ctxt/sec
> 64402 evts/sec 47873 ctxt/sec
> 53555 evts/sec 58733 ctxt/sec
> 46000 evts/sec 48984 ctxt/sec
> 67052 evts/sec 21006 ctxt/sec
> 68460 evts/sec 12344 ctxt/sec
> 67629 evts/sec 10655 ctxt/sec
> Avg: 69475 evts/sec
Without epoll number of events/sec is about 18% more - 58k vs 69k.
--
Evgeniy Polyakov
On Mon, Nov 06, 2006 at 10:17:37PM +0100, Eric Dumazet ([email protected]) wrote:
> Evgeniy Polyakov a écrit :
> >
> >If there would exist sockets support, then I could patch it to work with
> >kevents.
> >
>
> OK I post here my last version of epoll_bench.
My results with AF_INET are inlined.
Hardware 2.4 Ghz Xeon w(1 CPU, HT enabled) with 1 GB of RAM.
[root@pcix event]# ./epoll_bench -n 2000 -i
2000 handles setup
49758 evts/sec 1.56177 samples per call
38999 evts/sec 95 ctxt/sec 2.77247 samples per call
54042 evts/sec 130 ctxt/sec 4.19909 samples per call
60155 evts/sec 188 ctxt/sec 5.38024 samples per call
59588 evts/sec 178 ctxt/sec 6.38112 samples per call
60023 evts/sec 188 ctxt/sec 7.19564 samples per call
59694 evts/sec 186 ctxt/sec 7.93067 samples per call
60182 evts/sec 190 ctxt/sec 8.52397 samples per call
59750 evts/sec 182 ctxt/sec 9.08015 samples per call
60158 evts/sec 192 ctxt/sec 9.53548 samples per call
59739 evts/sec 188 ctxt/sec 9.97013 samples per call
60054 evts/sec 216 ctxt/sec 10.32 samples per call
59820 evts/sec 206 ctxt/sec 10.6641 samples per call
60095 evts/sec 218 ctxt/sec 10.9289 samples per call
59376 evts/sec 158 ctxt/sec 11.3231 samples per call
Avg: 57428 evts/sec
[root@pcix event]# ./kevent_bench -n2000 -i
2000 handles setup
57960 evts/sec 0.276702 samples per call
59802 evts/sec 75 ctxt/sec 0.462737 samples per call
59864 evts/sec 71 ctxt/sec 0.623457 samples per call
59651 evts/sec 72 ctxt/sec 0.721579 samples per call
59504 evts/sec 84 ctxt/sec 0.804311 samples per call
61019 evts/sec 72 ctxt/sec 0.904817 samples per call
59846 evts/sec 72 ctxt/sec 0.949439 samples per call
60550 evts/sec 74 ctxt/sec 1.00416 samples per call
59421 evts/sec 66 ctxt/sec 1.04133 samples per call
60334 evts/sec 75 ctxt/sec 1.06845 samples per call
60000 evts/sec 67 ctxt/sec 1.09594 samples per call
59429 evts/sec 74 ctxt/sec 1.11404 samples per call
60508 evts/sec 77 ctxt/sec 1.14482 samples per call
59530 evts/sec 66 ctxt/sec 1.15454 samples per call
59506 evts/sec 73 ctxt/sec 1.17937 samples per call
Avg: 59794 evts/sec
[root@pcix event]# ./kevent_bench -n2000 -i -f
2000 handles setup
82893 evts/sec
88624 evts/sec 390 ctxt/sec
88751 evts/sec 475 ctxt/sec
88784 evts/sec 488 ctxt/sec
88918 evts/sec 458 ctxt/sec
88866 evts/sec 504 ctxt/sec
88950 evts/sec 458 ctxt/sec
88883 evts/sec 472 ctxt/sec
88915 evts/sec 404 ctxt/sec
88836 evts/sec 368 ctxt/sec
89065 evts/sec 442 ctxt/sec
88859 evts/sec 398 ctxt/sec
89070 evts/sec 446 ctxt/sec
88809 evts/sec 428 ctxt/sec
89012 evts/sec 542 ctxt/sec
Avg: 88482 evts/sec
epoll: 57428
kevent: 59794
max: 88482
BUT!
Kevent does not support analogue for EPOLLET, i.e. the case when the
same event is used, instead kevent must modify existing one (i.e. behave
exactly like epoll without EPOLLET), so modified epoll_bench to work
without EPOLLET like kevent.
epoll with EPOLLET shows upto 71k events/sec.
Lack of such feature is a minus for kevent indeed.
I will add it into todo list behind (implemented) new ring buffer
implementation and (implemented) wake-up-one-thread flag implementation.
Hopefully I will include it into next kevent release soon, but do not
expect it today/tomorrow, there some unrelated to hacking problems.
--
Evgeniy Polyakov
Nate Diller wrote:
> Indesiciveness has certainly been an issue here, but I remember akpm
> and Ulrich both giving concrete suggestions. I was particularly
> interested in Andrew's request to explain and justify the differences
> between kevent and BSD's kqueue interface. Was there a discussion
> that I missed? I am very interested to see your work on this
> mechanism merged, because you've clearly emphasized performance and
> shown impressive results. But it seems like we lose out on a lot by
> throwing out all the applications that already use kqueue.
kqueue looks pretty nice, the filter/note models in particular. I don't
see anything about ring buffers though.
I also wonder about the asynchronous event side (send), not just the
event reception side.
Jeff
David Miller wrote:
> From: Pavel Machek <[email protected]>
> Date: Fri, 3 Nov 2006 09:57:12 +0100
>
>> Not sure what you are smoking, but "there's unsigned long in *bsd
>> version, lets rewrite it from scratch" sounds like very bad idea. What
>> about fixing that one bit you don't like?
>
> I disagree, it's more like since we have to be structure incompatible
> anyways, let's design something superior if we can.
Definitely agreed.
Jeff
On Tue, Nov 07, 2006 at 12:18:43PM +0300, Evgeniy Polyakov ([email protected]) wrote:
> On Mon, Nov 06, 2006 at 10:17:37PM +0100, Eric Dumazet ([email protected]) wrote:
> > Evgeniy Polyakov a écrit :
> > >
> > >If there would exist sockets support, then I could patch it to work with
> > >kevents.
> > >
> >
> > OK I post here my last version of epoll_bench.
>
> My results with AF_INET are inlined.
> Hardware 2.4 Ghz Xeon w(1 CPU, HT enabled) with 1 GB of RAM.
>
> [root@pcix event]# ./epoll_bench -n 2000 -i
> 2000 handles setup
> 49758 evts/sec 1.56177 samples per call
> 38999 evts/sec 95 ctxt/sec 2.77247 samples per call
> 54042 evts/sec 130 ctxt/sec 4.19909 samples per call
> 60155 evts/sec 188 ctxt/sec 5.38024 samples per call
> 59588 evts/sec 178 ctxt/sec 6.38112 samples per call
> 60023 evts/sec 188 ctxt/sec 7.19564 samples per call
> 59694 evts/sec 186 ctxt/sec 7.93067 samples per call
> 60182 evts/sec 190 ctxt/sec 8.52397 samples per call
> 59750 evts/sec 182 ctxt/sec 9.08015 samples per call
> 60158 evts/sec 192 ctxt/sec 9.53548 samples per call
> 59739 evts/sec 188 ctxt/sec 9.97013 samples per call
> 60054 evts/sec 216 ctxt/sec 10.32 samples per call
> 59820 evts/sec 206 ctxt/sec 10.6641 samples per call
> 60095 evts/sec 218 ctxt/sec 10.9289 samples per call
> 59376 evts/sec 158 ctxt/sec 11.3231 samples per call
> Avg: 57428 evts/sec
> [root@pcix event]# ./kevent_bench -n2000 -i
> 2000 handles setup
> 57960 evts/sec 0.276702 samples per call
> 59802 evts/sec 75 ctxt/sec 0.462737 samples per call
> 59864 evts/sec 71 ctxt/sec 0.623457 samples per call
> 59651 evts/sec 72 ctxt/sec 0.721579 samples per call
> 59504 evts/sec 84 ctxt/sec 0.804311 samples per call
> 61019 evts/sec 72 ctxt/sec 0.904817 samples per call
> 59846 evts/sec 72 ctxt/sec 0.949439 samples per call
> 60550 evts/sec 74 ctxt/sec 1.00416 samples per call
> 59421 evts/sec 66 ctxt/sec 1.04133 samples per call
> 60334 evts/sec 75 ctxt/sec 1.06845 samples per call
> 60000 evts/sec 67 ctxt/sec 1.09594 samples per call
> 59429 evts/sec 74 ctxt/sec 1.11404 samples per call
> 60508 evts/sec 77 ctxt/sec 1.14482 samples per call
> 59530 evts/sec 66 ctxt/sec 1.15454 samples per call
> 59506 evts/sec 73 ctxt/sec 1.17937 samples per call
> Avg: 59794 evts/sec
> [root@pcix event]# ./kevent_bench -n2000 -i -f
> 2000 handles setup
> 82893 evts/sec
> 88624 evts/sec 390 ctxt/sec
> 88751 evts/sec 475 ctxt/sec
> 88784 evts/sec 488 ctxt/sec
> 88918 evts/sec 458 ctxt/sec
> 88866 evts/sec 504 ctxt/sec
> 88950 evts/sec 458 ctxt/sec
> 88883 evts/sec 472 ctxt/sec
> 88915 evts/sec 404 ctxt/sec
> 88836 evts/sec 368 ctxt/sec
> 89065 evts/sec 442 ctxt/sec
> 88859 evts/sec 398 ctxt/sec
> 89070 evts/sec 446 ctxt/sec
> 88809 evts/sec 428 ctxt/sec
> 89012 evts/sec 542 ctxt/sec
> Avg: 88482 evts/sec
>
> epoll: 57428
> kevent: 59794
> max: 88482
>
> BUT!
> Kevent does not support analogue for EPOLLET, i.e. the case when the
> same event is used, instead kevent must modify existing one (i.e. behave
> exactly like epoll without EPOLLET), so modified epoll_bench to work
> without EPOLLET like kevent.
> epoll with EPOLLET shows upto 71k events/sec.
>
> Lack of such feature is a minus for kevent indeed.
> I will add it into todo list behind (implemented) new ring buffer
> implementation and (implemented) wake-up-one-thread flag implementation.
> Hopefully I will include it into next kevent release soon, but do not
> expect it today/tomorrow, there some unrelated to hacking problems.
Here is edge-triggered behavior of kevent:
[root@pcix event]# ./kevent_bench -n2000 -i
2000 handles setup
67057 evts/sec 1.18746 samples per call
79239 evts/sec 68 ctxt/sec 1.30531 samples per call
78877 evts/sec 140 ctxt/sec 1.34172 samples per call
79017 evts/sec 82 ctxt/sec 1.35835 samples per call
78957 evts/sec 115 ctxt/sec 1.36885 samples per call
79084 evts/sec 70 ctxt/sec 1.37419 samples per call
79083 evts/sec 98 ctxt/sec 1.38 samples per call
79083 evts/sec 72 ctxt/sec 1.38194 samples per call
79025 evts/sec 111 ctxt/sec 1.38426 samples per call
79139 evts/sec 78 ctxt/sec 1.38554 samples per call
79055 evts/sec 112 ctxt/sec 1.38701 samples per call
79118 evts/sec 72 ctxt/sec 1.39 samples per call
79040 evts/sec 94 ctxt/sec 1.39108 samples per call
79098 evts/sec 81 ctxt/sec 1.39136 samples per call
79104 evts/sec 90 ctxt/sec 1.39269 samples per call
Avg: 78265 evts/sec
So, kevent is faster than epoll.
It was proven using three independent benchmarks (mine
evserver_kevent.c, Johann Borck's own web server and Eric's epoll_bench).
I plan to release new version with all additional goodies today.
--
Evgeniy Polyakov
Kevent pipe benchmark kevent_pipe kernel kevent part:
epoll (edge-triggered): 248408 events/sec
kevent (edge-triggered): 269282 events/sec
Busy reading loop: 269519 events/sec
Kevent is definitely a winner with extremely small overhead.
I will add kevent_pipe into next kevent release which will be available
soon.
--
Evgeniy Polyakov