2006-09-20 09:10:39

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take19 0/4] kevent: Generic event handling mechanism.


Generic event handling mechanism.

Consider for inclusion.

Changes from 'take18' patchset:
* use __init instead of __devinit
* removed 'default N' from config for user statistic
* removed kevent_user_fini() since kevent can not be unloaded
* use KERN_INFO for statistic output

Changes from 'take17' patchset:
* Use RB tree instead of hash table.
At least for a web sever, frequency of addition/deletion of new kevent
is comparable with number of search access, i.e. most of the time events
are added, accesed only couple of times and then removed, so it justifies
RB tree usage over AVL tree, since the latter does have much slower deletion
time (max O(log(N)) compared to 3 ops),
although faster search time (1.44*O(log(N)) vs. 2*O(log(N))).
So for kevents I use RB tree for now and later, when my AVL tree implementation
is ready, it will be possible to compare them.
* Changed readiness check for socket notifications.

With both above changes it is possible to achieve more than 3380 req/second compared to 2200,
sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same hardware.
It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is 4096 events.

Changes from 'take16' patchset:
* misc cleanups (__read_mostly, const ...)
* created special macro which is used for mmap size (number of pages) calculation
* export kevent_socket_notify(), since it is used in network protocols which can be
built as modules (IPv6 for example)

Changes from 'take15' patchset:
* converted kevent_timer to high-resolution timers, this forces timer API update at
http://linux-net.osdl.org/index.php/Kevent
* use struct ukevent* instead of void * in syscalls (documentation has been updated)
* added warning in kevent_add_ukevent() if ring has broken index (for testing)

Changes from 'take14' patchset:
* added kevent_wait()
This syscall waits until either timeout expires or at least one event
becomes ready. It also commits that @num events from @start are processed
by userspace and thus can be be removed or rearmed (depending on it's flags).
It can be used for commit events read by userspace through mmap interface.
Example userspace code (evtest.c) can be found on project's homepage.
* added socket notifications (send/recv/accept)

Changes from 'take13' patchset:
* do not get lock aroung user data check in __kevent_search()
* fail early if there were no registered callbacks for given type of kevent
* trailing whitespace cleanup

Changes from 'take12' patchset:
* remove non-chardev interface for initialization
* use pointer to kevent_mring instead of unsigned longs
* use aligned 64bit type in raw user data (can be used by high-res timer if needed)
* simplified enqueue/dequeue callbacks and kevent initialization
* use nanoseconds for timeout
* put number of milliseconds into timer's return data
* move some definitions into user-visible header
* removed filenames from comments

Changes from 'take11' patchset:
* include missing headers into patchset
* some trivial code cleanups (use goto instead of if/else games and so on)
* some whitespace cleanups
* check for ready_callback() callback before main loop which should save us some ticks

Changes from 'take10' patchset:
* removed non-existent prototypes
* added helper function for kevent_registered_callbacks
* fixed 80 lines comments issues
* added shared between userspace and kernelspace header instead of embedd them in one
* core restructuring to remove forward declarations
* s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
* use vm_insert_page() instead of remap_pfn_range()

Changes from 'take9' patchset:
* fixed ->nopage method

Changes from 'take8' patchset:
* fixed mmap release bug
* use module_init() instead of late_initcall()
* use better structures for timer notifications

Changes from 'take7' patchset:
* new mmap interface (not tested, waiting for other changes to be acked)
- use nopage() method to dynamically substitue pages
- allocate new page for events only when new added kevent requres it
- do not use ugly index dereferencing, use structure instead
- reduced amount of data in the ring (id and flags),
maximum 12 pages on x86 per kevent fd

Changes from 'take6' patchset:
* a lot of comments!
* do not use list poisoning for detection of the fact, that entry is in the list
* return number of ready kevents even if copy*user() fails
* strict check for number of kevents in syscall
* use ARRAY_SIZE for array size calculation
* changed superblock magic number
* use SLAB_PANIC instead of direct panic() call
* changed -E* return values
* a lot of small cleanups and indent fixes

Changes from 'take5' patchset:
* removed compilation warnings about unused wariables when lockdep is not turned on
* do not use internal socket structures, use appropriate (exported) wrappers instead
* removed default 1 second timeout
* removed AIO stuff from patchset

Changes from 'take4' patchset:
* use miscdevice instead of chardevice
* comments fixes

Changes from 'take3' patchset:
* removed serializing mutex from kevent_user_wait()
* moved storage list processing to RCU
* removed lockdep screaming - all storage locks are initialized in the same function, so it was learned
to differentiate between various cases
* remove kevent from storage if is marked as broken after callback
* fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion

Changes from 'take2' patchset:
* split kevent_finish_user() to locked and unlocked variants
* do not use KEVENT_STAT ifdefs, use inline functions instead
* use array of callbacks of each type instead of each kevent callback initialization
* changed name of ukevent guarding lock
* use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
* do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
* various indent cleanups
* added optimisation, which is aimed to help when a lot of kevents are being copied from userspace
* mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
- rebased against 2.6.18-git tree
- removed ioctl controlling
- added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
unsigned int timeout, void __user *buf, unsigned flags)
- use old syscall kevent_ctl for creation/removing, modification and initial kevent
initialization
- use mutuxes instead of semaphores
- added file descriptor check and return error if provided descriptor does not match
kevent file operations
- various indent fixes
- removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov <[email protected]>



2006-09-20 09:10:36

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take19 2/4] kevent: poll/select() notifications.


poll/select() notifications.

This patch includes generic poll/select notifications.

kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake, a lot of allocations and so on.).

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2561020..a697930 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -236,6 +236,7 @@ #include <linux/prio_tree.h>
#include <linux/init.h>
#include <linux/sched.h>
#include <linux/mutex.h>
+#include <linux/kevent.h>

#include <asm/atomic.h>
#include <asm/semaphore.h>
@@ -546,6 +547,10 @@ #ifdef CONFIG_INOTIFY
struct mutex inotify_mutex; /* protects the watches list */
#endif

+#ifdef CONFIG_KEVENT_SOCKET
+ struct kevent_storage st;
+#endif
+
unsigned long i_state;
unsigned long dirtied_when; /* jiffies of first dirtying */

@@ -698,6 +703,9 @@ #ifdef CONFIG_EPOLL
struct list_head f_ep_links;
spinlock_t f_ep_lock;
#endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+ struct kevent_storage st;
+#endif
struct address_space *f_mapping;
};
extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..fb74e0f
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,222 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+ struct poll_table_struct pt;
+ struct kevent *k;
+};
+
+struct kevent_poll_wait_container
+{
+ struct list_head container_entry;
+ wait_queue_head_t *whead;
+ wait_queue_t wait;
+ struct kevent *k;
+};
+
+struct kevent_poll_private
+{
+ struct list_head container_list;
+ spinlock_t container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+ unsigned mode, int sync, void *key)
+{
+ struct kevent_poll_wait_container *cont =
+ container_of(wait, struct kevent_poll_wait_container, wait);
+ struct kevent *k = cont->k;
+ struct file *file = k->st->origin;
+ u32 revents;
+
+ revents = file->f_op->poll(file, NULL);
+
+ kevent_storage_ready(k->st, NULL, revents);
+
+ return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+ struct poll_table_struct *poll_table)
+{
+ struct kevent *k =
+ container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+ struct kevent_poll_private *priv = k->priv;
+ struct kevent_poll_wait_container *cont;
+ unsigned long flags;
+
+ cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+ if (!cont) {
+ kevent_break(k);
+ return;
+ }
+
+ cont->k = k;
+ init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+ cont->whead = whead;
+
+ spin_lock_irqsave(&priv->container_lock, flags);
+ list_add_tail(&cont->container_entry, &priv->container_list);
+ spin_unlock_irqrestore(&priv->container_lock, flags);
+
+ add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+ struct file *file;
+ int err, ready = 0;
+ unsigned int revents;
+ struct kevent_poll_ctl ctl;
+ struct kevent_poll_private *priv;
+
+ file = fget(k->event.id.raw[0]);
+ if (!file)
+ return -ENODEV;
+
+ err = -EINVAL;
+ if (!file->f_op || !file->f_op->poll)
+ goto err_out_fput;
+
+ err = -ENOMEM;
+ priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+ if (!priv)
+ goto err_out_fput;
+
+ spin_lock_init(&priv->container_lock);
+ INIT_LIST_HEAD(&priv->container_list);
+
+ k->priv = priv;
+
+ ctl.k = k;
+ init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+
+ err = kevent_storage_enqueue(&file->st, k);
+ if (err)
+ goto err_out_free;
+
+ revents = file->f_op->poll(file, &ctl.pt);
+ if (revents & k->event.event) {
+ ready = 1;
+ kevent_poll_dequeue(k);
+ }
+
+ return ready;
+
+err_out_free:
+ kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+ fput(file);
+ return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+ struct file *file = k->st->origin;
+ struct kevent_poll_private *priv = k->priv;
+ struct kevent_poll_wait_container *w, *n;
+ unsigned long flags;
+
+ kevent_storage_dequeue(k->st, k);
+
+ spin_lock_irqsave(&priv->container_lock, flags);
+ list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+ list_del(&w->container_entry);
+ remove_wait_queue(w->whead, &w->wait);
+ kmem_cache_free(kevent_poll_container_cache, w);
+ }
+ spin_unlock_irqrestore(&priv->container_lock, flags);
+
+ kmem_cache_free(kevent_poll_priv_cache, priv);
+ k->priv = NULL;
+
+ fput(file);
+
+ return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+ struct file *file = k->st->origin;
+ unsigned int revents = file->f_op->poll(file, NULL);
+
+ k->event.ret_data[0] = revents & k->event.event;
+
+ return (revents & k->event.event);
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+ struct kevent_callbacks pc = {
+ .callback = &kevent_poll_callback,
+ .enqueue = &kevent_poll_enqueue,
+ .dequeue = &kevent_poll_dequeue};
+
+ kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache",
+ sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+ if (!kevent_poll_container_cache) {
+ printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+ return -ENOMEM;
+ }
+
+ kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache",
+ sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+ if (!kevent_poll_priv_cache) {
+ printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+ kmem_cache_destroy(kevent_poll_container_cache);
+ kevent_poll_container_cache = NULL;
+ return -ENOMEM;
+ }
+
+ kevent_add_callbacks(&pc, KEVENT_POLL);
+
+ printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+ return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+ lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+ kmem_cache_destroy(kevent_poll_priv_cache);
+ kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);

2006-09-20 09:11:07

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take19 4/4] kevent: Timer notifications.


Timer notifications.

Timer notifications can be used for fine grained per-process time
management, since interval timers are very inconvenient to use,
and they are limited.

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..04acc46
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,113 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/hrtimer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+struct kevent_timer
+{
+ struct hrtimer ktimer;
+ struct kevent_storage ktimer_storage;
+ struct kevent *ktimer_event;
+};
+
+static int kevent_timer_func(struct hrtimer *timer)
+{
+ struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer);
+ struct kevent *k = t->ktimer_event;
+
+ kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL);
+ hrtimer_forward(timer, timer->base->softirq_time,
+ ktime_set(k->event.id.raw[0], k->event.id.raw[1]));
+ return HRTIMER_RESTART;
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+ int err;
+ struct kevent_timer *t;
+
+ t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+ if (!t)
+ return -ENOMEM;
+
+ hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL);
+ t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]);
+ t->ktimer.function = kevent_timer_func;
+ t->ktimer_event = k;
+
+ err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+ if (err)
+ goto err_out_free;
+ lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+ err = kevent_storage_enqueue(&t->ktimer_storage, k);
+ if (err)
+ goto err_out_st_fini;
+
+ printk("%s: jiffies: %lu, timer: %p.\n", __func__, jiffies, &t->ktimer);
+ hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL);
+
+ return 0;
+
+err_out_st_fini:
+ kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+ kfree(t);
+
+ return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+ struct kevent_storage *st = k->st;
+ struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage);
+
+ hrtimer_cancel(&t->ktimer);
+ kevent_storage_dequeue(st, k);
+ kfree(t);
+
+ return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+ k->event.ret_data[0] = jiffies_to_msecs(jiffies);
+ return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+ struct kevent_callbacks tc = {
+ .callback = &kevent_timer_callback,
+ .enqueue = &kevent_timer_enqueue,
+ .dequeue = &kevent_timer_dequeue};
+
+ return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);
+

2006-09-20 09:11:12

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take19 3/4] kevent: Socket notifications.


Socket notifications.

This patch include socket send/recv/accept notifications.
Using trivial web server based on kevent and this features
instead of epoll it's performance increased more than noticebly.
More details about benchmark and server itself (evserver_kevent.c)
can be found on project's homepage.

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/fs/inode.c b/fs/inode.c
index 0bf9f04..181521d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@ #include <linux/pagemap.h>
#include <linux/cdev.h>
#include <linux/bootmem.h>
#include <linux/inotify.h>
+#include <linux/kevent.h>
#include <linux/mount.h>

/*
@@ -165,12 +166,18 @@ #endif
}
memset(&inode->u, 0, sizeof(inode->u));
inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET
+ kevent_storage_init(inode, &inode->st);
+#endif
}
return inode;
}

void destroy_inode(struct inode *inode)
{
+#if defined CONFIG_KEVENT_SOCKET
+ kevent_storage_fini(&inode->st);
+#endif
BUG_ON(inode_has_buffers(inode));
security_inode_free(inode);
if (inode->i_sb->s_op->destroy_inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2561020..a697930 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -236,6 +236,7 @@ #include <linux/prio_tree.h>
#include <linux/init.h>
#include <linux/sched.h>
#include <linux/mutex.h>
+#include <linux/kevent.h>

#include <asm/atomic.h>
#include <asm/semaphore.h>
@@ -546,6 +547,10 @@ #ifdef CONFIG_INOTIFY
struct mutex inotify_mutex; /* protects the watches list */
#endif

+#ifdef CONFIG_KEVENT_SOCKET
+ struct kevent_storage st;
+#endif
+
unsigned long i_state;
unsigned long dirtied_when; /* jiffies of first dirtying */

@@ -698,6 +703,9 @@ #ifdef CONFIG_EPOLL
struct list_head f_ep_links;
spinlock_t f_ep_lock;
#endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+ struct kevent_storage st;
+#endif
struct address_space *f_mapping;
};
extern spinlock_t files_lock;
diff --git a/include/net/sock.h b/include/net/sock.h
index 324b3ea..5d71ed7 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -48,6 +48,7 @@ #include <linux/lockdep.h>
#include <linux/netdevice.h>
#include <linux/skbuff.h> /* struct sk_buff */
#include <linux/security.h>
+#include <linux/kevent.h>

#include <linux/filter.h>

@@ -450,6 +451,21 @@ static inline int sk_stream_memory_free(

extern void sk_stream_rfree(struct sk_buff *skb);

+struct socket_alloc {
+ struct socket socket;
+ struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+ return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+ return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
+}
+
static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
{
skb->sk = sk;
@@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct
sk->sk_backlog.tail = skb;
}
skb->next = NULL;
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
}

#define sk_wait_event(__sk, __timeo, __condition) \
@@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio
return si->kiocb;
}

-struct socket_alloc {
- struct socket socket;
- struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
- return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
- return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
-}
-
extern void __sk_stream_mem_reclaim(struct sock *sk);
extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..69f4ad2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so
tp->ucopy.memory = 0;
} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
wake_up_interruptible(sk->sk_sleep);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
if (!inet_csk_ack_scheduled(sk))
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
(3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 0000000..1ddd2a1
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,126 @@
+/*
+ * kevent_socket.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/tcp.h>
+#include <linux/kevent.h>
+
+#include <net/sock.h>
+#include <net/request_sock.h>
+#include <net/inet_connection_sock.h>
+
+static int kevent_socket_callback(struct kevent *k)
+{
+ struct inode *inode = k->st->origin;
+ return SOCKET_I(inode)->ops->poll(SOCKET_I(inode)->file, SOCKET_I(inode), NULL);
+}
+
+int kevent_socket_enqueue(struct kevent *k)
+{
+ struct inode *inode;
+ struct socket *sock;
+ int err = -ENODEV;
+
+ sock = sockfd_lookup(k->event.id.raw[0], &err);
+ if (!sock)
+ goto err_out_exit;
+
+ inode = igrab(SOCK_INODE(sock));
+ if (!inode)
+ goto err_out_fput;
+
+ err = kevent_storage_enqueue(&inode->st, k);
+ if (err)
+ goto err_out_iput;
+
+ err = k->callbacks.callback(k);
+ if (err)
+ goto err_out_dequeue;
+
+ sockfd_put(sock);
+ return err;
+
+err_out_dequeue:
+ kevent_storage_dequeue(k->st, k);
+err_out_iput:
+ iput(inode);
+err_out_fput:
+ sockfd_put(sock);
+err_out_exit:
+ return err;
+}
+
+int kevent_socket_dequeue(struct kevent *k)
+{
+ struct inode *inode = k->st->origin;
+
+ kevent_storage_dequeue(k->st, k);
+ iput(inode);
+
+ return 0;
+}
+
+void kevent_socket_notify(struct sock *sk, u32 event)
+{
+ if (sk->sk_socket)
+ kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event);
+}
+
+/*
+ * It is required for network protocols compiled as modules, like IPv6.
+ */
+EXPORT_SYMBOL_GPL(kevent_socket_notify);
+
+#ifdef CONFIG_LOCKDEP
+static struct lock_class_key kevent_sock_key;
+
+void kevent_socket_reinit(struct socket *sock)
+{
+ struct inode *inode = SOCK_INODE(sock);
+
+ lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+}
+
+void kevent_sk_reinit(struct sock *sk)
+{
+ if (sk->sk_socket) {
+ struct inode *inode = SOCK_INODE(sk->sk_socket);
+
+ lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+ }
+}
+#endif
+static int __init kevent_init_socket(void)
+{
+ struct kevent_callbacks sc = {
+ .callback = &kevent_socket_callback,
+ .enqueue = &kevent_socket_enqueue,
+ .dequeue = &kevent_socket_dequeue};
+
+ return kevent_add_callbacks(&sc, KEVENT_SOCKET);
+}
+module_init(kevent_init_socket);
diff --git a/net/core/sock.c b/net/core/sock.c
index 51fcfbc..4f91615 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1406,6 +1406,7 @@ static void sock_def_wakeup(struct sock
if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
wake_up_interruptible_all(sk->sk_sleep);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}

static void sock_def_error_report(struct sock *sk)
@@ -1415,6 +1416,7 @@ static void sock_def_error_report(struct
wake_up_interruptible(sk->sk_sleep);
sk_wake_async(sk,0,POLL_ERR);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}

static void sock_def_readable(struct sock *sk, int len)
@@ -1424,6 +1426,7 @@ static void sock_def_readable(struct soc
wake_up_interruptible(sk->sk_sleep);
sk_wake_async(sk,1,POLL_IN);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}

static void sock_def_write_space(struct sock *sk)
@@ -1443,6 +1446,7 @@ static void sock_def_write_space(struct
}

read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
}

static void sock_def_destruct(struct sock *sk)
@@ -1493,6 +1497,8 @@ #endif
sk->sk_state = TCP_CLOSE;
sk->sk_socket = sock;

+ kevent_sk_reinit(sk);
+
sock_set_flag(sk, SOCK_ZAPPED);

if(sock)
@@ -1559,8 +1565,10 @@ void fastcall release_sock(struct sock *
if (sk->sk_backlog.tail)
__release_sock(sk);
sk->sk_lock.owner = NULL;
- if (waitqueue_active(&sk->sk_lock.wq))
+ if (waitqueue_active(&sk->sk_lock.wq)) {
wake_up(&sk->sk_lock.wq);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
+ }
spin_unlock_bh(&sk->sk_lock.slock);
}
EXPORT_SYMBOL(release_sock);
diff --git a/net/core/stream.c b/net/core/stream.c
index d1d7dec..2878c2a 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock *
wake_up_interruptible(sk->sk_sleep);
if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
sock_wake_async(sock, 2, POLL_OUT);
+ kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
}
}

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 104af5d..14cee12 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3112,6 +3112,7 @@ static void tcp_ofo_queue(struct sock *s

__skb_unlink(skb, &tp->out_of_order_queue);
__skb_queue_tail(&sk->sk_receive_queue, skb);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
if(skb->h.th->fin)
tcp_fin(skb, sk, skb->h.th);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 4b04c3e..cda1500 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -61,6 +61,7 @@ #include <linux/cache.h>
#include <linux/jhash.h>
#include <linux/init.h>
#include <linux/times.h>
+#include <linux/kevent.h>

#include <net/icmp.h>
#include <net/inet_hashtables.h>
@@ -867,6 +868,7 @@ #endif
reqsk_free(req);
} else {
inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+ kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT);
}
return 0;

diff --git a/net/socket.c b/net/socket.c
index b4848ce..42e19e2 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -85,6 +85,7 @@ #include <linux/compat.h>
#include <linux/kmod.h>
#include <linux/audit.h>
#include <linux/wireless.h>
+#include <linux/kevent.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -526,6 +527,8 @@ static struct socket *sock_alloc(void)
inode->i_uid = current->fsuid;
inode->i_gid = current->fsgid;

+ kevent_socket_reinit(sock);
+
get_cpu_var(sockets_in_use)++;
put_cpu_var(sockets_in_use);
return sock;

2006-09-20 09:11:50

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take19 1/4] kevent: Core files.


Core files.

This patch includes core kevent files:
- userspace controlling
- kernelspace interfaces
- initialization
- notification state machines

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index dd63d47..c10698e 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -317,3 +317,6 @@ ENTRY(sys_call_table)
.long sys_tee /* 315 */
.long sys_vmsplice
.long sys_move_pages
+ .long sys_kevent_get_events
+ .long sys_kevent_ctl
+ .long sys_kevent_wait /* 320 */
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index 5d4a7d1..a06b76f 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -710,7 +710,10 @@ #endif
.quad compat_sys_get_robust_list
.quad sys_splice
.quad sys_sync_file_range
- .quad sys_tee
+ .quad sys_tee /* 315 */
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
+ .quad sys_kevent_get_events
+ .quad sys_kevent_ctl
+ .quad sys_kevent_wait /* 320 */
ia32_syscall_end:
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index fc1c8dd..68072b5 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -323,10 +323,13 @@ #define __NR_sync_file_range 314
#define __NR_tee 315
#define __NR_vmsplice 316
#define __NR_move_pages 317
+#define __NR_kevent_get_events 318
+#define __NR_kevent_ctl 319
+#define __NR_kevent_wait 320

#ifdef __KERNEL__

-#define NR_syscalls 318
+#define NR_syscalls 321

/*
* user-visible error numbers are in the range -1 - -128: see
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 94387c9..ee907ad 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,16 @@ #define __NR_vmsplice 278
__SYSCALL(__NR_vmsplice, sys_vmsplice)
#define __NR_move_pages 279
__SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events 280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl 281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
+#define __NR_kevent_wait 282
+__SYSCALL(__NR_kevent_wait, sys_kevent_wait)

#ifdef __KERNEL__

-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_wait

#ifndef __NO_STUBS

diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..24ced10
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,195 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/kevent_storage.h>
+#include <linux/ukevent.h>
+
+#define KEVENT_MIN_BUFFS_ALLOC 3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+ kevent_callback_t callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY 0x1
+#define KEVENT_STORAGE 0x2
+#define KEVENT_USER 0x4
+
+struct kevent
+{
+ /* Used for kevent freeing.*/
+ struct rcu_head rcu_head;
+ struct ukevent event;
+ /* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+ spinlock_t ulock;
+
+ /* Entry of user's tree. */
+ struct rb_node kevent_node;
+ /* Entry of origin's queue. */
+ struct list_head storage_entry;
+ /* Entry of user's ready. */
+ struct list_head ready_entry;
+
+ u32 flags;
+
+ /* User who requested this kevent. */
+ struct kevent_user *user;
+ /* Kevent container. */
+ struct kevent_storage *st;
+
+ struct kevent_callbacks callbacks;
+
+ /* Private data for different storages.
+ * poll()/select storage has a list of wait_queue_t containers
+ * for each ->poll() { poll_wait()' } here.
+ */
+ void *priv;
+};
+
+struct kevent_user
+{
+ struct rb_root kevent_root;
+ spinlock_t kevent_lock;
+ /* Number of queued kevents. */
+ unsigned int kevent_num;
+
+ /* List of ready kevents. */
+ struct list_head ready_list;
+ /* Number of ready kevents. */
+ unsigned int ready_num;
+ /* Protects all manipulations with ready queue. */
+ spinlock_t ready_lock;
+
+ /* Protects against simultaneous kevent_user control manipulations. */
+ struct mutex ctl_mutex;
+ /* Wait until some events are ready. */
+ wait_queue_head_t wait;
+
+ /* Reference counter, increased for each new kevent. */
+ atomic_t refcnt;
+
+ unsigned int pages_in_use;
+ /* Array of pages forming mapped ring buffer */
+ struct kevent_mring **pring;
+
+#ifdef CONFIG_KEVENT_USER_STAT
+ unsigned long im_num;
+ unsigned long wait_num;
+ unsigned long total;
+#endif
+};
+
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos);
+
+void kevent_user_ring_add_event(struct kevent *k);
+
+void kevent_storage_ready(struct kevent_storage *st,
+ kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+ u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+ printk(KERN_INFO "%s: u=%p, wait=%lu, immediately=%lu, total=%lu.\n",
+ __func__, u, u->wait_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+ u->im_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+ u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+ u->total++;
+}
+#else
+#define kevent_stat_print(u) ({ (void) u;})
+#define kevent_stat_init(u) ({ (void) u;})
+#define kevent_stat_im(u) ({ (void) u;})
+#define kevent_stat_wait(u) ({ (void) u;})
+#define kevent_stat_total(u) ({ (void) u;})
+#endif
+
+#ifdef CONFIG_KEVENT_SOCKET
+#ifdef CONFIG_LOCKDEP
+void kevent_socket_reinit(struct socket *sock);
+void kevent_sk_reinit(struct sock *sk);
+#else
+static inline void kevent_socket_reinit(struct socket *sock)
+{
+}
+static inline void kevent_sk_reinit(struct sock *sk)
+{
+}
+#endif
+void kevent_socket_notify(struct sock *sock, u32 event);
+int kevent_socket_dequeue(struct kevent *k);
+int kevent_socket_enqueue(struct kevent *k);
+#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)
+#else
+static inline void kevent_socket_notify(struct sock *sock, u32 event)
+{
+}
+#define sock_async(__sk) ({ (void)__sk; 0; })
+#endif
+
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+ void *origin; /* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+ struct list_head list; /* List of queued kevents. */
+ spinlock_t lock; /* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 008f04c..9d4690f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -597,4 +597,8 @@ asmlinkage long sys_get_robust_list(int
asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
size_t len);

+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max,
+ __u64 timeout, struct ukevent __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, struct ukevent __user *buf);
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned int num, __u64 timeout);
#endif
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
new file mode 100644
index 0000000..e38801f
--- /dev/null
+++ b/include/linux/ukevent.h
@@ -0,0 +1,159 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __UKEVENT_H
+#define __UKEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+/* Process this event only once and then dequeue. */
+#define KEVENT_REQ_ONESHOT 0x1
+
+/*
+ * Kevent return flags.
+ */
+/* Kevent is broken. */
+#define KEVENT_RET_BROKEN 0x1
+/* Kevent processing was finished successfully. */
+#define KEVENT_RET_DONE 0x2
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 0
+#define KEVENT_INODE 1
+#define KEVENT_TIMER 2
+#define KEVENT_POLL 3
+#define KEVENT_NAIO 4
+#define KEVENT_AIO 5
+#define KEVENT_MAX 6
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define KEVENT_TIMER_FIRED 0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#define KEVENT_SOCKET_RECV 0x1
+#define KEVENT_SOCKET_ACCEPT 0x2
+#define KEVENT_SOCKET_SEND 0x4
+
+/*
+ * Inode events.
+ */
+#define KEVENT_INODE_CREATE 0x1
+#define KEVENT_INODE_REMOVE 0x2
+
+/*
+ * Poll events.
+ */
+#define KEVENT_POLL_POLLIN 0x0001
+#define KEVENT_POLL_POLLPRI 0x0002
+#define KEVENT_POLL_POLLOUT 0x0004
+#define KEVENT_POLL_POLLERR 0x0008
+#define KEVENT_POLL_POLLHUP 0x0010
+#define KEVENT_POLL_POLLNVAL 0x0020
+
+#define KEVENT_POLL_POLLRDNORM 0x0040
+#define KEVENT_POLL_POLLRDBAND 0x0080
+#define KEVENT_POLL_POLLWRNORM 0x0100
+#define KEVENT_POLL_POLLWRBAND 0x0200
+#define KEVENT_POLL_POLLMSG 0x0400
+#define KEVENT_POLL_POLLREMOVE 0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define KEVENT_AIO_BIO 0x1
+
+#define KEVENT_MASK_ALL 0xffffffff
+/* Mask of all possible event values. */
+#define KEVENT_MASK_EMPTY 0x0
+/* Empty mask of ready events. */
+
+struct kevent_id
+{
+ union {
+ __u32 raw[2];
+ __u64 raw_u64 __attribute__((aligned(8)));
+ };
+};
+
+struct ukevent
+{
+ /* Id of this request, e.g. socket number, file descriptor and so on... */
+ struct kevent_id id;
+ /* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+ __u32 type;
+ /* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+ __u32 event;
+ /* Per-event request flags */
+ __u32 req_flags;
+ /* Per-event return flags */
+ __u32 ret_flags;
+ /* Event return data. Event originator fills it with anything it likes. */
+ __u32 ret_data[2];
+ /* User's data. It is not used, just copied to/from user.
+ * The whole structure is aligned to 8 bytes already, so the last union
+ * is aligned properly.
+ */
+ union {
+ __u32 user[2];
+ void *ptr;
+ };
+};
+
+struct mukevent
+{
+ struct kevent_id id;
+ __u32 ret_flags;
+};
+
+#define KEVENT_MAX_EVENTS 4096
+
+/*
+ * Note that kevents does not exactly fill the page (each mukevent is 12 bytes),
+ * so we reuse 4 bytes at the begining of the page to store index.
+ * Take that into account if you want to change size of struct mukevent.
+ */
+#define KEVENTS_ON_PAGE ((PAGE_SIZE-sizeof(unsigned int))/sizeof(struct mukevent))
+struct kevent_mring
+{
+ unsigned int index;
+ struct mukevent event[KEVENTS_ON_PAGE];
+};
+
+#define KEVENT_MAX_PAGES ((KEVENT_MAX_EVENTS%KEVENTS_ON_PAGE)?\
+ (KEVENT_MAX_EVENTS/KEVENTS_ON_PAGE+1):\
+ (KEVENT_MAX_EVENTS/KEVENTS_ON_PAGE))
+
+#define KEVENT_CTL_ADD 0
+#define KEVENT_CTL_REMOVE 1
+#define KEVENT_CTL_MODIFY 2
+
+#endif /* __UKEVENT_H */
diff --git a/init/Kconfig b/init/Kconfig
index a099fc6..c550fcc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -218,6 +218,8 @@ config AUDITSYSCALL
such as SELinux. To use audit's filesystem watch feature, please
ensure that INOTIFY is configured.

+source "kernel/kevent/Kconfig"
+
config IKCONFIG
bool "Kernel .config support"
---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
obj-$(CONFIG_RELAY) += relay.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..5ba8086
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,39 @@
+config KEVENT
+ bool "Kernel event notification mechanism"
+ help
+ This option enables event queue mechanism.
+ It can be used as replacement for poll()/select(), AIO callback
+ invocations, advanced timer notifications and other kernel
+ object status changes.
+
+config KEVENT_USER_STAT
+ bool "Kevent user statistic"
+ depends on KEVENT
+ help
+ This option will turn kevent_user statistic collection on.
+ Statistic data includes total number of kevent, number of kevents
+ which are ready immediately at insertion time and number of kevents
+ which were removed through readiness completion.
+ It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_TIMER
+ bool "Kernel event notifications for timers"
+ depends on KEVENT
+ help
+ This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+ bool "Kernel event notifications for poll()/select()"
+ depends on KEVENT
+ help
+ This option allows to use kevent subsystem for poll()/select()
+ notifications.
+
+config KEVENT_SOCKET
+ bool "Kernel event notifications for sockets"
+ depends on NET && KEVENT
+ help
+ This option enables notifications through KEVENT subsystem of
+ sockets operations, like new packet receiving conditions,
+ ready for accept conditions and so on.
+
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..9130cad
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,4 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
+obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..25404d3
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,227 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+ return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+ return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.ret_flags |= KEVENT_RET_BROKEN;
+ spin_unlock_irqrestore(&k->ulock, flags);
+ return -EINVAL;
+}
+
+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX] __read_mostly;
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos)
+{
+ struct kevent_callbacks *p;
+
+ if (pos >= KEVENT_MAX)
+ return -EINVAL;
+
+ p = &kevent_registered_callbacks[pos];
+
+ p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
+ p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
+ p->callback = (cb->callback) ? cb->callback : kevent_break;
+
+ printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
+ return 0;
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+ spin_lock_init(&k->ulock);
+ k->flags = 0;
+
+ if (unlikely(k->event.type >= KEVENT_MAX ||
+ !kevent_registered_callbacks[k->event.type].callback))
+ return kevent_break(k);
+
+ k->callbacks = kevent_registered_callbacks[k->event.type];
+ if (unlikely(k->callbacks.callback == kevent_break))
+ return kevent_break(k);
+
+ return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+ unsigned long flags;
+
+ k->st = st;
+ spin_lock_irqsave(&st->lock, flags);
+ list_add_tail_rcu(&k->storage_entry, &st->list);
+ k->flags |= KEVENT_STORAGE;
+ spin_unlock_irqrestore(&st->lock, flags);
+ return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue.
+ * It does not decrease origin's reference counter in any way
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&st->lock, flags);
+ if (k->flags & KEVENT_STORAGE) {
+ list_del_rcu(&k->storage_entry);
+ k->flags &= ~KEVENT_STORAGE;
+ }
+ spin_unlock_irqrestore(&st->lock, flags);
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static void __kevent_requeue(struct kevent *k, u32 event)
+{
+ int ret, rem;
+ unsigned long flags;
+
+ ret = k->callbacks.callback(k);
+
+ spin_lock_irqsave(&k->ulock, flags);
+ if (ret > 0)
+ k->event.ret_flags |= KEVENT_RET_DONE;
+ else if (ret < 0)
+ k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE);
+ else
+ ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+ rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+ spin_unlock_irqrestore(&k->ulock, flags);
+
+ if (ret) {
+ if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) {
+ list_del_rcu(&k->storage_entry);
+ k->flags &= ~KEVENT_STORAGE;
+ }
+
+ spin_lock_irqsave(&k->user->ready_lock, flags);
+ if (!(k->flags & KEVENT_READY)) {
+ kevent_user_ring_add_event(k);
+ list_add_tail(&k->ready_entry, &k->user->ready_list);
+ k->flags |= KEVENT_READY;
+ k->user->ready_num++;
+ }
+ spin_unlock_irqrestore(&k->user->ready_lock, flags);
+ wake_up(&k->user->wait);
+ }
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->st->lock, flags);
+ __kevent_requeue(k, 0);
+ spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st,
+ kevent_callback_t ready_callback, u32 event)
+{
+ struct kevent *k;
+
+ rcu_read_lock();
+ if (ready_callback)
+ list_for_each_entry_rcu(k, &st->list, storage_entry)
+ (*ready_callback)(k);
+
+ list_for_each_entry_rcu(k, &st->list, storage_entry)
+ if (event & k->event.event)
+ __kevent_requeue(k, event);
+ rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+ spin_lock_init(&st->lock);
+ st->origin = origin;
+ INIT_LIST_HEAD(&st->list);
+ return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, sockt and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+ kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..fbe54da
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,954 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static const char kevent_name[] = "kevent";
+static kmem_cache_t *kevent_cache __read_mostly;
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+ struct kevent_user *u = file->private_data;
+ unsigned int mask;
+
+ poll_wait(file, &u->wait, wait);
+ mask = 0;
+
+ if (u->ready_num)
+ mask |= POLLIN | POLLRDNORM;
+
+ return mask;
+}
+
+static inline void kevent_user_ring_set(struct kevent_user *u, unsigned int num)
+{
+ u->pring[0]->index = num;
+}
+
+static int kevent_user_ring_grow(struct kevent_user *u)
+{
+ unsigned int idx;
+
+ idx = (u->kevent_num + u->pring[0]->index + 1) / KEVENTS_ON_PAGE;
+ if (idx >= u->pages_in_use) {
+ u->pring[idx] = (void *)__get_free_page(GFP_KERNEL);
+ if (!u->pring[idx])
+ return -ENOMEM;
+ u->pages_in_use++;
+ }
+ return 0;
+}
+
+/*
+ * Called under kevent_user->ready_lock, so updates are always protected.
+ */
+void kevent_user_ring_add_event(struct kevent *k)
+{
+ unsigned int pidx, off;
+ struct kevent_mring *ring, *copy_ring;
+
+ ring = k->user->pring[0];
+
+ pidx = ring->index/KEVENTS_ON_PAGE;
+ off = ring->index%KEVENTS_ON_PAGE;
+
+ if (unlikely(pidx >= k->user->pages_in_use)) {
+ printk("%s: ring->index: %u, on_page: %lu, pidx: %u, pages_in_use: %u.\n",
+ __func__, ring->index, KEVENTS_ON_PAGE, pidx, k->user->pages_in_use);
+ return;
+ }
+
+ copy_ring = k->user->pring[pidx];
+
+ copy_ring->event[off].id.raw[0] = k->event.id.raw[0];
+ copy_ring->event[off].id.raw[1] = k->event.id.raw[1];
+ copy_ring->event[off].ret_flags = k->event.ret_flags;
+
+ if (++ring->index >= KEVENT_MAX_EVENTS)
+ ring->index = 0;
+}
+
+/*
+ * Initialize mmap ring buffer.
+ * It will store ready kevents, so userspace could get them directly instead
+ * of using syscall. Esentially syscall becomes just a waiting point.
+ */
+static int kevent_user_ring_init(struct kevent_user *u)
+{
+ u->pring = kzalloc(KEVENT_MAX_PAGES * sizeof(struct kevent_mring *), GFP_KERNEL);
+ if (!u->pring)
+ return -ENOMEM;
+
+ u->pring[0] = (struct kevent_mring *)__get_free_page(GFP_KERNEL);
+ if (!u->pring[0])
+ goto err_out_free;
+
+ u->pages_in_use = 1;
+ kevent_user_ring_set(u, 0);
+
+ return 0;
+
+err_out_free:
+ kfree(u->pring);
+
+ return -ENOMEM;
+}
+
+static void kevent_user_ring_fini(struct kevent_user *u)
+{
+ int i;
+
+ for (i = 0; i < u->pages_in_use; ++i)
+ free_page((unsigned long)u->pring[i]);
+
+ kfree(u->pring);
+}
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+ struct kevent_user *u;
+
+ u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+ if (!u)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&u->ready_list);
+ spin_lock_init(&u->ready_lock);
+ kevent_stat_init(u);
+ spin_lock_init(&u->kevent_lock);
+ u->kevent_root = RB_ROOT;
+
+ mutex_init(&u->ctl_mutex);
+ init_waitqueue_head(&u->wait);
+
+ atomic_set(&u->refcnt, 1);
+
+ if (unlikely(kevent_user_ring_init(u))) {
+ kfree(u);
+ return -ENOMEM;
+ }
+
+ file->private_data = u;
+ return 0;
+}
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+ atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+ if (atomic_dec_and_test(&u->refcnt)) {
+ kevent_stat_print(u);
+ kevent_user_ring_fini(u);
+ kfree(u);
+ }
+}
+
+static struct page *kevent_user_nopage(struct vm_area_struct *vma, unsigned long addr, int *type)
+{
+ struct kevent_user *u = vma->vm_file->private_data;
+ unsigned long off = (addr - vma->vm_start)/PAGE_SIZE;
+
+ if (type)
+ *type = VM_FAULT_MINOR;
+
+ if (off >= u->pages_in_use)
+ goto err_out_sigbus;
+
+ return virt_to_page(u->pring[off]);
+
+err_out_sigbus:
+ return NOPAGE_SIGBUS;
+}
+
+static struct vm_operations_struct kevent_user_vm_ops = {
+ .nopage = &kevent_user_nopage,
+};
+
+/*
+ * Mmap implementation for ring buffer, which is created as array
+ * of pages, so vm_pgoff is an offset (in pages, not in bytes) of
+ * the first page to be mapped.
+ */
+static int kevent_user_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ unsigned long start = vma->vm_start;
+ struct kevent_user *u = file->private_data;
+
+ if (vma->vm_flags & VM_WRITE)
+ return -EPERM;
+
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ vma->vm_ops = &kevent_user_vm_ops;
+ vma->vm_flags |= VM_RESERVED;
+ vma->vm_file = file;
+
+ if (vm_insert_page(vma, start, virt_to_page(u->pring[0])))
+ return -EFAULT;
+
+ return 0;
+}
+
+static inline int kevent_compare_id(struct kevent_id *left, struct kevent_id *right)
+{
+ return (left->raw_u64 > right->raw_u64)?-1:(right->raw_u64 - left->raw_u64);
+}
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+ struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+ kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+
+ if (deq)
+ kevent_dequeue(k);
+
+ spin_lock_irqsave(&u->ready_lock, flags);
+ if (k->flags & KEVENT_READY) {
+ list_del(&k->ready_entry);
+ k->flags &= ~KEVENT_READY;
+ u->ready_num--;
+ }
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+
+ kevent_user_put(u);
+ call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+
+ rb_erase(&k->kevent_node, &u->kevent_root);
+ k->flags &= ~KEVENT_USER;
+ u->kevent_num--;
+ kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events,
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ rb_erase(&k->kevent_node, &u->kevent_root);
+ k->flags &= ~KEVENT_USER;
+ u->kevent_num--;
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+ kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
+{
+ unsigned long flags;
+ struct kevent *k = NULL;
+
+ spin_lock_irqsave(&u->ready_lock, flags);
+ if (u->ready_num && !list_empty(&u->ready_list)) {
+ k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+ list_del(&k->ready_entry);
+ k->flags &= ~KEVENT_READY;
+ u->ready_num--;
+ }
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+
+ return k;
+}
+
+/*
+ * Search a kevent inside kevent tree for given ukevent.
+ */
+static struct kevent *__kevent_search(struct kevent_id *id, struct kevent_user *u)
+{
+ struct kevent *k, *ret = NULL;
+ struct rb_node *n = u->kevent_root.rb_node;
+ int cmp;
+
+ while (n) {
+ k = rb_entry(n, struct kevent, kevent_node);
+ cmp = kevent_compare_id(&k->event.id, id);
+
+ if (cmp > 0)
+ n = n->rb_right;
+ else if (cmp < 0)
+ n = n->rb_left;
+ else {
+ ret = k;
+ break;
+ }
+ }
+
+ return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ int err = -ENODEV;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&uk->id, u);
+ if (k) {
+ spin_lock(&k->ulock);
+ k->event.event = uk->event;
+ k->event.req_flags = uk->req_flags;
+ k->event.ret_flags = 0;
+ spin_unlock(&k->ulock);
+ kevent_requeue(k);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+ int err = -ENODEV;
+ struct kevent *k;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&uk->id, u);
+ if (k) {
+ __kevent_finish_user(k, 1);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+ struct kevent_user *u = file->private_data;
+ struct kevent *k;
+ struct rb_node *n;
+
+ for (n = rb_first(&u->kevent_root); n; n = rb_next(n)) {
+ k = rb_entry(n, struct kevent, kevent_node);
+ kevent_finish_user(k, 1);
+ }
+
+ kevent_user_put(u);
+ file->private_data = NULL;
+
+ return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+ struct ukevent *ukev;
+
+ ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+ if (!ukev)
+ return NULL;
+
+ if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+ kfree(ukev);
+ return NULL;
+ }
+
+ return ukev;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = 0, i;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > u->kevent_num) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ if (kevent_modify(&ukev[i], u))
+ ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+ ukev[i].ret_flags |= KEVENT_RET_DONE;
+ }
+ if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+ err = -EFAULT;
+ kfree(ukev);
+ goto out;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ if (kevent_modify(&uk, u))
+ uk.ret_flags |= KEVENT_RET_BROKEN;
+ uk.ret_flags |= KEVENT_RET_DONE;
+
+ if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ arg += sizeof(struct ukevent);
+ }
+out:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = 0, i;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > u->kevent_num) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ if (kevent_remove(&ukev[i], u))
+ ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+ ukev[i].ret_flags |= KEVENT_RET_DONE;
+ }
+ if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+ err = -EFAULT;
+ kfree(ukev);
+ goto out;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ if (kevent_remove(&uk, u))
+ uk.ret_flags |= KEVENT_RET_BROKEN;
+
+ uk.ret_flags |= KEVENT_RET_DONE;
+
+ if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ arg += sizeof(struct ukevent);
+ }
+out:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static int kevent_user_enqueue(struct kevent_user *u, struct kevent *new)
+{
+ unsigned long flags;
+ struct rb_node **p = &u->kevent_root.rb_node, *parent = NULL;
+ struct kevent *k;
+ int err = 0, cmp;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ while (*p) {
+ parent = *p;
+ k = rb_entry(parent, struct kevent, kevent_node);
+
+ cmp = kevent_compare_id(&k->event.id, &new->event.id);
+ if (cmp > 0)
+ p = &parent->rb_right;
+ else if (cmp < 0)
+ p = &parent->rb_left;
+ else {
+ err = -EEXIST;
+ break;
+ }
+ }
+ if (likely(!err)) {
+ rb_link_node(&new->kevent_node, parent, p);
+ rb_insert_color(&new->kevent_node, &u->kevent_root);
+ new->flags |= KEVENT_USER;
+ u->kevent_num++;
+ kevent_user_get(u);
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ int err;
+
+ if (kevent_user_ring_grow(u)) {
+ err = -ENOMEM;
+ goto err_out_exit;
+ }
+
+ k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+ if (!k) {
+ err = -ENOMEM;
+ goto err_out_exit;
+ }
+
+ memcpy(&k->event, uk, sizeof(struct ukevent));
+ INIT_RCU_HEAD(&k->rcu_head);
+
+ k->event.ret_flags = 0;
+
+ err = kevent_init(k);
+ if (err) {
+ kmem_cache_free(kevent_cache, k);
+ goto err_out_exit;
+ }
+ k->user = u;
+ kevent_stat_total(u);
+ err = kevent_user_enqueue(u, k);
+ if (err) {
+ kmem_cache_free(kevent_cache, k);
+ goto err_out_exit;
+ }
+
+ err = kevent_enqueue(k);
+ if (err) {
+ memcpy(uk, &k->event, sizeof(struct ukevent));
+ kevent_finish_user(k, 0);
+ goto err_out_exit;
+ }
+
+ return 0;
+
+err_out_exit:
+ if (err < 0) {
+ uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
+ uk->ret_data[1] = err;
+ } else if (err > 0)
+ uk->ret_flags |= KEVENT_RET_DONE;
+ return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one
+ * and add them into appropriate kevent_storages,
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err, cerr = 0, knum = 0, rnum = 0, i;
+ void __user *orig = arg;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ err = -EINVAL;
+ if (u->kevent_num + num >= KEVENT_MAX_EVENTS)
+ goto out_remove;
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ err = kevent_user_add_ukevent(&ukev[i], u);
+ if (err) {
+ kevent_stat_im(u);
+ if (i != rnum)
+ memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+ rnum++;
+ } else
+ knum++;
+ }
+ if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+ cerr = -EFAULT;
+ kfree(ukev);
+ goto out_setup;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ cerr = -EFAULT;
+ break;
+ }
+ arg += sizeof(struct ukevent);
+
+ err = kevent_user_add_ukevent(&uk, u);
+ if (err) {
+ kevent_stat_im(u);
+ if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+ cerr = -EFAULT;
+ break;
+ }
+ orig += sizeof(struct ukevent);
+ rnum++;
+ } else
+ knum++;
+ }
+
+out_setup:
+ if (cerr < 0) {
+ err = cerr;
+ goto out_remove;
+ }
+
+ err = rnum;
+out_remove:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u,
+ unsigned int min_nr, unsigned int max_nr, __u64 timeout,
+ void __user *buf)
+{
+ struct kevent *k;
+ int num = 0;
+
+ if (!(file->f_flags & O_NONBLOCK)) {
+ wait_event_interruptible_timeout(u->wait,
+ u->ready_num >= min_nr,
+ clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+ }
+
+ while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
+ if (copy_to_user(buf + num*sizeof(struct ukevent),
+ &k->event, sizeof(struct ukevent)))
+ break;
+
+ /*
+ * If it is one-shot kevent, it has been removed already from
+ * origin's queue, so we can easily free it here.
+ */
+ if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+ kevent_finish_user(k, 1);
+ ++num;
+ kevent_stat_wait(u);
+ }
+
+ return num;
+}
+
+static struct file_operations kevent_user_fops = {
+ .mmap = kevent_user_mmap,
+ .open = kevent_user_open,
+ .release = kevent_user_release,
+ .poll = kevent_user_poll,
+ .owner = THIS_MODULE,
+};
+
+static struct miscdevice kevent_miscdev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = kevent_name,
+ .fops = &kevent_user_fops,
+};
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+ int err;
+ struct kevent_user *u = file->private_data;
+
+ if (!u || num > KEVENT_MAX_EVENTS)
+ return -EINVAL;
+
+ switch (cmd) {
+ case KEVENT_CTL_ADD:
+ err = kevent_user_ctl_add(u, num, arg);
+ break;
+ case KEVENT_CTL_REMOVE:
+ err = kevent_user_ctl_remove(u, num, arg);
+ break;
+ case KEVENT_CTL_MODIFY:
+ err = kevent_user_ctl_modify(u, num, arg);
+ break;
+ default:
+ err = -EINVAL;
+ break;
+ }
+
+ return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - timeout in nanoseconds to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - ununsed for now (will be used for mmap implementation).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+ __u64 timeout, struct ukevent __user *buf, unsigned flags)
+{
+ int err = -EINVAL;
+ struct file *file;
+ struct kevent_user *u;
+
+ file = fget(ctl_fd);
+ if (!file)
+ return -ENODEV;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+ u = file->private_data;
+
+ err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * This syscall is used to perform waiting until there is free space in kevent queue
+ * and removes all ready kevents until and including @index.
+ * @ctl_fd - kevent file descriptor.
+ * @start - start index of the processed by userspace kevents.
+ * @num - number of processed kevents.
+ * @timeout - this timeout specifies number of nanoseconds to wait until there is
+ * free space in kevent queue.
+ */
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned int num, __u64 timeout)
+{
+ int err = -EINVAL;
+ struct file *file;
+ struct kevent_user *u;
+ struct kevent *k;
+ struct mukevent *muk;
+ unsigned int idx, off;
+ unsigned long flags;
+
+ if (start + num >= KEVENT_MAX_EVENTS ||
+ start >= KEVENT_MAX_EVENTS ||
+ num >= KEVENT_MAX_EVENTS)
+ return -EINVAL;
+
+ file = fget(ctl_fd);
+ if (!file)
+ return -ENODEV;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+ u = file->private_data;
+
+ if (((start + num) / KEVENTS_ON_PAGE) >= u->pages_in_use ||
+ (start / KEVENTS_ON_PAGE) >= u->pages_in_use)
+ goto out_fput;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ while (num > 0) {
+ idx = start / KEVENTS_ON_PAGE;
+ off = start % KEVENTS_ON_PAGE;
+
+ muk = &u->pring[idx]->event[off];
+ k = __kevent_search(&muk->id, u);
+ if (unlikely(!k)) {
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+ goto out_fput;
+ }
+
+ if (++start >= KEVENT_MAX_EVENTS)
+ start = 0;
+ num--;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ if (!(file->f_flags & O_NONBLOCK)) {
+ wait_event_interruptible_timeout(u->wait,
+ u->ready_num >= 1,
+ clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+ }
+
+ fput(file);
+
+ return (u->ready_num >= 1)?0:-EAGAIN;
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent __user *arg)
+{
+ int err = -EINVAL;
+ struct file *file;
+
+ file = fget(fd);
+ if (!file)
+ return -ENODEV;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+
+ err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * Kevent subsystem initialization - create kevent cache and register
+ * filesystem to get control file descriptors from.
+ */
+static int __init kevent_user_init(void)
+{
+ int err = 0;
+
+ kevent_cache = kmem_cache_create("kevent_cache",
+ sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+
+ err = misc_register(&kevent_miscdev);
+ if (err) {
+ printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
+ goto err_out_exit;
+ }
+
+ printk("KEVENT subsystem has been successfully registered.\n");
+
+ return 0;
+
+err_out_exit:
+ kmem_cache_destroy(kevent_cache);
+ return err;
+}
+
+module_init(kevent_user_init);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 6991bec..564e618 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,10 @@ cond_syscall(ppc_rtas);
cond_syscall(sys_spu_run);
cond_syscall(sys_spu_create);

+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_wait);
+cond_syscall(sys_kevent_ctl);
+
/* mmu depending weak syscall entries */
cond_syscall(sys_mprotect);
cond_syscall(sys_msync);

2006-09-22 19:23:19

by Andrew Morton

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

On Wed, 20 Sep 2006 13:35:47 +0400
Evgeniy Polyakov <[email protected]> wrote:

> Generic event handling mechanism.
>
> Consider for inclusion.

Ulrich's objections sounded substantial, and afaik remain largely
unresolved. How do we sort this out?

2006-09-23 04:26:13

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

On Fri, Sep 22, 2006 at 12:22:07PM -0700, Andrew Morton ([email protected]) wrote:
> On Wed, 20 Sep 2006 13:35:47 +0400
> Evgeniy Polyakov <[email protected]> wrote:
>
> > Generic event handling mechanism.
> >
> > Consider for inclusion.
>
> Ulrich's objections sounded substantial, and afaik remain largely
> unresolved. How do we sort this out?

There are no objections, but request for additional interface.

The only two things missed in patchset after his suggestions are
new POSIX-like interface, which I personally consider as very unconvenient,
but in any way it can be implemented as addon, and signal mask change,
but Ulrich have not answered how does it differ from blocking in
userspace and then calling appropriate syscall, I expect the difference
is only in reduced number of syscalls.

--
Evgeniy Polyakov

2006-09-26 15:54:27

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

On Fri, Sep 22, 2006 at 12:22:07PM -0700, Andrew Morton wrote:
> On Wed, 20 Sep 2006 13:35:47 +0400
> Evgeniy Polyakov <[email protected]> wrote:
>
> > Generic event handling mechanism.
> >
> > Consider for inclusion.
>
> Ulrich's objections sounded substantial, and afaik remain largely
> unresolved. How do we sort this out?

I haven't seen any of Ulrichs points (which mostly is a large subset of
my objection) beeing addressed.

2006-09-27 04:46:57

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

On Tue, Sep 26, 2006 at 04:54:16PM +0100, Christoph Hellwig ([email protected]) wrote:
> > > Generic event handling mechanism.
> > >
> > > Consider for inclusion.
> >
> > Ulrich's objections sounded substantial, and afaik remain largely
> > unresolved. How do we sort this out?
>
> I haven't seen any of Ulrichs points (which mostly is a large subset of
> my objection) beeing addressed.

Could you please be more specific?

As far as I can see I addressed all suggestions made by Christoph and
still waiting for comments about my points after reply to Ulrich's.

--
Evgeniy Polyakov

2006-09-27 15:10:50

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

On Wed, Sep 20, 2006 at 01:35:47PM +0400, Evgeniy Polyakov ([email protected]) wrote:
>
> Generic event handling mechanism.
>
> Consider for inclusion.

I have been told in private what is signal masks about - just to wait
until either signal or given condition is ready, but in that case just
add additional kevent user like AIO complete or netwrok notification
and wait until either requested events are ready or signal is triggered.

--
Evgeniy Polyakov

2006-10-04 04:50:13

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

On 9/27/06, Evgeniy Polyakov <[email protected]> wrote:
\> I have been told in private what is signal masks about - just to wait
> until either signal or given condition is ready, but in that case just
> add additional kevent user like AIO complete or netwrok notification
> and wait until either requested events are ready or signal is triggered.

No, this won't work. Yes, I want signal notification as part of the
event handling. But there are situations when this is not suitable.
Only if the signal is expected in the same code using the event
handling can you do this. But this is not always possible.
Especially when the signal handling code is used in other parts of the
code than the event handling. E.g., signal handling in a library,
event handling in the main code. You cannot assume that all the code
is completely integrated.

2006-10-04 04:55:58

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

On Tue, Oct 03, 2006 at 09:50:09PM -0700, Ulrich Drepper ([email protected]) wrote:
> On 9/27/06, Evgeniy Polyakov <[email protected]> wrote:
> \> I have been told in private what is signal masks about - just to wait
> >until either signal or given condition is ready, but in that case just
> >add additional kevent user like AIO complete or netwrok notification
> >and wait until either requested events are ready or signal is triggered.
>
> No, this won't work. Yes, I want signal notification as part of the
> event handling. But there are situations when this is not suitable.
> Only if the signal is expected in the same code using the event
> handling can you do this. But this is not always possible.
> Especially when the signal handling code is used in other parts of the
> code than the event handling. E.g., signal handling in a library,
> event handling in the main code. You cannot assume that all the code
> is completely integrated.

Signals still can be delivered in usual way too.

When we enter sys_ppoll() we specify needed signals as syscall
parameter, with kevents we will add them into the queue.

--
Evgeniy Polyakov

2006-10-04 06:09:17

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

On 9/22/06, Evgeniy Polyakov <[email protected]> wrote:
> The only two things missed in patchset after his suggestions are
> new POSIX-like interface, which I personally consider as very unconvenient,

This means you really do not know at all what this is about. We
already have these interfaces. Several of them and there will likely
be more. These are interfaces for functionality which needs the new
event notification. There is *NO* reason whatsoever to not make this

2006-10-04 06:10:54

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

[Bah, sent too eaqrly]

On 9/22/06, Evgeniy Polyakov <[email protected]> wrote:
> The only two things missed in patchset after his suggestions are
> new POSIX-like interface, which I personally consider as very unconvenient,

This means you really do not know at all what this is about. We
already have these interfaces. Several of them and there will likely
be more. These are interfaces for functionality which needs the new
event notification. There is *NO* reason whatsoever to not make add
this extension and instead invent new interfaces to have notification
sent to the event queue.

2006-10-04 06:25:36

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

On Tue, Oct 03, 2006 at 11:09:15PM -0700, Ulrich Drepper ([email protected]) wrote:
> On 9/22/06, Evgeniy Polyakov <[email protected]> wrote:
> >The only two things missed in patchset after his suggestions are
> >new POSIX-like interface, which I personally consider as very unconvenient,
>
> This means you really do not know at all what this is about. We
> already have these interfaces. Several of them and there will likely
> be more. These are interfaces for functionality which needs the new
> event notification. There is *NO* reason whatsoever to not make this

It looks I'm a bit puzzled...

Let me clarify my position about it.
Kevent as a generic event handling mechanism should not knwo about how
events were added. It was designed to be quite flexible, so one could
add events from essentially any possible situation.
One of the most common situations is userspace requests - they are added
through set of created syscalls. There can exists tons of quadrillions of
any other interfaces, I even specially created helper function for kernel
subsystems (existing and new ones) which might want to create events
using own syscalls and parameters. For example network AIO work that way
- it has own syscalls, which parses parameters, creates ukevent
structure and pass them into kevent core, which in turn calls appropriate
callbacks back to network AIO.

Everyone can add new interfaces in any way he likes, it would quite
silly to created new subsystem which would required strick API and
failed to work with different set of interfaces.

So from my point of view, problem is not in case that 'we need only this
API', but 'why new API is better that old one'.
It is possible to create new API, which will add events from _existing_
syscalls, it is just one function call from given syscall, I completely
agree with it.
I'm just objecting against removing existing interface in favour of new
one.

People who need POSIX timer API - feel free to call
kevent_user_add_ukevent() from your favorite posix_timer_create().
Who needs signal queueing can do it even in signal() syscall - kevent
callback for that subsystem for example can update process' signal mask
and add kevents.

--
Evgeniy Polyakov

2006-10-04 06:28:07

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

On Tue, Oct 03, 2006 at 11:10:51PM -0700, Ulrich Drepper ([email protected]) wrote:
> [Bah, sent too eaqrly]
>
> On 9/22/06, Evgeniy Polyakov <[email protected]> wrote:
> >The only two things missed in patchset after his suggestions are
> >new POSIX-like interface, which I personally consider as very unconvenient,
>
> This means you really do not know at all what this is about. We
> already have these interfaces. Several of them and there will likely
> be more. These are interfaces for functionality which needs the new
> event notification. There is *NO* reason whatsoever to not make add
> this extension and instead invent new interfaces to have notification
> sent to the event queue.

As I described in previous e-mail, there are completely _no_ limitations
on iterfaces - it is possible to queue events from any place, not matter
if it is new interface (which I prefer to use) or any old one, which is
more convenient for someone. There is special herlper function for that.
One can check network AIO implementation to see how it was done in
practice - network AIO has own syscalls (aio_send(), aio_recv() and
aio_sendfile(), which create kevent queue and put there own events,
it is completely transparent for userspace which does not even know that
network AIO is based on kevent).

--
Evgeniy Polyakov

2006-10-04 06:34:06

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On 9/20/06, Evgeniy Polyakov <[email protected]> wrote:
> This patch includes core kevent files:
> [...]

I tried to look at the example programs before and failed. I tried
again. Where can I find up-to-date example code?

Some other points:

- I really would prefer not to rush all this into the upstream kernel.
The main problem is that the ring buffer interface is a shared data
structure. These are always tricky. We need to find the right
combination between size (as small as possible) and supporting all the
interfaces.

- so far only the timer and aio notification is speced out. What
about the rest? Are we sure all aspects can be expressed? I am not
yet.

- we need an interface to add an event from userlevel. I.e., we need
to be able to synthesize events. There are events (like, for instance
the async DNS functionality) which come from userlevel code.

I would very much prefer we look at the other events before setting
the data structures in stone.

2006-10-04 06:49:29

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Tue, Oct 03, 2006 at 11:34:02PM -0700, Ulrich Drepper ([email protected]) wrote:
> On 9/20/06, Evgeniy Polyakov <[email protected]> wrote:
> >This patch includes core kevent files:
> >[...]
>
> I tried to look at the example programs before and failed. I tried
> again. Where can I find up-to-date example code?

http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
http://tservice.net.ru/~s0mbre/archive/kevent/evtest.c

Structures were not changed from the beginning of kevent project.

> Some other points:
>
> - I really would prefer not to rush all this into the upstream kernel.
> The main problem is that the ring buffer interface is a shared data
> structure. These are always tricky. We need to find the right
> combination between size (as small as possible) and supporting all the
> interfaces.

mmap interface itself is in question, since it allows to create dos
since there are no rlimits for pinned memory.

> - so far only the timer and aio notification is speced out. What
> about the rest? Are we sure all aspects can be expressed? I am not
> yet.

AIO was removed from patchset by request of Cristoph.
Timers, network AIO, fs AIO, socket nortifications and poll/select
events work well with existing structures.

> - we need an interface to add an event from userlevel. I.e., we need
> to be able to synthesize events. There are events (like, for instance
> the async DNS functionality) which come from userlevel code.
>
> I would very much prefer we look at the other events before setting
> the data structures in stone.

Signals and userspace events (hello solaris) easily fits into existing
structures.

It is even possible to create variable sized kevents - each kevent
contain pointer to user's data, which can be considered as pointer to
additional area (it's size kernel implementation for given kevent type
can determine from other parameters or use predefined one and fetch
additional data in ->enqueue() callback).

--
Evgeniy Polyakov

2006-10-04 07:32:12

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
> When we enter sys_ppoll() we specify needed signals as syscall
> parameter, with kevents we will add them into the queue.

No, this is not sufficient as I said in the last mail. Why do you
completely ignore what others say. The code which depends on the signal
does not have to have access to the event queue. If a library sets up
an interrupt handler then it expect the signal to be delivered this way.
In such situations ppoll etc allow the signal to be generally blocked
and enabled only and *ATOMICALLY* around the delays. This is not
possible with the current wait interface. We need this signal mask
interfaces and the appropriate setup code.

Being able to get signal notifications does not mean this is always the
way it can and must happen.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


Attachments:
signature.asc (251.00 B)
OpenPGP digital signature

2006-10-04 07:48:54

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

On Wed, Oct 04, 2006 at 12:33:25AM -0700, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> > When we enter sys_ppoll() we specify needed signals as syscall
> > parameter, with kevents we will add them into the queue.
>
> No, this is not sufficient as I said in the last mail. Why do you
> completely ignore what others say. The code which depends on the signal
> does not have to have access to the event queue. If a library sets up
> an interrupt handler then it expect the signal to be delivered this way.
> In such situations ppoll etc allow the signal to be generally blocked
> and enabled only and *ATOMICALLY* around the delays. This is not
> possible with the current wait interface. We need this signal mask
> interfaces and the appropriate setup code.
>
> Being able to get signal notifications does not mean this is always the
> way it can and must happen.

It is completely possible to do what you describe without special
syscall parameters. Just add interesting signals to the queue (and
optionally block them globally) and wait on that queue.
When signal's event is generated and appropriate kevent is removed, that
signal will be restored in global signal mask (there are appropriate
enqueue/dequeue callbacks which can perform operations on signal mask
for given process).

My main consern is to not add special cases for something in generic
code, especially when gneric code can easily handle that situations.

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
>



--
Evgeniy Polyakov

2006-10-04 17:23:24

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
> It is completely possible to do what you describe without special
> syscall parameters.

First of all, I don't see how this is efficiently possible. The mask
might change from call to call.

Second, hasn't it sunk in that inventing new ways to pass parameters is
bad? Programmers don't want to learn new ways for every new interface.
Reuse is good!

This applies to the signal mask here.

But there is another parameter falling into that category and I meant to
mention it before: the timeout value. All other calls except poll and
especially all modern interfaces use a timespec pointer. This is the
way times are kept in userland code. Don't try to force people to do
something else.

Using a timespec also has the advantage that we can add an absolute
timeout value mode (optional) instead of the relative timeout value.

In this context, we should/must be able to specify which clock the
timeout is for (not as part of the wait call, but another control
operation perhaps). It's important to distinguish between
CLOCK_REALTIME and CLOCK_MONOTONE. Both have their use.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


Attachments:
signature.asc (251.00 B)
OpenPGP digital signature

2006-10-04 17:57:43

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On 10/3/06, Evgeniy Polyakov <[email protected]> wrote:
> http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
> http://tservice.net.ru/~s0mbre/archive/kevent/evtest.c

These are simple programs which by themselves have problems. For
instance, I consider a very bad idea to hardcode the size of the ring
buffer. Specifying macros in the header file counts as hardcoding.
Systems grow over time and so will the demand of connections. I have
no problem with the kernel hardcoding the value internally (or having
a /proc entry to select it) but programs should be able to dynamically
learn about the value so they don't have to be recompiled.

But more problematic is that I don't see how the interfaces can be
efficiently used in multi-threaded (or multi-process) programs. How
would multiple threads using the same kevent queue and running in the
same kevent_get_events() loop work out? How do they guarantee that
each request is only handled once?

>From what I see now this means a second data structure is needed to
keep track of the state of each entry. But even then, how do we even
recognized used ring buffer entries?

For instance, assume two threads. Both call get_events, one event is
reported, both threads are woken up (which is another thing to
consider, more later). One thread uses ring buffer entry, the other
goes back to sleep in get_events. Now, how does the kernel know when
the other thread is done working on the ring buffer entry? There
might be lots of entries coming in overflowing the entire buffer.
Heck, you don't even need two threads for this scenario.

When I was thinking about this (and discussing it in Ottawa) I was
always assuming that we have a status field in the ring buffer entry
which lets the userlevel code indicate whether the entry is free again
or not. This requires a writable mapping, yes, and potentially causes
cache line ping-pong. I think Zach mentioned he has some ideas about
this.


As for the multiple thread wakeup, I mentioned this before. We have
to avoid the trampling herd problem. We cannot wakeup all waiters.
But we also cannot assume that, without protocols, waking up just one
for each available entry is sufficient. So the first question is:
what is the current policy?


> AIO was removed from patchset by request of Cristoph.
> Timers, network AIO, fs AIO, socket nortifications and poll/select
> events work well with existing structures.

Well, excuse me if I don't take your word for it. I agree, the AIO
code should not be submitted along with this. The same for any other
code using the event handling. But we need to check whether the
interface is generic enough to accomodate them in a way which actually
makes sense. Again, think highly threaded processes or multiple
processes sharing the same event queue.


> It is even possible to create variable sized kevents - each kevent
> contain pointer to user's data, which can be considered as pointer to
> additional area (it's size kernel implementation for given kevent type
> can determine from other parameters or use predefined one and fetch
> additional data in ->enqueue() callback).

That sounds interesting and certainly helps with securing the
interface for the future. But if there is anything we can do to avoid
unnecessary costs we should do it, even if this means investigation
all this further.

2006-10-05 08:58:36

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Wed, Oct 04, 2006 at 10:57:32AM -0700, Ulrich Drepper ([email protected]) wrote:
> On 10/3/06, Evgeniy Polyakov <[email protected]> wrote:
> >http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
> >http://tservice.net.ru/~s0mbre/archive/kevent/evtest.c
>
> These are simple programs which by themselves have problems. For
> instance, I consider a very bad idea to hardcode the size of the ring
> buffer. Specifying macros in the header file counts as hardcoding.
> Systems grow over time and so will the demand of connections. I have
> no problem with the kernel hardcoding the value internally (or having
> a /proc entry to select it) but programs should be able to dynamically
> learn about the value so they don't have to be recompiled.

Well, it is possible to create /sys/proc entry for that, and even now
userspace can grow mapping ring until it is forbiden by kernel, which
means limit is reached.

Actually the whole idea with global limit of kevents does not sound very
good to me, but it is required to remove overflow in mapped buffer.

> But more problematic is that I don't see how the interfaces can be
> efficiently used in multi-threaded (or multi-process) programs. How
> would multiple threads using the same kevent queue and running in the
> same kevent_get_events() loop work out? How do they guarantee that
> each request is only handled once?

kqueue_dequeue_ready() is atomic and this function removes kevent from
ready queue so other thread can not get it.

> From what I see now this means a second data structure is needed to
> keep track of the state of each entry. But even then, how do we even
> recognized used ring buffer entries?
>
> For instance, assume two threads. Both call get_events, one event is
> reported, both threads are woken up (which is another thing to
> consider, more later). One thread uses ring buffer entry, the other
> goes back to sleep in get_events. Now, how does the kernel know when
> the other thread is done working on the ring buffer entry? There
> might be lots of entries coming in overflowing the entire buffer.
> Heck, you don't even need two threads for this scenario.

Are you talking about mapped buffer or syscall interface?
The former has special syscall kevent_wait(), which reports number of
'processed' events and first processed number, so kernel can remove all
appropriate events. The latter is described above -
kqueue_dequeue_ready() is atomic, so that event will be removed from the
ready queue and optionally from the whole kevent tree.

It is possible to work with both interfaces at the same time, since
mapped buffer contains a copy of the event, which is potentially freed
and processed by other thread.

Actually I do not like idea of mapped ring anyway, since if application
uses a lot of events, it will batch them into big chunks, so syscall
overhead is negligible, if application uses small number of events,
syscalls will be rare and will not hurt performance.

> When I was thinking about this (and discussing it in Ottawa) I was
> always assuming that we have a status field in the ring buffer entry
> which lets the userlevel code indicate whether the entry is free again
> or not. This requires a writable mapping, yes, and potentially causes
> cache line ping-pong. I think Zach mentioned he has some ideas about
> this.

As far as I can see, there are no other ideas on how to implement ring
buffer, so I did it like I wanted. It has some limitation indeed, but
since I do not see any other code, how can I say what is better or
worse?

> As for the multiple thread wakeup, I mentioned this before. We have
> to avoid the trampling herd problem. We cannot wakeup all waiters.
> But we also cannot assume that, without protocols, waking up just one
> for each available entry is sufficient. So the first question is:
> what is the current policy?

It is a good practice to _not_ share the same queue between a lot of
threads. Currently all waiters are awakened.

> >AIO was removed from patchset by request of Cristoph.
> >Timers, network AIO, fs AIO, socket nortifications and poll/select
> >events work well with existing structures.
>
> Well, excuse me if I don't take your word for it. I agree, the AIO
> code should not be submitted along with this. The same for any other
> code using the event handling. But we need to check whether the
> interface is generic enough to accomodate them in a way which actually
> makes sense. Again, think highly threaded processes or multiple
> processes sharing the same event queue.

You missed the point.
I implemented _all_ above and it does work.
Although it was removed from submission patchset.
You can find all patches on kevent homepage, they were posted to lkml@
and netdev@ too many times to miss them.

> >It is even possible to create variable sized kevents - each kevent
> >contain pointer to user's data, which can be considered as pointer to
> >additional area (it's size kernel implementation for given kevent type
> >can determine from other parameters or use predefined one and fetch
> >additional data in ->enqueue() callback).
>
> That sounds interesting and certainly helps with securing the
> interface for the future. But if there is anything we can do to avoid
> unnecessary costs we should do it, even if this means investigation
> all this further.

Ulrich, do _you_ have any ideas on how to change data structures?
Not talks about investigations and the like, but real design which
covers today's and tomorrow's needs?

Existing structures were not taken from the universe, but are result of
quite long thoughts about requests for AIO/epoll and networking
development...

--
Evgeniy Polyakov

2006-10-05 09:02:47

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

On Wed, Oct 04, 2006 at 10:20:44AM -0700, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> > It is completely possible to do what you describe without special
> > syscall parameters.
>
> First of all, I don't see how this is efficiently possible. The mask
> might change from call to call.

And you can add/remove signal events using existing kevent api between
calls.

> Second, hasn't it sunk in that inventing new ways to pass parameters is
> bad? Programmers don't want to learn new ways for every new interface.
> Reuse is good!

And creating special cases for usual events is bad.
There is unified way to deal with events in kevent -
add/remove/modify/wait on them, signals are just usual events.

> This applies to the signal mask here.
>
> But there is another parameter falling into that category and I meant to
> mention it before: the timeout value. All other calls except poll and
> especially all modern interfaces use a timespec pointer. This is the
> way times are kept in userland code. Don't try to force people to do
> something else.
>
> Using a timespec also has the advantage that we can add an absolute
> timeout value mode (optional) instead of the relative timeout value.
>
> In this context, we should/must be able to specify which clock the
> timeout is for (not as part of the wait call, but another control
> operation perhaps). It's important to distinguish between
> CLOCK_REALTIME and CLOCK_MONOTONE. Both have their use.

I think you wanted to say, that 'all event mechanism except the most
commonly used poll/select/epoll use timespec'.
I designed it to be similar to poll(), it is really good interface.
Nature of the waiting is to wait for some time, so I put there that
'some time'.

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
>



--
Evgeniy Polyakov

2006-10-05 09:56:29

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Thursday 05 October 2006 10:57, Evgeniy Polyakov wrote:

> Well, it is possible to create /sys/proc entry for that, and even now
> userspace can grow mapping ring until it is forbiden by kernel, which
> means limit is reached.

No need for yet another /sys/proc entry.

Right now, I (for example) may have a use for Generic event handling, but for
a program that needs XXX.XXX handles, and about XX.XXX events per second.

Right now, this program uses epoll, and reaches no limit at all, once you pass
the "ulimit -n", and other kernel wide tunes of course, not related to epoll.

With your current kevent, I cannot switch to it, because of hardcoded limits.

I may be wrong, but what is currently missing for me is :

- No hardcoded limit on the max number of events. (A process that can open
XXX.XXX files should be allowed to open a kevent queue with at least XXX.XXX
events). Right now thats not clear what happens IF the current limit is
reached.

- In order to avoid touching the whole ring buffer, it might be good to be
able to reset the indexes to the beginning when ring buffer is empty. (So if
the user land is responsive enough to consume events, only first pages of the
mapping would be used : that saves L1/L2 cpu caches)

A plus would be

- A working/usable mmap ring buffer implementation, but I think its not
mandatory. System calls are not that expensive, especially if you can batch
XX events per syscall (like epoll). Nice thing with a ring buffer is that we
touch less cache lines than say epoll that have lot of linked structures.

About mmap, I think you might want a hybrid thing :

One writable page where userland can write its index, (and hold one or more
futex shared by kernel) (with appropriate thread locking in case multiple
threads want to dequeue events). In fast path, no syscalls are needed to
maintain this user index.

XXX readonly pages (for user, but r/w for kernel), where kernel write its own
index, and events of course.

Using separate cache lines avoid false sharing : kernel can update its own
index and events without having to pay the price of cache line ping pongs.
It could use futex infrastructure to wakeup one thread 'only' instead of all
threads waiting an event.


Eric

2006-10-05 10:21:52

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Thu, Oct 05, 2006 at 11:56:24AM +0200, Eric Dumazet ([email protected]) wrote:
> On Thursday 05 October 2006 10:57, Evgeniy Polyakov wrote:
>
> > Well, it is possible to create /sys/proc entry for that, and even now
> > userspace can grow mapping ring until it is forbiden by kernel, which
> > means limit is reached.
>
> No need for yet another /sys/proc entry.
>
> Right now, I (for example) may have a use for Generic event handling, but for
> a program that needs XXX.XXX handles, and about XX.XXX events per second.
>
> Right now, this program uses epoll, and reaches no limit at all, once you pass
> the "ulimit -n", and other kernel wide tunes of course, not related to epoll.
>
> With your current kevent, I cannot switch to it, because of hardcoded limits.
>
> I may be wrong, but what is currently missing for me is :
>
> - No hardcoded limit on the max number of events. (A process that can open
> XXX.XXX files should be allowed to open a kevent queue with at least XXX.XXX
> events). Right now thats not clear what happens IF the current limit is
> reached.

This forces to overflows in fixed sized memory mapped buffer.
If we remove memory mapped buffer or will allow to have overflows (and
thus skipped entries) keven can easily scale to that limits (tested with
xx.xxx events though).

> - In order to avoid touching the whole ring buffer, it might be good to be
> able to reset the indexes to the beginning when ring buffer is empty. (So if
> the user land is responsive enough to consume events, only first pages of the
> mapping would be used : that saves L1/L2 cpu caches)

And what happens when there are 3 empty at the beginning and \we need to
put there 4 ready events?

> A plus would be
>
> - A working/usable mmap ring buffer implementation, but I think its not
> mandatory. System calls are not that expensive, especially if you can batch
> XX events per syscall (like epoll). Nice thing with a ring buffer is that we
> touch less cache lines than say epoll that have lot of linked structures.
>
> About mmap, I think you might want a hybrid thing :
>
> One writable page where userland can write its index, (and hold one or more
> futex shared by kernel) (with appropriate thread locking in case multiple
> threads want to dequeue events). In fast path, no syscalls are needed to
> maintain this user index.
>
> XXX readonly pages (for user, but r/w for kernel), where kernel write its own
> index, and events of course.

The problem is in that xxx pages - how many can we eat per kevent
descriptor? It is pinned memory and thus it is possible to have a DoS.
If xxx above is not enough to store all events, we will have
yet-another-broken behaviour like rt-signal queue overflow.

> Using separate cache lines avoid false sharing : kernel can update its own
> index and events without having to pay the price of cache line ping pongs.
> It could use futex infrastructure to wakeup one thread 'only' instead of all
> threads waiting an event.
>
>
> Eric

--
Evgeniy Polyakov

2006-10-05 10:45:09

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Thursday 05 October 2006 12:21, Evgeniy Polyakov wrote:
> On Thu, Oct 05, 2006 at 11:56:24AM +0200, Eric Dumazet ([email protected])
> > I may be wrong, but what is currently missing for me is :
> >
> > - No hardcoded limit on the max number of events. (A process that can
> > open XXX.XXX files should be allowed to open a kevent queue with at least
> > XXX.XXX events). Right now thats not clear what happens IF the current
> > limit is reached.
>
> This forces to overflows in fixed sized memory mapped buffer.
> If we remove memory mapped buffer or will allow to have overflows (and
> thus skipped entries) keven can easily scale to that limits (tested with
> xx.xxx events though).

What is missing or not obvious is : If events are skipped because of
overflows, What happens ? Connections stuck forever ? Hope that everything
will restore itself ? Is kernel able to SIGNAL this problem to user land ?


>
> > - In order to avoid touching the whole ring buffer, it might be good to
> > be able to reset the indexes to the beginning when ring buffer is empty.
> > (So if the user land is responsive enough to consume events, only first
> > pages of the mapping would be used : that saves L1/L2 cpu caches)
>
> And what happens when there are 3 empty at the beginning and \we need to
> put there 4 ready events?

Re-read what I said : when ring buffer is empty.

When ring buffer is empty, kernel can reset index right before adding XX new
events. You read 3 events consumed, I said : When all ring buffer is empty,
because all previous events were consumed by user land, then we can reset
indexes to 0.

>
> > A plus would be
> >
> > - A working/usable mmap ring buffer implementation, but I think its not
> > mandatory. System calls are not that expensive, especially if you can
> > batch XX events per syscall (like epoll). Nice thing with a ring buffer
> > is that we touch less cache lines than say epoll that have lot of linked
> > structures.
> >
> > About mmap, I think you might want a hybrid thing :
> >
> > One writable page where userland can write its index, (and hold one or
> > more futex shared by kernel) (with appropriate thread locking in case
> > multiple threads want to dequeue events). In fast path, no syscalls are
> > needed to maintain this user index.
> >
> > XXX readonly pages (for user, but r/w for kernel), where kernel write its
> > own index, and events of course.
>
> The problem is in that xxx pages - how many can we eat per kevent
> descriptor? It is pinned memory and thus it is possible to have a DoS.
> If xxx above is not enough to store all events, we will have
> yet-another-broken behaviour like rt-signal queue overflow.
>

Re-read : I have a process that has the right to open XXX.XXX handles,
allocating XXX.XXX tcp sockets, dentries, files structures, inodes, epoll
events, its obviously already a DOS risk, but controled by 'ulimit -n'

Allocating XXX.XXX * (32 or 64) bytes is a win if I can zap epoll structures
(currently more than 256 bytes per event)

epoll structures are pinned too... what's wrong with that ?

# egrep "filp|poll|TCP|dentries|sock_inode" /proc/slabinfo |cut -c1-50
tw_sock_TCP 1302 2200 192 20 1 :
request_sock_TCP 2046 4260 128 30 1 :
TCP 151509 196910 1472 5 2 :
eventpoll_pwq 146718 199439 72 53 1 :
eventpoll_epi 146718 199360 192 20 1 :
sock_inode_cache 149182 197940 640 6 1 :
filp 149537 202515 256 15 1 :

If you want to protect from DOS, just use ulimit -n 100

Eric

2006-10-05 10:56:11

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Thu, Oct 05, 2006 at 12:45:03PM +0200, Eric Dumazet ([email protected]) wrote:
> On Thursday 05 October 2006 12:21, Evgeniy Polyakov wrote:
> > On Thu, Oct 05, 2006 at 11:56:24AM +0200, Eric Dumazet ([email protected])
> > > I may be wrong, but what is currently missing for me is :
> > >
> > > - No hardcoded limit on the max number of events. (A process that can
> > > open XXX.XXX files should be allowed to open a kevent queue with at least
> > > XXX.XXX events). Right now thats not clear what happens IF the current
> > > limit is reached.
> >
> > This forces to overflows in fixed sized memory mapped buffer.
> > If we remove memory mapped buffer or will allow to have overflows (and
> > thus skipped entries) keven can easily scale to that limits (tested with
> > xx.xxx events though).
>
> What is missing or not obvious is : If events are skipped because of
> overflows, What happens ? Connections stuck forever ? Hope that everything
> will restore itself ? Is kernel able to SIGNAL this problem to user land ?

Exisitng code does not overflow by design, but can consume a lot of
memory. I talked about the case, when there will be some limit on
number of entries put into mapped buffer.

> > > - In order to avoid touching the whole ring buffer, it might be good to
> > > be able to reset the indexes to the beginning when ring buffer is empty.
> > > (So if the user land is responsive enough to consume events, only first
> > > pages of the mapping would be used : that saves L1/L2 cpu caches)
> >
> > And what happens when there are 3 empty at the beginning and \we need to
> > put there 4 ready events?
>
> Re-read what I said : when ring buffer is empty.
>
> When ring buffer is empty, kernel can reset index right before adding XX new
> events. You read 3 events consumed, I said : When all ring buffer is empty,
> because all previous events were consumed by user land, then we can reset
> indexes to 0.

It is the same.
What if reing buffer was grown upto 3 entry, and is now empty, and we
need to put there 4 entries? Grow it again?
It can be done, easily, but it looks like a workaround not as solution.
And it is highly unlikely that in situation, when there are a lot of
event, ring can be empty.

> >
> > > A plus would be
> > >
> > > - A working/usable mmap ring buffer implementation, but I think its not
> > > mandatory. System calls are not that expensive, especially if you can
> > > batch XX events per syscall (like epoll). Nice thing with a ring buffer
> > > is that we touch less cache lines than say epoll that have lot of linked
> > > structures.
> > >
> > > About mmap, I think you might want a hybrid thing :
> > >
> > > One writable page where userland can write its index, (and hold one or
> > > more futex shared by kernel) (with appropriate thread locking in case
> > > multiple threads want to dequeue events). In fast path, no syscalls are
> > > needed to maintain this user index.
> > >
> > > XXX readonly pages (for user, but r/w for kernel), where kernel write its
> > > own index, and events of course.
> >
> > The problem is in that xxx pages - how many can we eat per kevent
> > descriptor? It is pinned memory and thus it is possible to have a DoS.
> > If xxx above is not enough to store all events, we will have
> > yet-another-broken behaviour like rt-signal queue overflow.
> >
>
> Re-read : I have a process that has the right to open XXX.XXX handles,
> allocating XXX.XXX tcp sockets, dentries, files structures, inodes, epoll
> events, its obviously already a DOS risk, but controled by 'ulimit -n'
>
> Allocating XXX.XXX * (32 or 64) bytes is a win if I can zap epoll structures
> (currently more than 256 bytes per event)
>
> epoll structures are pinned too... what's wrong with that ?
>
> # egrep "filp|poll|TCP|dentries|sock_inode" /proc/slabinfo |cut -c1-50
> tw_sock_TCP 1302 2200 192 20 1 :
> request_sock_TCP 2046 4260 128 30 1 :
> TCP 151509 196910 1472 5 2 :
> eventpoll_pwq 146718 199439 72 53 1 :
> eventpoll_epi 146718 199360 192 20 1 :
> sock_inode_cache 149182 197940 640 6 1 :
> filp 149537 202515 256 15 1 :
>
> If you want to protect from DOS, just use ulimit -n 100

epoll() does not have mmap.
Problem is not about how many events can be put into the kernel, but how
many of them can be put into mapped buffer.
There is no problem if mmap is turned off.

> Eric

--
Evgeniy Polyakov

2006-10-05 12:09:35

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Thursday 05 October 2006 12:55, Evgeniy Polyakov wrote:
> On Thu, Oct 05, 2006 at 12:45:03PM +0200, Eric Dumazet ([email protected])
> >
> > What is missing or not obvious is : If events are skipped because of
> > overflows, What happens ? Connections stuck forever ? Hope that
> > everything will restore itself ? Is kernel able to SIGNAL this problem to
> > user land ?
>
> Exisitng code does not overflow by design, but can consume a lot of
> memory. I talked about the case, when there will be some limit on
> number of entries put into mapped buffer.

You still dont answer my question. Please answer the question.
Recap : You have a max of XXXX events queued. A network message come and
kernel want to add another event. It cannot because limit is reached. How the
User Program knows that this problem was hit ?


> It is the same.
> What if reing buffer was grown upto 3 entry, and is now empty, and we
> need to put there 4 entries? Grow it again?
> It can be done, easily, but it looks like a workaround not as solution.
> And it is highly unlikely that in situation, when there are a lot of
> event, ring can be empty.

I dont speak of re-allocation of ring buffer. I dont care to allocate at
startup a big enough buffer.

Say you have allocated a ring buffer of 1024*1024 entries.
Then you queue 100 events per second, and dequeue them immediatly.
No need to blindly use all 1024*1024 slots in the ring buffer, doing
index = (index+1)%(1024*1024)



> epoll() does not have mmap.
> Problem is not about how many events can be put into the kernel, but how
> many of them can be put into mapped buffer.
> There is no problem if mmap is turned off.

So zap mmap() support completely, since it is not usable at all. We wont
discuss on it.

2006-10-05 12:37:59

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Thu, Oct 05, 2006 at 02:09:31PM +0200, Eric Dumazet ([email protected]) wrote:
> On Thursday 05 October 2006 12:55, Evgeniy Polyakov wrote:
> > On Thu, Oct 05, 2006 at 12:45:03PM +0200, Eric Dumazet ([email protected])
> > >
> > > What is missing or not obvious is : If events are skipped because of
> > > overflows, What happens ? Connections stuck forever ? Hope that
> > > everything will restore itself ? Is kernel able to SIGNAL this problem to
> > > user land ?
> >
> > Exisitng code does not overflow by design, but can consume a lot of
> > memory. I talked about the case, when there will be some limit on
> > number of entries put into mapped buffer.
>
> You still dont answer my question. Please answer the question.
> Recap : You have a max of XXXX events queued. A network message come and
> kernel want to add another event. It cannot because limit is reached. How the
> User Program knows that this problem was hit ?

Existing design does not allow overflow.
If event was added into the queue (like user requested notification,
when new data has arrived), it is guaranteed that there will be place to
put that event into mapped buffer when it is ready.

If user wants to add anotehr event (for example after accept() user
wants to add another socket with request for notification about data
arrival into that socket), it can fail though. This limit is introduced
only because of mmap buffer.

> > It is the same.
> > What if reing buffer was grown upto 3 entry, and is now empty, and we
> > need to put there 4 entries? Grow it again?
> > It can be done, easily, but it looks like a workaround not as solution.
> > And it is highly unlikely that in situation, when there are a lot of
> > event, ring can be empty.
>
> I dont speak of re-allocation of ring buffer. I dont care to allocate at
> startup a big enough buffer.
>
> Say you have allocated a ring buffer of 1024*1024 entries.
> Then you queue 100 events per second, and dequeue them immediatly.
> No need to blindly use all 1024*1024 slots in the ring buffer, doing
> index = (index+1)%(1024*1024)

But what if they are not dequeued immediateyl? What if rate is high and
while one tries to dequeue, system adds another events?

> > epoll() does not have mmap.
> > Problem is not about how many events can be put into the kernel, but how
> > many of them can be put into mapped buffer.
> > There is no problem if mmap is turned off.
>
> So zap mmap() support completely, since it is not usable at all. We wont
> discuss on it.

Initial implementation did not have it.
But I was requested to do it, and it is ready now.
No one likes it, but no one provides an alternative implementation.
We are stuck.

--
Evgeniy Polyakov

2006-10-05 14:01:27

by Hans Henrik Happe

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Thursday 05 October 2006 12:21, Evgeniy Polyakov wrote:
> On Thu, Oct 05, 2006 at 11:56:24AM +0200, Eric Dumazet ([email protected])
wrote:
> > On Thursday 05 October 2006 10:57, Evgeniy Polyakov wrote:
> >
> > > Well, it is possible to create /sys/proc entry for that, and even now
> > > userspace can grow mapping ring until it is forbiden by kernel, which
> > > means limit is reached.
> >
> > No need for yet another /sys/proc entry.
> >
> > Right now, I (for example) may have a use for Generic event handling, but
for
> > a program that needs XXX.XXX handles, and about XX.XXX events per second.
> >
> > Right now, this program uses epoll, and reaches no limit at all, once you
pass
> > the "ulimit -n", and other kernel wide tunes of course, not related to
epoll.
> >
> > With your current kevent, I cannot switch to it, because of hardcoded
limits.
> >
> > I may be wrong, but what is currently missing for me is :
> >
> > - No hardcoded limit on the max number of events. (A process that can open
> > XXX.XXX files should be allowed to open a kevent queue with at least
XXX.XXX
> > events). Right now thats not clear what happens IF the current limit is
> > reached.
>
> This forces to overflows in fixed sized memory mapped buffer.
> If we remove memory mapped buffer or will allow to have overflows (and
> thus skipped entries) keven can easily scale to that limits (tested with
> xx.xxx events though).
>
> > - In order to avoid touching the whole ring buffer, it might be good to be
> > able to reset the indexes to the beginning when ring buffer is empty. (So
if
> > the user land is responsive enough to consume events, only first pages of
the
> > mapping would be used : that saves L1/L2 cpu caches)
>
> And what happens when there are 3 empty at the beginning and \we need to
> put there 4 ready events?

Couldn't there be 3 areas in the mmap buffer:

- Unused: entries that the kernel can alloc from.
- Alloced: entries alloced by kernel but not yet used by user. Kernel can
update these if new events requires that.
- Consumed: entries that the user are processing.

The user takes a set of alloced entries and make them consumed. Then it
processes the events after which it makes them unused.

If there are no unused entries and the kernel needs some, it has wait for free
entries. The user has to notify when unused entries becomes available. It
could set a flag in the mmap'ed area to avoid unnessesary wakeups.

The are some details with indexing and wakeup notification that I have left
out, but I hope my idea is clear. I could give a more detailed description if
requested. Also, I'm a user-level programmer so I might not get the whole
picture.

Hans Henrik Happe

2006-10-05 14:21:39

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Thu, Oct 05, 2006 at 04:01:19PM +0200, Hans Henrik Happe ([email protected]) wrote:
> > And what happens when there are 3 empty at the beginning and \we need to
> > put there 4 ready events?
>
> Couldn't there be 3 areas in the mmap buffer:
>
> - Unused: entries that the kernel can alloc from.
> - Alloced: entries alloced by kernel but not yet used by user. Kernel can
> update these if new events requires that.
> - Consumed: entries that the user are processing.
>
> The user takes a set of alloced entries and make them consumed. Then it
> processes the events after which it makes them unused.
>
> If there are no unused entries and the kernel needs some, it has wait for free
> entries. The user has to notify when unused entries becomes available. It
> could set a flag in the mmap'ed area to avoid unnessesary wakeups.
>
> The are some details with indexing and wakeup notification that I have left
> out, but I hope my idea is clear. I could give a more detailed description if
> requested. Also, I'm a user-level programmer so I might not get the whole
> picture.

This looks good on a picture, but how can you put it into page-based
storage without major and complex shared structures, which should be
properly locked between kernelspace and userspace?

> Hans Henrik Happe

--
Evgeniy Polyakov

2006-10-05 14:44:36

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
> And you can add/remove signal events using existing kevent api between
> calls.

That's far more expensive than using a mask under control of the program.


> And creating special cases for usual events is bad.
> There is unified way to deal with events in kevent -
> add/remove/modify/wait on them, signals are just usual events.

How can this be unified? The installment of the temporary signal mask
is unlike the handling of signal for the purpose of reporting them
through the signal queue. It's equally completely new functionality.
Don't kid yourself in thinking that because this is signal stuff, too,
you're "unifying" something. The way this signal mask is used has
nothing whatsoever to do with the delivering signals via the event
queue. For the latter the signals always must be blocked (similar to
sigwait's requirement).

As a result it means you want to introduce a new mechanism for the event
queue instead of using the well known and often used method of
optionally passing a signal mask to the syscall. That's just insane.


> I think you wanted to say, that 'all event mechanism except the most
> commonly used poll/select/epoll use timespec'.

Get your facts straight. select uses timeval which is just the
predecessor of of timespec. And epoll is just (badly) designed after
poll. Fact is therefore that poll plus its spawn is the only interface
using such a timeout method.


> I designed it to be similar to poll(), it is really good interface.

Not many people agree. All the interfaces designed (not derived) in the
last years take a timespec parameter.

Plus, you chose to ignore all the nice things using a timespec allow you
like absolute timeout modes etc. See the clock_nanosleep() interface
for a way this can be useful.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


Attachments:
signature.asc (251.00 B)
OpenPGP digital signature

2006-10-05 15:07:55

by Hans Henrik Happe

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Thursday 05 October 2006 16:15, Evgeniy Polyakov wrote:
> On Thu, Oct 05, 2006 at 04:01:19PM +0200, Hans Henrik Happe
([email protected]) wrote:
> > > And what happens when there are 3 empty at the beginning and \we need to
> > > put there 4 ready events?
> >
> > Couldn't there be 3 areas in the mmap buffer:
> >
> > - Unused: entries that the kernel can alloc from.
> > - Alloced: entries alloced by kernel but not yet used by user. Kernel can
> > update these if new events requires that.
> > - Consumed: entries that the user are processing.
> >
> > The user takes a set of alloced entries and make them consumed. Then it
> > processes the events after which it makes them unused.
> >
> > If there are no unused entries and the kernel needs some, it has wait for
free
> > entries. The user has to notify when unused entries becomes available. It
> > could set a flag in the mmap'ed area to avoid unnessesary wakeups.
> >
> > The are some details with indexing and wakeup notification that I have
left
> > out, but I hope my idea is clear. I could give a more detailed description
if
> > requested. Also, I'm a user-level programmer so I might not get the whole
> > picture.
>
> This looks good on a picture, but how can you put it into page-based
> storage without major and complex shared structures, which should be
> properly locked between kernelspace and userspace?

I wasn't clear about the structure. I meant a ring-buffer with 3 areas. So
it's basically the same model as Eric Dumazet described, only with 3 indexes;
2 in the user-writeable page and 1 in kernel.

When the kernel has alloced an entry it should store it in a way that makes it
invalid after user consumsion, which is simply an increment of an index.
Sliding-window like schemes should solve this.

Hans Henrik Happe

2006-10-06 08:36:54

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

On Thu, Oct 05, 2006 at 07:45:23AM -0700, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> > And you can add/remove signal events using existing kevent api between
> > calls.
>
> That's far more expensive than using a mask under control of the program.

In context you have cut, one updated signal mask between calls to event
delivery mechanism (using for example signal()), so it has exactly the
same price.

> > And creating special cases for usual events is bad.
> > There is unified way to deal with events in kevent -
> > add/remove/modify/wait on them, signals are just usual events.
>
> How can this be unified? The installment of the temporary signal mask
> is unlike the handling of signal for the purpose of reporting them
> through the signal queue. It's equally completely new functionality.
> Don't kid yourself in thinking that because this is signal stuff, too,
> you're "unifying" something. The way this signal mask is used has
> nothing whatsoever to do with the delivering signals via the event
> queue. For the latter the signals always must be blocked (similar to
> sigwait's requirement).
>
> As a result it means you want to introduce a new mechanism for the event
> queue instead of using the well known and often used method of
> optionally passing a signal mask to the syscall. That's just insane.

I created it just because I think that POSIX workaround to add signals
into the syscall parameters is not good enough.
Exactly with the same logic I created kevent to drive AIO completeness,
to work with socket notifications and timers and so on.
All above (listio(), poll(), setitimer() and so on) are known and good
interfaces, but practice shows that having one interface, which can
easily work with all existing cases is much more convenient than having
tons of them.

So, yes, I do introduce new mechanism to solve signal problem here, just
because I think all existing have some limitations or problems.

> > I think you wanted to say, that 'all event mechanism except the most
> > commonly used poll/select/epoll use timespec'.
>
> Get your facts straight. select uses timeval which is just the
> predecessor of of timespec. And epoll is just (badly) designed after
> poll. Fact is therefore that poll plus its spawn is the only interface
> using such a timeout method.

And it was designed very good, although it looks like we disagree a bit
here...

> > I designed it to be similar to poll(), it is really good interface.
>
> Not many people agree. All the interfaces designed (not derived) in the
> last years take a timespec parameter.
>
> Plus, you chose to ignore all the nice things using a timespec allow you
> like absolute timeout modes etc. See the clock_nanosleep() interface
> for a way this can be useful.

You again cut my explanation on why just pure timeout is used.
We start a syscall, which can block forever, so we want to limit it's
time, and we add special parameter to show how long this syscall should
run. Timeout is not about how long we should sleep (which indeed can be
absolute), but how long syscall should run - which is related to the
time syscall started.

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

--
Evgeniy Polyakov

2006-10-15 22:45:08

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
> In context you have cut, one updated signal mask between calls to event
> delivery mechanism (using for example signal()), so it has exactly the
> same price.

No, it does not. If the signal mask is recomputed by the program for
each new wait call then you have a lot more work to do when the signal
mask is implicitly specified.


> I created it just because I think that POSIX workaround to add signals
> into the syscall parameters is not good enough.

Not good enough? It does exactly what it is supposed to do. What can
there be "not good enough"?


> You again cut my explanation on why just pure timeout is used.
> We start a syscall, which can block forever, so we want to limit it's
> time, and we add special parameter to show how long this syscall should
> run. Timeout is not about how long we should sleep (which indeed can be
> absolute), but how long syscall should run - which is related to the
> time syscall started.

I know very well what a timeout is. But the way the timeout can be
specified can vary. It is often useful (as for select, poll) to specify
relative timeouts.

But there are equally useful uses where the timeout is needed at a
specific point in time. Without a syscall interface which can have a
absolute timeout parameter we'd have to write as a poor approximation at
userlever

clock_gettime (CLOCK_REALTIME, &ts);
struct timespec rel;
rel.tv_sec = abstmo.tv_sec - ts.tv_sec;
rel.tv_nsec = abstmo.tv_sec - ts.tv_nsec;
if (rel.tv_nsec < 0) {
rel.tv_nsec += 1000000000;
--rel.tv_sec;
}
if (rel.tv_sec < 0)
inttmo = -1; // or whatever is used for return immediately
else
inttmo = rel.tv_sec * UINT64_C(1000000000) + rel.tv_nsec;

wait(..., inttmo, ...)


Not only is this much more expensive to do at userlevel, it is also
inadequate because calls to settimeofday() do not cause a recomputation
of the timeout.

See Ingo's RT futex stuff as an example for a kernel interface which
does it right.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-10-15 23:24:24

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

Evgeniy Polyakov wrote:
> Existing design does not allow overflow.

And I've pointed out a number of times that this is not practical at
best. There are event sources which can create events which cannot be
coalesced into one single event as it would be required with your design.

Signals are one example, specifically realtime signals. If we do not
want the design to be limited from the start this approach has to be
thought over.


>> So zap mmap() support completely, since it is not usable at all. We wont
>> discuss on it.
>
> Initial implementation did not have it.
> But I was requested to do it, and it is ready now.
> No one likes it, but no one provides an alternative implementation.
> We are stuck.

We need the mapped ring buffer. The current design (before it was
removed) was broken but this does not mean it shouldn't be implemented.
We just need more time to figure out how to implement it correctly.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-10-16 07:23:47

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

On Sun, Oct 15, 2006 at 03:43:39PM -0700, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >In context you have cut, one updated signal mask between calls to event
> >delivery mechanism (using for example signal()), so it has exactly the
> >same price.
>
> No, it does not. If the signal mask is recomputed by the program for
> each new wait call then you have a lot more work to do when the signal
> mask is implicitly specified.

One can set number of events before the syscall and do not remove them
after syscall. It can be updated if there is need for that.

> >I created it just because I think that POSIX workaround to add signals
> >into the syscall parameters is not good enough.
>
> Not good enough? It does exactly what it is supposed to do. What can
> there be "not good enough"?

Not to move signals into special case of events. If poll() can not work
with them it does not mean, that they need to be specified as additional
syscall parameter, instead change poll() to work with them, which can be
easily done with kevents.

> >You again cut my explanation on why just pure timeout is used.
> >We start a syscall, which can block forever, so we want to limit it's
> >time, and we add special parameter to show how long this syscall should
> >run. Timeout is not about how long we should sleep (which indeed can be
> >absolute), but how long syscall should run - which is related to the
> >time syscall started.
>
> I know very well what a timeout is. But the way the timeout can be
> specified can vary. It is often useful (as for select, poll) to specify
> relative timeouts.
>
> But there are equally useful uses where the timeout is needed at a
> specific point in time. Without a syscall interface which can have a
> absolute timeout parameter we'd have to write as a poor approximation at
> userlever
>
> clock_gettime (CLOCK_REALTIME, &ts);
> struct timespec rel;
> rel.tv_sec = abstmo.tv_sec - ts.tv_sec;
> rel.tv_nsec = abstmo.tv_sec - ts.tv_nsec;
> if (rel.tv_nsec < 0) {
> rel.tv_nsec += 1000000000;
> --rel.tv_sec;
> }
> if (rel.tv_sec < 0)
> inttmo = -1; // or whatever is used for return immediately
> else
> inttmo = rel.tv_sec * UINT64_C(1000000000) + rel.tv_nsec;
>
> wait(..., inttmo, ...)

Do not mix warm and soft - waiting for some period is not equal to
syscall timeout. Waiting is possible with timer kevent user (although
only relative timeout, can be changed to support both, not a big
problem).

> Not only is this much more expensive to do at userlevel, it is also
> inadequate because calls to settimeofday() do not cause a recomputation
> of the timeout.
>
> See Ingo's RT futex stuff as an example for a kernel interface which
> does it right.

I'm quite sure that absolute timeouts are very usefull, but not as in
the case of waiting for syscall completeness. In any way, kevent can be
extended to support absolute timeouts in it's timer notifications.

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View,
> CA ❖

--
Evgeniy Polyakov

2006-10-16 07:34:34

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Sun, Oct 15, 2006 at 04:22:45PM -0700, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >Existing design does not allow overflow.
>
> And I've pointed out a number of times that this is not practical at
> best. There are event sources which can create events which cannot be
> coalesced into one single event as it would be required with your design.
>
> Signals are one example, specifically realtime signals. If we do not
> want the design to be limited from the start this approach has to be
> thought over.

The whole idea of mmap buffer seems to be broken, since those who asked
for creation do not like existing design and do not show theirs...

According to signals and possibility to overflow in existing ring buffer
implementation.
You seems to not checked the code - each event can be marked as ready
only one time, which means only one copy and so on.
It was done _specially_. And it is not limitation, but "new" approach.
Queue of the same signals or any other events has fundamental flawness
(as any other ring buffer implementation, which has queue size) -
it's size of the queue and extremely bad case of the overflow.
So, the same event may not be ready several times. Any design which
allows to create infinite number of events generated for the same case
is broken, since consumer can be in situation, when it can not handle
that flow. That is why poll() returns only POLLIN when data is ready in
network stack, but is not trying to generate some kind of a signal for
each byte/packet/MTU/MSS received.
RT signals have design problems, and I will not repeate the same error
with similar limits in kevent.

> >>So zap mmap() support completely, since it is not usable at all. We wont
> >>discuss on it.
> >
> >Initial implementation did not have it.
> >But I was requested to do it, and it is ready now.
> >No one likes it, but no one provides an alternative implementation.
> >We are stuck.
>
> We need the mapped ring buffer. The current design (before it was
> removed) was broken but this does not mean it shouldn't be implemented.
> We just need more time to figure out how to implement it correctly.

In the latest patchset it was removed. I'm waiting for your code.

Mmap implementation can be added separately, since it does not affect
kevent core.

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View,
> CA ❖

--
Evgeniy Polyakov

2006-10-16 10:00:37

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
> One can set number of events before the syscall and do not remove them
> after syscall. It can be updated if there is need for that.

Nobody doubts that it is possible. But it is

a) potentially much expensive

and

b) an alien concept

to have the signal mask to set during the wait call implicitly.
Conceptually it doesn't even make sense. This is no event to wait for.
It a parameter for the specific wait call, just like the timeout. And
I fortunately haven't seen you proposing to pass the timeout value
implicitly.


>> Not good enough? It does exactly what it is supposed to do. What can
>> there be "not good enough"?
>
> Not to move signals into special case of events. If poll() can not work
> with them it does not mean, that they need to be specified as additional
> syscall parameter, instead change poll() to work with them, which can be
> easily done with kevents.

You still seem to be completely missing the point. The signal mask is
no event to wait for. It has nothing to do with this that ppoll() takes
the signal mask as a parameter. The signal mask is a parameter for the
wait call just like the timeout, not more and not less.


> Do not mix warm and soft - waiting for some period is not equal to
> syscall timeout. Waiting is possible with timer kevent user (although
> only relative timeout, can be changed to support both, not a big
> problem).

That's what I'm saying all the time. Of course it can be supported.
But for this the timeout parameter must be a timespec pointer. Whatever
you could possibly mean by "do not mix warm and soft" I cannot possibly
imagine. Fact is that both relative and absolute timeouts are useful.
And that for absolute timeouts the change of the clock has to be taken
into account.


> I'm quite sure that absolute timeouts are very usefull, but not as in
> the case of waiting for syscall completeness. In any way, kevent can be
> extended to support absolute timeouts in it's timer notifications.

That's not the same. If you argue that then the syscall should have no
timeout parameter at all. Fact is that setting up a timer is not for
free. Since the timeout is used all the time having a timeout parameter
is the right answer. And if you do this then do it right just like
every other syscall other than poll: use a timespec object. This gives
flexibility without measurable cost.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-10-16 10:17:15

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

Evgeniy Polyakov wrote:
> The whole idea of mmap buffer seems to be broken, since those who asked
> for creation do not like existing design and do not show theirs...

What kind of argumentation is that?

"Because my attempt to implement it doesn't work and nobody right
away has a better suggestion this means the idea is broken."

Nonsense.

It just means that time should be spend on thinking about this. You cut
all this short by rushing out your attempt without any discussions.
Unfortunately nobody else really looked at the approach so it lingered
around for some weeks. Well, now it is clear that it is not the right
approach and we can start thinking about it again.


> You seems to not checked the code - each event can be marked as ready
> only one time, which means only one copy and so on.
> It was done _specially_. And it is not limitation, but "new" approach.

I know that it is done deliberately and I tell you that this is wrong
and unacceptable. Realtime signals are one event which need to have
more than one event queued. This is no description of what you have
implemented, it's a description of the reality of realtime signals.

RT signals are queued. They carry a data value (the sigval_t object)
which can be unique for each signal delivery. Coalescing the signal
events therefore leads to information loss.

Therefore, at the very least for signal we need to have the ability to
queue more than one event for each event source. Not having this
functionality means that signals and likely other types of events cannot
be implemented using kevent queues.


> Queue of the same signals or any other events has fundamental flawness
> (as any other ring buffer implementation, which has queue size) -
> it's size of the queue and extremely bad case of the overflow.

Of course there are additional problems. Overflows need to be handled.
But this is nothing which is unsolvable.


> So, the same event may not be ready several times. Any design which
> allows to create infinite number of events generated for the same case
> is broken, since consumer can be in situation, when it can not handle
> that flow.

That's complete nonsense. Again, for RT signals it is very reasonable
and not "broken" to have multiple outstanding signals.


> That is why poll() returns only POLLIN when data is ready in
> network stack, but is not trying to generate some kind of a signal for
> each byte/packet/MTU/MSS received.

It makes no sense to drag poll() into this discussion. poll() is a very
limited interface. The new event handling is supposed to be the
opposite, namely, usable for all kinds of events. Arguing that because
poll() does it like this just means you don't see what big step is
needed to get to the goal of a unified event handling. The shackles of
poll() must be left behind.


> RT signals have design problems, and I will not repeate the same error
> with similar limits in kevent.

I don't know what to say. You claim to be the source of all wisdom is
OS design. Maybe you should design your own OS, from ground up. I
wonder how many people would like that since all your arguments are
squarely geared towards optimizing the implementation. But: the
implementation is irrelevant without users. The functionality users (=
programmers) want and need is what must drive the implementation. And
RT signals are definitely heavily used and liked by programmers. You
have to accept that you try to modify an OS which has that functionality
regardless of how much you hate it and want to fight it.


> Mmap implementation can be added separately, since it does not affect
> kevent core.

That I doubt very much and it is why I would not want the kevent stuff
go into any released kernel until that "detail" is resolved.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-10-16 10:39:51

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 0/4] kevent: Generic event handling mechanism.

On Mon, Oct 16, 2006 at 02:59:48AM -0700, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >One can set number of events before the syscall and do not remove them
> >after syscall. It can be updated if there is need for that.
>
> Nobody doubts that it is possible. But it is
>
> a) potentially much expensive
>
> and
>
> b) an alien concept
>
> to have the signal mask to set during the wait call implicitly.
> Conceptually it doesn't even make sense. This is no event to wait for.
> It a parameter for the specific wait call, just like the timeout. And
> I fortunately haven't seen you proposing to pass the timeout value
> implicitly.

Because timeout has it's meaning for syscall processing, but signals are
completely separated objects. Why do you want to allow to queue signals
_and_ add 'temporal' signal mask for syscall? Just use one way - queue
them all.

> >>Not good enough? It does exactly what it is supposed to do. What can
> >>there be "not good enough"?
> >
> >Not to move signals into special case of events. If poll() can not work
> >with them it does not mean, that they need to be specified as additional
> >syscall parameter, instead change poll() to work with them, which can be
> >easily done with kevents.
>
> You still seem to be completely missing the point. The signal mask is
> no event to wait for. It has nothing to do with this that ppoll() takes
> the signal mask as a parameter. The signal mask is a parameter for the
> wait call just like the timeout, not more and not less.

That's where we have different opinioins (among others places :) - I do
not agree that signals are parameters for syscall, I insist that is is
usual events. ppoll() shows us that there is no difference between
signal reported as usual user - syscall returns and we can check if
something was changed (signal was delivered or even was fired), it does
not differ from the case when syscall returns and we check what event it
reports first - ready signal or some other event.

> >Do not mix warm and soft - waiting for some period is not equal to
> >syscall timeout. Waiting is possible with timer kevent user (although
> >only relative timeout, can be changed to support both, not a big
> >problem).
>
> That's what I'm saying all the time. Of course it can be supported.
> But for this the timeout parameter must be a timespec pointer. Whatever
> you could possibly mean by "do not mix warm and soft" I cannot possibly
> imagine. Fact is that both relative and absolute timeouts are useful.
> And that for absolute timeouts the change of the clock has to be taken
> into account.

They are usefull for special waiting, but not for waiting when syscall
is called. The former is supported by timer notifications, the latter -
by syscall parameter. We can add support for absolute timer
notifications as addon to relative ones. But using there timeval
structure is not accessible, since it has different sizes on different
arches, so there will be problems with 32/64 arches like x86_64.
Instead it is possible to use u32/u32 structure for sec/nsec, like what
is used for relative timeouts.

> >I'm quite sure that absolute timeouts are very usefull, but not as in
> >the case of waiting for syscall completeness. In any way, kevent can be
> >extended to support absolute timeouts in it's timer notifications.
>
> That's not the same. If you argue that then the syscall should have no
> timeout parameter at all. Fact is that setting up a timer is not for
> free. Since the timeout is used all the time having a timeout parameter
> is the right answer. And if you do this then do it right just like
> every other syscall other than poll: use a timespec object. This gives
> flexibility without measurable cost.

It does not introduce any flexibility, since syscall does not have a
parameter to specify absolute or relative timeout has been provided.
That's one.
I do argue that syscall must have timout parameter, since it is related
to syscall behaviour but not to events syscall is working with - which is
completely different things: syscall must be interrupted after some time
to allow to fail operation or perform other tasks, but timer event can
be fired in any time in the future, syscall should not care about
underlaying events. That's two.
You say "every other syscall other than poll" - but even aio_suspend()
and friends use relative timeouts (although glibc converts them into
absolute to be used with pthread_cond_timedwait), so why do you propose
to use wariable sized structure (even if it is transferred almost for
free in syscall) instead of usual timeout specified in
seconds/nanoseconds/anything? That's three.

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View,
> CA ❖

--
Evgeniy Polyakov

2006-10-16 11:24:09

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Mon, Oct 16, 2006 at 03:16:15AM -0700, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >The whole idea of mmap buffer seems to be broken, since those who asked
> >for creation do not like existing design and do not show theirs...
>
> What kind of argumentation is that?
>
> "Because my attempt to implement it doesn't work and nobody right
> away has a better suggestion this means the idea is broken."
>
> Nonsense.

Ok, let's reformulate:
My attempt works, but nobody around likes it, I remove it and wait until
some other implement it.

> It just means that time should be spend on thinking about this. You cut
> all this short by rushing out your attempt without any discussions.
> Unfortunately nobody else really looked at the approach so it lingered
> around for some weeks. Well, now it is clear that it is not the right
> approach and we can start thinking about it again.

I talked about it in the last 13 releases of the kevent, and _noone_
said at least some comments. And now I get - 'it is broken, it does not
work, there are problems, we do not want it' and the like. I tried
hardly to show that it does work and problems shown can not happen, but
noone still hears me. Since I think it is not that interface which is
100% required for correct functionality, I removed it. When there are
better suggestions and implementation we can return to them of course.

> >You seems to not checked the code - each event can be marked as ready
> >only one time, which means only one copy and so on.
> >It was done _specially_. And it is not limitation, but "new" approach.
>
> I know that it is done deliberately and I tell you that this is wrong
> and unacceptable. Realtime signals are one event which need to have
> more than one event queued. This is no description of what you have
> implemented, it's a description of the reality of realtime signals.
>
> RT signals are queued. They carry a data value (the sigval_t object)
> which can be unique for each signal delivery. Coalescing the signal
> events therefore leads to information loss.
>
> Therefore, at the very least for signal we need to have the ability to
> queue more than one event for each event source. Not having this
> functionality means that signals and likely other types of events cannot
> be implemented using kevent queues.

Well, my point about rt-signals is that they do not deserve to be
resurrected, but it is only my point :)
In case it is still used, each signal setup should create event - many
signals means many events, each signal can be sent with different
parameters - each event should correspond to one unique case.

> >Queue of the same signals or any other events has fundamental flawness
> >(as any other ring buffer implementation, which has queue size) -
> >it's size of the queue and extremely bad case of the overflow.
>
> Of course there are additional problems. Overflows need to be handled.
> But this is nothing which is unsolvable.

I strongly disagree that having design which allows overflows is
acceptible - do we really want rt-signals queue overflow problems in new
place? Instead some complex allocation scheme can be created.

> >So, the same event may not be ready several times. Any design which
> >allows to create infinite number of events generated for the same case
> >is broken, since consumer can be in situation, when it can not handle
> >that flow.
>
> That's complete nonsense. Again, for RT signals it is very reasonable
> and not "broken" to have multiple outstanding signals.

The same signal with different payload is acceptible, but when number of
them increases ulimit and they are started to be forgotten - that's what
I call broken design.

> >That is why poll() returns only POLLIN when data is ready in
> >network stack, but is not trying to generate some kind of a signal for
> >each byte/packet/MTU/MSS received.
>
> It makes no sense to drag poll() into this discussion. poll() is a very
> limited interface. The new event handling is supposed to be the
> opposite, namely, usable for all kinds of events. Arguing that because
> poll() does it like this just means you don't see what big step is
> needed to get to the goal of a unified event handling. The shackles of
> poll() must be left behind.

Kevent is that subsystem, and for now it works quite good.

> >RT signals have design problems, and I will not repeate the same error
> >with similar limits in kevent.
>
> I don't know what to say. You claim to be the source of all wisdom is
> OS design. Maybe you should design your own OS, from ground up. I
> wonder how many people would like that since all your arguments are
> squarely geared towards optimizing the implementation. But: the
> implementation is irrelevant without users. The functionality users (=
> programmers) want and need is what must drive the implementation. And
> RT signals are definitely heavily used and liked by programmers. You
> have to accept that you try to modify an OS which has that functionality
> regardless of how much you hate it and want to fight it.

No problem, but I hope you agree that they have major problem related to
queue length? And I want to design interface which will not have that
problem, so I do not introduce situation which allows to create infinite
number of events when receiving side can not handle them.

> >Mmap implementation can be added separately, since it does not affect
> >kevent core.
>
> That I doubt very much and it is why I would not want the kevent stuff
> go into any released kernel until that "detail" is resolved.

I see you point :)

But talk is cheap, and no code has been released by people who argue
against kevent, only existing ring buffer implementation.
I have only two arms and one brain, which unfortunately is not capable
to remotely read mental waves about possible design of ring buffer, so
I'm waiting.

I expect no one will release new code (soon), so it is possible that
kevent will wait forever...
If you do argue for that, I can only say that we are on the different
sides - one on the ship, and other on the coast.

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View,
> CA ❖

--
Evgeniy Polyakov

2006-10-17 05:06:38

by Johann Borck

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

Ulrich Drepper wrote:
> Evgeniy Polyakov wrote:
>> Existing design does not allow overflow.
>
> And I've pointed out a number of times that this is not practical at
> best. There are event sources which can create events which cannot be
> coalesced into one single event as it would be required with your design.
>
> Signals are one example, specifically realtime signals. If we do not
> want the design to be limited from the start this approach has to be
> thought over.
>
>
>>> So zap mmap() support completely, since it is not usable at all. We
>>> wont discuss on it.
>>
>> Initial implementation did not have it.
>> But I was requested to do it, and it is ready now.
>> No one likes it, but no one provides an alternative implementation.
>> We are stuck.
>
> We need the mapped ring buffer. The current design (before it was
> removed) was broken but this does not mean it shouldn't be
> implemented. We just need more time to figure out how to implement it
> correctly.
>
Considering the if at all and if then how of ring buffer implemetation
I'd like to throw in some ideas I had when reading the discussion and
respective code. If I understood Ulrich Drepper right, his notion of a
generic event handling interface is, that it has to be flexible enough
to transport additional info from origin to userspace, and to support
queuing of events from the same origin, so that additional
per-event-occurrence data doesn't get lost, which would happen when
coalescing multiple events into one until delivery. From what I read he
says ring buffer is broken because of insufficient space for additional
data (mukevent) and the limited number of events that can be put into
ring buffer. Another argument is missing notification of userspace about
dropped events in case ring buffer limit is reached. (is that right?)
I see no reason why kevent couldn't be modified to fit (all) these
needs. While modifying the server-example and writing a client using
kevent I came across the coalescing problem, there were more incoming
connections than accept events, and I had to work around that. In this
case the pure number of coalesced events would suffice, while it
wouldn't for the example of RT-signals that Ulrich Drepper gave. So if
coalescing can be done at all or if it is impossible depends on the type
of event. The same goes for additional data delivered with the events.
There might be no panacea for all possible scenarios with one fixed
design. Either performance suffers for 'lightweight' events which don't
need additional data and/or coalescing is not problematic and/or ring
buffer, or kevent is not usable for other types of events. Why not treat
different things differently, and let the (kernel-)user decide.
I don't know if I got all this right, but if, then ring buffer is needed
especially for cases where coalescing is not possible and additional
data has to be delivered for each triggered notification (so the pure
number of events is not enough; other reasons? performance? ). To me it
doesn't make sense to have kevent fill memory and use processor-time if
buffer is not used at all, which is the case when using kevent_getevents.
So here are my Ideas:
Make usage of ring buffer optional, if not required for specific
event-type it might be chosen by userspace-code.
Make limit of events in ring buffer optional and controllable from
userspace.
Regarding mukevent I'm thinking of a event-type specific struct, that is
filled by the originating code, and placed into a per-event-type ring
buffer (which requires modification of kevent_wait). To my limited
understanding it seems that alternative or modified versions of
kevent_storage_ready, (__)kevent_requeue and kevent_user_ring_add_event
could return a void pointer to the position in buffer, and all kevent
has to know about is the size of the struct.
If coalescing doesn't hurt for a specific event-type it might just be
modified to notify userspace about the number of coalesced events. Make
it depend on type of event.

I know this doesn't address all objections that have been made, and
Evgeniy, big sorry for this being just talk again, and maybe not even
applicable for some reasons I do not overlook, but maybe it's worth
consideration. I'll gladly try to put that into code, and see where it
leads. I think kevent is great, and if things can be done to increase
it's genericity without sacrifying performance, why not.
Sorry for the length of post and repetitions,

Johann

2006-10-17 06:00:24

by Chase Venters

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Tuesday 17 October 2006 00:09, Johann Borck wrote:
> Regarding mukevent I'm thinking of a event-type specific struct, that is
> filled by the originating code, and placed into a per-event-type ring
> buffer (which requires modification of kevent_wait).

I'd personally worry about an implementation that used a per-event-type ring
buffer, because you're still left having to hack around starvation issues in
user-space. It is of course possible under the current model for anyone who
wants per-event-type ring buffers to have them - just make separate kevent
sets.

I haven't thought this through all the way yet, but why not have variable
length event structures and have the kernel fill in a "next" pointer in each
one? This could even be used to keep backwards binary compatibility while
adding additional fields to the structures over time, though no space would
be wasted on modern programs. You still end up with a question of what to do
in case of overflow, but I'm thinking the thing to do in that case might be
to start pushing overflow events onto a linked list which can be written back
into the ring buffer when space becomes available. The appropriate behavior
would be to throw new events on the linked list if the linked list had any
events, so that things are delivered in order, but write to the mapped buffer
directly otherwise.

Deciding when to do that is tricky, and I haven't thought through the
implications fully when I say this, but what about activating a bottom half
when more space becomes available, and let that drain overflowed events back
into the mapped buffer? Or perhaps the time to do it would be in the next
blocking wait, when the queue emptied?

I think it is very important to avoid any limits that can not be adjusted on
the fly at run-time by CAP_SYS_ADMIN or what have you. Doing it this way may
have other problems I've ignored but at least the big one - compile-time
capacity limits in the year 2006 - would be largely avoided :P

Nothing real solid yet, just some electrical storms in the grey matter...

Thanks,
Chase

2006-10-17 10:40:24

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Tue, Oct 17, 2006 at 07:10:14AM +0200, Johann Borck ([email protected]) wrote:
> Ulrich Drepper wrote:
> > Evgeniy Polyakov wrote:
> >> Existing design does not allow overflow.
> >
> > And I've pointed out a number of times that this is not practical at
> > best. There are event sources which can create events which cannot be
> > coalesced into one single event as it would be required with your design.
> >
> > Signals are one example, specifically realtime signals. If we do not
> > want the design to be limited from the start this approach has to be
> > thought over.
> >
> >
> >>> So zap mmap() support completely, since it is not usable at all. We
> >>> wont discuss on it.
> >>
> >> Initial implementation did not have it.
> >> But I was requested to do it, and it is ready now.
> >> No one likes it, but no one provides an alternative implementation.
> >> We are stuck.
> >
> > We need the mapped ring buffer. The current design (before it was
> > removed) was broken but this does not mean it shouldn't be
> > implemented. We just need more time to figure out how to implement it
> > correctly.
> >
> Considering the if at all and if then how of ring buffer implemetation
> I'd like to throw in some ideas I had when reading the discussion and
> respective code. If I understood Ulrich Drepper right, his notion of a
> generic event handling interface is, that it has to be flexible enough
> to transport additional info from origin to userspace, and to support
> queuing of events from the same origin, so that additional
> per-event-occurrence data doesn't get lost, which would happen when
> coalescing multiple events into one until delivery. From what I read he
> says ring buffer is broken because of insufficient space for additional
> data (mukevent) and the limited number of events that can be put into
> ring buffer. Another argument is missing notification of userspace about
> dropped events in case ring buffer limit is reached. (is that right?)

I can add such notification, but its existense _is_ the broken design.
After such condition happend, all new events will dissapear (although
they are still accessible through usual queue) from mapped buffer.

While writing this I have come to the idea on how to imrove the case of
the size of mapped buffer - we can make it with limited size, and when
it is full, some bit will be set in the shared area and obviously no new
events can be added there, but when user commits some events from that
buffer (i.e. says to kernel that appropriate kevents can be freed or
requeued according to theirs flags), new ready events from ready queue
can be copied into mapped buffer.

It still does not solve (and I do insist that it is broken behaviour)
the case when kernel is going to generate infinite number of events for
one requested by userspace (as in case of generating new 'data_has_arrived'
event when new byte has been received).

Userspace events are only marked as ready, they are not generated - it
is high-performance _feature_ of the new design, not some kind of a bug.

> I see no reason why kevent couldn't be modified to fit (all) these
> needs. While modifying the server-example and writing a client using
> kevent I came across the coalescing problem, there were more incoming
> connections than accept events, and I had to work around that. In this

Btw, accept() issue is exactly the same as with usual poll() - repeated
insertion of the same kevent will fire immediately, which requires event
to be one-shot. One of the initial implementation contained number of
ready for accept sockets as one of the returned parameters though.

> case the pure number of coalesced events would suffice, while it
> wouldn't for the example of RT-signals that Ulrich Drepper gave. So if
> coalescing can be done at all or if it is impossible depends on the type
> of event. The same goes for additional data delivered with the events.
> There might be no panacea for all possible scenarios with one fixed
> design. Either performance suffers for 'lightweight' events which don't
> need additional data and/or coalescing is not problematic and/or ring
> buffer, or kevent is not usable for other types of events. Why not treat
> different things differently, and let the (kernel-)user decide.
> I don't know if I got all this right, but if, then ring buffer is needed
> especially for cases where coalescing is not possible and additional
> data has to be delivered for each triggered notification (so the pure
> number of events is not enough; other reasons? performance? ). To me it
> doesn't make sense to have kevent fill memory and use processor-time if
> buffer is not used at all, which is the case when using kevent_getevents.
> So here are my Ideas:
> Make usage of ring buffer optional, if not required for specific
> event-type it might be chosen by userspace-code.
> Make limit of events in ring buffer optional and controllable from
> userspace.

It is of course possible, main problem is that existing design of the
mapped buffer is not sufficient, and there are no other propositions
except that 'it sucks'.

> Regarding mukevent I'm thinking of a event-type specific struct, that is
> filled by the originating code, and placed into a per-event-type ring
> buffer (which requires modification of kevent_wait). To my limited
> understanding it seems that alternative or modified versions of
> kevent_storage_ready, (__)kevent_requeue and kevent_user_ring_add_event
> could return a void pointer to the position in buffer, and all kevent
> has to know about is the size of the struct.
> If coalescing doesn't hurt for a specific event-type it might just be
> modified to notify userspace about the number of coalesced events. Make
> it depend on type of event.

It is perfectly ok to add number of the same event fired while it was in
the ready queue, it was even done for accept notifications some time
ago. It depends on kernel kevent user - each one can add anything it
wants into appropriate returned data, that is why it was added.

> I know this doesn't address all objections that have been made, and
> Evgeniy, big sorry for this being just talk again, and maybe not even
> applicable for some reasons I do not overlook, but maybe it's worth
> consideration. I'll gladly try to put that into code, and see where it
> leads. I think kevent is great, and if things can be done to increase
> it's genericity without sacrifying performance, why not.
> Sorry for the length of post and repetitions,

I greatly appreciate you work with kevents Johann, thank you.

> Johann

--
Evgeniy Polyakov

2006-10-17 10:43:30

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Tue, Oct 17, 2006 at 12:59:47AM -0500, Chase Venters ([email protected]) wrote:
> On Tuesday 17 October 2006 00:09, Johann Borck wrote:
> > Regarding mukevent I'm thinking of a event-type specific struct, that is
> > filled by the originating code, and placed into a per-event-type ring
> > buffer (which requires modification of kevent_wait).
>
> I'd personally worry about an implementation that used a per-event-type ring
> buffer, because you're still left having to hack around starvation issues in
> user-space. It is of course possible under the current model for anyone who
> wants per-event-type ring buffers to have them - just make separate kevent
> sets.
>
> I haven't thought this through all the way yet, but why not have variable
> length event structures and have the kernel fill in a "next" pointer in each
> one? This could even be used to keep backwards binary compatibility while

Why do we want variable size structures in mmap ring buffer?

> adding additional fields to the structures over time, though no space would
> be wasted on modern programs. You still end up with a question of what to do
> in case of overflow, but I'm thinking the thing to do in that case might be
> to start pushing overflow events onto a linked list which can be written back
> into the ring buffer when space becomes available. The appropriate behavior
> would be to throw new events on the linked list if the linked list had any
> events, so that things are delivered in order, but write to the mapped buffer
> directly otherwise.

I think in a similar way.
Kevent actually do not require such list, since it has already queue of
the ready events.

--
Evgeniy Polyakov

2006-10-17 13:12:42

by Chase Venters

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Tuesday 17 October 2006 05:42, Evgeniy Polyakov wrote:
> On Tue, Oct 17, 2006 at 12:59:47AM -0500, Chase Venters
([email protected]) wrote:
> > On Tuesday 17 October 2006 00:09, Johann Borck wrote:
> > > Regarding mukevent I'm thinking of a event-type specific struct, that
> > > is filled by the originating code, and placed into a per-event-type
> > > ring buffer (which requires modification of kevent_wait).
> >
> > I'd personally worry about an implementation that used a per-event-type
> > ring buffer, because you're still left having to hack around starvation
> > issues in user-space. It is of course possible under the current model
> > for anyone who wants per-event-type ring buffers to have them - just make
> > separate kevent sets.
> >
> > I haven't thought this through all the way yet, but why not have variable
> > length event structures and have the kernel fill in a "next" pointer in
> > each one? This could even be used to keep backwards binary compatibility
> > while
>
> Why do we want variable size structures in mmap ring buffer?

Flexibility primarily. So when we all decide to add a new event type six
months from now, or add more information to an existing one, we don't run the
risk that the existing mukevent isn't big enough.

> > adding additional fields to the structures over time, though no space
> > would be wasted on modern programs. You still end up with a question of
> > what to do in case of overflow, but I'm thinking the thing to do in that
> > case might be to start pushing overflow events onto a linked list which
> > can be written back into the ring buffer when space becomes available.
> > The appropriate behavior would be to throw new events on the linked list
> > if the linked list had any events, so that things are delivered in order,
> > but write to the mapped buffer directly otherwise.
>
> I think in a similar way.
> Kevent actually do not require such list, since it has already queue of
> the ready events.

The current event types coalesce if there are multiple events, correct? It
sounds like there may be other event types where coalescing multiple events
is not the correct approach.

Thanks,
Chase

2006-10-17 13:19:38

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Tuesday 17 October 2006 12:39, Evgeniy Polyakov wrote:

> I can add such notification, but its existense _is_ the broken design.
> After such condition happend, all new events will dissapear (although
> they are still accessible through usual queue) from mapped buffer.
>
> While writing this I have come to the idea on how to imrove the case of
> the size of mapped buffer - we can make it with limited size, and when
> it is full, some bit will be set in the shared area and obviously no new
> events can be added there, but when user commits some events from that
> buffer (i.e. says to kernel that appropriate kevents can be freed or
> requeued according to theirs flags), new ready events from ready queue
> can be copied into mapped buffer.
>
> It still does not solve (and I do insist that it is broken behaviour)
> the case when kernel is going to generate infinite number of events for
> one requested by userspace (as in case of generating new 'data_has_arrived'
> event when new byte has been received).

Behavior is not broken. It's quite usefull and works 99.9999% of time.

I was trying to suggest you but you missed my point.

You dont want to use a bit, but a full sequence counter, 32bits.

A program may handle XXX.XXX handles, but use a 4096 entries ring
buffer 'only'.

The user program keeps a local copy of a special word
named 'ring_buffer_full_counter'

Each time the kernel cannot queue an event in the ring buffer, it increase
the "ring_buffer_was_full_counter" (exported to user app in the mmap view)

When the user application notice the kernel
changed "ring_buffer_was_full_counter" it does a full scan of all file
handles (preferably using poll() to get all relevant info in one syscall) :

do {
if (read_event_from_mmap()) {handle_event(fd); continue;}
/* ring buffer is empty, check if we missed some events */
if (unlikely(mmap->ring_buffer_full_counter !=
my_ring_buffer_full_counter)) {
my_ring_buffer_full_counter = mmap->ring_buffer_full_counter;
/* slow PATH */
/* can use a big poll() for example, or just a loop without poll() */
for_all_file_desc_do() {
check if some event/data is waiting on THIS fd
}
/*
}
else syscall_wait_for_one_available_kevent(queue)
}

This is how a program can recover. If ring buffer has a reasonable size, this
kind of event should not happen very frequently. If it does (because events
continue to fill ring_buffer during recovery and might hit FULL again), maybe
a smart program is able to resize the ring_buffer, and start using it after
yet another recovery pass.
If not, we dont care, because a big poll() give us many ready file-descriptors
in one syscall, and maybe this is much better than kevent/epoll when XX.XXX
events are ready.

Eric

2006-10-17 13:36:25

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Tue, Oct 17, 2006 at 08:12:04AM -0500, Chase Venters ([email protected]) wrote:
> > > > Regarding mukevent I'm thinking of a event-type specific struct, that
> > > > is filled by the originating code, and placed into a per-event-type
> > > > ring buffer (which requires modification of kevent_wait).
> > >
> > > I'd personally worry about an implementation that used a per-event-type
> > > ring buffer, because you're still left having to hack around starvation
> > > issues in user-space. It is of course possible under the current model
> > > for anyone who wants per-event-type ring buffers to have them - just make
> > > separate kevent sets.
> > >
> > > I haven't thought this through all the way yet, but why not have variable
> > > length event structures and have the kernel fill in a "next" pointer in
> > > each one? This could even be used to keep backwards binary compatibility
> > > while
> >
> > Why do we want variable size structures in mmap ring buffer?
>
> Flexibility primarily. So when we all decide to add a new event type six
> months from now, or add more information to an existing one, we don't run the
> risk that the existing mukevent isn't big enough.

Do we need such flexibility, when we have unique id attached to each
event? User can store any information in own buffers, which are indexed
by that id.

> > > adding additional fields to the structures over time, though no space
> > > would be wasted on modern programs. You still end up with a question of
> > > what to do in case of overflow, but I'm thinking the thing to do in that
> > > case might be to start pushing overflow events onto a linked list which
> > > can be written back into the ring buffer when space becomes available.
> > > The appropriate behavior would be to throw new events on the linked list
> > > if the linked list had any events, so that things are delivered in order,
> > > but write to the mapped buffer directly otherwise.
> >
> > I think in a similar way.
> > Kevent actually do not require such list, since it has already queue of
> > the ready events.
>
> The current event types coalesce if there are multiple events, correct? It
> sounds like there may be other event types where coalescing multiple events
> is not the correct approach.

There is no events coalescing, I think that it is even incorrect to say, that
something is being coalesced in kevents.

There is 'new' (which is well forgotten old) approach - user _asks_ kernel
about some information, and kernel says when it is ready. Kernel does not
say: part of the info is ready, part of the info is ready and so on, it
just marks user's request as ready - that means that it is possible that
there were zillions of events, each one could mark the _same_ userspace
request as ready, and exactly what user requested is transferred back.
Thus it is very fast and is correct way to deal with problem of pipes of
different diameters.

Kernel does not generate events - only user creates requests, which are
marked as ready.

I made that decision to remove _any_ kind of possible overflows from
kernel side - if user was scheduled away, or has unsufficient space or
bad mood, to not introduce any kind of ugly priorities (higher one
could fill the whole pipe while lower could not even send a single event).
Instead kernel does just what it was requested to do, and it can provide
some hints on how that process happend (for example how many sockets are
ready for accept(), or how many bytes are in the receiving queue).

And that approach does solve the problem of the cases when it looks like
it is logical to _generate_ event - for example in inotify case, where
new event is _generated_ each time requested case happens. For example
the case when new files are created in the directory - it is possible
that there will be queue overflow (btw, watch for each file in the kernel
tree takes about 2gb of kernel mem), if many files were created, so
userspace must rescan the whole directory to check missed files, so why
is it needed at all to generate info about first two or ten files,
instead userspace asks kernel to notify it when directory has changed or
some new files were created, and kernelspace will answer when directory
has been changed or new files were created (with some hint with number
of them).

Likely request for generation of events in kernel is a workaround for
some other problems, which in long term will hit us with new troubles -
queue length and overflows.

> Thanks,
> Chase

--
Evgeniy Polyakov

2006-10-17 13:42:44

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Tue, Oct 17, 2006 at 03:19:36PM +0200, Eric Dumazet ([email protected]) wrote:
> On Tuesday 17 October 2006 12:39, Evgeniy Polyakov wrote:
>
> > I can add such notification, but its existense _is_ the broken design.
> > After such condition happend, all new events will dissapear (although
> > they are still accessible through usual queue) from mapped buffer.
> >
> > While writing this I have come to the idea on how to imrove the case of
> > the size of mapped buffer - we can make it with limited size, and when
> > it is full, some bit will be set in the shared area and obviously no new
> > events can be added there, but when user commits some events from that
> > buffer (i.e. says to kernel that appropriate kevents can be freed or
> > requeued according to theirs flags), new ready events from ready queue
> > can be copied into mapped buffer.
> >
> > It still does not solve (and I do insist that it is broken behaviour)
> > the case when kernel is going to generate infinite number of events for
> > one requested by userspace (as in case of generating new 'data_has_arrived'
> > event when new byte has been received).
>
> Behavior is not broken. It's quite usefull and works 99.9999% of time.
>
> I was trying to suggest you but you missed my point.
>
> You dont want to use a bit, but a full sequence counter, 32bits.
>
> A program may handle XXX.XXX handles, but use a 4096 entries ring
> buffer 'only'.
>
> The user program keeps a local copy of a special word
> named 'ring_buffer_full_counter'
>
> Each time the kernel cannot queue an event in the ring buffer, it increase
> the "ring_buffer_was_full_counter" (exported to user app in the mmap view)
>
> When the user application notice the kernel
> changed "ring_buffer_was_full_counter" it does a full scan of all file
> handles (preferably using poll() to get all relevant info in one syscall) :

I.e. to scan the rest of the xxx.xxx events?

> do {
> if (read_event_from_mmap()) {handle_event(fd); continue;}
> /* ring buffer is empty, check if we missed some events */
> if (unlikely(mmap->ring_buffer_full_counter !=
> my_ring_buffer_full_counter)) {
> my_ring_buffer_full_counter = mmap->ring_buffer_full_counter;
> /* slow PATH */
> /* can use a big poll() for example, or just a loop without poll() */
> for_all_file_desc_do() {
> check if some event/data is waiting on THIS fd
> }
> /*
> }
> else syscall_wait_for_one_available_kevent(queue)
> }
>
> This is how a program can recover. If ring buffer has a reasonable size, this
> kind of event should not happen very frequently. If it does (because events
> continue to fill ring_buffer during recovery and might hit FULL again), maybe
> a smart program is able to resize the ring_buffer, and start using it after
> yet another recovery pass.
> If not, we dont care, because a big poll() give us many ready file-descriptors
> in one syscall, and maybe this is much better than kevent/epoll when XX.XXX
> events are ready.

What about the case, which I described in other e-mail, when in case of
the full ring buffer, no new events are written there, and when
userspace commits (i.e. marks as ready to be freed or requeued by kernel)
some events, new ones will be copied from ready queue into the buffer?

> Eric

--
Evgeniy Polyakov

2006-10-17 13:52:37

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Tuesday 17 October 2006 15:42, Evgeniy Polyakov wrote:
> On Tue, Oct 17, 2006 at 03:19:36PM +0200, Eric Dumazet ([email protected])
wrote:
> > On Tuesday 17 October 2006 12:39, Evgeniy Polyakov wrote:
> > > I can add such notification, but its existense _is_ the broken design.
> > > After such condition happend, all new events will dissapear (although
> > > they are still accessible through usual queue) from mapped buffer.
> > >
> > > While writing this I have come to the idea on how to imrove the case of
> > > the size of mapped buffer - we can make it with limited size, and when
> > > it is full, some bit will be set in the shared area and obviously no
> > > new events can be added there, but when user commits some events from
> > > that buffer (i.e. says to kernel that appropriate kevents can be freed
> > > or requeued according to theirs flags), new ready events from ready
> > > queue can be copied into mapped buffer.
> > >
> > > It still does not solve (and I do insist that it is broken behaviour)
> > > the case when kernel is going to generate infinite number of events for
> > > one requested by userspace (as in case of generating new
> > > 'data_has_arrived' event when new byte has been received).
> >
> > Behavior is not broken. It's quite usefull and works 99.9999% of time.
> >
> > I was trying to suggest you but you missed my point.
> >
> > You dont want to use a bit, but a full sequence counter, 32bits.
> >
> > A program may handle XXX.XXX handles, but use a 4096 entries ring
> > buffer 'only'.
> >
> > The user program keeps a local copy of a special word
> > named 'ring_buffer_full_counter'
> >
> > Each time the kernel cannot queue an event in the ring buffer, it
> > increase the "ring_buffer_was_full_counter" (exported to user app in the
> > mmap view)
> >
> > When the user application notice the kernel
> > changed "ring_buffer_was_full_counter" it does a full scan of all file
> > handles (preferably using poll() to get all relevant info in one syscall)
> > :
>
> I.e. to scan the rest of the xxx.xxx events?
>
> > do {
> > if (read_event_from_mmap()) {handle_event(fd); continue;}
> > /* ring buffer is empty, check if we missed some events */
> > if (unlikely(mmap->ring_buffer_full_counter !=
> > my_ring_buffer_full_counter)) {
> > my_ring_buffer_full_counter = mmap->ring_buffer_full_counter;
> > /* slow PATH */
> > /* can use a big poll() for example, or just a loop without poll() */
> > for_all_file_desc_do() {
> > check if some event/data is waiting on THIS fd
> > }
> > /*
> > }
> > else syscall_wait_for_one_available_kevent(queue)
> > }
> >
> > This is how a program can recover. If ring buffer has a reasonable size,
> > this kind of event should not happen very frequently. If it does (because
> > events continue to fill ring_buffer during recovery and might hit FULL
> > again), maybe a smart program is able to resize the ring_buffer, and
> > start using it after yet another recovery pass.
> > If not, we dont care, because a big poll() give us many ready
> > file-descriptors in one syscall, and maybe this is much better than
> > kevent/epoll when XX.XXX events are ready.
>
> What about the case, which I described in other e-mail, when in case of
> the full ring buffer, no new events are written there, and when
> userspace commits (i.e. marks as ready to be freed or requeued by kernel)
> some events, new ones will be copied from ready queue into the buffer?

Then, user might receive 'false events', exactly like poll()/select()/epoll()
can do sometime. IE a 'ready' indication while there is no current event
available on a particular fd / event_source.

This should be safe, since those programs already ignore read()
returns -EAGAIN and other similar things.

Programmer prefers to receive two 'event available' indications than ZERO (and
be stuck for infinite time). Of course, hot path (normal cases) should return
one 'event' only.

In order words, being ultra fast 99.99 % of the time, but being able to block
forever once in a while is not an option.

Eric

2006-10-17 14:08:12

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Tue, Oct 17, 2006 at 03:52:34PM +0200, Eric Dumazet ([email protected]) wrote:
> > What about the case, which I described in other e-mail, when in case of
> > the full ring buffer, no new events are written there, and when
> > userspace commits (i.e. marks as ready to be freed or requeued by kernel)
> > some events, new ones will be copied from ready queue into the buffer?
>
> Then, user might receive 'false events', exactly like poll()/select()/epoll()
> can do sometime. IE a 'ready' indication while there is no current event
> available on a particular fd / event_source.

Only if user simultaneously uses oth interfaces and remove even from the
queue when it's copy was in mapped buffer, but in that case it's user's
problem (and if we do want, we can store pointer/index of the ring
buffer entry, so when event is removed from the ready queue (using
kevent_get_events()), appropriate entry in the ring buffer will be
updated to show that it is no longer valid.

> This should be safe, since those programs already ignore read()
> returns -EAGAIN and other similar things.
>
> Programmer prefers to receive two 'event available' indications than ZERO (and
> be stuck for infinite time). Of course, hot path (normal cases) should return
> one 'event' only.
>
> In order words, being ultra fast 99.99 % of the time, but being able to block
> forever once in a while is not an option.

Have I missed something? It looks like the only problematic situation is
described above when user simultaneously uses both interfaces.

> Eric

--
Evgeniy Polyakov

2006-10-17 14:25:06

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Tuesday 17 October 2006 16:07, Evgeniy Polyakov wrote:
> On Tue, Oct 17, 2006 at 03:52:34PM +0200, Eric Dumazet ([email protected])
wrote:
> > > What about the case, which I described in other e-mail, when in case of
> > > the full ring buffer, no new events are written there, and when
> > > userspace commits (i.e. marks as ready to be freed or requeued by
> > > kernel) some events, new ones will be copied from ready queue into the
> > > buffer?
> >
> > Then, user might receive 'false events', exactly like
> > poll()/select()/epoll() can do sometime. IE a 'ready' indication while
> > there is no current event available on a particular fd / event_source.
>
> Only if user simultaneously uses oth interfaces and remove even from the
> queue when it's copy was in mapped buffer, but in that case it's user's
> problem (and if we do want, we can store pointer/index of the ring
> buffer entry, so when event is removed from the ready queue (using
> kevent_get_events()), appropriate entry in the ring buffer will be
> updated to show that it is no longer valid.
>
> > This should be safe, since those programs already ignore read()
> > returns -EAGAIN and other similar things.
> >
> > Programmer prefers to receive two 'event available' indications than ZERO
> > (and be stuck for infinite time). Of course, hot path (normal cases)
> > should return one 'event' only.
> >
> > In order words, being ultra fast 99.99 % of the time, but being able to
> > block forever once in a while is not an option.
>
> Have I missed something? It looks like the only problematic situation is
> described above when user simultaneously uses both interfaces.

In my point of view, user of the 'mmaped ring buffer' should be prepared to
use both interfaces. Or else you are forced to presize the ring buffer to
insane limits.

That is :
- Most of the time, we expect consuming events via mmaped ring buffer and no
syscalls.
- In case we notice a 'mmaped ring buffer overflow', syscalls to get/consume
events that could not be stored in mmaped buffer (but queued by kevent
subsystem). If not stored by kevent subsystem (memory failure ?), revert to
poll() to fetch all 'missed fds' in one row. Go back to normal mode.

- In case of empty ring buffer (or no mmap support at all, because this app
doesnt expect lot of events per time unit, or because kevent dont have mmap
support) : Be able to syscall and wait for an event.

Eric

2006-10-17 15:10:19

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Tue, Oct 17, 2006 at 04:25:00PM +0200, Eric Dumazet ([email protected]) wrote:
> On Tuesday 17 October 2006 16:07, Evgeniy Polyakov wrote:
> > On Tue, Oct 17, 2006 at 03:52:34PM +0200, Eric Dumazet ([email protected])
> wrote:
> > > > What about the case, which I described in other e-mail, when in case of
> > > > the full ring buffer, no new events are written there, and when
> > > > userspace commits (i.e. marks as ready to be freed or requeued by
> > > > kernel) some events, new ones will be copied from ready queue into the
> > > > buffer?
> > >
> > > Then, user might receive 'false events', exactly like
> > > poll()/select()/epoll() can do sometime. IE a 'ready' indication while
> > > there is no current event available on a particular fd / event_source.
> >
> > Only if user simultaneously uses oth interfaces and remove even from the
> > queue when it's copy was in mapped buffer, but in that case it's user's
> > problem (and if we do want, we can store pointer/index of the ring
> > buffer entry, so when event is removed from the ready queue (using
> > kevent_get_events()), appropriate entry in the ring buffer will be
> > updated to show that it is no longer valid.
> >
> > > This should be safe, since those programs already ignore read()
> > > returns -EAGAIN and other similar things.
> > >
> > > Programmer prefers to receive two 'event available' indications than ZERO
> > > (and be stuck for infinite time). Of course, hot path (normal cases)
> > > should return one 'event' only.
> > >
> > > In order words, being ultra fast 99.99 % of the time, but being able to
> > > block forever once in a while is not an option.
> >
> > Have I missed something? It looks like the only problematic situation is
> > described above when user simultaneously uses both interfaces.
>
> In my point of view, user of the 'mmaped ring buffer' should be prepared to
> use both interfaces. Or else you are forced to presize the ring buffer to
> insane limits.
>
> That is :
> - Most of the time, we expect consuming events via mmaped ring buffer and no
> syscalls.
> - In case we notice a 'mmaped ring buffer overflow', syscalls to get/consume
> events that could not be stored in mmaped buffer (but queued by kevent
> subsystem). If not stored by kevent subsystem (memory failure ?), revert to
> poll() to fetch all 'missed fds' in one row. Go back to normal mode.

kevent uses smaller amount of memory than epoll() per event, so it is very
unlikely that it will be impossible to store new event there and epoll()
will succeed. The same can be applied to poll(), which allocates the
whole table in syscall.

> - In case of empty ring buffer (or no mmap support at all, because this app
> doesnt expect lot of events per time unit, or because kevent dont have mmap
> support) : Be able to syscall and wait for an event.

So the most complex case is when user is going to use both interfaces,
and it's steps when mapped ring buffer has overflow.
In that case user can either read and mark some events as ready in ring
buffer (the latter is being done through special syscall), so kevent
core will put there new ready events.
User can also get events using usual syscall, in that case events in
ring buffer must be updated - and actually I implemented mapped buffer
in the way which allows to remove events from the queue - queue is a
FIFO, and the first entry to be obtained through syscall is _always_ the
first entry in the ring buffer.

So when user reads event through syscall (no matter if we are in overflow
case or not), even being read is easily accessible in the ring buffer.

So I propose following design for ring buffer (quite simple):
kernelspace maintains two indexes - to the first and the last events in
the ring buffer (and maximum size of the buffer of course).
When new event is marked as ready, some info is being copied into ring
buffer and index of the last entry is increased.
When event is being read through syscall it is _guaranteed_ that that
event will be at the position pointed by the index of the first
element, that index is then increased (thus opening new slot in the
buffer).
If index of the last entry reaches (with possible wrapping) index of the
first entry, that means that overflow has happend. In this case no new
events can be copied into ring buffer, so they are only placed into
ready queue (accessible through syscall kevent_get_events()).

When user calls kevent_get_events() it will obtain the first element
(pointed by index of the first element in the ring buffer), and if there
is ready event, which is not placed into the ring buffer, it is
copied (with appropriate update of the last index and new overflow
condition).

When userspace calls kevent_wait(num), it means that userspace marks as
ready first (from index of the first element) $num elements, which thus
can be removed (or requeued) and replaced by pending ready events.

Does it sound like clawing over the glass or much better?

> Eric
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Evgeniy Polyakov

2006-10-17 15:32:30

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Tuesday 17 October 2006 17:09, Evgeniy Polyakov wrote:
> On Tue, Oct 17, 2006 at 04:25:00PM +0200, Eric Dumazet ([email protected])
wrote:
> > On Tuesday 17 October 2006 16:07, Evgeniy Polyakov wrote:
> > > On Tue, Oct 17, 2006 at 03:52:34PM +0200, Eric Dumazet
> > > ([email protected])
> >
> > wrote:
> > > > > What about the case, which I described in other e-mail, when in
> > > > > case of the full ring buffer, no new events are written there, and
> > > > > when userspace commits (i.e. marks as ready to be freed or requeued
> > > > > by kernel) some events, new ones will be copied from ready queue
> > > > > into the buffer?
> > > >
> > > > Then, user might receive 'false events', exactly like
> > > > poll()/select()/epoll() can do sometime. IE a 'ready' indication
> > > > while there is no current event available on a particular fd /
> > > > event_source.
> > >
> > > Only if user simultaneously uses oth interfaces and remove even from
> > > the queue when it's copy was in mapped buffer, but in that case it's
> > > user's problem (and if we do want, we can store pointer/index of the
> > > ring buffer entry, so when event is removed from the ready queue (using
> > > kevent_get_events()), appropriate entry in the ring buffer will be
> > > updated to show that it is no longer valid.
> > >
> > > > This should be safe, since those programs already ignore read()
> > > > returns -EAGAIN and other similar things.
> > > >
> > > > Programmer prefers to receive two 'event available' indications than
> > > > ZERO (and be stuck for infinite time). Of course, hot path (normal
> > > > cases) should return one 'event' only.
> > > >
> > > > In order words, being ultra fast 99.99 % of the time, but being able
> > > > to block forever once in a while is not an option.
> > >
> > > Have I missed something? It looks like the only problematic situation
> > > is described above when user simultaneously uses both interfaces.
> >
> > In my point of view, user of the 'mmaped ring buffer' should be prepared
> > to use both interfaces. Or else you are forced to presize the ring buffer
> > to insane limits.
> >
> > That is :
> > - Most of the time, we expect consuming events via mmaped ring buffer and
> > no syscalls.
> > - In case we notice a 'mmaped ring buffer overflow', syscalls to
> > get/consume events that could not be stored in mmaped buffer (but queued
> > by kevent subsystem). If not stored by kevent subsystem (memory failure
> > ?), revert to poll() to fetch all 'missed fds' in one row. Go back to
> > normal mode.
>
> kevent uses smaller amount of memory than epoll() per event, so it is very
> unlikely that it will be impossible to store new event there and epoll()
> will succeed. The same can be applied to poll(), which allocates the
> whole table in syscall.
>
> > - In case of empty ring buffer (or no mmap support at all, because this
> > app doesnt expect lot of events per time unit, or because kevent dont
> > have mmap support) : Be able to syscall and wait for an event.
>
> So the most complex case is when user is going to use both interfaces,
> and it's steps when mapped ring buffer has overflow.
> In that case user can either read and mark some events as ready in ring
> buffer (the latter is being done through special syscall), so kevent
> core will put there new ready events.
> User can also get events using usual syscall, in that case events in
> ring buffer must be updated - and actually I implemented mapped buffer
> in the way which allows to remove events from the queue - queue is a
> FIFO, and the first entry to be obtained through syscall is _always_ the
> first entry in the ring buffer.
>
> So when user reads event through syscall (no matter if we are in overflow
> case or not), even being read is easily accessible in the ring buffer.
>
> So I propose following design for ring buffer (quite simple):
> kernelspace maintains two indexes - to the first and the last events in
> the ring buffer (and maximum size of the buffer of course).
> When new event is marked as ready, some info is being copied into ring
> buffer and index of the last entry is increased.
> When event is being read through syscall it is _guaranteed_ that that
> event will be at the position pointed by the index of the first
> element, that index is then increased (thus opening new slot in the
> buffer).
> If index of the last entry reaches (with possible wrapping) index of the
> first entry, that means that overflow has happend. In this case no new
> events can be copied into ring buffer, so they are only placed into
> ready queue (accessible through syscall kevent_get_events()).
>
> When user calls kevent_get_events() it will obtain the first element
> (pointed by index of the first element in the ring buffer), and if there
> is ready event, which is not placed into the ring buffer, it is
> copied (with appropriate update of the last index and new overflow
> condition).

Well, I'm not sure its good to do this 'move one event from ready list to slot
X', one by one, because this event will likely be flushed out of processor
cache (because we will have to consume 4096 events before reaching this one).
I think its better to batch this kind of 'push XX events' later, XX being
small enough not to waste CPU cache, and when ring buffer is empty again.

mmap buffer is good for latency and minimum synchro between user thread and
kernel producer. But once we hit an 'overflow', it is better to revert to a
mode feeding XX events per syscall, to be sure it fits CPU caches : The user
thread will do the copy between kernel memory to user memory, and this thread
will shortly use those events in user land.

BTW, maintaining coherency on mmap buffer is expensive : once a event is
copied to mmap buffer, kernel has to issue a smp_mb() before updating the
index, so that a user thread wont start to consume an event with random
values because its CPU see the update on index before updates on data.

Once all the queue is flushed in efficient way, we can switch to mmap mode
again.

Eric

2006-10-17 15:33:39

by Hans Henrik Happe

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Tuesday 17 October 2006 16:25, Eric Dumazet wrote:
> On Tuesday 17 October 2006 16:07, Evgeniy Polyakov wrote:
> > On Tue, Oct 17, 2006 at 03:52:34PM +0200, Eric Dumazet
([email protected])
> wrote:
> > > > What about the case, which I described in other e-mail, when in case
of
> > > > the full ring buffer, no new events are written there, and when
> > > > userspace commits (i.e. marks as ready to be freed or requeued by
> > > > kernel) some events, new ones will be copied from ready queue into the
> > > > buffer?
> > >
> > > Then, user might receive 'false events', exactly like
> > > poll()/select()/epoll() can do sometime. IE a 'ready' indication while
> > > there is no current event available on a particular fd / event_source.
> >
> > Only if user simultaneously uses oth interfaces and remove even from the
> > queue when it's copy was in mapped buffer, but in that case it's user's
> > problem (and if we do want, we can store pointer/index of the ring
> > buffer entry, so when event is removed from the ready queue (using
> > kevent_get_events()), appropriate entry in the ring buffer will be
> > updated to show that it is no longer valid.
> >
> > > This should be safe, since those programs already ignore read()
> > > returns -EAGAIN and other similar things.
> > >
> > > Programmer prefers to receive two 'event available' indications than
ZERO
> > > (and be stuck for infinite time). Of course, hot path (normal cases)
> > > should return one 'event' only.
> > >
> > > In order words, being ultra fast 99.99 % of the time, but being able to
> > > block forever once in a while is not an option.
> >
> > Have I missed something? It looks like the only problematic situation is
> > described above when user simultaneously uses both interfaces.
>
> In my point of view, user of the 'mmaped ring buffer' should be prepared to
> use both interfaces. Or else you are forced to presize the ring buffer to
> insane limits.

I don't see why overflow couldn't be handle by a syscall telling the kernel
that the buffer is ready for new events. As mentioned most of the time
overflow should not happend and if it does the syscall should be amortized
nicely by the number of events.

> That is :
> - Most of the time, we expect consuming events via mmaped ring buffer and no
> syscalls.
> - In case we notice a 'mmaped ring buffer overflow', syscalls to get/consume
> events that could not be stored in mmaped buffer (but queued by kevent
> subsystem). If not stored by kevent subsystem (memory failure ?), revert to
> poll() to fetch all 'missed fds' in one row. Go back to normal mode.
>
> - In case of empty ring buffer (or no mmap support at all, because this app
> doesnt expect lot of events per time unit, or because kevent dont have mmap
> support) : Be able to syscall and wait for an event.

As I see it there are two main problems with a mmapped ring buffer (correct me
if I'm wrong):

1. Overflow.
2. Handle multiple kernel event that only needs one user event. I.e. multiple
packet arriving at the same socket. The user should only see one IN event at
the time he is ready to handle it.

In an earlier post I suggested a scheme that solves these issues. It was based
on the assumption that kernel and user-space share index variables and can
read/update them atomically without much overhead. Only in cases where the
buffer is empty and full system call would be required.

Hans Henrik Happe

2006-10-17 16:03:10

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Tue, Oct 17, 2006 at 05:32:28PM +0200, Eric Dumazet ([email protected]) wrote:
> > So the most complex case is when user is going to use both interfaces,
> > and it's steps when mapped ring buffer has overflow.
> > In that case user can either read and mark some events as ready in ring
> > buffer (the latter is being done through special syscall), so kevent
> > core will put there new ready events.
> > User can also get events using usual syscall, in that case events in
> > ring buffer must be updated - and actually I implemented mapped buffer
> > in the way which allows to remove events from the queue - queue is a
> > FIFO, and the first entry to be obtained through syscall is _always_ the
> > first entry in the ring buffer.
> >
> > So when user reads event through syscall (no matter if we are in overflow
> > case or not), even being read is easily accessible in the ring buffer.
> >
> > So I propose following design for ring buffer (quite simple):
> > kernelspace maintains two indexes - to the first and the last events in
> > the ring buffer (and maximum size of the buffer of course).
> > When new event is marked as ready, some info is being copied into ring
> > buffer and index of the last entry is increased.
> > When event is being read through syscall it is _guaranteed_ that that
> > event will be at the position pointed by the index of the first
> > element, that index is then increased (thus opening new slot in the
> > buffer).
> > If index of the last entry reaches (with possible wrapping) index of the
> > first entry, that means that overflow has happend. In this case no new
> > events can be copied into ring buffer, so they are only placed into
> > ready queue (accessible through syscall kevent_get_events()).
> >
> > When user calls kevent_get_events() it will obtain the first element
> > (pointed by index of the first element in the ring buffer), and if there
> > is ready event, which is not placed into the ring buffer, it is
> > copied (with appropriate update of the last index and new overflow
> > condition).
>
> Well, I'm not sure its good to do this 'move one event from ready list to slot
> X', one by one, because this event will likely be flushed out of processor
> cache (because we will have to consume 4096 events before reaching this one).
> I think its better to batch this kind of 'push XX events' later, XX being
> small enough not to waste CPU cache, and when ring buffer is empty again.

Ok, that's possible.

> mmap buffer is good for latency and minimum synchro between user thread and
> kernel producer. But once we hit an 'overflow', it is better to revert to a
> mode feeding XX events per syscall, to be sure it fits CPU caches : The user
> thread will do the copy between kernel memory to user memory, and this thread
> will shortly use those events in user land.

User can do both - either get events through syscall, or get them from
mapped ring buffer when it is refilled.

> BTW, maintaining coherency on mmap buffer is expensive : once a event is
> copied to mmap buffer, kernel has to issue a smp_mb() before updating the
> index, so that a user thread wont start to consume an event with random
> values because its CPU see the update on index before updates on data.

There will be some tricks with barriers indeed.

> Once all the queue is flushed in efficient way, we can switch to mmap mode
> again.
>
> Eric

Ok, there is one apologist for mmap buffer implementation, who forced me
to create first implementation, which was dropped due to absense of
remote mental reading abilities.
Ulrich, does above approach sound good for you?
I actually do not want to reimplement something, that will be
pointed to with words 'no matter what you say, it is broken and I do not
want it' again :).

--
Evgeniy Polyakov

2006-10-17 16:26:07

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Tuesday 17 October 2006 18:01, Evgeniy Polyakov wrote:

> Ok, there is one apologist for mmap buffer implementation, who forced me
> to create first implementation, which was dropped due to absense of
> remote mental reading abilities.
> Ulrich, does above approach sound good for you?
> I actually do not want to reimplement something, that will be
> pointed to with words 'no matter what you say, it is broken and I do not
> want it' again :).

In my humble opinion, you should first write a 'real application', to show how
the mmap buffer and kevent syscalls would be used (fast path and
slow/recovery paths). I am sure it would be easier for everybody to agree on
the API *before* you start coding a *lot* of hard (kernel) stuff : It would
certainly save your mental CPU cycles (and ours too :) )

This 'real application' could be the event loop of a simple HTTP server, or a
basic 'echo all' server. Adding the bits about timers events and signals
should be done too.

Eric

2006-10-17 16:36:29

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Tue, Oct 17, 2006 at 06:26:04PM +0200, Eric Dumazet ([email protected]) wrote:
> On Tuesday 17 October 2006 18:01, Evgeniy Polyakov wrote:
>
> > Ok, there is one apologist for mmap buffer implementation, who forced me
> > to create first implementation, which was dropped due to absense of
> > remote mental reading abilities.
> > Ulrich, does above approach sound good for you?
> > I actually do not want to reimplement something, that will be
> > pointed to with words 'no matter what you say, it is broken and I do not
> > want it' again :).
>
> In my humble opinion, you should first write a 'real application', to show how
> the mmap buffer and kevent syscalls would be used (fast path and
> slow/recovery paths). I am sure it would be easier for everybody to agree on
> the API *before* you start coding a *lot* of hard (kernel) stuff : It would
> certainly save your mental CPU cycles (and ours too :) )
>
> This 'real application' could be the event loop of a simple HTTP server, or a
> basic 'echo all' server. Adding the bits about timers events and signals
> should be done too.

I wrote one with previous ring buffer implementation - it used timers
and echoed when they fired, it was even described in details in one of the
lwn.net articles.

I'm not going to waste others and my time implementing feature requests
without at least _some_ feedback from those who asked them.
In case when person, originally requested some feature, does not answer
and there are other opinions, only they will be get into account of
course.

> Eric

--
Evgeniy Polyakov

2006-10-17 16:45:57

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Tuesday 17 October 2006 18:35, Evgeniy Polyakov wrote:
> On Tue, Oct 17, 2006 at 06:26:04PM +0200, Eric Dumazet ([email protected])
wrote:
> > On Tuesday 17 October 2006 18:01, Evgeniy Polyakov wrote:
> > > Ok, there is one apologist for mmap buffer implementation, who forced
> > > me to create first implementation, which was dropped due to absense of
> > > remote mental reading abilities.
> > > Ulrich, does above approach sound good for you?
> > > I actually do not want to reimplement something, that will be
> > > pointed to with words 'no matter what you say, it is broken and I do
> > > not want it' again :).
> >
> > In my humble opinion, you should first write a 'real application', to
> > show how the mmap buffer and kevent syscalls would be used (fast path and
> > slow/recovery paths). I am sure it would be easier for everybody to agree
> > on the API *before* you start coding a *lot* of hard (kernel) stuff : It
> > would certainly save your mental CPU cycles (and ours too :) )
> >
> > This 'real application' could be the event loop of a simple HTTP server,
> > or a basic 'echo all' server. Adding the bits about timers events and
> > signals should be done too.
>
> I wrote one with previous ring buffer implementation - it used timers
> and echoed when they fired, it was even described in details in one of the
> lwn.net articles.
>
> I'm not going to waste others and my time implementing feature requests
> without at least _some_ feedback from those who asked them.
> In case when person, originally requested some feature, does not answer
> and there are other opinions, only they will be get into account of
> course.

I am not sure I understand what you wrote, English is not our native language.

I think many people gave you feedbacks. I feel that all feedback on this
mailing list is constructive. Many posts/patches on this list are never
commented at all.

Eric

2006-10-18 04:11:12

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

On Tue, Oct 17, 2006 at 06:45:54PM +0200, Eric Dumazet ([email protected]) wrote:
> On Tuesday 17 October 2006 18:35, Evgeniy Polyakov wrote:
> > On Tue, Oct 17, 2006 at 06:26:04PM +0200, Eric Dumazet ([email protected])
> wrote:
> > > On Tuesday 17 October 2006 18:01, Evgeniy Polyakov wrote:
> > > > Ok, there is one apologist for mmap buffer implementation, who forced
> > > > me to create first implementation, which was dropped due to absense of
> > > > remote mental reading abilities.
> > > > Ulrich, does above approach sound good for you?
> > > > I actually do not want to reimplement something, that will be
> > > > pointed to with words 'no matter what you say, it is broken and I do
> > > > not want it' again :).
> > >
> > > In my humble opinion, you should first write a 'real application', to
> > > show how the mmap buffer and kevent syscalls would be used (fast path and
> > > slow/recovery paths). I am sure it would be easier for everybody to agree
> > > on the API *before* you start coding a *lot* of hard (kernel) stuff : It
> > > would certainly save your mental CPU cycles (and ours too :) )
> > >
> > > This 'real application' could be the event loop of a simple HTTP server,
> > > or a basic 'echo all' server. Adding the bits about timers events and
> > > signals should be done too.
> >
> > I wrote one with previous ring buffer implementation - it used timers
> > and echoed when they fired, it was even described in details in one of the
> > lwn.net articles.
> >
> > I'm not going to waste others and my time implementing feature requests
> > without at least _some_ feedback from those who asked them.
> > In case when person, originally requested some feature, does not answer
> > and there are other opinions, only they will be get into account of
> > course.
>
> I am not sure I understand what you wrote, English is not our native language.
>
> I think many people gave you feedbacks. I feel that all feedback on this
> mailing list is constructive. Many posts/patches on this list are never
> commented at all.

And I do greatly appreciate feedback from those people!

But I do not understand why I never got feedback on initial design and
implementation (and then created as far as I recall at least 10
releases) from Ulrich, who first asked for such a feture.
So right now I'm waiting for his opinion on that problem, even if it will
be 'it sucks' again, but at least in that case I will not waste people's time.

Ulrich, could you please comment on design notes sent couple of mail
above?

> Eric

--
Evgeniy Polyakov

2006-10-18 04:45:37

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take19 1/4] kevent: Core files.

Evgeniy Polyakov a e'crit :
> On Tue, Oct 17, 2006 at 06:45:54PM +0200, Eric Dumazet ([email protected]) wrote:
>> I am not sure I understand what you wrote, English is not our native language.
>>
>> I think many people gave you feedbacks. I feel that all feedback on this
>> mailing list is constructive. Many posts/patches on this list are never
>> commented at all.
>
> And I do greatly appreciate feedback from those people!
>
> But I do not understand why I never got feedback on initial design and
> implementation (and then created as far as I recall at least 10
> releases) from Ulrich, who first asked for such a feture.
> So right now I'm waiting for his opinion on that problem, even if it will
> be 'it sucks' again, but at least in that case I will not waste people's time.
>
> Ulrich, could you please comment on design notes sent couple of mail
> above?


Ulrich is a very busy man. We have to live with that.

<rant_mode>
For example, I *complained* one day, that each glibc fopen()/fread()/fclose()
pass does a mmap()/munmap() to obtain a single 4KB of memory, without any
cache mechanism. This badly hurts performance of multi-threaded programs as we
know mmap()/munmap() has to down_write(&mm->mmap_sem); and play VM games.

So to avoid this, I manually call setvbuf() in my own programs, to provide a
suitable buffer to glibc, because of its suboptimal default allocation,
vestige of an old epoch...
</rant_mode>

Eric