2006-10-27 16:11:31

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take21 0/4] kevent: Generic event handling mechanism.


Generic event handling mechanism.

Consider for inclusion.

Changes from 'take20' patchset:
* new ring buffer implementation
* removed artificial limit on possible number of kevents
With this release and fixed userspace web server it was possible to
achive 3960+ req/s with client connection rate of 4000 con/s
over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which
is too close to wire speed if we get into account headers and the like.

Changes from 'take19' patchset:
* use __init instead of __devinit
* removed 'default N' from config for user statistic
* removed kevent_user_fini() since kevent can not be unloaded
* use KERN_INFO for statistic output

Changes from 'take18' patchset:
* use __init instead of __devinit
* removed 'default N' from config for user statistic
* removed kevent_user_fini() since kevent can not be unloaded
* use KERN_INFO for statistic output

Changes from 'take17' patchset:
* Use RB tree instead of hash table.
At least for a web sever, frequency of addition/deletion of new kevent
is comparable with number of search access, i.e. most of the time events
are added, accesed only couple of times and then removed, so it justifies
RB tree usage over AVL tree, since the latter does have much slower deletion
time (max O(log(N)) compared to 3 ops),
although faster search time (1.44*O(log(N)) vs. 2*O(log(N))).
So for kevents I use RB tree for now and later, when my AVL tree implementation
is ready, it will be possible to compare them.
* Changed readiness check for socket notifications.

With both above changes it is possible to achieve more than 3380 req/second compared to 2200,
sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same
hardware.
It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is
4096 events.

Changes from 'take16' patchset:
* misc cleanups (__read_mostly, const ...)
* created special macro which is used for mmap size (number of pages) calculation
* export kevent_socket_notify(), since it is used in network protocols which can be
built as modules (IPv6 for example)

Changes from 'take15' patchset:
* converted kevent_timer to high-resolution timers, this forces timer API update at
http://linux-net.osdl.org/index.php/Kevent
* use struct ukevent* instead of void * in syscalls (documentation has been updated)
* added warning in kevent_add_ukevent() if ring has broken index (for testing)

Changes from 'take14' patchset:
* added kevent_wait()
This syscall waits until either timeout expires or at least one event
becomes ready. It also commits that @num events from @start are processed
by userspace and thus can be be removed or rearmed (depending on it's flags).
It can be used for commit events read by userspace through mmap interface.
Example userspace code (evtest.c) can be found on project's homepage.
* added socket notifications (send/recv/accept)

Changes from 'take13' patchset:
* do not get lock aroung user data check in __kevent_search()
* fail early if there were no registered callbacks for given type of kevent
* trailing whitespace cleanup

Changes from 'take12' patchset:
* remove non-chardev interface for initialization
* use pointer to kevent_mring instead of unsigned longs
* use aligned 64bit type in raw user data (can be used by high-res timer if needed)
* simplified enqueue/dequeue callbacks and kevent initialization
* use nanoseconds for timeout
* put number of milliseconds into timer's return data
* move some definitions into user-visible header
* removed filenames from comments

Changes from 'take11' patchset:
* include missing headers into patchset
* some trivial code cleanups (use goto instead of if/else games and so on)
* some whitespace cleanups
* check for ready_callback() callback before main loop which should save us some ticks

Changes from 'take10' patchset:
* removed non-existent prototypes
* added helper function for kevent_registered_callbacks
* fixed 80 lines comments issues
* added shared between userspace and kernelspace header instead of embedd them in one
* core restructuring to remove forward declarations
* s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
* use vm_insert_page() instead of remap_pfn_range()

Changes from 'take9' patchset:
* fixed ->nopage method

Changes from 'take8' patchset:
* fixed mmap release bug
* use module_init() instead of late_initcall()
* use better structures for timer notifications

Changes from 'take7' patchset:
* new mmap interface (not tested, waiting for other changes to be acked)
- use nopage() method to dynamically substitue pages
- allocate new page for events only when new added kevent requres it
- do not use ugly index dereferencing, use structure instead
- reduced amount of data in the ring (id and flags),
maximum 12 pages on x86 per kevent fd

Changes from 'take6' patchset:
* a lot of comments!
* do not use list poisoning for detection of the fact, that entry is in the list
* return number of ready kevents even if copy*user() fails
* strict check for number of kevents in syscall
* use ARRAY_SIZE for array size calculation
* changed superblock magic number
* use SLAB_PANIC instead of direct panic() call
* changed -E* return values
* a lot of small cleanups and indent fixes

Changes from 'take5' patchset:
* removed compilation warnings about unused wariables when lockdep is not turned on
* do not use internal socket structures, use appropriate (exported) wrappers instead
* removed default 1 second timeout
* removed AIO stuff from patchset

Changes from 'take4' patchset:
* use miscdevice instead of chardevice
* comments fixes

Changes from 'take3' patchset:
* removed serializing mutex from kevent_user_wait()
* moved storage list processing to RCU
* removed lockdep screaming - all storage locks are initialized in the same function, so it was
learned
to differentiate between various cases
* remove kevent from storage if is marked as broken after callback
* fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion

Changes from 'take2' patchset:
* split kevent_finish_user() to locked and unlocked variants
* do not use KEVENT_STAT ifdefs, use inline functions instead
* use array of callbacks of each type instead of each kevent callback initialization
* changed name of ukevent guarding lock
* use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
* do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
* various indent cleanups
* added optimisation, which is aimed to help when a lot of kevents are being copied from
userspace
* mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
- rebased against 2.6.18-git tree
- removed ioctl controlling
- added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
unsigned int timeout, void __user *buf, unsigned flags)
- use old syscall kevent_ctl for creation/removing, modification and initial kevent
initialization
- use mutuxes instead of semaphores
- added file descriptor check and return error if provided descriptor does not match
kevent file operations
- various indent fixes
- removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov <[email protected]>



2006-10-27 16:10:45

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take21 2/4] kevent: poll/select() notifications.


poll/select() notifications.

This patch includes generic poll/select notifications.
kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake, a lot of allocations and so on).

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5baf3a1..f81299f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -276,6 +276,7 @@ #include <linux/prio_tree.h>
#include <linux/init.h>
#include <linux/sched.h>
#include <linux/mutex.h>
+#include <linux/kevent.h>

#include <asm/atomic.h>
#include <asm/semaphore.h>
@@ -586,6 +587,10 @@ #ifdef CONFIG_INOTIFY
struct mutex inotify_mutex; /* protects the watches list */
#endif

+#ifdef CONFIG_KEVENT_SOCKET
+ struct kevent_storage st;
+#endif
+
unsigned long i_state;
unsigned long dirtied_when; /* jiffies of first dirtying */

@@ -739,6 +744,9 @@ #ifdef CONFIG_EPOLL
struct list_head f_ep_links;
spinlock_t f_ep_lock;
#endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+ struct kevent_storage st;
+#endif
struct address_space *f_mapping;
};
extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..fb74e0f
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,222 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+ struct poll_table_struct pt;
+ struct kevent *k;
+};
+
+struct kevent_poll_wait_container
+{
+ struct list_head container_entry;
+ wait_queue_head_t *whead;
+ wait_queue_t wait;
+ struct kevent *k;
+};
+
+struct kevent_poll_private
+{
+ struct list_head container_list;
+ spinlock_t container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+ unsigned mode, int sync, void *key)
+{
+ struct kevent_poll_wait_container *cont =
+ container_of(wait, struct kevent_poll_wait_container, wait);
+ struct kevent *k = cont->k;
+ struct file *file = k->st->origin;
+ u32 revents;
+
+ revents = file->f_op->poll(file, NULL);
+
+ kevent_storage_ready(k->st, NULL, revents);
+
+ return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+ struct poll_table_struct *poll_table)
+{
+ struct kevent *k =
+ container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+ struct kevent_poll_private *priv = k->priv;
+ struct kevent_poll_wait_container *cont;
+ unsigned long flags;
+
+ cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+ if (!cont) {
+ kevent_break(k);
+ return;
+ }
+
+ cont->k = k;
+ init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+ cont->whead = whead;
+
+ spin_lock_irqsave(&priv->container_lock, flags);
+ list_add_tail(&cont->container_entry, &priv->container_list);
+ spin_unlock_irqrestore(&priv->container_lock, flags);
+
+ add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+ struct file *file;
+ int err, ready = 0;
+ unsigned int revents;
+ struct kevent_poll_ctl ctl;
+ struct kevent_poll_private *priv;
+
+ file = fget(k->event.id.raw[0]);
+ if (!file)
+ return -ENODEV;
+
+ err = -EINVAL;
+ if (!file->f_op || !file->f_op->poll)
+ goto err_out_fput;
+
+ err = -ENOMEM;
+ priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+ if (!priv)
+ goto err_out_fput;
+
+ spin_lock_init(&priv->container_lock);
+ INIT_LIST_HEAD(&priv->container_list);
+
+ k->priv = priv;
+
+ ctl.k = k;
+ init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+
+ err = kevent_storage_enqueue(&file->st, k);
+ if (err)
+ goto err_out_free;
+
+ revents = file->f_op->poll(file, &ctl.pt);
+ if (revents & k->event.event) {
+ ready = 1;
+ kevent_poll_dequeue(k);
+ }
+
+ return ready;
+
+err_out_free:
+ kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+ fput(file);
+ return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+ struct file *file = k->st->origin;
+ struct kevent_poll_private *priv = k->priv;
+ struct kevent_poll_wait_container *w, *n;
+ unsigned long flags;
+
+ kevent_storage_dequeue(k->st, k);
+
+ spin_lock_irqsave(&priv->container_lock, flags);
+ list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+ list_del(&w->container_entry);
+ remove_wait_queue(w->whead, &w->wait);
+ kmem_cache_free(kevent_poll_container_cache, w);
+ }
+ spin_unlock_irqrestore(&priv->container_lock, flags);
+
+ kmem_cache_free(kevent_poll_priv_cache, priv);
+ k->priv = NULL;
+
+ fput(file);
+
+ return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+ struct file *file = k->st->origin;
+ unsigned int revents = file->f_op->poll(file, NULL);
+
+ k->event.ret_data[0] = revents & k->event.event;
+
+ return (revents & k->event.event);
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+ struct kevent_callbacks pc = {
+ .callback = &kevent_poll_callback,
+ .enqueue = &kevent_poll_enqueue,
+ .dequeue = &kevent_poll_dequeue};
+
+ kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache",
+ sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+ if (!kevent_poll_container_cache) {
+ printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+ return -ENOMEM;
+ }
+
+ kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache",
+ sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+ if (!kevent_poll_priv_cache) {
+ printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+ kmem_cache_destroy(kevent_poll_container_cache);
+ kevent_poll_container_cache = NULL;
+ return -ENOMEM;
+ }
+
+ kevent_add_callbacks(&pc, KEVENT_POLL);
+
+ printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+ return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+ lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+ kmem_cache_destroy(kevent_poll_priv_cache);
+ kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);

2006-10-27 16:10:49

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take21 4/4] kevent: Timer notifications.


Timer notifications.

Timer notifications can be used for fine grained per-process time
management, since interval timers are very inconvenient to use,
and they are limited.

This subsystem uses high-resolution timers.
id.raw[0] is used as number of seconds
id.raw[1] is used as number of nanoseconds

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..04acc46
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,113 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/hrtimer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+struct kevent_timer
+{
+ struct hrtimer ktimer;
+ struct kevent_storage ktimer_storage;
+ struct kevent *ktimer_event;
+};
+
+static int kevent_timer_func(struct hrtimer *timer)
+{
+ struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer);
+ struct kevent *k = t->ktimer_event;
+
+ kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL);
+ hrtimer_forward(timer, timer->base->softirq_time,
+ ktime_set(k->event.id.raw[0], k->event.id.raw[1]));
+ return HRTIMER_RESTART;
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+ int err;
+ struct kevent_timer *t;
+
+ t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+ if (!t)
+ return -ENOMEM;
+
+ hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL);
+ t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]);
+ t->ktimer.function = kevent_timer_func;
+ t->ktimer_event = k;
+
+ err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+ if (err)
+ goto err_out_free;
+ lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+ err = kevent_storage_enqueue(&t->ktimer_storage, k);
+ if (err)
+ goto err_out_st_fini;
+
+ printk("%s: jiffies: %lu, timer: %p.\n", __func__, jiffies, &t->ktimer);
+ hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL);
+
+ return 0;
+
+err_out_st_fini:
+ kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+ kfree(t);
+
+ return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+ struct kevent_storage *st = k->st;
+ struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage);
+
+ hrtimer_cancel(&t->ktimer);
+ kevent_storage_dequeue(st, k);
+ kfree(t);
+
+ return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+ k->event.ret_data[0] = jiffies_to_msecs(jiffies);
+ return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+ struct kevent_callbacks tc = {
+ .callback = &kevent_timer_callback,
+ .enqueue = &kevent_timer_enqueue,
+ .dequeue = &kevent_timer_dequeue};
+
+ return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);
+

2006-10-27 16:11:25

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take21 1/4] kevent: Core files.


Core files.

This patch includes core kevent files:
* userspace controlling
* kernelspace interfaces
* initialization
* notification state machines

Some bits of documentation can be found on project's homepage (and links from there):
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 7e639f7..a9560eb 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -318,3 +318,6 @@ ENTRY(sys_call_table)
.long sys_vmsplice
.long sys_move_pages
.long sys_getcpu
+ .long sys_kevent_get_events
+ .long sys_kevent_ctl /* 320 */
+ .long sys_kevent_wait
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index b4aa875..cf18955 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -714,8 +714,11 @@ #endif
.quad compat_sys_get_robust_list
.quad sys_splice
.quad sys_sync_file_range
- .quad sys_tee
+ .quad sys_tee /* 315 */
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
.quad sys_getcpu
+ .quad sys_kevent_get_events
+ .quad sys_kevent_ctl /* 320 */
+ .quad sys_kevent_wait
ia32_syscall_end:
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index bd99870..f009677 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -324,10 +324,13 @@ #define __NR_tee 315
#define __NR_vmsplice 316
#define __NR_move_pages 317
#define __NR_getcpu 318
+#define __NR_kevent_get_events 319
+#define __NR_kevent_ctl 320
+#define __NR_kevent_wait 321

#ifdef __KERNEL__

-#define NR_syscalls 319
+#define NR_syscalls 322
#include <linux/err.h>

/*
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 6137146..c53d156 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,16 @@ #define __NR_vmsplice 278
__SYSCALL(__NR_vmsplice, sys_vmsplice)
#define __NR_move_pages 279
__SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events 280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl 281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
+#define __NR_kevent_wait 282
+__SYSCALL(__NR_kevent_wait, sys_kevent_wait)

#ifdef __KERNEL__

-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_wait
#include <linux/err.h>

#ifndef __NO_STUBS
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..125414c
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,205 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/kevent_storage.h>
+#include <linux/ukevent.h>
+
+#define KEVENT_MIN_BUFFS_ALLOC 3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+ kevent_callback_t callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY 0x1
+#define KEVENT_STORAGE 0x2
+#define KEVENT_USER 0x4
+
+struct kevent
+{
+ /* Used for kevent freeing.*/
+ struct rcu_head rcu_head;
+ struct ukevent event;
+ /* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+ spinlock_t ulock;
+
+ /* Entry of user's tree. */
+ struct rb_node kevent_node;
+ /* Entry of origin's queue. */
+ struct list_head storage_entry;
+ /* Entry of user's ready. */
+ struct list_head ready_entry;
+
+ u32 flags;
+
+ /* User who requested this kevent. */
+ struct kevent_user *user;
+ /* Kevent container. */
+ struct kevent_storage *st;
+
+ struct kevent_callbacks callbacks;
+
+ /* Private data for different storages.
+ * poll()/select storage has a list of wait_queue_t containers
+ * for each ->poll() { poll_wait()' } here.
+ */
+ void *priv;
+};
+
+struct kevent_user
+{
+ struct rb_root kevent_root;
+ spinlock_t kevent_lock;
+ /* Number of queued kevents. */
+ unsigned int kevent_num;
+
+ /* List of ready kevents. */
+ struct list_head ready_list;
+ /* Number of ready kevents. */
+ unsigned int ready_num;
+ /* Protects all manipulations with ready queue. */
+ spinlock_t ready_lock;
+
+ /* Protects against simultaneous kevent_user control manipulations. */
+ struct mutex ctl_mutex;
+ /* Wait until some events are ready. */
+ wait_queue_head_t wait;
+
+ /* Reference counter, increased for each new kevent. */
+ atomic_t refcnt;
+
+ /* First kevent which was not put into ring buffer due to overflow.
+ * It will be copied into the buffer, when first event will be removed
+ * from ready queue (and thus there will be an empty place in the
+ * ring buffer).
+ */
+ struct kevent *overflow_kevent;
+ /* Array of pages forming mapped ring buffer */
+ struct kevent_mring **pring;
+
+#ifdef CONFIG_KEVENT_USER_STAT
+ unsigned long im_num;
+ unsigned long wait_num, mmap_num;
+ unsigned long total;
+#endif
+};
+
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos);
+
+int kevent_user_ring_add_event(struct kevent *k);
+
+void kevent_storage_ready(struct kevent_storage *st,
+ kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+ u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+ printk(KERN_INFO "%s: u: %p, wait: %lu, mmap: %lu, immediately: %lu, total: %lu.\n",
+ __func__, u, u->wait_num, u->mmap_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+ u->im_num++;
+}
+static inline void kevent_stat_mmap(struct kevent_user *u)
+{
+ u->mmap_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+ u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+ u->total++;
+}
+#else
+#define kevent_stat_print(u) ({ (void) u;})
+#define kevent_stat_init(u) ({ (void) u;})
+#define kevent_stat_im(u) ({ (void) u;})
+#define kevent_stat_wait(u) ({ (void) u;})
+#define kevent_stat_mmap(u) ({ (void) u;})
+#define kevent_stat_total(u) ({ (void) u;})
+#endif
+
+#ifdef CONFIG_KEVENT_SOCKET
+#ifdef CONFIG_LOCKDEP
+void kevent_socket_reinit(struct socket *sock);
+void kevent_sk_reinit(struct sock *sk);
+#else
+static inline void kevent_socket_reinit(struct socket *sock)
+{
+}
+static inline void kevent_sk_reinit(struct sock *sk)
+{
+}
+#endif
+void kevent_socket_notify(struct sock *sock, u32 event);
+int kevent_socket_dequeue(struct kevent *k);
+int kevent_socket_enqueue(struct kevent *k);
+#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)
+#else
+static inline void kevent_socket_notify(struct sock *sock, u32 event)
+{
+}
+#define sock_async(__sk) ({ (void)__sk; 0; })
+#endif
+
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+ void *origin; /* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+ struct list_head list; /* List of queued kevents. */
+ spinlock_t lock; /* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2d1c3d5..71a758f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -54,6 +54,7 @@ struct compat_stat;
struct compat_timeval;
struct robust_list_head;
struct getcpu_cache;
+struct ukevent;

#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -599,4 +600,8 @@ asmlinkage long sys_set_robust_list(stru
size_t len);
asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);

+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max,
+ __u64 timeout, struct ukevent __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, struct ukevent __user *buf);
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned int num, __u64 timeout);
#endif
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
new file mode 100644
index 0000000..daa8202
--- /dev/null
+++ b/include/linux/ukevent.h
@@ -0,0 +1,163 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __UKEVENT_H
+#define __UKEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+/* Process this event only once and then dequeue. */
+#define KEVENT_REQ_ONESHOT 0x1
+
+/*
+ * Kevent return flags.
+ */
+/* Kevent is broken. */
+#define KEVENT_RET_BROKEN 0x1
+/* Kevent processing was finished successfully. */
+#define KEVENT_RET_DONE 0x2
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 0
+#define KEVENT_INODE 1
+#define KEVENT_TIMER 2
+#define KEVENT_POLL 3
+#define KEVENT_NAIO 4
+#define KEVENT_AIO 5
+#define KEVENT_MAX 6
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define KEVENT_TIMER_FIRED 0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#define KEVENT_SOCKET_RECV 0x1
+#define KEVENT_SOCKET_ACCEPT 0x2
+#define KEVENT_SOCKET_SEND 0x4
+
+/*
+ * Inode events.
+ */
+#define KEVENT_INODE_CREATE 0x1
+#define KEVENT_INODE_REMOVE 0x2
+
+/*
+ * Poll events.
+ */
+#define KEVENT_POLL_POLLIN 0x0001
+#define KEVENT_POLL_POLLPRI 0x0002
+#define KEVENT_POLL_POLLOUT 0x0004
+#define KEVENT_POLL_POLLERR 0x0008
+#define KEVENT_POLL_POLLHUP 0x0010
+#define KEVENT_POLL_POLLNVAL 0x0020
+
+#define KEVENT_POLL_POLLRDNORM 0x0040
+#define KEVENT_POLL_POLLRDBAND 0x0080
+#define KEVENT_POLL_POLLWRNORM 0x0100
+#define KEVENT_POLL_POLLWRBAND 0x0200
+#define KEVENT_POLL_POLLMSG 0x0400
+#define KEVENT_POLL_POLLREMOVE 0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define KEVENT_AIO_BIO 0x1
+
+#define KEVENT_MASK_ALL 0xffffffff
+/* Mask of all possible event values. */
+#define KEVENT_MASK_EMPTY 0x0
+/* Empty mask of ready events. */
+
+struct kevent_id
+{
+ union {
+ __u32 raw[2];
+ __u64 raw_u64 __attribute__((aligned(8)));
+ };
+};
+
+struct ukevent
+{
+ /* Id of this request, e.g. socket number, file descriptor and so on... */
+ struct kevent_id id;
+ /* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+ __u32 type;
+ /* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+ __u32 event;
+ /* Per-event request flags */
+ __u32 req_flags;
+ /* Per-event return flags */
+ __u32 ret_flags;
+ /* Event return data. Event originator fills it with anything it likes. */
+ __u32 ret_data[2];
+ /* User's data. It is not used, just copied to/from user.
+ * The whole structure is aligned to 8 bytes already, so the last union
+ * is aligned properly.
+ */
+ union {
+ __u32 user[2];
+ void *ptr;
+ };
+};
+
+struct mukevent
+{
+ struct kevent_id id;
+ __u32 ret_flags;
+};
+
+#define KEVENT_MAX_PAGES 2
+
+/*
+ * Note that kevents does not exactly fill the page (each mukevent is 12 bytes),
+ * so we reuse 4 bytes at the begining of the page to store index.
+ * Take that into account if you want to change size of struct mukevent.
+ */
+#define KEVENTS_ON_PAGE ((PAGE_SIZE-2*sizeof(unsigned int))/sizeof(struct mukevent))
+struct kevent_mring
+{
+ unsigned int kidx, uidx;
+ struct mukevent event[KEVENTS_ON_PAGE];
+};
+
+/*
+ * Used only for sanitizing of the kevent_wait() input data - do not
+ * allow user to specify number of events more than it is possible to place
+ * into ring buffer. This does not limit number of events which can be
+ * put into kevent queue (which is unlimited).
+ */
+#define KEVENT_MAX_EVENTS (KEVENT_MAX_PAGES * KEVENTS_ON_PAGE)
+
+#define KEVENT_CTL_ADD 0
+#define KEVENT_CTL_REMOVE 1
+#define KEVENT_CTL_MODIFY 2
+
+#endif /* __UKEVENT_H */
diff --git a/init/Kconfig b/init/Kconfig
index d2eb7a8..c7d8250 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -201,6 +201,8 @@ config AUDITSYSCALL
such as SELinux. To use audit's filesystem watch feature, please
ensure that INOTIFY is configured.

+source "kernel/kevent/Kconfig"
+
config IKCONFIG
bool "Kernel .config support"
---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
obj-$(CONFIG_RELAY) += relay.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..5ba8086
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,39 @@
+config KEVENT
+ bool "Kernel event notification mechanism"
+ help
+ This option enables event queue mechanism.
+ It can be used as replacement for poll()/select(), AIO callback
+ invocations, advanced timer notifications and other kernel
+ object status changes.
+
+config KEVENT_USER_STAT
+ bool "Kevent user statistic"
+ depends on KEVENT
+ help
+ This option will turn kevent_user statistic collection on.
+ Statistic data includes total number of kevent, number of kevents
+ which are ready immediately at insertion time and number of kevents
+ which were removed through readiness completion.
+ It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_TIMER
+ bool "Kernel event notifications for timers"
+ depends on KEVENT
+ help
+ This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+ bool "Kernel event notifications for poll()/select()"
+ depends on KEVENT
+ help
+ This option allows to use kevent subsystem for poll()/select()
+ notifications.
+
+config KEVENT_SOCKET
+ bool "Kernel event notifications for sockets"
+ depends on NET && KEVENT
+ help
+ This option enables notifications through KEVENT subsystem of
+ sockets operations, like new packet receiving conditions,
+ ready for accept conditions and so on.
+
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..9130cad
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,4 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
+obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..25404d3
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,227 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+ return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+ return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.ret_flags |= KEVENT_RET_BROKEN;
+ spin_unlock_irqrestore(&k->ulock, flags);
+ return -EINVAL;
+}
+
+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX] __read_mostly;
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos)
+{
+ struct kevent_callbacks *p;
+
+ if (pos >= KEVENT_MAX)
+ return -EINVAL;
+
+ p = &kevent_registered_callbacks[pos];
+
+ p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
+ p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
+ p->callback = (cb->callback) ? cb->callback : kevent_break;
+
+ printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
+ return 0;
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+ spin_lock_init(&k->ulock);
+ k->flags = 0;
+
+ if (unlikely(k->event.type >= KEVENT_MAX ||
+ !kevent_registered_callbacks[k->event.type].callback))
+ return kevent_break(k);
+
+ k->callbacks = kevent_registered_callbacks[k->event.type];
+ if (unlikely(k->callbacks.callback == kevent_break))
+ return kevent_break(k);
+
+ return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+ unsigned long flags;
+
+ k->st = st;
+ spin_lock_irqsave(&st->lock, flags);
+ list_add_tail_rcu(&k->storage_entry, &st->list);
+ k->flags |= KEVENT_STORAGE;
+ spin_unlock_irqrestore(&st->lock, flags);
+ return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue.
+ * It does not decrease origin's reference counter in any way
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&st->lock, flags);
+ if (k->flags & KEVENT_STORAGE) {
+ list_del_rcu(&k->storage_entry);
+ k->flags &= ~KEVENT_STORAGE;
+ }
+ spin_unlock_irqrestore(&st->lock, flags);
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static void __kevent_requeue(struct kevent *k, u32 event)
+{
+ int ret, rem;
+ unsigned long flags;
+
+ ret = k->callbacks.callback(k);
+
+ spin_lock_irqsave(&k->ulock, flags);
+ if (ret > 0)
+ k->event.ret_flags |= KEVENT_RET_DONE;
+ else if (ret < 0)
+ k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE);
+ else
+ ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+ rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+ spin_unlock_irqrestore(&k->ulock, flags);
+
+ if (ret) {
+ if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) {
+ list_del_rcu(&k->storage_entry);
+ k->flags &= ~KEVENT_STORAGE;
+ }
+
+ spin_lock_irqsave(&k->user->ready_lock, flags);
+ if (!(k->flags & KEVENT_READY)) {
+ kevent_user_ring_add_event(k);
+ list_add_tail(&k->ready_entry, &k->user->ready_list);
+ k->flags |= KEVENT_READY;
+ k->user->ready_num++;
+ }
+ spin_unlock_irqrestore(&k->user->ready_lock, flags);
+ wake_up(&k->user->wait);
+ }
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->st->lock, flags);
+ __kevent_requeue(k, 0);
+ spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st,
+ kevent_callback_t ready_callback, u32 event)
+{
+ struct kevent *k;
+
+ rcu_read_lock();
+ if (ready_callback)
+ list_for_each_entry_rcu(k, &st->list, storage_entry)
+ (*ready_callback)(k);
+
+ list_for_each_entry_rcu(k, &st->list, storage_entry)
+ if (event & k->event.event)
+ __kevent_requeue(k, event);
+ rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+ spin_lock_init(&st->lock);
+ st->origin = origin;
+ INIT_LIST_HEAD(&st->list);
+ return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, sockt and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+ kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..e92a1dc
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,1000 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static const char kevent_name[] = "kevent";
+static kmem_cache_t *kevent_cache __read_mostly;
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+ struct kevent_user *u = file->private_data;
+ unsigned int mask;
+
+ poll_wait(file, &u->wait, wait);
+ mask = 0;
+
+ if (u->ready_num)
+ mask |= POLLIN | POLLRDNORM;
+
+ return mask;
+}
+
+/*
+ * Called under kevent_user->ready_lock, so updates are always protected.
+ */
+int kevent_user_ring_add_event(struct kevent *k)
+{
+ unsigned int pidx, off;
+ struct kevent_mring *ring, *copy_ring;
+
+ ring = k->user->pring[0];
+
+ if ((ring->kidx + 1 == ring->uidx) ||
+ ((ring->kidx + 1 == KEVENT_MAX_EVENTS) && ring->uidx == 0)) {
+ if (k->user->overflow_kevent == NULL)
+ k->user->overflow_kevent = k;
+ return -EAGAIN;
+ }
+
+ pidx = ring->kidx/KEVENTS_ON_PAGE;
+ off = ring->kidx%KEVENTS_ON_PAGE;
+
+ if (unlikely(pidx >= KEVENT_MAX_PAGES)) {
+ printk(KERN_ERR "%s: kidx: %u, pidx: %u, on_page: %lu, pidx: %u.\n",
+ __func__, ring->kidx, ring->uidx, KEVENTS_ON_PAGE, pidx);
+ return -EINVAL;
+ }
+
+ copy_ring = k->user->pring[pidx];
+
+ copy_ring->event[off].id.raw[0] = k->event.id.raw[0];
+ copy_ring->event[off].id.raw[1] = k->event.id.raw[1];
+ copy_ring->event[off].ret_flags = k->event.ret_flags;
+
+ if (++ring->kidx >= KEVENT_MAX_EVENTS)
+ ring->kidx = 0;
+
+ return 0;
+}
+
+/*
+ * Initialize mmap ring buffer.
+ * It will store ready kevents, so userspace could get them directly instead
+ * of using syscall. Esentially syscall becomes just a waiting point.
+ * @KEVENT_MAX_PAGES is an arbitrary number of pages to store ready events.
+ */
+static int kevent_user_ring_init(struct kevent_user *u)
+{
+ int i;
+
+ u->pring = kzalloc(KEVENT_MAX_PAGES * sizeof(struct kevent_mring *), GFP_KERNEL);
+ if (!u->pring)
+ return -ENOMEM;
+
+ for (i=0; i<KEVENT_MAX_PAGES; ++i) {
+ u->pring[i] = (struct kevent_mring *)__get_free_page(GFP_KERNEL);
+ if (!u->pring[i])
+ break;
+ }
+
+ if (i != KEVENT_MAX_PAGES)
+ goto err_out_free;
+
+ u->pring[0]->uidx = u->pring[0]->kidx = 0;
+
+ return 0;
+
+err_out_free:
+ for (i=0; i<KEVENT_MAX_PAGES; ++i) {
+ if (!u->pring[i])
+ break;
+
+ free_page((unsigned long)u->pring[i]);
+ }
+
+ kfree(u->pring);
+
+ return -ENOMEM;
+}
+
+static void kevent_user_ring_fini(struct kevent_user *u)
+{
+ int i;
+
+ for (i=0; i<KEVENT_MAX_PAGES; ++i)
+ free_page((unsigned long)u->pring[i]);
+
+ kfree(u->pring);
+}
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+ struct kevent_user *u;
+
+ u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+ if (!u)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&u->ready_list);
+ spin_lock_init(&u->ready_lock);
+ kevent_stat_init(u);
+ spin_lock_init(&u->kevent_lock);
+ u->kevent_root = RB_ROOT;
+
+ mutex_init(&u->ctl_mutex);
+ init_waitqueue_head(&u->wait);
+
+ atomic_set(&u->refcnt, 1);
+
+ if (unlikely(kevent_user_ring_init(u))) {
+ kfree(u);
+ return -ENOMEM;
+ }
+
+ file->private_data = u;
+ return 0;
+}
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+ atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+ if (atomic_dec_and_test(&u->refcnt)) {
+ kevent_stat_print(u);
+ kevent_user_ring_fini(u);
+ kfree(u);
+ }
+}
+
+/*
+ * Mmap implementation for ring buffer, which is created as array
+ * of pages, so vm_pgoff is an offset (in pages, not in bytes) of
+ * the first page to be mapped.
+ */
+static int kevent_user_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ unsigned long start = vma->vm_start, off = vma->vm_pgoff / PAGE_SIZE;
+ struct kevent_user *u = file->private_data;
+
+ if (off >= KEVENT_MAX_PAGES)
+ return -EINVAL;
+
+ if (vma->vm_flags & VM_WRITE)
+ return -EPERM;
+
+ vma->vm_flags |= VM_RESERVED;
+ vma->vm_file = file;
+
+ if (vm_insert_page(vma, start, virt_to_page(u->pring[off])))
+ return -EFAULT;
+
+ return 0;
+}
+
+static inline int kevent_compare_id(struct kevent_id *left, struct kevent_id *right)
+{
+ if (left->raw_u64 > right->raw_u64)
+ return -1;
+
+ if (right->raw_u64 > left->raw_u64)
+ return 1;
+
+ return 0;
+}
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+ struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+ kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+
+ if (deq)
+ kevent_dequeue(k);
+
+ spin_lock_irqsave(&u->ready_lock, flags);
+ if (k->flags & KEVENT_READY) {
+ list_del(&k->ready_entry);
+ k->flags &= ~KEVENT_READY;
+ u->ready_num--;
+ }
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+
+ kevent_user_put(u);
+ call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+
+ rb_erase(&k->kevent_node, &u->kevent_root);
+ k->flags &= ~KEVENT_USER;
+ u->kevent_num--;
+ kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events,
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ rb_erase(&k->kevent_node, &u->kevent_root);
+ k->flags &= ~KEVENT_USER;
+ u->kevent_num--;
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+ kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
+{
+ unsigned long flags;
+ struct kevent *k = NULL;
+
+ spin_lock_irqsave(&u->ready_lock, flags);
+ if (u->ready_num && !list_empty(&u->ready_list)) {
+ k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+ list_del(&k->ready_entry);
+ k->flags &= ~KEVENT_READY;
+ u->ready_num--;
+ if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS)
+ u->pring[0]->uidx = 0;
+
+ if (u->overflow_kevent) {
+ int err;
+
+ err = kevent_user_ring_add_event(u->overflow_kevent);
+ if (!err) {
+ if (u->overflow_kevent->ready_entry.next == &u->ready_list)
+ u->overflow_kevent = NULL;
+ else
+ u->overflow_kevent =
+ list_entry(u->overflow_kevent->ready_entry.next,
+ struct kevent, ready_entry);
+ }
+ }
+ }
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+
+ return k;
+}
+
+/*
+ * Search a kevent inside kevent tree for given ukevent.
+ */
+static struct kevent *__kevent_search(struct kevent_id *id, struct kevent_user *u)
+{
+ struct kevent *k, *ret = NULL;
+ struct rb_node *n = u->kevent_root.rb_node;
+ int cmp;
+
+ while (n) {
+ k = rb_entry(n, struct kevent, kevent_node);
+ cmp = kevent_compare_id(&k->event.id, id);
+
+ if (cmp > 0)
+ n = n->rb_right;
+ else if (cmp < 0)
+ n = n->rb_left;
+ else {
+ ret = k;
+ break;
+ }
+ }
+
+ return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ int err = -ENODEV;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&uk->id, u);
+ if (k) {
+ spin_lock(&k->ulock);
+ k->event.event = uk->event;
+ k->event.req_flags = uk->req_flags;
+ k->event.ret_flags = 0;
+ spin_unlock(&k->ulock);
+ kevent_requeue(k);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+ int err = -ENODEV;
+ struct kevent *k;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&uk->id, u);
+ if (k) {
+ __kevent_finish_user(k, 1);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+ struct kevent_user *u = file->private_data;
+ struct kevent *k;
+ struct rb_node *n;
+
+ for (n = rb_first(&u->kevent_root); n; n = rb_next(n)) {
+ k = rb_entry(n, struct kevent, kevent_node);
+ kevent_finish_user(k, 1);
+ }
+
+ kevent_user_put(u);
+ file->private_data = NULL;
+
+ return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+ struct ukevent *ukev;
+
+ ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+ if (!ukev)
+ return NULL;
+
+ if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+ kfree(ukev);
+ return NULL;
+ }
+
+ return ukev;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = 0, i;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > u->kevent_num) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ if (kevent_modify(&ukev[i], u))
+ ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+ ukev[i].ret_flags |= KEVENT_RET_DONE;
+ }
+ if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+ err = -EFAULT;
+ kfree(ukev);
+ goto out;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ if (kevent_modify(&uk, u))
+ uk.ret_flags |= KEVENT_RET_BROKEN;
+ uk.ret_flags |= KEVENT_RET_DONE;
+
+ if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ arg += sizeof(struct ukevent);
+ }
+out:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = 0, i;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > u->kevent_num) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ if (kevent_remove(&ukev[i], u))
+ ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+ ukev[i].ret_flags |= KEVENT_RET_DONE;
+ }
+ if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+ err = -EFAULT;
+ kfree(ukev);
+ goto out;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ if (kevent_remove(&uk, u))
+ uk.ret_flags |= KEVENT_RET_BROKEN;
+
+ uk.ret_flags |= KEVENT_RET_DONE;
+
+ if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ arg += sizeof(struct ukevent);
+ }
+out:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static int kevent_user_enqueue(struct kevent_user *u, struct kevent *new)
+{
+ unsigned long flags;
+ struct rb_node **p = &u->kevent_root.rb_node, *parent = NULL;
+ struct kevent *k;
+ int err = 0, cmp;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ while (*p) {
+ parent = *p;
+ k = rb_entry(parent, struct kevent, kevent_node);
+
+ cmp = kevent_compare_id(&k->event.id, &new->event.id);
+ if (cmp > 0)
+ p = &parent->rb_right;
+ else if (cmp < 0)
+ p = &parent->rb_left;
+ else {
+ err = -EEXIST;
+ break;
+ }
+ }
+ if (likely(!err)) {
+ rb_link_node(&new->kevent_node, parent, p);
+ rb_insert_color(&new->kevent_node, &u->kevent_root);
+ new->flags |= KEVENT_USER;
+ u->kevent_num++;
+ kevent_user_get(u);
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ int err;
+
+ k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+ if (!k) {
+ err = -ENOMEM;
+ goto err_out_exit;
+ }
+
+ memcpy(&k->event, uk, sizeof(struct ukevent));
+ INIT_RCU_HEAD(&k->rcu_head);
+
+ k->event.ret_flags = 0;
+
+ err = kevent_init(k);
+ if (err) {
+ kmem_cache_free(kevent_cache, k);
+ goto err_out_exit;
+ }
+ k->user = u;
+ kevent_stat_total(u);
+ err = kevent_user_enqueue(u, k);
+ if (err) {
+ kmem_cache_free(kevent_cache, k);
+ goto err_out_exit;
+ }
+
+ err = kevent_enqueue(k);
+ if (err) {
+ memcpy(uk, &k->event, sizeof(struct ukevent));
+ kevent_finish_user(k, 0);
+ goto err_out_exit;
+ }
+
+ return 0;
+
+err_out_exit:
+ if (err < 0) {
+ uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
+ uk->ret_data[1] = err;
+ } else if (err > 0)
+ uk->ret_flags |= KEVENT_RET_DONE;
+ return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one
+ * and add them into appropriate kevent_storages,
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err, cerr = 0, knum = 0, rnum = 0, i;
+ void __user *orig = arg;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ err = -EINVAL;
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ err = kevent_user_add_ukevent(&ukev[i], u);
+ if (err) {
+ kevent_stat_im(u);
+ if (i != rnum)
+ memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+ rnum++;
+ } else
+ knum++;
+ }
+ if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+ cerr = -EFAULT;
+ kfree(ukev);
+ goto out_setup;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ cerr = -EFAULT;
+ break;
+ }
+ arg += sizeof(struct ukevent);
+
+ err = kevent_user_add_ukevent(&uk, u);
+ if (err) {
+ kevent_stat_im(u);
+ if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+ cerr = -EFAULT;
+ break;
+ }
+ orig += sizeof(struct ukevent);
+ rnum++;
+ } else
+ knum++;
+ }
+
+out_setup:
+ if (cerr < 0) {
+ err = cerr;
+ goto out_remove;
+ }
+
+ err = rnum;
+out_remove:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u,
+ unsigned int min_nr, unsigned int max_nr, __u64 timeout,
+ void __user *buf)
+{
+ struct kevent *k;
+ int num = 0;
+
+ if (!(file->f_flags & O_NONBLOCK)) {
+ wait_event_interruptible_timeout(u->wait,
+ u->ready_num >= min_nr,
+ clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+ }
+
+ while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
+ if (copy_to_user(buf + num*sizeof(struct ukevent),
+ &k->event, sizeof(struct ukevent)))
+ break;
+
+ /*
+ * If it is one-shot kevent, it has been removed already from
+ * origin's queue, so we can easily free it here.
+ */
+ if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+ kevent_finish_user(k, 1);
+ ++num;
+ kevent_stat_wait(u);
+ }
+
+ return num;
+}
+
+static struct file_operations kevent_user_fops = {
+ .mmap = kevent_user_mmap,
+ .open = kevent_user_open,
+ .release = kevent_user_release,
+ .poll = kevent_user_poll,
+ .owner = THIS_MODULE,
+};
+
+static struct miscdevice kevent_miscdev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = kevent_name,
+ .fops = &kevent_user_fops,
+};
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+ int err;
+ struct kevent_user *u = file->private_data;
+
+ switch (cmd) {
+ case KEVENT_CTL_ADD:
+ err = kevent_user_ctl_add(u, num, arg);
+ break;
+ case KEVENT_CTL_REMOVE:
+ err = kevent_user_ctl_remove(u, num, arg);
+ break;
+ case KEVENT_CTL_MODIFY:
+ err = kevent_user_ctl_modify(u, num, arg);
+ break;
+ default:
+ err = -EINVAL;
+ break;
+ }
+
+ return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - timeout in nanoseconds to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - ununsed for now (will be used for mmap implementation).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+ __u64 timeout, struct ukevent __user *buf, unsigned flags)
+{
+ int err = -EINVAL;
+ struct file *file;
+ struct kevent_user *u;
+
+ file = fget(ctl_fd);
+ if (!file)
+ return -ENODEV;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+ u = file->private_data;
+
+ err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * This syscall is used to perform waiting until there is free space in kevent queue
+ * and removes/requeues requested number of events (commits them). Function returns
+ * number of actually committed events.
+ *
+ * @ctl_fd - kevent file descriptor.
+ * @start - number of first ready event.
+ * @num - number of processed kevents.
+ * @timeout - this timeout specifies number of nanoseconds to wait until there is
+ * free space in kevent queue.
+ *
+ * Ring buffer is designed in a way that first ready kevent will be at @ring->uidx
+ * position, and all other ready events will be in FIFO order after it.
+ * So when we need to commit @num events, it means we should just remove first @num
+ * kevents from ready queue and commit them. We do not use any special locking to
+ * protect this function against simultaneous running - kevent dequeueing is atomic,
+ * and we do not care about order in which events were committed.
+ * An example: thread 1 and thread 2 simultaneously call kevent_wait() to
+ * commit 2 and 3 events. It is possible that first thread will commit
+ * events 0 and 2 while second thread will commit events 1, 3 and 4.
+ * If there were only 3 ready events, then one of the calls will return lesser number
+ * of committed events than it was requested.
+ * ring->uidx update is atomic, since it is protected by u->ready_lock,
+ * which removes race with kevent_user_ring_add_event().
+ *
+ * If user asks to commit events which have beed removed by kevent_get_events() recently
+ * (for example when one thread looked into ring indexes and started to commit evets,
+ * which were simultaneously committed by other thread through kevent_get_events(),
+ * kevent_wait() will not commit unprocessed events, but will return number of actually
+ * committed events instead.
+ *
+ * It is forbidden to try to commit events not from the start of the buffer, but from
+ * some 'futher' event.
+ *
+ * An example: if ready events use positions 2-5,
+ * it is permitted to start to commit 3 events from position 0,
+ * in this case 0 and 1 positions will be ommited and only event in position 2 will
+ * be committed and kevent_wait() will return 1, since only one event was actually committed.
+ * It is forbidden to try to commit from position 4, 0 will be returned.
+ * This means that if some events were committed using kevent_get_events(),
+ * they will not be counted, instead userspace should check ring index and try to commit again.
+ */
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int start, unsigned int num, __u64 timeout)
+{
+ int err = -EINVAL, committed = 0;
+ struct file *file;
+ struct kevent_user *u;
+ struct kevent *k;
+ struct kevent_mring *ring;
+ unsigned int i, actual;
+ unsigned long flags;
+
+ if (num >= KEVENT_MAX_EVENTS)
+ return -EINVAL;
+
+ file = fget(ctl_fd);
+ if (!file)
+ return -ENODEV;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+ u = file->private_data;
+
+ ring = u->pring[0];
+
+ spin_lock_irqsave(&u->ready_lock, flags);
+ actual = (ring->kidx > ring->uidx)?
+ (ring->kidx - ring->uidx):
+ (KEVENT_MAX_EVENTS - (ring->uidx - ring->kidx));
+
+ if (actual < num)
+ num = actual;
+
+ if (start < ring->uidx) {
+ /*
+ * Some events have been committed through kevent_get_events().
+ * ready events
+ * |==========|RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR|==========|
+ * ring->uidx ring->kidx
+ * | |
+ * start start+num
+ *
+ */
+ unsigned int diff = ring->uidx - start;
+
+ if (num < diff)
+ num = 0;
+ else
+ num -= diff;
+ } else if (start > ring->uidx)
+ num = 0;
+
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+
+ for (i=0; i<num; ++i) {
+ k = kqueue_dequeue_ready(u);
+ if (!k)
+ break;
+
+ if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+ kevent_finish_user(k, 1);
+ kevent_stat_mmap(u);
+ committed++;
+ }
+
+ if (!(file->f_flags & O_NONBLOCK)) {
+ wait_event_interruptible_timeout(u->wait,
+ u->ready_num >= 1,
+ clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+ }
+
+ fput(file);
+
+ return committed;
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent __user *arg)
+{
+ int err = -EINVAL;
+ struct file *file;
+
+ file = fget(fd);
+ if (!file)
+ return -ENODEV;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+
+ err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * Kevent subsystem initialization - create kevent cache and register
+ * filesystem to get control file descriptors from.
+ */
+static int __init kevent_user_init(void)
+{
+ int err = 0;
+
+ kevent_cache = kmem_cache_create("kevent_cache",
+ sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+
+ err = misc_register(&kevent_miscdev);
+ if (err) {
+ printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
+ goto err_out_exit;
+ }
+
+ printk(KERN_INFO "KEVENT subsystem has been successfully registered.\n");
+
+ return 0;
+
+err_out_exit:
+ kmem_cache_destroy(kevent_cache);
+ return err;
+}
+
+module_init(kevent_user_init);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7a3b2e7..bc0582b 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,10 @@ cond_syscall(ppc_rtas);
cond_syscall(sys_spu_run);
cond_syscall(sys_spu_create);

+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_wait);
+cond_syscall(sys_kevent_ctl);
+
/* mmu depending weak syscall entries */
cond_syscall(sys_mprotect);
cond_syscall(sys_msync);

2006-10-27 16:11:17

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take21 3/4] kevent: Socket notifications.


Socket notifications.

This patch includes socket send/recv/accept notifications.
Using trivial web server based on kevent and this features
instead of epoll it's performance increased more than noticebly.
More details about various benchmarks and server itself
(evserver_kevent.c) can be found on project's homepage.

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/fs/inode.c b/fs/inode.c
index ada7643..ff1b129 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@ #include <linux/pagemap.h>
#include <linux/cdev.h>
#include <linux/bootmem.h>
#include <linux/inotify.h>
+#include <linux/kevent.h>
#include <linux/mount.h>

/*
@@ -164,12 +165,18 @@ #endif
}
inode->i_private = 0;
inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET
+ kevent_storage_init(inode, &inode->st);
+#endif
}
return inode;
}

void destroy_inode(struct inode *inode)
{
+#if defined CONFIG_KEVENT_SOCKET
+ kevent_storage_fini(&inode->st);
+#endif
BUG_ON(inode_has_buffers(inode));
security_inode_free(inode);
if (inode->i_sb->s_op->destroy_inode)
diff --git a/include/net/sock.h b/include/net/sock.h
index edd4d73..d48ded8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -48,6 +48,7 @@ #include <linux/lockdep.h>
#include <linux/netdevice.h>
#include <linux/skbuff.h> /* struct sk_buff */
#include <linux/security.h>
+#include <linux/kevent.h>

#include <linux/filter.h>

@@ -450,6 +451,21 @@ static inline int sk_stream_memory_free(

extern void sk_stream_rfree(struct sk_buff *skb);

+struct socket_alloc {
+ struct socket socket;
+ struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+ return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+ return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
+}
+
static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
{
skb->sk = sk;
@@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct
sk->sk_backlog.tail = skb;
}
skb->next = NULL;
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
}

#define sk_wait_event(__sk, __timeo, __condition) \
@@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio
return si->kiocb;
}

-struct socket_alloc {
- struct socket socket;
- struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
- return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
- return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
-}
-
extern void __sk_stream_mem_reclaim(struct sock *sk);
extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..69f4ad2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so
tp->ucopy.memory = 0;
} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
wake_up_interruptible(sk->sk_sleep);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
if (!inet_csk_ack_scheduled(sk))
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
(3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 0000000..c865b3e
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,129 @@
+/*
+ * kevent_socket.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/tcp.h>
+#include <linux/kevent.h>
+
+#include <net/sock.h>
+#include <net/request_sock.h>
+#include <net/inet_connection_sock.h>
+
+static int kevent_socket_callback(struct kevent *k)
+{
+ struct inode *inode = k->st->origin;
+ return SOCKET_I(inode)->ops->poll(SOCKET_I(inode)->file, SOCKET_I(inode), NULL);
+}
+
+int kevent_socket_enqueue(struct kevent *k)
+{
+ struct inode *inode;
+ struct socket *sock;
+ int err = -ENODEV;
+
+ sock = sockfd_lookup(k->event.id.raw[0], &err);
+ if (!sock)
+ goto err_out_exit;
+
+ inode = igrab(SOCK_INODE(sock));
+ if (!inode)
+ goto err_out_fput;
+
+ err = kevent_storage_enqueue(&inode->st, k);
+ if (err)
+ goto err_out_iput;
+
+ err = k->callbacks.callback(k);
+ if (err)
+ goto err_out_dequeue;
+
+ return err;
+
+err_out_dequeue:
+ kevent_storage_dequeue(k->st, k);
+err_out_iput:
+ iput(inode);
+err_out_fput:
+ sockfd_put(sock);
+err_out_exit:
+ return err;
+}
+
+int kevent_socket_dequeue(struct kevent *k)
+{
+ struct inode *inode = k->st->origin;
+ struct socket *sock;
+
+ kevent_storage_dequeue(k->st, k);
+
+ sock = SOCKET_I(inode);
+ iput(inode);
+ sockfd_put(sock);
+
+ return 0;
+}
+
+void kevent_socket_notify(struct sock *sk, u32 event)
+{
+ if (sk->sk_socket)
+ kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event);
+}
+
+/*
+ * It is required for network protocols compiled as modules, like IPv6.
+ */
+EXPORT_SYMBOL_GPL(kevent_socket_notify);
+
+#ifdef CONFIG_LOCKDEP
+static struct lock_class_key kevent_sock_key;
+
+void kevent_socket_reinit(struct socket *sock)
+{
+ struct inode *inode = SOCK_INODE(sock);
+
+ lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+}
+
+void kevent_sk_reinit(struct sock *sk)
+{
+ if (sk->sk_socket) {
+ struct inode *inode = SOCK_INODE(sk->sk_socket);
+
+ lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+ }
+}
+#endif
+static int __init kevent_init_socket(void)
+{
+ struct kevent_callbacks sc = {
+ .callback = &kevent_socket_callback,
+ .enqueue = &kevent_socket_enqueue,
+ .dequeue = &kevent_socket_dequeue};
+
+ return kevent_add_callbacks(&sc, KEVENT_SOCKET);
+}
+module_init(kevent_init_socket);
diff --git a/net/core/sock.c b/net/core/sock.c
index b77e155..7d5fa3e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1402,6 +1402,7 @@ static void sock_def_wakeup(struct sock
if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
wake_up_interruptible_all(sk->sk_sleep);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}

static void sock_def_error_report(struct sock *sk)
@@ -1411,6 +1412,7 @@ static void sock_def_error_report(struct
wake_up_interruptible(sk->sk_sleep);
sk_wake_async(sk,0,POLL_ERR);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}

static void sock_def_readable(struct sock *sk, int len)
@@ -1420,6 +1422,7 @@ static void sock_def_readable(struct soc
wake_up_interruptible(sk->sk_sleep);
sk_wake_async(sk,1,POLL_IN);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}

static void sock_def_write_space(struct sock *sk)
@@ -1439,6 +1442,7 @@ static void sock_def_write_space(struct
}

read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
}

static void sock_def_destruct(struct sock *sk)
@@ -1489,6 +1493,8 @@ #endif
sk->sk_state = TCP_CLOSE;
sk->sk_socket = sock;

+ kevent_sk_reinit(sk);
+
sock_set_flag(sk, SOCK_ZAPPED);

if(sock)
@@ -1555,8 +1561,10 @@ void fastcall release_sock(struct sock *
if (sk->sk_backlog.tail)
__release_sock(sk);
sk->sk_lock.owner = NULL;
- if (waitqueue_active(&sk->sk_lock.wq))
+ if (waitqueue_active(&sk->sk_lock.wq)) {
wake_up(&sk->sk_lock.wq);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
+ }
spin_unlock_bh(&sk->sk_lock.slock);
}
EXPORT_SYMBOL(release_sock);
diff --git a/net/core/stream.c b/net/core/stream.c
index d1d7dec..2878c2a 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock *
wake_up_interruptible(sk->sk_sleep);
if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
sock_wake_async(sock, 2, POLL_OUT);
+ kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
}
}

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3f884ce..e7dd989 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3119,6 +3119,7 @@ static void tcp_ofo_queue(struct sock *s

__skb_unlink(skb, &tp->out_of_order_queue);
__skb_queue_tail(&sk->sk_receive_queue, skb);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
if(skb->h.th->fin)
tcp_fin(skb, sk, skb->h.th);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index c83938b..b0dd70d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -61,6 +61,7 @@ #include <linux/cache.h>
#include <linux/jhash.h>
#include <linux/init.h>
#include <linux/times.h>
+#include <linux/kevent.h>

#include <net/icmp.h>
#include <net/inet_hashtables.h>
@@ -870,6 +871,7 @@ #endif
reqsk_free(req);
} else {
inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+ kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT);
}
return 0;

diff --git a/net/socket.c b/net/socket.c
index 1bc4167..5582b4a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -85,6 +85,7 @@ #include <linux/compat.h>
#include <linux/kmod.h>
#include <linux/audit.h>
#include <linux/wireless.h>
+#include <linux/kevent.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -490,6 +491,8 @@ static struct socket *sock_alloc(void)
inode->i_uid = current->fsuid;
inode->i_gid = current->fsgid;

+ kevent_socket_reinit(sock);
+
get_cpu_var(sockets_in_use)++;
put_cpu_var(sockets_in_use);
return sock;

2006-10-27 16:43:04

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take21 0/4] kevent: Generic event handling mechanism.

On Fri, Oct 27, 2006 at 08:10:01PM +0400, Evgeniy Polyakov ([email protected]) wrote:
>
> Generic event handling mechanism.
>
> Consider for inclusion.
>
> Changes from 'take20' patchset:
> * new ring buffer implementation

Test userspace application can be found in archive on project's
homepage. It is also attached to this mail.

Short design notes about ring buffer implementation.

Ring buffer is designed in a way that first ready kevent will be at
ring->uidx position, and all other ready events will be in FIFO order
after it. So when we need to commit num events, it means we should just
remove first num kevents from ready queue and commit them. We do not use
any special locking to protect this function against simultaneous
running - kevent dequeueing is atomic, and we do not care about order in
which events were committed.
An example: thread 1 and thread 2 simultaneously call kevent_wait() to
commit 2 and 3 events. It is possible that first thread will commit
events 0 and 2 while second thread will commit events 1, 3 and 4. If
there were only 3 ready events, then one of the calls will return lesser
number of committed events than it was requested.
ring->uidx update is atomic, since it is protected by u->ready_lock,
which removes race with kevent_user_ring_add_event().

If user asks to commit events which have beed removed by
kevent_get_events() recently (for example when one thread looked into
ring indexes and started to commit evets, which were simultaneously
committed by other thread through kevent_get_events(), kevent_wait()
will not commit unprocessed events, but will return number of actually
committed events instead.

It is forbidden to try to commit events not from the start of the
buffer, but from some 'futher' event.

An example: if ready events use positions 2-5, it is permitted to start
to commit 3 events from position 0, in this case 0 and 1 positions will
be ommited and only event in position 2 will be committed and
kevent_wait() will return 1, since only one event was actually
committed.
It is forbidden to try to commit from position 4, 0 will be returned.
This means that if some events were committed using kevent_get_events(),
they will not be counted, instead userspace should check ring index and
try to commit again.

--
Evgeniy Polyakov


Attachments:
(No filename) (2.25 kB)
evtest.c (4.95 kB)
Download all attachments

2006-10-28 10:04:12

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take21 2/4] kevent: poll/select() notifications.

Evgeniy Polyakov a ?crit :

> + file = fget(k->event.id.raw[0]);
> + if (!file)
> + return -ENODEV;

Please, do us a favor, and use EBADF instead of ENODEV.

EBADF : /* Bad file number */

ENODEV : /* No such device */

You have many ENODEV uses in your patches and that really hurts.

Eric

2006-10-28 10:08:52

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take21 2/4] kevent: poll/select() notifications.

On Sat, Oct 28, 2006 at 12:04:10PM +0200, Eric Dumazet ([email protected]) wrote:
> Evgeniy Polyakov a écrit :
>
> >+ file = fget(k->event.id.raw[0]);
> >+ if (!file)
> >+ return -ENODEV;
>
> Please, do us a favor, and use EBADF instead of ENODEV.
>
> EBADF : /* Bad file number */
>
> ENODEV : /* No such device */
>
> You have many ENODEV uses in your patches and that really hurts.

Ok :)

> Eric

--
Evgeniy Polyakov

2006-10-28 10:28:14

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take21 1/4] kevent: Core files.

+/*
+ * Called under kevent_user->ready_lock, so updates are always protected.
+ */
+int kevent_user_ring_add_event(struct kevent *k)
+{
+ unsigned int pidx, off;
+ struct kevent_mring *ring, *copy_ring;
+
+ ring = k->user->pring[0];
+
+ if ((ring->kidx + 1 == ring->uidx) ||
+ ((ring->kidx + 1 == KEVENT_MAX_EVENTS) && ring->uidx == 0)) {
+ if (k->user->overflow_kevent == NULL)
+ k->user->overflow_kevent = k;
+ return -EAGAIN;
+ }
+


I really dont understand how you manage to queue multiple kevents in the
'overflow list'. You just queue one kevent at most. What am I missing ?



> +
> + for (i=0; i<KEVENT_MAX_PAGES; ++i) {
> + u->pring[i] = (struct kevent_mring *)__get_free_page(GFP_KERNEL);
> + if (!u->pring[i])
> + break;
> + }
> +
> + if (i != KEVENT_MAX_PAGES)
> + goto err_out_free;

Why dont you use goto directly ?

if (!u->pring[i])
goto err_out_free;




> +
> + u->pring[0]->uidx = u->pring[0]->kidx = 0;
> +
> + return 0;
> +
> +err_out_free:
> + for (i=0; i<KEVENT_MAX_PAGES; ++i) {
> + if (!u->pring[i])
> + break;
> +
> + free_page((unsigned long)u->pring[i]);
> + }
> + return k;
> +}
> +




> +static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
> +{
> + int err, cerr = 0, knum = 0, rnum = 0, i;
> + void __user *orig = arg;
> + struct ukevent uk;
> +
> + mutex_lock(&u->ctl_mutex);
> +
> + err = -EINVAL;
> + if (num > KEVENT_MIN_BUFFS_ALLOC) {
> + struct ukevent *ukev;
> +
> + ukev = kevent_get_user(num, arg);
> + if (ukev) {
> + for (i = 0; i < num; ++i) {
> + err = kevent_user_add_ukevent(&ukev[i], u);
> + if (err) {
> + kevent_stat_im(u);
> + if (i != rnum)
> + memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
> + rnum++;
> + } else
> + knum++;


Why are you using/counting knum ?



> + }
> + if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
> + cerr = -EFAULT;
> + kfree(ukev);
> + goto out_setup;
> + }
> + }
> +
> + for (i = 0; i < num; ++i) {
> + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
> + cerr = -EFAULT;
> + break;
> + }
> + arg += sizeof(struct ukevent);
> +
> + err = kevent_user_add_ukevent(&uk, u);
> + if (err) {
> + kevent_stat_im(u);
> + if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
> + cerr = -EFAULT;
> + break;
> + }
> + orig += sizeof(struct ukevent);
> + rnum++;
> + } else
> + knum++;
> + }
> +
> +out_setup:
> + if (cerr < 0) {
> + err = cerr;
> + goto out_remove;
> + }
> +
> + err = rnum;
> +out_remove:
> + mutex_unlock(&u->ctl_mutex);
> +
> + return err;
> +}

2006-10-28 10:54:17

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take21 1/4] kevent: Core files.

On Sat, Oct 28, 2006 at 12:28:12PM +0200, Eric Dumazet ([email protected]) wrote:
> +/*
> + * Called under kevent_user->ready_lock, so updates are always protected.
> + */
> +int kevent_user_ring_add_event(struct kevent *k)
> +{
> + unsigned int pidx, off;
> + struct kevent_mring *ring, *copy_ring;
> +
> + ring = k->user->pring[0];
> +
> + if ((ring->kidx + 1 == ring->uidx) ||
> + ((ring->kidx + 1 == KEVENT_MAX_EVENTS) && ring->uidx
> == 0)) {
> + if (k->user->overflow_kevent == NULL)
> + k->user->overflow_kevent = k;
> + return -EAGAIN;
> + }
> +
>
>
> I really dont understand how you manage to queue multiple kevents in the
> 'overflow list'. You just queue one kevent at most. What am I missing ?

There is no overflow list - it is a pointer to the first kevent in the
ready queue, which was not put into ring buffer. It is an optimisation,
which allows to not search for that position each time new event should
be placed into the buffer, when it starts to have an empty slot.

>
> >+
> >+ for (i=0; i<KEVENT_MAX_PAGES; ++i) {
> >+ u->pring[i] = (struct kevent_mring
> >*)__get_free_page(GFP_KERNEL);
> >+ if (!u->pring[i])
> >+ break;
> >+ }
> >+
> >+ if (i != KEVENT_MAX_PAGES)
> >+ goto err_out_free;
>
> Why dont you use goto directly ?
>
> if (!u->pring[i])
> goto err_out_free;
>

I used a fallback mode here which allowed to use smaller number of pages
for kevent ring buffer, but then decided to drop it.
So it is possible to use goto directly.

> >+
> >+ u->pring[0]->uidx = u->pring[0]->kidx = 0;
> >+
> >+ return 0;
> >+
> >+err_out_free:
> >+ for (i=0; i<KEVENT_MAX_PAGES; ++i) {
> >+ if (!u->pring[i])
> >+ break;
> >+
> >+ free_page((unsigned long)u->pring[i]);
> >+ }
> >+ return k;
> >+}
> >+
>
>
>
>
> >+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num,
> >void __user *arg)
> >+{
> >+ int err, cerr = 0, knum = 0, rnum = 0, i;
> >+ void __user *orig = arg;
> >+ struct ukevent uk;
> >+
> >+ mutex_lock(&u->ctl_mutex);
> >+
> >+ err = -EINVAL;
> >+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
> >+ struct ukevent *ukev;
> >+
> >+ ukev = kevent_get_user(num, arg);
> >+ if (ukev) {
> >+ for (i = 0; i < num; ++i) {
> >+ err = kevent_user_add_ukevent(&ukev[i], u);
> >+ if (err) {
> >+ kevent_stat_im(u);
> >+ if (i != rnum)
> >+ memcpy(&ukev[rnum],
> >&ukev[i], sizeof(struct ukevent));
> >+ rnum++;
> >+ } else
> >+ knum++;
>
>
> Why are you using/counting knum ?

It should go avay.

> >+ }
> >+ if (copy_to_user(orig, ukev, rnum*sizeof(struct
> >ukevent)))
> >+ cerr = -EFAULT;
> >+ kfree(ukev);
> >+ goto out_setup;
> >+ }
> >+ }
> >+
> >+ for (i = 0; i < num; ++i) {
> >+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
> >+ cerr = -EFAULT;
> >+ break;
> >+ }
> >+ arg += sizeof(struct ukevent);
> >+
> >+ err = kevent_user_add_ukevent(&uk, u);
> >+ if (err) {
> >+ kevent_stat_im(u);
> >+ if (copy_to_user(orig, &uk, sizeof(struct ukevent)))
> >{
> >+ cerr = -EFAULT;
> >+ break;
> >+ }
> >+ orig += sizeof(struct ukevent);
> >+ rnum++;
> >+ } else
> >+ knum++;
> >+ }
> >+
> >+out_setup:
> >+ if (cerr < 0) {
> >+ err = cerr;
> >+ goto out_remove;
> >+ }
> >+
> >+ err = rnum;
> >+out_remove:
> >+ mutex_unlock(&u->ctl_mutex);
> >+
> >+ return err;
> >+}
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Evgeniy Polyakov

2006-10-28 12:36:36

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take21 1/4] kevent: Core files.

Evgeniy Polyakov a e'crit :
> On Sat, Oct 28, 2006 at 12:28:12PM +0200, Eric Dumazet ([email protected]) wrote:
>>
>> I really dont understand how you manage to queue multiple kevents in the
>> 'overflow list'. You just queue one kevent at most. What am I missing ?
>
> There is no overflow list - it is a pointer to the first kevent in the
> ready queue, which was not put into ring buffer. It is an optimisation,
> which allows to not search for that position each time new event should
> be placed into the buffer, when it starts to have an empty slot.

This overflow list (you may call it differently, but still it IS a list), is
not complete. I feel you add it just to make me happy, but I am not (yet :) )

For example, you make no test at kevent_finish_user_complete() time.

Obviously, you can have a dangling pointer, and crash your box in certain
conditions.

static void kevent_finish_user_complete(struct kevent *k, int deq)
{
struct kevent_user *u = k->user;
unsigned long flags;

if (deq)
kevent_dequeue(k);

spin_lock_irqsave(&u->ready_lock, flags);
if (k->flags & KEVENT_READY) {
+ if (u->overflow_event == k) {
+ /* MUST do something to change u->overflow_kevent */
+ }
list_del(&k->ready_entry);
k->flags &= ~KEVENT_READY;
u->ready_num--;
}
spin_unlock_irqrestore(&u->ready_lock, flags);

kevent_user_put(u);
call_rcu(&k->rcu_head, kevent_free_rcu);
}

Eric

2006-10-28 13:05:49

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take21 1/4] kevent: Core files.

On Sat, Oct 28, 2006 at 02:36:31PM +0200, Eric Dumazet ([email protected]) wrote:
> Evgeniy Polyakov a e'crit :
> >On Sat, Oct 28, 2006 at 12:28:12PM +0200, Eric Dumazet
> >([email protected]) wrote:
> >>
> >>I really dont understand how you manage to queue multiple kevents in the
> >>'overflow list'. You just queue one kevent at most. What am I missing ?
> >
> >There is no overflow list - it is a pointer to the first kevent in the
> >ready queue, which was not put into ring buffer. It is an optimisation,
> >which allows to not search for that position each time new event should
> >be placed into the buffer, when it starts to have an empty slot.
>
> This overflow list (you may call it differently, but still it IS a list),
> is not complete. I feel you add it just to make me happy, but I am not (yet
> :) )

There is no overflow list.
There is ready queue, part of which (first several entries) is copied
into the ring buffer, overflow_kevent is a pointer to the first kevent which
was not copied.

> For example, you make no test at kevent_finish_user_complete() time.
>
> Obviously, you can have a dangling pointer, and crash your box in certain
> conditions.

You are right, I did not put overflow_kevent check into all places which
can remove kevent.

Here is a patch I am about to commit into the kevent tree:

diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
index 711a8a8..ecee668 100644
--- a/kernel/kevent/kevent_user.c
+++ b/kernel/kevent/kevent_user.c
@@ -235,6 +235,36 @@ static void kevent_free_rcu(struct rcu_h
}

/*
+ * Must be called under u->ready_lock.
+ * This function removes kevent from ready queue and
+ * tries to add new kevent into ring buffer.
+ */
+static void kevent_remove_ready(struct kevent *k)
+{
+ struct kevent_user *u = k->user;
+
+ list_del(&k->ready_entry);
+ k->flags &= ~KEVENT_READY;
+ u->ready_num--;
+ if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS)
+ u->pring[0]->uidx = 0;
+
+ if (u->overflow_kevent) {
+ int err;
+
+ err = kevent_user_ring_add_event(u->overflow_kevent);
+ if (!err || u->overflow_kevent == k) {
+ if (u->overflow_kevent->ready_entry.next == &u->ready_list)
+ u->overflow_kevent = NULL;
+ else
+ u->overflow_kevent =
+ list_entry(u->overflow_kevent->ready_entry.next,
+ struct kevent, ready_entry);
+ }
+ }
+}
+
+/*
* Complete kevent removing - it dequeues kevent from storage list
* if it is requested, removes kevent from ready list, drops userspace
* control block reference counter and schedules kevent freeing through RCU.
@@ -248,11 +278,8 @@ static void kevent_finish_user_complete(
kevent_dequeue(k);

spin_lock_irqsave(&u->ready_lock, flags);
- if (k->flags & KEVENT_READY) {
- list_del(&k->ready_entry);
- k->flags &= ~KEVENT_READY;
- u->ready_num--;
- }
+ if (k->flags & KEVENT_READY)
+ kevent_remove_ready(k);
spin_unlock_irqrestore(&u->ready_lock, flags);

kevent_user_put(u);
@@ -303,25 +330,7 @@ static struct kevent *kqueue_dequeue_rea
spin_lock_irqsave(&u->ready_lock, flags);
if (u->ready_num && !list_empty(&u->ready_list)) {
k = list_entry(u->ready_list.next, struct kevent, ready_entry);
- list_del(&k->ready_entry);
- k->flags &= ~KEVENT_READY;
- u->ready_num--;
- if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS)
- u->pring[0]->uidx = 0;
-
- if (u->overflow_kevent) {
- int err;
-
- err = kevent_user_ring_add_event(u->overflow_kevent);
- if (!err) {
- if (u->overflow_kevent->ready_entry.next == &u->ready_list)
- u->overflow_kevent = NULL;
- else
- u->overflow_kevent =
- list_entry(u->overflow_kevent->ready_entry.next,
- struct kevent, ready_entry);
- }
- }
+ kevent_remove_ready(k);
}
spin_unlock_irqrestore(&u->ready_lock, flags);


It tries to put next kevent into the ring and thus update
overflow_kevent if new kevent has been put into the
buffer or kevent being removed is overflow kevent.
Patch depends on committed changes of returned error numbers and unused
variables cleanup, it will be included into next patchset if there are
no problems with it.

--
Evgeniy Polyakov

2006-10-28 13:23:43

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take21 1/4] kevent: Core files.

Evgeniy Polyakov a e'crit :
> On Sat, Oct 28, 2006 at 02:36:31PM +0200, Eric Dumazet ([email protected]) wrote:
>> Evgeniy Polyakov a e'crit :
>>> On Sat, Oct 28, 2006 at 12:28:12PM +0200, Eric Dumazet
>>> ([email protected]) wrote:
>>>> I really dont understand how you manage to queue multiple kevents in the
>>>> 'overflow list'. You just queue one kevent at most. What am I missing ?
>>> There is no overflow list - it is a pointer to the first kevent in the
>>> ready queue, which was not put into ring buffer. It is an optimisation,
>>> which allows to not search for that position each time new event should
>>> be placed into the buffer, when it starts to have an empty slot.
>> This overflow list (you may call it differently, but still it IS a list),
>> is not complete. I feel you add it just to make me happy, but I am not (yet
>> :) )
>
> There is no overflow list.
> There is ready queue, part of which (first several entries) is copied
> into the ring buffer, overflow_kevent is a pointer to the first kevent which
> was not copied.
>
>> For example, you make no test at kevent_finish_user_complete() time.
>>
>> Obviously, you can have a dangling pointer, and crash your box in certain
>> conditions.
>
> You are right, I did not put overflow_kevent check into all places which
> can remove kevent.
>
> Here is a patch I am about to commit into the kevent tree:
>
> diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
> index 711a8a8..ecee668 100644
> --- a/kernel/kevent/kevent_user.c
> +++ b/kernel/kevent/kevent_user.c
> @@ -235,6 +235,36 @@ static void kevent_free_rcu(struct rcu_h
> }
>
> /*
> + * Must be called under u->ready_lock.
> + * This function removes kevent from ready queue and
> + * tries to add new kevent into ring buffer.
> + */
> +static void kevent_remove_ready(struct kevent *k)
> +{
> + struct kevent_user *u = k->user;
> +
> + list_del(&k->ready_entry);

Arg... no

You cannot call list_del() , then check overflow_kevent.

I you call list_del on what happens to be the kevent pointed by
overflow_kevent, you loose...

> + k->flags &= ~KEVENT_READY;
> + u->ready_num--;
> + if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS)
> + u->pring[0]->uidx = 0;
> +
> + if (u->overflow_kevent) {
> + int err;
> +
> + err = kevent_user_ring_add_event(u->overflow_kevent);
> + if (!err || u->overflow_kevent == k) {
> + if (u->overflow_kevent->ready_entry.next == &u->ready_list)
> + u->overflow_kevent = NULL;
> + else
> + u->overflow_kevent =
> + list_entry(u->overflow_kevent->ready_entry.next,
> + struct kevent, ready_entry);
> + }
> + }
> +}
> +
> +/*
> * Complete kevent removing - it dequeues kevent from storage list
> * if it is requested, removes kevent from ready list, drops userspace
> * control block reference counter and schedules kevent freeing through RCU.
> @@ -248,11 +278,8 @@ static void kevent_finish_user_complete(
> kevent_dequeue(k);
>
> spin_lock_irqsave(&u->ready_lock, flags);
> - if (k->flags & KEVENT_READY) {
> - list_del(&k->ready_entry);
> - k->flags &= ~KEVENT_READY;
> - u->ready_num--;
> - }
> + if (k->flags & KEVENT_READY)
> + kevent_remove_ready(k);
> spin_unlock_irqrestore(&u->ready_lock, flags);
>
> kevent_user_put(u);
> @@ -303,25 +330,7 @@ static struct kevent *kqueue_dequeue_rea
> spin_lock_irqsave(&u->ready_lock, flags);
> if (u->ready_num && !list_empty(&u->ready_list)) {
> k = list_entry(u->ready_list.next, struct kevent, ready_entry);
> - list_del(&k->ready_entry);
> - k->flags &= ~KEVENT_READY;
> - u->ready_num--;
> - if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS)
> - u->pring[0]->uidx = 0;
> -
> - if (u->overflow_kevent) {
> - int err;
> -
> - err = kevent_user_ring_add_event(u->overflow_kevent);
> - if (!err) {
> - if (u->overflow_kevent->ready_entry.next == &u->ready_list)
> - u->overflow_kevent = NULL;
> - else
> - u->overflow_kevent =
> - list_entry(u->overflow_kevent->ready_entry.next,
> - struct kevent, ready_entry);
> - }
> - }
> + kevent_remove_ready(k);
> }
> spin_unlock_irqrestore(&u->ready_lock, flags);
>


2006-10-28 13:28:52

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take21 1/4] kevent: Core files.

On Sat, Oct 28, 2006 at 03:23:40PM +0200, Eric Dumazet ([email protected]) wrote:
> >diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
> >index 711a8a8..ecee668 100644
> >--- a/kernel/kevent/kevent_user.c
> >+++ b/kernel/kevent/kevent_user.c
> >@@ -235,6 +235,36 @@ static void kevent_free_rcu(struct rcu_h
> > }
> >
> > /*
> >+ * Must be called under u->ready_lock.
> >+ * This function removes kevent from ready queue and
> >+ * tries to add new kevent into ring buffer.
> >+ */
> >+static void kevent_remove_ready(struct kevent *k)
> >+{
> >+ struct kevent_user *u = k->user;
> >+
> >+ list_del(&k->ready_entry);
>
> Arg... no
>
> You cannot call list_del() , then check overflow_kevent.
>
> I you call list_del on what happens to be the kevent pointed by
> overflow_kevent, you loose...

This function is always called from appropriate context, where it is
guaranteed that it is safe to call list_del:
1. when kevent is removed. It is called after check, that given kevent
is in the ready queue.
2. when dequeued from ready queue, which means that it can be removed
from that queue.

--
Evgeniy Polyakov

2006-10-28 13:34:55

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take21 1/4] kevent: Core files.

Evgeniy Polyakov a e'crit :
> On Sat, Oct 28, 2006 at 03:23:40PM +0200, Eric Dumazet ([email protected]) wrote:
>>> diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
>>> index 711a8a8..ecee668 100644
>>> --- a/kernel/kevent/kevent_user.c
>>> +++ b/kernel/kevent/kevent_user.c
>>> @@ -235,6 +235,36 @@ static void kevent_free_rcu(struct rcu_h
>>> }
>>>
>>> /*
>>> + * Must be called under u->ready_lock.
>>> + * This function removes kevent from ready queue and
>>> + * tries to add new kevent into ring buffer.
>>> + */
>>> +static void kevent_remove_ready(struct kevent *k)
>>> +{
>>> + struct kevent_user *u = k->user;
>>> +
>>> + list_del(&k->ready_entry);
>> Arg... no
>>
>> You cannot call list_del() , then check overflow_kevent.
>>
>> I you call list_del on what happens to be the kevent pointed by
>> overflow_kevent, you loose...
>
> This function is always called from appropriate context, where it is
> guaranteed that it is safe to call list_del:
> 1. when kevent is removed. It is called after check, that given kevent
> is in the ready queue.
> 2. when dequeued from ready queue, which means that it can be removed
> from that queue.
>

Could you please check the list_del() function ?

file include/linux/list.h

static inline void list_del(struct list_head *entry)
{
__list_del(entry->prev, entry->next);
entry->next = LIST_POISON1;
entry->prev = LIST_POISON2;
}

So, after calling list_del(&k->read_entry);
next and prev are basically destroyed.

So when you write later :

+ if (!err || u->overflow_kevent == k) {
+ if (u->overflow_kevent->ready_entry.next == &u->ready_list)
+ u->overflow_kevent = NULL;
+ else
+ u->overflow_kevent = +
list_entry(u->overflow_kevent->ready_entry.next, +
struct kevent, ready_entry);
+ }


then you have a problem, since

list_entry(k->ready_entry.next, struct kevent, ready_entry);

will give you garbage.

Eric

2006-10-28 13:48:13

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take21 1/4] kevent: Core files.

On Sat, Oct 28, 2006 at 03:34:52PM +0200, Eric Dumazet ([email protected]) wrote:
> >>>+ list_del(&k->ready_entry);
> >>Arg... no
> >>
> >>You cannot call list_del() , then check overflow_kevent.
> >>
> >>I you call list_del on what happens to be the kevent pointed by
> >>overflow_kevent, you loose...
> >
> >This function is always called from appropriate context, where it is
> >guaranteed that it is safe to call list_del:
> >1. when kevent is removed. It is called after check, that given kevent
> >is in the ready queue.
> >2. when dequeued from ready queue, which means that it can be removed
> >from that queue.
> >
>
> Could you please check the list_del() function ?
>
> file include/linux/list.h
>
> static inline void list_del(struct list_head *entry)
> {
> __list_del(entry->prev, entry->next);
> entry->next = LIST_POISON1;
> entry->prev = LIST_POISON2;
> }
>
> So, after calling list_del(&k->read_entry);
> next and prev are basically destroyed.
>
> So when you write later :
>
> + if (!err || u->overflow_kevent == k) {
> + if (u->overflow_kevent->ready_entry.next == &u->ready_list)
> + u->overflow_kevent = NULL;
> + else
> + u->overflow_kevent = +
> list_entry(u->overflow_kevent->ready_entry.next, +
> struct kevent, ready_entry);
> + }
>
>
> then you have a problem, since
>
> list_entry(k->ready_entry.next, struct kevent, ready_entry);
>
> will give you garbage.

Ok, I understand you now.
To remove this issue we can delete entry from the list after all checks
with overflow_kevent pointer are completed, i.e. have something like
this:

diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
index 711a8a8..f3fec9b 100644
--- a/kernel/kevent/kevent_user.c
+++ b/kernel/kevent/kevent_user.c
@@ -235,6 +235,36 @@ static void kevent_free_rcu(struct rcu_h
}

/*
+ * Must be called under u->ready_lock.
+ * This function removes kevent from ready queue and
+ * tries to add new kevent into ring buffer.
+ */
+static void kevent_remove_ready(struct kevent *k)
+{
+ struct kevent_user *u = k->user;
+
+ if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS)
+ u->pring[0]->uidx = 0;
+
+ if (u->overflow_kevent) {
+ int err;
+
+ err = kevent_user_ring_add_event(u->overflow_kevent);
+ if (!err || u->overflow_kevent == k) {
+ if (u->overflow_kevent->ready_entry.next == &u->ready_list)
+ u->overflow_kevent = NULL;
+ else
+ u->overflow_kevent =
+ list_entry(u->overflow_kevent->ready_entry.next,
+ struct kevent, ready_entry);
+ }
+ }
+ list_del(&k->ready_entry);
+ k->flags &= ~KEVENT_READY;
+ u->ready_num--;
+}
+
+/*
* Complete kevent removing - it dequeues kevent from storage list
* if it is requested, removes kevent from ready list, drops userspace
* control block reference counter and schedules kevent freeing through RCU.
@@ -248,11 +278,8 @@ static void kevent_finish_user_complete(
kevent_dequeue(k);

spin_lock_irqsave(&u->ready_lock, flags);
- if (k->flags & KEVENT_READY) {
- list_del(&k->ready_entry);
- k->flags &= ~KEVENT_READY;
- u->ready_num--;
- }
+ if (k->flags & KEVENT_READY)
+ kevent_remove_ready(k);
spin_unlock_irqrestore(&u->ready_lock, flags);

kevent_user_put(u);
@@ -303,25 +330,7 @@ static struct kevent *kqueue_dequeue_rea
spin_lock_irqsave(&u->ready_lock, flags);
if (u->ready_num && !list_empty(&u->ready_list)) {
k = list_entry(u->ready_list.next, struct kevent, ready_entry);
- list_del(&k->ready_entry);
- k->flags &= ~KEVENT_READY;
- u->ready_num--;
- if (++u->pring[0]->uidx == KEVENT_MAX_EVENTS)
- u->pring[0]->uidx = 0;
-
- if (u->overflow_kevent) {
- int err;
-
- err = kevent_user_ring_add_event(u->overflow_kevent);
- if (!err) {
- if (u->overflow_kevent->ready_entry.next == &u->ready_list)
- u->overflow_kevent = NULL;
- else
- u->overflow_kevent =
- list_entry(u->overflow_kevent->ready_entry.next,
- struct kevent, ready_entry);
- }
- }
+ kevent_remove_ready(k);
}
spin_unlock_irqrestore(&u->ready_lock, flags);


Thanks.

> Eric

--
Evgeniy Polyakov

2006-11-07 11:26:30

by Jeff Garzik

[permalink] [raw]
Subject: Re: [take21 0/4] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
> Generic event handling mechanism.
>
> Consider for inclusion.
>
> Changes from 'take20' patchset:
> * new ring buffer implementation
> * removed artificial limit on possible number of kevents
> With this release and fixed userspace web server it was possible to
> achive 3960+ req/s with client connection rate of 4000 con/s
> over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which
> is too close to wire speed if we get into account headers and the like.

OK, now that ring buffer is here, I definitely like the direction this
code is taking. I just committed the patches to a local repo for a good
in-depth review.

Could you write up a simple text file, documenting (a) your proposed
syscalls and (b) your ring buffer design?


Overall I have a Linux "design wish", that I hope kevent can fulfill:

To develop completely async applications (generally network servers, in
Linux-land) and increase the chance of zero-copy I/O, network and file
I/O submission and completion should be as async as possible.

As such, syscalls themselves have come a serializing bottleneck that
isn't strictly necessary. A fully-async application should be able to
submit file read, file write, and network write requests
asynchronously... in batches. Network reads, and file I/O completions
should be received asynchronously, potentially in batches.

Even with epoll and AIO syscalls, Linux isn't quite up to the task.

So to me, the design of the userspace interface that solves this problem
is a fundamental issue.

My best guess at a solution would be two classes of mmap'd ring buffers,
request and response. Let the app allocate one or more. Then have two
hooks, (a) kick the kernel to read the request ring, and (b) kick the
app when one or more events have arrived on a ring.

But that's just thinking out loud. I welcome any solution that gives
userspace a fully-async submission/completion interface for both network
and file I/O.

Setting the standard for a good interface here means Linux will kick ass
for decades more to come ;-) This is IMO a Big Deal(tm).

Jeff


2006-11-07 11:47:13

by Jeff Garzik

[permalink] [raw]
Subject: Re: [take21 0/4] kevent: Generic event handling mechanism.

At an aside... This may be useful. Or not.

Al Viro had an interesting idea about kernel<->userspace data passing
interfaces. He had suggested creating a task-specific filesystem
derived from ramfs. Through the normal VFS/VM codepaths, the user can
easily create [subject to resource/priv checks] a buffer that is locked
into the pagecache. Using mmap, read, write, whatever they prefer.
Derive from tmpfs, and the buffers are swappable.

Then it would be a simple matter to associate a file stored in
"keventfs" with a ring buffer guaranteed to be pagecache-friendly.

Heck, that might make zero-copy easier in some cases, too. And using a
filesystem would mean that you could do all this without adding
syscalls, by using special (poll-able!) files in the filesystem for
control and notification purposes.

Jeff



2006-11-07 11:59:34

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take21 0/4] kevent: Generic event handling mechanism.

On Tue, Nov 07, 2006 at 06:26:09AM -0500, Jeff Garzik ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >Generic event handling mechanism.
> >
> >Consider for inclusion.
> >
> >Changes from 'take20' patchset:
> > * new ring buffer implementation
> > * removed artificial limit on possible number of kevents
> >With this release and fixed userspace web server it was possible to
> >achive 3960+ req/s with client connection rate of 4000 con/s
> >over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which
> >is too close to wire speed if we get into account headers and the like.
>
> OK, now that ring buffer is here, I definitely like the direction this
> code is taking. I just committed the patches to a local repo for a good
> in-depth review.

It is third ring buffer, the fourth one will be in the next release,
which should satisfy everyone.

> Could you write up a simple text file, documenting (a) your proposed
> syscalls and (b) your ring buffer design?

Initial draft about supported syscalls can be found at documentation page at
http://linux-net.osdl.org/index.php/Kevent

Ring buffer background bits pasted below (quotations from blog, do not
pay too much attention if sometimes something is not in sync).

New ring buffer is implemented fully in userspace in process' memory,
which means that there are no memory pinned, its size can have almost
any length, several threads and processes can access it simultaneously.
There is new system call

int kevent_ring_init(int ctl_fd, struct ring_buffer *ring, unsigned int
num);

which initializes kevent's ring buffer (int ctl_fd is a kevent file
descriptor, struct ring_buffer *ring is a userspace allocated ring
buffer, and unsigned int num is maximum number of events (struct
ukevent) which can be placed into that buffer).
Ring buffer is described with following structure:

struct kevent_ring
{
unsigned int ring_kidx, ring_uidx;
struct ukevent event[0];
};

where unsigned int ring_kidx, ring_uidx are last kernel's position (i.e.
position which points to the first place after the last kevent put by
kernel into the ring buffer) and last userspace commit (i.e. position
where first unread kevent lives) positions appropriately.
I will release appropriate userspace test application when tests are
completed.

When kevent is removed (not dequeued when it is ready, but just
removed), even if it was ready, it is not copied into ring buffer, since
if it is removed, no one cares about it (otherwise user would wait until
it becomes ready and got it through usual way using kevent_get_events()
or kevent_wait()) and thus no need to copy it to the ring buffer.
Dequeueing of the kevent (calling kevent_get_events()) means that user
has processed previously dequeued kevent and is ready to process new
one, which means that position in the ring buffer previously ocupied but
that event can be reused by currently dequeued event. In the world where
only one type of syscalls to get events is used (either usual way and
kevent_get_events() or ring buffer and kevent_wait()) it should not be a
problem, since kevent_wait() only allows to mark number of events as
processed by userspace starting from the beginning (i.e. from the last
processed event), but if several threads will use different models, that
can rise some questions, for example one thread can start to read events
from ring buffer, and in that time other thread will call
kevent_get_events(), which can rewrite that events. Actually other
thread can call kevent_wait() to commit that events (i.e. mark them as
processed by userspace so kernel could free them or requeue), so
appropriate locking is required in userspace in any way.

So I want to repeat, that it is possible with userspace ring buffer,
that events in the ring buffer can be replaced without knowledge for the
thread currently reading them (when other thread calls
kevent_get_events() or kevent_wait()), so appropriate locking between
threads or processes, which can simultaneously access the same ring
buffer, is required.

Having userspace ring buffer allows to make all kevent syscalls as so
called 'cancellation points' by glibc, i.e. when thread has been
cancelled in kevent syscall, thread can be safely removed and no events
will be lost, since each syscall will copy event into special ring
buffer, accessible from other threads or even processes (if shared
memory is used).


>
> Overall I have a Linux "design wish", that I hope kevent can fulfill:
>
> To develop completely async applications (generally network servers, in
> Linux-land) and increase the chance of zero-copy I/O, network and file
> I/O submission and completion should be as async as possible.
>
> As such, syscalls themselves have come a serializing bottleneck that
> isn't strictly necessary. A fully-async application should be able to
> submit file read, file write, and network write requests
> asynchronously... in batches. Network reads, and file I/O completions
> should be received asynchronously, potentially in batches.
>
> Even with epoll and AIO syscalls, Linux isn't quite up to the task.
>
> So to me, the design of the userspace interface that solves this problem
> is a fundamental issue.
>
> My best guess at a solution would be two classes of mmap'd ring buffers,
> request and response. Let the app allocate one or more. Then have two
> hooks, (a) kick the kernel to read the request ring, and (b) kick the
> app when one or more events have arrived on a ring.

Mmap ring buffer implementation was stopped by Andrew Morton and Ulrich
Drepper, process' memory is used instead. copy_to_user() is slower (and
some times noticebly), but there are major advantages of such approach.

> But that's just thinking out loud. I welcome any solution that gives
> userspace a fully-async submission/completion interface for both network
> and file I/O.

Well, kevent network and FS AIO are suspended for now (although first
patches included them all).

> Setting the standard for a good interface here means Linux will kick ass
> for decades more to come ;-) This is IMO a Big Deal(tm).
>
> Jeff
>

--
Evgeniy Polyakov

2006-11-07 11:59:13

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take21 0/4] kevent: Generic event handling mechanism.

On Tue, Nov 07, 2006 at 06:46:58AM -0500, Jeff Garzik ([email protected]) wrote:
> At an aside... This may be useful. Or not.
>
> Al Viro had an interesting idea about kernel<->userspace data passing
> interfaces. He had suggested creating a task-specific filesystem
> derived from ramfs. Through the normal VFS/VM codepaths, the user can
> easily create [subject to resource/priv checks] a buffer that is locked
> into the pagecache. Using mmap, read, write, whatever they prefer.
> Derive from tmpfs, and the buffers are swappable.

It looks like Al likes filesystems more than any other part of kernel
tree...
Existing ring buffer is created in process' memory, so it is swappable
too (which is probably the most significant part of this ring buffer
version), but in theory kevent file descriptor can be obtained not from
the char device, but from special filesystem (well, it was done in that
way in first releases but then I was asked to remove such
functionality).

> Then it would be a simple matter to associate a file stored in
> "keventfs" with a ring buffer guaranteed to be pagecache-friendly.
>
> Heck, that might make zero-copy easier in some cases, too. And using a
> filesystem would mean that you could do all this without adding
> syscalls, by using special (poll-able!) files in the filesystem for
> control and notification purposes.

There are too many ideas about networking zero-copy both sending and
receiving, and some of them are even implemented on different layers
(starting from special allocator down to splice() with additional
single allocation/copy).

> Jeff

--
Evgeniy Polyakov

2006-11-07 12:17:10

by Jeff Garzik

[permalink] [raw]
Subject: Re: [take21 0/4] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
> Well, kevent network and FS AIO are suspended for now (although first

Why?

IMO, getting async event submission right is important. It should be
designed in parallel with async event reception.

Jeff


2006-11-07 12:32:33

by Jeff Garzik

[permalink] [raw]
Subject: Re: [take21 0/4] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
> Mmap ring buffer implementation was stopped by Andrew Morton and Ulrich
> Drepper, process' memory is used instead. copy_to_user() is slower (and
> some times noticebly), but there are major advantages of such approach.


hmmmm. I say there are advantages to both.

Perhaps create a "kevent_direct_limit" resource limit for each thread.
By default, each thread could mmap $n pinned pagecache pages. Sysadmin
can tune certain app resource limits to permit more.

I would think that retaining the option to avoid copy_to_user()
-somehow- in -some- cases would be wise.

Jeff


2006-11-07 12:33:44

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take21 0/4] kevent: Generic event handling mechanism.

On Tue, Nov 07, 2006 at 07:17:03AM -0500, Jeff Garzik ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >Well, kevent network and FS AIO are suspended for now (although first
>
> Why?
>
> IMO, getting async event submission right is important. It should be
> designed in parallel with async event reception.

It was not only designed but also implemented, but...

FS AIO was confirmed to have correct design, but there were minor (from
my point of view) layering design problems
(I was almost suggested to make myself a lobotomy after I put
get_block() callback into address_space_operations, there were also some
code duplication of mpage_readpages() in async way in
kevent/kevent_aio.c - I made it to separate kevent as much as possible,
both changes can live in fs/ with appropriate callback export).

Network AIO I postponed for a while, since looking how hard core changed
are processed, it looks like a better decision...
Using Ulrich's DMA allocation API (if it would exist not only as
proposal) it would be possible to speed up NAIO yet a bit too.

Kevent based FS AIO patch can be found for example here (it contains
full kevent subsystem with network aio and fs aio):
http://tservice.net.ru/~s0mbre/archive/kevent/kevent_full.diff.3

Network aio homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=naio

> Jeff
>

--
Evgeniy Polyakov

2006-11-07 19:35:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [take21 0/4] kevent: Generic event handling mechanism.

On Tue, 07 Nov 2006 07:32:20 -0500
Jeff Garzik <[email protected]> wrote:

> Evgeniy Polyakov wrote:
> > Mmap ring buffer implementation was stopped by Andrew Morton and Ulrich
> > Drepper, process' memory is used instead. copy_to_user() is slower (and
> > some times noticebly), but there are major advantages of such approach.
>
>
> hmmmm. I say there are advantages to both.

My problem with the old mmapped ringbuffer was that it permitted each user
to pin (typically) 48MB of unswappable memory. Plus this pinned-memory
problem would put upper bounds on the ring size.

> Perhaps create a "kevent_direct_limit" resource limit for each thread.
> By default, each thread could mmap $n pinned pagecache pages. Sysadmin
> can tune certain app resource limits to permit more.
>
> I would think that retaining the option to avoid copy_to_user()
> -somehow- in -some- cases would be wise.

What Evgeniy means here is that copy_to_user() is slower than memcpy() (on
his machine, with his kernel config, at least).

Which is kinda weird and unexpected and is something which we should
investigate independently from this project. (Rather than simply going
and bypassing it!)

2006-11-07 20:52:42

by David Miller

[permalink] [raw]
Subject: Re: [take21 0/4] kevent: Generic event handling mechanism.

From: Andrew Morton <[email protected]>
Date: Tue, 7 Nov 2006 11:34:00 -0800

> What Evgeniy means here is that copy_to_user() is slower than memcpy() (on
> his machine, with his kernel config, at least).
>
> Which is kinda weird and unexpected and is something which we should
> investigate independently from this project. (Rather than simply going
> and bypassing it!)

It's straightforward to me. :-)

If the kerne memcpy()'s, it uses those nice 4MB PTE mappings to
the kernel pages. With copy_to_user() you run through tiny
4K or 8K PTE mappings which thrash the TLB.

The TLB is therefore able to hold more of the accessed state at
a time if you touch the pages on the kernel side.

2006-11-07 21:39:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [take21 0/4] kevent: Generic event handling mechanism.

On Tue, 07 Nov 2006 12:52:41 -0800 (PST)
David Miller <[email protected]> wrote:

> From: Andrew Morton <[email protected]>
> Date: Tue, 7 Nov 2006 11:34:00 -0800
>
> > What Evgeniy means here is that copy_to_user() is slower than memcpy() (on
> > his machine, with his kernel config, at least).
> >
> > Which is kinda weird and unexpected and is something which we should
> > investigate independently from this project. (Rather than simply going
> > and bypassing it!)
>
> It's straightforward to me. :-)
>
> If the kerne memcpy()'s, it uses those nice 4MB PTE mappings to
> the kernel pages. With copy_to_user() you run through tiny
> 4K or 8K PTE mappings which thrash the TLB.
>
> The TLB is therefore able to hold more of the accessed state at
> a time if you touch the pages on the kernel side.

Maybe. Evgeniy tends to favour teeny microbenchmarks. I'd also be
suspecting the considerable setup code in the x86 uaccess funtions. That
would show up in a tight loop doing large numbers of small copies.