Generic event handling mechanism.
Kevent is a generic subsytem which allows to handle event notifications.
It supports both level and edge triggered events. It is similar to
poll/epoll in some cases, but it is more scalable, it is faster and
allows to work with essentially eny kind of events.
Events are provided into kernel through control syscall and can be read
back through ring buffer or using usual syscalls.
Kevent update (i.e. readiness switching) happens directly from internals
of the appropriate state machine of the underlying subsytem (like
network, filesystem, timer or any other).
Homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent
Documentation page:
http://linux-net.osdl.org/index.php/Kevent
I installed slightly used, but still functional (bought on ebay) remote
mind reader, and set it up to read Ulrich's alpha brain waves (I hope he
agrees that it is a good decision), which took me the whole week.
So I think the last ring buffer implementation is what we all wanted.
Details in documentation part.
Changes from 'take24' patchset:
* new (old (new)) ring buffer imeplementation with kernel and user indexes.
* added initialization syscall instead of opening /dev/kevent
* kevent_commit() syscall to commit ring buffer entries
* changed KEVENT_REQ_WAKEUP_ONE flag to KEVENT_REQ_WAKEUP_ALL, kevent wakes
only first thread always if that flag is not set
* KEVENT_REQ_ALWAYS_QUEUE flag. If set, kevent will be queued into ready queue
instead of copying back to userspace when kevent is ready immediately when
it is added.
* lighttpd patch (Hail! Although nothing realy outstanding compared to epoll)
Changes from 'take23' patchset:
* kevent PIPE notifications
* KEVENT_REQ_LAST_CHECK flag, which allows to perform last check at dequeueing time
* fixed poll/select notifications (were broken due to tree manipulations)
* made Documentation/kevent.txt look nice in 80-col terminal
* fix for copy_to_user() failure report for the first kevent (Andrew Morton)
* minor function renames
Changes from 'take22' patchset:
* new ring buffer implementation in process' memory
* wakeup-one-thread flag
* edge-triggered behaviour
Changes from 'take21' patchset:
* minor cleanups (different return values, removed unneded variables, whitespaces and so on)
* fixed bug in kevent removal in case when kevent being removed
is the same as overflow_kevent (spotted by Eric Dumazet)
Changes from 'take20' patchset:
* new ring buffer implementation
* removed artificial limit on possible number of kevents
Changes from 'take19' patchset:
* use __init instead of __devinit
* removed 'default N' from config for user statistic
* removed kevent_user_fini() since kevent can not be unloaded
* use KERN_INFO for statistic output
Changes from 'take18' patchset:
* use __init instead of __devinit
* removed 'default N' from config for user statistic
* removed kevent_user_fini() since kevent can not be unloaded
* use KERN_INFO for statistic output
Changes from 'take17' patchset:
* Use RB tree instead of hash table.
At least for a web sever, frequency of addition/deletion of new kevent
is comparable with number of search access, i.e. most of the time events
are added, accesed only couple of times and then removed, so it justifies
RB tree usage over AVL tree, since the latter does have much slower deletion
time (max O(log(N)) compared to 3 ops),
although faster search time (1.44*O(log(N)) vs. 2*O(log(N))).
So for kevents I use RB tree for now and later, when my AVL tree implementation
is ready, it will be possible to compare them.
* Changed readiness check for socket notifications.
With both above changes it is possible to achieve more than 3380 req/second compared to 2200,
sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same
hardware.
It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is
4096 events.
Changes from 'take16' patchset:
* misc cleanups (__read_mostly, const ...)
* created special macro which is used for mmap size (number of pages) calculation
* export kevent_socket_notify(), since it is used in network protocols which can be
built as modules (IPv6 for example)
Changes from 'take15' patchset:
* converted kevent_timer to high-resolution timers, this forces timer API update at
http://linux-net.osdl.org/index.php/Kevent
* use struct ukevent* instead of void * in syscalls (documentation has been updated)
* added warning in kevent_add_ukevent() if ring has broken index (for testing)
Changes from 'take14' patchset:
* added kevent_wait()
This syscall waits until either timeout expires or at least one event
becomes ready. It also commits that @num events from @start are processed
by userspace and thus can be be removed or rearmed (depending on it's flags).
It can be used for commit events read by userspace through mmap interface.
Example userspace code (evtest.c) can be found on project's homepage.
* added socket notifications (send/recv/accept)
Changes from 'take13' patchset:
* do not get lock aroung user data check in __kevent_search()
* fail early if there were no registered callbacks for given type of kevent
* trailing whitespace cleanup
Changes from 'take12' patchset:
* remove non-chardev interface for initialization
* use pointer to kevent_mring instead of unsigned longs
* use aligned 64bit type in raw user data (can be used by high-res timer if needed)
* simplified enqueue/dequeue callbacks and kevent initialization
* use nanoseconds for timeout
* put number of milliseconds into timer's return data
* move some definitions into user-visible header
* removed filenames from comments
Changes from 'take11' patchset:
* include missing headers into patchset
* some trivial code cleanups (use goto instead of if/else games and so on)
* some whitespace cleanups
* check for ready_callback() callback before main loop which should save us some ticks
Changes from 'take10' patchset:
* removed non-existent prototypes
* added helper function for kevent_registered_callbacks
* fixed 80 lines comments issues
* added shared between userspace and kernelspace header instead of embedd them in one
* core restructuring to remove forward declarations
* s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
* use vm_insert_page() instead of remap_pfn_range()
Changes from 'take9' patchset:
* fixed ->nopage method
Changes from 'take8' patchset:
* fixed mmap release bug
* use module_init() instead of late_initcall()
* use better structures for timer notifications
Changes from 'take7' patchset:
* new mmap interface (not tested, waiting for other changes to be acked)
- use nopage() method to dynamically substitue pages
- allocate new page for events only when new added kevent requres it
- do not use ugly index dereferencing, use structure instead
- reduced amount of data in the ring (id and flags),
maximum 12 pages on x86 per kevent fd
Changes from 'take6' patchset:
* a lot of comments!
* do not use list poisoning for detection of the fact, that entry is in the list
* return number of ready kevents even if copy*user() fails
* strict check for number of kevents in syscall
* use ARRAY_SIZE for array size calculation
* changed superblock magic number
* use SLAB_PANIC instead of direct panic() call
* changed -E* return values
* a lot of small cleanups and indent fixes
Changes from 'take5' patchset:
* removed compilation warnings about unused wariables when lockdep is not turned on
* do not use internal socket structures, use appropriate (exported) wrappers instead
* removed default 1 second timeout
* removed AIO stuff from patchset
Changes from 'take4' patchset:
* use miscdevice instead of chardevice
* comments fixes
Changes from 'take3' patchset:
* removed serializing mutex from kevent_user_wait()
* moved storage list processing to RCU
* removed lockdep screaming - all storage locks are initialized in the same function, so it was
learned
to differentiate between various cases
* remove kevent from storage if is marked as broken after callback
* fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion
Changes from 'take2' patchset:
* split kevent_finish_user() to locked and unlocked variants
* do not use KEVENT_STAT ifdefs, use inline functions instead
* use array of callbacks of each type instead of each kevent callback initialization
* changed name of ukevent guarding lock
* use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
* do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
* various indent cleanups
* added optimisation, which is aimed to help when a lot of kevents are being copied from
userspace
* mapped buffer (initial) implementation (no userspace yet)
Changes from 'take1' patchset:
- rebased against 2.6.18-git tree
- removed ioctl controlling
- added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
unsigned int timeout, void __user *buf, unsigned flags)
- use old syscall kevent_ctl for creation/removing, modification and initial kevent
initialization
- use mutuxes instead of semaphores
- added file descriptor check and return error if provided descriptor does not match
kevent file operations
- various indent fixes
- removed aio_sendfile() declarations.
Thank you.
Signed-off-by: Evgeniy Polyakov <[email protected]>
poll/select() notifications.
This patch includes generic poll/select notifications.
kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake, a lot of allocations and so on).
Signed-off-by: Evgeniy Polyakov <[email protected]>
diff --git a/fs/file_table.c b/fs/file_table.c
index bc35a40..0805547 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -20,6 +20,7 @@
#include <linux/cdev.h>
#include <linux/fsnotify.h>
#include <linux/sysctl.h>
+#include <linux/kevent.h>
#include <linux/percpu_counter.h>
#include <asm/atomic.h>
@@ -119,6 +120,7 @@ struct file *get_empty_filp(void)
f->f_uid = tsk->fsuid;
f->f_gid = tsk->fsgid;
eventpoll_init_file(f);
+ kevent_init_file(f);
/* f->f_version: 0 */
return f;
@@ -164,6 +166,7 @@ void fastcall __fput(struct file *file)
* in the file cleanup chain.
*/
eventpoll_release(file);
+ kevent_cleanup_file(file);
locks_remove_flock(file);
if (file->f_op && file->f_op->release)
diff --git a/fs/inode.c b/fs/inode.c
index ada7643..2740617 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@
#include <linux/cdev.h>
#include <linux/bootmem.h>
#include <linux/inotify.h>
+#include <linux/kevent.h>
#include <linux/mount.h>
/*
@@ -164,12 +165,18 @@ static struct inode *alloc_inode(struct
}
inode->i_private = 0;
inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+ kevent_storage_init(inode, &inode->st);
+#endif
}
return inode;
}
void destroy_inode(struct inode *inode)
{
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+ kevent_storage_fini(&inode->st);
+#endif
BUG_ON(inode_has_buffers(inode));
security_inode_free(inode);
if (inode->i_sb->s_op->destroy_inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5baf3a1..8bbf3a5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -276,6 +276,7 @@ extern int dir_notify_enable;
#include <linux/init.h>
#include <linux/sched.h>
#include <linux/mutex.h>
+#include <linux/kevent_storage.h>
#include <asm/atomic.h>
#include <asm/semaphore.h>
@@ -586,6 +587,10 @@ struct inode {
struct mutex inotify_mutex; /* protects the watches list */
#endif
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+ struct kevent_storage st;
+#endif
+
unsigned long i_state;
unsigned long dirtied_when; /* jiffies of first dirtying */
@@ -739,6 +744,9 @@ struct file {
struct list_head f_ep_links;
spinlock_t f_ep_lock;
#endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+ struct kevent_storage st;
+#endif
struct address_space *f_mapping;
};
extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..11dbe25
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,232 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+ struct poll_table_struct pt;
+ struct kevent *k;
+};
+
+struct kevent_poll_wait_container
+{
+ struct list_head container_entry;
+ wait_queue_head_t *whead;
+ wait_queue_t wait;
+ struct kevent *k;
+};
+
+struct kevent_poll_private
+{
+ struct list_head container_list;
+ spinlock_t container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+ unsigned mode, int sync, void *key)
+{
+ struct kevent_poll_wait_container *cont =
+ container_of(wait, struct kevent_poll_wait_container, wait);
+ struct kevent *k = cont->k;
+
+ kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL);
+ return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+ struct poll_table_struct *poll_table)
+{
+ struct kevent *k =
+ container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+ struct kevent_poll_private *priv = k->priv;
+ struct kevent_poll_wait_container *cont;
+ unsigned long flags;
+
+ cont = kmem_cache_alloc(kevent_poll_container_cache, GFP_KERNEL);
+ if (!cont) {
+ kevent_break(k);
+ return;
+ }
+
+ cont->k = k;
+ init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+ cont->whead = whead;
+
+ spin_lock_irqsave(&priv->container_lock, flags);
+ list_add_tail(&cont->container_entry, &priv->container_list);
+ spin_unlock_irqrestore(&priv->container_lock, flags);
+
+ add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+ struct file *file;
+ int err;
+ unsigned int revents;
+ unsigned long flags;
+ struct kevent_poll_ctl ctl;
+ struct kevent_poll_private *priv;
+
+ file = fget(k->event.id.raw[0]);
+ if (!file)
+ return -EBADF;
+
+ err = -EINVAL;
+ if (!file->f_op || !file->f_op->poll)
+ goto err_out_fput;
+
+ err = -ENOMEM;
+ priv = kmem_cache_alloc(kevent_poll_priv_cache, GFP_KERNEL);
+ if (!priv)
+ goto err_out_fput;
+
+ spin_lock_init(&priv->container_lock);
+ INIT_LIST_HEAD(&priv->container_list);
+
+ k->priv = priv;
+
+ ctl.k = k;
+ init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+
+ err = kevent_storage_enqueue(&file->st, k);
+ if (err)
+ goto err_out_free;
+
+ if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE) {
+ kevent_requeue(k);
+ } else {
+ revents = file->f_op->poll(file, &ctl.pt);
+ if (revents & k->event.event) {
+ err = 1;
+ goto out_dequeue;
+ }
+ }
+
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.req_flags |= KEVENT_REQ_LAST_CHECK;
+ spin_unlock_irqrestore(&k->ulock, flags);
+
+ return 0;
+
+out_dequeue:
+ kevent_storage_dequeue(k->st, k);
+err_out_free:
+ kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+ fput(file);
+ return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+ struct file *file = k->st->origin;
+ struct kevent_poll_private *priv = k->priv;
+ struct kevent_poll_wait_container *w, *n;
+ unsigned long flags;
+
+ kevent_storage_dequeue(k->st, k);
+
+ spin_lock_irqsave(&priv->container_lock, flags);
+ list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+ list_del(&w->container_entry);
+ remove_wait_queue(w->whead, &w->wait);
+ kmem_cache_free(kevent_poll_container_cache, w);
+ }
+ spin_unlock_irqrestore(&priv->container_lock, flags);
+
+ kmem_cache_free(kevent_poll_priv_cache, priv);
+ k->priv = NULL;
+
+ fput(file);
+
+ return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+ if (k->event.req_flags & KEVENT_REQ_LAST_CHECK) {
+ return 1;
+ } else {
+ struct file *file = k->st->origin;
+ unsigned int revents = file->f_op->poll(file, NULL);
+
+ k->event.ret_data[0] = revents & k->event.event;
+
+ return (revents & k->event.event);
+ }
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+ struct kevent_callbacks pc = {
+ .callback = &kevent_poll_callback,
+ .enqueue = &kevent_poll_enqueue,
+ .dequeue = &kevent_poll_dequeue};
+
+ kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache",
+ sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+ if (!kevent_poll_container_cache) {
+ printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+ return -ENOMEM;
+ }
+
+ kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache",
+ sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+ if (!kevent_poll_priv_cache) {
+ printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+ kmem_cache_destroy(kevent_poll_container_cache);
+ kevent_poll_container_cache = NULL;
+ return -ENOMEM;
+ }
+
+ kevent_add_callbacks(&pc, KEVENT_POLL);
+
+ printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+ return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+ lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+ kmem_cache_destroy(kevent_poll_priv_cache);
+ kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);
Core files.
This patch includes core kevent files:
* userspace controlling
* kernelspace interfaces
* initialization
* notification state machines
Some bits of documentation can be found on project's homepage (and links from there):
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent
Signed-off-by: Evgeniy Polyakov <[email protected]>
diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 7e639f7..a6221c2 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -318,3 +318,8 @@ ENTRY(sys_call_table)
.long sys_vmsplice
.long sys_move_pages
.long sys_getcpu
+ .long sys_kevent_get_events
+ .long sys_kevent_ctl /* 320 */
+ .long sys_kevent_wait
+ .long sys_kevent_commit
+ .long sys_kevent_init
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index b4aa875..dda2168 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -714,8 +714,13 @@ ia32_sys_call_table:
.quad compat_sys_get_robust_list
.quad sys_splice
.quad sys_sync_file_range
- .quad sys_tee
+ .quad sys_tee /* 315 */
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
.quad sys_getcpu
+ .quad sys_kevent_get_events
+ .quad sys_kevent_ctl /* 320 */
+ .quad sys_kevent_wait
+ .quad sys_kevent_commit
+ .quad sys_kevent_init
ia32_syscall_end:
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index bd99870..57a6b8c 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -324,10 +324,15 @@
#define __NR_vmsplice 316
#define __NR_move_pages 317
#define __NR_getcpu 318
+#define __NR_kevent_get_events 319
+#define __NR_kevent_ctl 320
+#define __NR_kevent_wait 321
+#define __NR_kevent_commit 322
+#define __NR_kevent_init 323
#ifdef __KERNEL__
-#define NR_syscalls 319
+#define NR_syscalls 324
#include <linux/err.h>
/*
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 6137146..17d750d 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,20 @@ __SYSCALL(__NR_sync_file_range, sys_sync
__SYSCALL(__NR_vmsplice, sys_vmsplice)
#define __NR_move_pages 279
__SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events 280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl 281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
+#define __NR_kevent_wait 282
+__SYSCALL(__NR_kevent_wait, sys_kevent_wait)
+#define __NR_kevent_commit 283
+__SYSCALL(__NR_kevent_commit, sys_kevent_commit)
+#define __NR_kevent_init 284
+__SYSCALL(__NR_kevent_init, sys_kevent_init)
#ifdef __KERNEL__
-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_init
#include <linux/err.h>
#ifndef __NO_STUBS
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..c909c62
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,230 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/kevent_storage.h>
+#include <linux/ukevent.h>
+
+#define KEVENT_MIN_BUFFS_ALLOC 3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+ kevent_callback_t callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY 0x1
+#define KEVENT_STORAGE 0x2
+#define KEVENT_USER 0x4
+
+struct kevent
+{
+ /* Used for kevent freeing.*/
+ struct rcu_head rcu_head;
+ struct ukevent event;
+ /* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+ spinlock_t ulock;
+
+ /* Entry of user's tree. */
+ struct rb_node kevent_node;
+ /* Entry of origin's queue. */
+ struct list_head storage_entry;
+ /* Entry of user's ready. */
+ struct list_head ready_entry;
+
+ u32 flags;
+
+ /* User who requested this kevent. */
+ struct kevent_user *user;
+ /* Kevent container. */
+ struct kevent_storage *st;
+
+ struct kevent_callbacks callbacks;
+
+ /* Private data for different storages.
+ * poll()/select storage has a list of wait_queue_t containers
+ * for each ->poll() { poll_wait()' } here.
+ */
+ void *priv;
+};
+
+struct kevent_user
+{
+ struct rb_root kevent_root;
+ spinlock_t kevent_lock;
+ /* Number of queued kevents. */
+ unsigned int kevent_num;
+
+ /* List of ready kevents. */
+ struct list_head ready_list;
+ /* Number of ready kevents. */
+ unsigned int ready_num;
+ /* Protects all manipulations with ready queue. */
+ spinlock_t ready_lock;
+
+ /* Protects against simultaneous kevent_user control manipulations. */
+ struct mutex ctl_mutex;
+ /* Wait until some events are ready. */
+ wait_queue_head_t wait;
+
+ /* Reference counter, increased for each new kevent. */
+ atomic_t refcnt;
+
+ /* Mutex protecting userspace ring buffer. */
+ struct mutex ring_lock;
+ /* Kernel index and size of the userspace ring buffer. */
+ unsigned int kidx, uidx, ring_size, ring_over, full;
+ /* Pointer to userspace ring buffer. */
+ struct kevent_ring __user *pring;
+
+#ifdef CONFIG_KEVENT_USER_STAT
+ unsigned long im_num;
+ unsigned long wait_num, ring_num;
+ unsigned long total;
+#endif
+};
+
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos);
+
+void kevent_storage_ready(struct kevent_storage *st,
+ kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+ u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+ printk(KERN_INFO "%s: u: %p, wait: %lu, ring: %lu, immediately: %lu, total: %lu.\n",
+ __func__, u, u->wait_num, u->ring_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+ u->im_num++;
+}
+static inline void kevent_stat_ring(struct kevent_user *u)
+{
+ u->ring_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+ u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+ u->total++;
+}
+#else
+#define kevent_stat_print(u) ({ (void) u;})
+#define kevent_stat_init(u) ({ (void) u;})
+#define kevent_stat_im(u) ({ (void) u;})
+#define kevent_stat_wait(u) ({ (void) u;})
+#define kevent_stat_ring(u) ({ (void) u;})
+#define kevent_stat_total(u) ({ (void) u;})
+#endif
+
+#ifdef CONFIG_LOCKDEP
+void kevent_socket_reinit(struct socket *sock);
+void kevent_sk_reinit(struct sock *sk);
+#else
+static inline void kevent_socket_reinit(struct socket *sock)
+{
+}
+static inline void kevent_sk_reinit(struct sock *sk)
+{
+}
+#endif
+#ifdef CONFIG_KEVENT_SOCKET
+void kevent_socket_notify(struct sock *sock, u32 event);
+int kevent_socket_dequeue(struct kevent *k);
+int kevent_socket_enqueue(struct kevent *k);
+#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)
+#else
+static inline void kevent_socket_notify(struct sock *sock, u32 event)
+{
+}
+#define sock_async(__sk) ({ (void)__sk; 0; })
+#endif
+
+#ifdef CONFIG_KEVENT_POLL
+static inline void kevent_init_file(struct file *file)
+{
+ kevent_storage_init(file, &file->st);
+}
+
+static inline void kevent_cleanup_file(struct file *file)
+{
+ kevent_storage_fini(&file->st);
+}
+#else
+static inline void kevent_init_file(struct file *file) {}
+static inline void kevent_cleanup_file(struct file *file) {}
+#endif
+
+#ifdef CONFIG_KEVENT_PIPE
+extern void kevent_pipe_notify(struct inode *inode, u32 events);
+#else
+static inline void kevent_pipe_notify(struct inode *inode, u32 events) {}
+#endif
+
+#ifdef CONFIG_KEVENT_SIGNAL
+extern int kevent_signal_notify(struct task_struct *tsk, int sig);
+#else
+static inline int kevent_signal_notify(struct task_struct *tsk, int sig) {return 0;}
+#endif
+
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+ void *origin; /* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+ struct list_head list; /* List of queued kevents. */
+ spinlock_t lock; /* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2d1c3d5..1317a18 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -54,6 +54,8 @@ struct compat_stat;
struct compat_timeval;
struct robust_list_head;
struct getcpu_cache;
+struct ukevent;
+struct kevent_ring;
#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -599,4 +601,10 @@ asmlinkage long sys_set_robust_list(stru
size_t len);
asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max,
+ __u64 timeout, struct ukevent __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, struct ukevent __user *buf);
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, __u64 timeout);
+asmlinkage long sys_kevent_commit(int ctl_fd, unsigned int start, unsigned int num, unsigned int over);
+asmlinkage long sys_kevent_init(int ctl_fd, struct kevent_ring __user *ring, unsigned int num);
#endif
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
new file mode 100644
index 0000000..0680fdf
--- /dev/null
+++ b/include/linux/ukevent.h
@@ -0,0 +1,178 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __UKEVENT_H
+#define __UKEVENT_H
+
+#include <linux/types.h>
+
+/*
+ * Kevent request flags.
+ */
+
+/* Process this event only once and then remove it. */
+#define KEVENT_REQ_ONESHOT 0x1
+/* Kevent wakes up only first thread interested in given event,
+ * or all threads if this flag is set.
+ */
+#define KEVENT_REQ_WAKEUP_ALL 0x2
+/* Edge Triggered behaviour. */
+#define KEVENT_REQ_ET 0x4
+/* Perform the last check on kevent (call appropriate callback) when
+ * kevent is marked as ready and has been removed from ready queue.
+ * If it will be confirmed that kevent is ready
+ * (k->callbacks.callback(k) returns true) then kevent will be copied
+ * to userspace, otherwise it will be requeued back to storage.
+ * Second (checking) call is performed with this bit _cleared_ so
+ * callback can detect when it was called from
+ * kevent_storage_ready() - bit is set, or
+ * kevent_dequeue_ready() - bit is cleared.
+ * If kevent will be requeued, bit will be set again. */
+#define KEVENT_REQ_LAST_CHECK 0x8
+/*
+ * Always queue kevent even if it is immediately ready.
+ */
+#define KEVENT_REQ_ALWAYS_QUEUE 0x16
+
+/*
+ * Kevent return flags.
+ */
+/* Kevent is broken. */
+#define KEVENT_RET_BROKEN 0x1
+/* Kevent processing was finished successfully. */
+#define KEVENT_RET_DONE 0x2
+/* Kevent was not copied into ring buffer due to some error conditions. */
+#define KEVENT_RET_COPY_FAILED 0x4
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 0
+#define KEVENT_INODE 1
+#define KEVENT_TIMER 2
+#define KEVENT_POLL 3
+#define KEVENT_NAIO 4
+#define KEVENT_AIO 5
+#define KEVENT_PIPE 6
+#define KEVENT_SIGNAL 7
+#define KEVENT_MAX 8
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define KEVENT_TIMER_FIRED 0x1
+
+/*
+ * Socket/network asynchronous IO and PIPE events.
+ */
+#define KEVENT_SOCKET_RECV 0x1
+#define KEVENT_SOCKET_ACCEPT 0x2
+#define KEVENT_SOCKET_SEND 0x4
+
+/*
+ * Inode events.
+ */
+#define KEVENT_INODE_CREATE 0x1
+#define KEVENT_INODE_REMOVE 0x2
+
+/*
+ * Poll events.
+ */
+#define KEVENT_POLL_POLLIN 0x0001
+#define KEVENT_POLL_POLLPRI 0x0002
+#define KEVENT_POLL_POLLOUT 0x0004
+#define KEVENT_POLL_POLLERR 0x0008
+#define KEVENT_POLL_POLLHUP 0x0010
+#define KEVENT_POLL_POLLNVAL 0x0020
+
+#define KEVENT_POLL_POLLRDNORM 0x0040
+#define KEVENT_POLL_POLLRDBAND 0x0080
+#define KEVENT_POLL_POLLWRNORM 0x0100
+#define KEVENT_POLL_POLLWRBAND 0x0200
+#define KEVENT_POLL_POLLMSG 0x0400
+#define KEVENT_POLL_POLLREMOVE 0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define KEVENT_AIO_BIO 0x1
+
+/*
+ * Signal events.
+ */
+#define KEVENT_SIGNAL_DELIVERY 0x1
+
+/* If set in raw64, then given signals will not be delivered
+ * in a usual way through sigmask update and signal callback
+ * invokation. */
+#define KEVENT_SIGNAL_NOMASK 0x8000000000000000ULL
+
+/* Mask of all possible event values. */
+#define KEVENT_MASK_ALL 0xffffffff
+/* Empty mask of ready events. */
+#define KEVENT_MASK_EMPTY 0x0
+
+struct kevent_id
+{
+ union {
+ __u32 raw[2];
+ __u64 raw_u64 __attribute__((aligned(8)));
+ };
+};
+
+struct ukevent
+{
+ /* Id of this request, e.g. socket number, file descriptor and so on... */
+ struct kevent_id id;
+ /* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+ __u32 type;
+ /* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+ __u32 event;
+ /* Per-event request flags */
+ __u32 req_flags;
+ /* Per-event return flags */
+ __u32 ret_flags;
+ /* Event return data. Event originator fills it with anything it likes. */
+ __u32 ret_data[2];
+ /* User's data. It is not used, just copied to/from user.
+ * The whole structure is aligned to 8 bytes already, so the last union
+ * is aligned properly.
+ */
+ union {
+ __u32 user[2];
+ void *ptr;
+ };
+};
+
+struct kevent_ring
+{
+ unsigned int ring_kidx, ring_uidx, ring_over;
+ struct ukevent event[0];
+};
+
+#define KEVENT_CTL_ADD 0
+#define KEVENT_CTL_REMOVE 1
+#define KEVENT_CTL_MODIFY 2
+
+#endif /* __UKEVENT_H */
diff --git a/init/Kconfig b/init/Kconfig
index d2eb7a8..c7d8250 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -201,6 +201,8 @@ config AUDITSYSCALL
such as SELinux. To use audit's filesystem watch feature, please
ensure that INOTIFY is configured.
+source "kernel/kevent/Kconfig"
+
config IKCONFIG
bool "Kernel .config support"
---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
obj-$(CONFIG_RELAY) += relay.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..4b137ee
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,60 @@
+config KEVENT
+ bool "Kernel event notification mechanism"
+ help
+ This option enables event queue mechanism.
+ It can be used as replacement for poll()/select(), AIO callback
+ invocations, advanced timer notifications and other kernel
+ object status changes.
+
+config KEVENT_USER_STAT
+ bool "Kevent user statistic"
+ depends on KEVENT
+ help
+ This option will turn kevent_user statistic collection on.
+ Statistic data includes total number of kevent, number of kevents
+ which are ready immediately at insertion time and number of kevents
+ which were removed through readiness completion.
+ It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_TIMER
+ bool "Kernel event notifications for timers"
+ depends on KEVENT
+ help
+ This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+ bool "Kernel event notifications for poll()/select()"
+ depends on KEVENT
+ help
+ This option allows to use kevent subsystem for poll()/select()
+ notifications.
+
+config KEVENT_SOCKET
+ bool "Kernel event notifications for sockets"
+ depends on NET && KEVENT
+ help
+ This option enables notifications through KEVENT subsystem of
+ sockets operations, like new packet receiving conditions,
+ ready for accept conditions and so on.
+
+config KEVENT_PIPE
+ bool "Kernel event notifications for pipes"
+ depends on KEVENT
+ help
+ This option enables notifications through KEVENT subsystem of
+ pipe read/write operations.
+
+config KEVENT_SIGNAL
+ bool "Kernel event notifications for signals"
+ depends on KEVENT
+ help
+ This option enables signal delivery through KEVENT subsystem.
+ Signals which were requested to be delivered through kevent
+ subsystem must be registered through usual signal() and others
+ syscalls, this option allows alternative delivery.
+ With KEVENT_SIGNAL_NOMASK flag being set in kevent for set of
+ signals, they will not be delivered in a usual way.
+ Kevents for appropriate signals are not copied when process forks,
+ new process must add new kevents after fork(). Mask of signals
+ is copied as before.
+
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..f98e0c8
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,6 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
+obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
+obj-$(CONFIG_KEVENT_PIPE) += kevent_pipe.o
+obj-$(CONFIG_KEVENT_SIGNAL) += kevent_signal.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..8cf756c
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,232 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+ return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+ return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.ret_flags |= KEVENT_RET_BROKEN;
+ spin_unlock_irqrestore(&k->ulock, flags);
+ return -EINVAL;
+}
+
+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX] __read_mostly;
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos)
+{
+ struct kevent_callbacks *p;
+
+ if (pos >= KEVENT_MAX)
+ return -EINVAL;
+
+ p = &kevent_registered_callbacks[pos];
+
+ p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
+ p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
+ p->callback = (cb->callback) ? cb->callback : kevent_break;
+
+ printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
+ return 0;
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+ spin_lock_init(&k->ulock);
+ k->flags = 0;
+
+ if (unlikely(k->event.type >= KEVENT_MAX ||
+ !kevent_registered_callbacks[k->event.type].callback))
+ return kevent_break(k);
+
+ k->callbacks = kevent_registered_callbacks[k->event.type];
+ if (unlikely(k->callbacks.callback == kevent_break))
+ return kevent_break(k);
+
+ return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+ unsigned long flags;
+
+ k->st = st;
+ spin_lock_irqsave(&st->lock, flags);
+ list_add_tail_rcu(&k->storage_entry, &st->list);
+ k->flags |= KEVENT_STORAGE;
+ spin_unlock_irqrestore(&st->lock, flags);
+ return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue.
+ * It does not decrease origin's reference counter in any way
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&st->lock, flags);
+ if (k->flags & KEVENT_STORAGE) {
+ list_del_rcu(&k->storage_entry);
+ k->flags &= ~KEVENT_STORAGE;
+ }
+ spin_unlock_irqrestore(&st->lock, flags);
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static int __kevent_requeue(struct kevent *k, u32 event)
+{
+ int ret, rem;
+ unsigned long flags;
+
+ ret = k->callbacks.callback(k);
+
+ spin_lock_irqsave(&k->ulock, flags);
+ if (ret > 0)
+ k->event.ret_flags |= KEVENT_RET_DONE;
+ else if (ret < 0)
+ k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE);
+ else
+ ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+ rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+ spin_unlock_irqrestore(&k->ulock, flags);
+
+ if (ret) {
+ if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) {
+ list_del_rcu(&k->storage_entry);
+ k->flags &= ~KEVENT_STORAGE;
+ }
+
+ spin_lock_irqsave(&k->user->ready_lock, flags);
+ if (!(k->flags & KEVENT_READY)) {
+ list_add_tail(&k->ready_entry, &k->user->ready_list);
+ k->flags |= KEVENT_READY;
+ k->user->ready_num++;
+ }
+ spin_unlock_irqrestore(&k->user->ready_lock, flags);
+ wake_up(&k->user->wait);
+ }
+
+ return ret;
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->st->lock, flags);
+ __kevent_requeue(k, 0);
+ spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st,
+ kevent_callback_t ready_callback, u32 event)
+{
+ struct kevent *k;
+ int wake_num = 0;
+
+ rcu_read_lock();
+ if (unlikely(ready_callback))
+ list_for_each_entry_rcu(k, &st->list, storage_entry)
+ (*ready_callback)(k);
+
+ list_for_each_entry_rcu(k, &st->list, storage_entry) {
+ if (event & k->event.event)
+ if ((k->event.req_flags & KEVENT_REQ_WAKEUP_ALL) || wake_num == 0)
+ if (__kevent_requeue(k, event))
+ wake_num++;
+ }
+ rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+ spin_lock_init(&st->lock);
+ st->origin = origin;
+ INIT_LIST_HEAD(&st->list);
+ return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, sockt and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+ kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..2cd8c99
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,1181 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static kmem_cache_t *kevent_cache __read_mostly;
+static kmem_cache_t *kevent_user_cache __read_mostly;
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+ struct kevent_user *u = file->private_data;
+ unsigned int mask;
+
+ poll_wait(file, &u->wait, wait);
+ mask = 0;
+
+ if (u->ready_num)
+ mask |= POLLIN | POLLRDNORM;
+
+ return mask;
+}
+
+static inline unsigned int kevent_ring_space(struct kevent_user *u)
+{
+ if (u->full)
+ return 0;
+
+ return (u->uidx > u->kidx)?
+ (u->uidx - u->kidx):
+ (u->ring_size - (u->kidx - u->uidx));
+}
+
+static inline int kevent_ring_index_inc(unsigned int *pidx, unsigned int size)
+{
+ unsigned int idx = *pidx;
+
+ if (++idx >= size)
+ idx = 0;
+ *pidx = idx;
+ return (idx == 0);
+}
+
+/*
+ * Copies kevent into userspace ring buffer if it was initialized.
+ * Returns
+ * 0 on success or if ring buffer is not used
+ * -EAGAIN if there were no place for that kevent
+ * -EFAULT if copy_to_user() failed.
+ *
+ * Must be called under kevent_user->ring_lock locked.
+ */
+static int kevent_copy_ring_buffer(struct kevent *k)
+{
+ struct kevent_ring __user *ring;
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+ int err;
+
+ ring = u->pring;
+ if (!ring)
+ return 0;
+
+ if (!kevent_ring_space(u))
+ return -EAGAIN;
+
+ if (copy_to_user(&ring->event[u->kidx], &k->event, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ goto err_out_exit;
+ }
+
+ kevent_ring_index_inc(&u->kidx, u->ring_size);
+
+ if (u->kidx == u->uidx)
+ u->full = 1;
+
+ if (put_user(u->kidx, &ring->ring_kidx)) {
+ err = -EFAULT;
+ goto err_out_exit;
+ }
+
+ return 0;
+
+err_out_exit:
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.ret_flags |= KEVENT_RET_COPY_FAILED;
+ spin_unlock_irqrestore(&k->ulock, flags);
+ return err;
+}
+
+static struct kevent_user *kevent_user_alloc(struct kevent_ring __user *ring, unsigned int num)
+{
+ struct kevent_user *u;
+
+ u = kmem_cache_alloc(kevent_user_cache, GFP_KERNEL);
+ if (!u)
+ return NULL;
+
+ INIT_LIST_HEAD(&u->ready_list);
+ spin_lock_init(&u->ready_lock);
+ kevent_stat_init(u);
+ spin_lock_init(&u->kevent_lock);
+ u->kevent_root = RB_ROOT;
+
+ mutex_init(&u->ctl_mutex);
+ init_waitqueue_head(&u->wait);
+
+ atomic_set(&u->refcnt, 1);
+
+ mutex_init(&u->ring_lock);
+ u->kidx = u->uidx = u->ring_over = u->full = 0;
+
+ u->pring = ring;
+ u->ring_size = num;
+
+ return u;
+}
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+ atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+ if (atomic_dec_and_test(&u->refcnt)) {
+ kevent_stat_print(u);
+ kmem_cache_free(kevent_user_cache, u);
+ }
+}
+
+static inline int kevent_compare_id(struct kevent_id *left, struct kevent_id *right)
+{
+ if (left->raw_u64 > right->raw_u64)
+ return -1;
+
+ if (right->raw_u64 > left->raw_u64)
+ return 1;
+
+ return 0;
+}
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+ struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+ kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Must be called under u->ready_lock.
+ * This function unlinks kevent from ready queue.
+ */
+static inline void kevent_unlink_ready(struct kevent *k)
+{
+ list_del(&k->ready_entry);
+ k->flags &= ~KEVENT_READY;
+ k->user->ready_num--;
+}
+
+static void kevent_remove_ready(struct kevent *k)
+{
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->ready_lock, flags);
+ if (k->flags & KEVENT_READY)
+ kevent_unlink_ready(k);
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+ if (deq)
+ kevent_dequeue(k);
+
+ kevent_remove_ready(k);
+
+ kevent_user_put(k->user);
+ call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+
+ rb_erase(&k->kevent_node, &u->kevent_root);
+ k->flags &= ~KEVENT_USER;
+ u->kevent_num--;
+ kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events,
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ rb_erase(&k->kevent_node, &u->kevent_root);
+ k->flags &= ~KEVENT_USER;
+ u->kevent_num--;
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+ kevent_finish_user_complete(k, deq);
+}
+
+static struct kevent *__kevent_dequeue_ready_one(struct kevent_user *u)
+{
+ unsigned long flags;
+ struct kevent *k = NULL;
+
+ if (u->ready_num) {
+ spin_lock_irqsave(&u->ready_lock, flags);
+ if (u->ready_num && !list_empty(&u->ready_list)) {
+ k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+ kevent_unlink_ready(k);
+ }
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+ }
+
+ return k;
+}
+
+static struct kevent *kevent_dequeue_ready_one(struct kevent_user *u)
+{
+ struct kevent *k = NULL;
+
+ while (u->ready_num && !k) {
+ k = __kevent_dequeue_ready_one(u);
+
+ if (k && (k->event.req_flags & KEVENT_REQ_LAST_CHECK)) {
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.req_flags &= ~KEVENT_REQ_LAST_CHECK;
+ spin_unlock_irqrestore(&k->ulock, flags);
+
+ if (!k->callbacks.callback(k)) {
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.req_flags |= KEVENT_REQ_LAST_CHECK;
+ k->event.ret_flags = 0;
+ k->event.ret_data[0] = k->event.ret_data[1] = 0;
+ spin_unlock_irqrestore(&k->ulock, flags);
+ k = NULL;
+ }
+ } else
+ break;
+ }
+
+ return k;
+}
+
+static inline void kevent_copy_ring(struct kevent *k)
+{
+ unsigned long flags;
+
+ if (!k)
+ return;
+
+ if (kevent_copy_ring_buffer(k)) {
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.ret_flags |= KEVENT_RET_COPY_FAILED;
+ spin_unlock_irqrestore(&k->ulock, flags);
+ }
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kevent_dequeue_ready(struct kevent_user *u)
+{
+ struct kevent *k;
+
+ mutex_lock(&u->ring_lock);
+ k = kevent_dequeue_ready_one(u);
+ kevent_copy_ring(k);
+ mutex_unlock(&u->ring_lock);
+
+ return k;
+}
+
+/*
+ * Dequeue one entry from user's ready queue if there is space in ring buffer.
+ */
+static struct kevent *kevent_dequeue_ready_ring(struct kevent_user *u)
+{
+ struct kevent *k = NULL;
+
+ mutex_lock(&u->ring_lock);
+ if (kevent_ring_space(u)) {
+ k = kevent_dequeue_ready_one(u);
+ kevent_copy_ring(k);
+ }
+ mutex_unlock(&u->ring_lock);
+
+ return k;
+}
+
+static void kevent_complete_ready(struct kevent *k)
+{
+ if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+ /*
+ * If it is one-shot kevent, it has been removed already from
+ * origin's queue, so we can easily free it here.
+ */
+ kevent_finish_user(k, 1);
+ else if (k->event.req_flags & KEVENT_REQ_ET) {
+ unsigned long flags;
+
+ /*
+ * Edge-triggered behaviour: mark event as clear new one.
+ */
+
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.ret_flags = 0;
+ k->event.ret_data[0] = k->event.ret_data[1] = 0;
+ spin_unlock_irqrestore(&k->ulock, flags);
+ }
+}
+
+/*
+ * Search a kevent inside kevent tree for given ukevent.
+ */
+static struct kevent *__kevent_search(struct kevent_id *id, struct kevent_user *u)
+{
+ struct kevent *k, *ret = NULL;
+ struct rb_node *n = u->kevent_root.rb_node;
+ int cmp;
+
+ while (n) {
+ k = rb_entry(n, struct kevent, kevent_node);
+ cmp = kevent_compare_id(&k->event.id, id);
+
+ if (cmp > 0)
+ n = n->rb_right;
+ else if (cmp < 0)
+ n = n->rb_left;
+ else {
+ ret = k;
+ break;
+ }
+ }
+
+ return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ int err = -ENODEV;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&uk->id, u);
+ if (k) {
+ spin_lock(&k->ulock);
+ k->event.event = uk->event;
+ k->event.req_flags = uk->req_flags;
+ k->event.ret_flags = 0;
+ spin_unlock(&k->ulock);
+ kevent_requeue(k);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+ int err = -ENODEV;
+ struct kevent *k;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&uk->id, u);
+ if (k) {
+ __kevent_finish_user(k, 1);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+ struct kevent_user *u = file->private_data;
+ struct kevent *k;
+ struct rb_node *n;
+
+ for (n = rb_first(&u->kevent_root); n; n = rb_next(n)) {
+ k = rb_entry(n, struct kevent, kevent_node);
+ kevent_finish_user(k, 1);
+ }
+
+ kevent_user_put(u);
+ file->private_data = NULL;
+
+ return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+ struct ukevent *ukev;
+
+ ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+ if (!ukev)
+ return NULL;
+
+ if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+ kfree(ukev);
+ return NULL;
+ }
+
+ return ukev;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = 0, i;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > u->kevent_num) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ if (kevent_modify(&ukev[i], u))
+ ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+ ukev[i].ret_flags |= KEVENT_RET_DONE;
+ }
+ if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+ err = -EFAULT;
+ kfree(ukev);
+ goto out;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ if (kevent_modify(&uk, u))
+ uk.ret_flags |= KEVENT_RET_BROKEN;
+ uk.ret_flags |= KEVENT_RET_DONE;
+
+ if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ arg += sizeof(struct ukevent);
+ }
+out:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = 0, i;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > u->kevent_num) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ if (kevent_remove(&ukev[i], u))
+ ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+ ukev[i].ret_flags |= KEVENT_RET_DONE;
+ }
+ if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+ err = -EFAULT;
+ kfree(ukev);
+ goto out;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ if (kevent_remove(&uk, u))
+ uk.ret_flags |= KEVENT_RET_BROKEN;
+
+ uk.ret_flags |= KEVENT_RET_DONE;
+
+ if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ arg += sizeof(struct ukevent);
+ }
+out:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static int kevent_user_enqueue(struct kevent_user *u, struct kevent *new)
+{
+ unsigned long flags;
+ struct rb_node **p = &u->kevent_root.rb_node, *parent = NULL;
+ struct kevent *k;
+ int err = 0, cmp;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ while (*p) {
+ parent = *p;
+ k = rb_entry(parent, struct kevent, kevent_node);
+
+ cmp = kevent_compare_id(&k->event.id, &new->event.id);
+ if (cmp > 0)
+ p = &parent->rb_right;
+ else if (cmp < 0)
+ p = &parent->rb_left;
+ else {
+ err = -EEXIST;
+ break;
+ }
+ }
+ if (likely(!err)) {
+ rb_link_node(&new->kevent_node, parent, p);
+ rb_insert_color(&new->kevent_node, &u->kevent_root);
+ new->flags |= KEVENT_USER;
+ u->kevent_num++;
+ kevent_user_get(u);
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ int err;
+
+ k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+ if (!k) {
+ err = -ENOMEM;
+ goto err_out_exit;
+ }
+
+ memcpy(&k->event, uk, sizeof(struct ukevent));
+ INIT_RCU_HEAD(&k->rcu_head);
+
+ k->event.ret_flags = 0;
+
+ err = kevent_init(k);
+ if (err) {
+ kmem_cache_free(kevent_cache, k);
+ goto err_out_exit;
+ }
+ k->user = u;
+ kevent_stat_total(u);
+ err = kevent_user_enqueue(u, k);
+ if (err) {
+ kmem_cache_free(kevent_cache, k);
+ goto err_out_exit;
+ }
+
+ err = kevent_enqueue(k);
+ if (err) {
+ memcpy(uk, &k->event, sizeof(struct ukevent));
+ kevent_finish_user(k, 0);
+ goto err_out_exit;
+ }
+
+ return 0;
+
+err_out_exit:
+ if (err < 0) {
+ uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
+ uk->ret_data[1] = err;
+ } else if (err > 0)
+ uk->ret_flags |= KEVENT_RET_DONE;
+ return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one
+ * and add them into appropriate kevent_storages,
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err, cerr = 0, rnum = 0, i;
+ void __user *orig = arg;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ err = -EINVAL;
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ err = kevent_user_add_ukevent(&ukev[i], u);
+ if (err) {
+ kevent_stat_im(u);
+ if (i != rnum)
+ memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+ rnum++;
+ }
+ }
+ if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+ cerr = -EFAULT;
+ kfree(ukev);
+ goto out_setup;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ cerr = -EFAULT;
+ break;
+ }
+ arg += sizeof(struct ukevent);
+
+ err = kevent_user_add_ukevent(&uk, u);
+ if (err) {
+ kevent_stat_im(u);
+ if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+ cerr = -EFAULT;
+ break;
+ }
+ orig += sizeof(struct ukevent);
+ rnum++;
+ }
+ }
+
+out_setup:
+ if (cerr < 0) {
+ err = cerr;
+ goto out_remove;
+ }
+
+ err = rnum;
+out_remove:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u,
+ unsigned int min_nr, unsigned int max_nr, __u64 timeout,
+ void __user *buf)
+{
+ struct kevent *k;
+ int num = 0;
+
+ if (!(file->f_flags & O_NONBLOCK)) {
+ wait_event_interruptible_timeout(u->wait,
+ u->ready_num >= min_nr,
+ clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+ }
+
+ while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) {
+ if (copy_to_user(buf + num*sizeof(struct ukevent),
+ &k->event, sizeof(struct ukevent))) {
+ if (num == 0)
+ num = -EFAULT;
+ break;
+ }
+ kevent_complete_ready(k);
+ ++num;
+ kevent_stat_wait(u);
+ }
+
+ return num;
+}
+
+static struct file_operations kevent_user_fops = {
+ .release = kevent_user_release,
+ .poll = kevent_user_poll,
+ .owner = THIS_MODULE,
+};
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+ int err;
+ struct kevent_user *u = file->private_data;
+
+ switch (cmd) {
+ case KEVENT_CTL_ADD:
+ err = kevent_user_ctl_add(u, num, arg);
+ break;
+ case KEVENT_CTL_REMOVE:
+ err = kevent_user_ctl_remove(u, num, arg);
+ break;
+ case KEVENT_CTL_MODIFY:
+ err = kevent_user_ctl_modify(u, num, arg);
+ break;
+ default:
+ err = -EINVAL;
+ break;
+ }
+
+ return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - timeout in nanoseconds to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - ununsed for now (will be used for mmap implementation).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+ __u64 timeout, struct ukevent __user *buf, unsigned flags)
+{
+ int err = -EINVAL;
+ struct file *file;
+ struct kevent_user *u;
+
+ file = fget(ctl_fd);
+ if (!file)
+ return -EBADF;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+ u = file->private_data;
+
+ err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+ fput(file);
+ return err;
+}
+
+static struct vfsmount *kevent_mnt __read_mostly;
+
+static int kevent_get_sb(struct file_system_type *fs_type, int flags,
+ const char *dev_name, void *data, struct vfsmount *mnt)
+{
+ return get_sb_pseudo(fs_type, "kevent", NULL, 0xaabbccdd, mnt);
+}
+
+static struct file_system_type kevent_fs_type = {
+ .name = "keventfs",
+ .get_sb = kevent_get_sb,
+ .kill_sb = kill_anon_super,
+};
+
+static int keventfs_delete_dentry(struct dentry *dentry)
+{
+ return 1;
+}
+
+static struct dentry_operations keventfs_dentry_operations = {
+ .d_delete = keventfs_delete_dentry,
+};
+
+asmlinkage long sys_kevent_init(struct kevent_ring __user *ring, unsigned int num)
+{
+ struct qstr this;
+ char name[32];
+ struct dentry *dentry;
+ struct inode *inode;
+ struct file *file;
+ int err = -ENFILE, fd;
+ struct kevent_user *u;
+
+ if ((ring && !num) || (!ring && num) || (num == 1))
+ return -EINVAL;
+
+ file = get_empty_filp();
+ if (!file)
+ goto err_out_exit;
+
+ inode = new_inode(kevent_mnt->mnt_sb);
+ if (!inode)
+ goto err_out_fput;
+
+ inode->i_fop = &kevent_user_fops;
+
+ inode->i_state = I_DIRTY;
+ inode->i_mode = S_IRUSR | S_IWUSR;
+ inode->i_uid = current->fsuid;
+ inode->i_gid = current->fsgid;
+ inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+
+ err = get_unused_fd();
+ if (err < 0)
+ goto err_out_iput;
+ fd = err;
+
+ err = -ENOMEM;
+ u = kevent_user_alloc(ring, num);
+ if (!u)
+ goto err_out_put_fd;
+
+ sprintf(name, "[%lu]", inode->i_ino);
+ this.name = name;
+ this.len = strlen(name);
+ this.hash = inode->i_ino;
+ dentry = d_alloc(kevent_mnt->mnt_sb->s_root, &this);
+ if (!dentry)
+ goto err_out_free;
+ dentry->d_op = &keventfs_dentry_operations;
+ d_add(dentry, inode);
+ file->f_vfsmnt = mntget(kevent_mnt);
+ file->f_dentry = dentry;
+ file->f_mapping = inode->i_mapping;
+ file->f_pos = 0;
+ file->f_flags = O_RDONLY;
+ file->f_op = &kevent_user_fops;
+ file->f_mode = FMODE_READ;
+ file->f_version = 0;
+ file->private_data = u;
+
+ fd_install(fd, file);
+
+ return fd;
+
+err_out_free:
+ kmem_cache_free(kevent_user_cache, u);
+err_out_put_fd:
+ put_unused_fd(fd);
+err_out_iput:
+ iput(inode);
+err_out_fput:
+ put_filp(file);
+err_out_exit:
+ return err;
+}
+
+/*
+ * This syscall is used to perform waiting until there is free space in the ring
+ * buffer, in that case some events will be copied there.
+ * Function returns number of actually copied ready events in ring buffer.
+ * After this function is completed userspace ring->ring_kidx will be updated.
+ *
+ * @ctl_fd - kevent file descriptor.
+ * @num - number of kevents to process.
+ * @timeout - this timeout specifies number of nanoseconds to wait until there is
+ * free space in kevent queue.
+ *
+ * When we need to commit @num events, it means we should just remove first @num
+ * kevents from ready queue and copy them into the buffer.
+ * Kevents will be copied into ring buffer in order they were placed into ready queue.
+ * One-shot kevents will be removed here, since there is no way they can be reused.
+ * Edge-triggered events will be requeued here for better performance.
+ */
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, __u64 timeout)
+{
+ int err = -EINVAL, copied = 0;
+ struct file *file;
+ struct kevent_user *u;
+ struct kevent *k;
+ struct kevent_ring __user *ring;
+ unsigned int i;
+
+ file = fget(ctl_fd);
+ if (!file)
+ return -EBADF;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+ u = file->private_data;
+
+ ring = u->pring;
+ if (!ring || num > u->ring_size)
+ goto out_fput;
+
+ if (!(file->f_flags & O_NONBLOCK)) {
+ wait_event_interruptible_timeout(u->wait,
+ ((u->ready_num >= 1) && (kevent_ring_space(u))),
+ clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+ }
+
+ for (i=0; i<num; ++i) {
+ k = kevent_dequeue_ready_ring(u);
+ if (!k)
+ break;
+ kevent_complete_ready(k);
+
+ if (k->event.ret_flags & KEVENT_RET_COPY_FAILED)
+ break;
+ kevent_stat_ring(u);
+ copied++;
+ }
+
+ fput(file);
+
+ return copied;
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * This syscall is used to commit events in ring buffer, i.e. mark appropriate
+ * entries as unused by userspace so subsequent kevent_wait() could overwrite them.
+ * This fucntion returns actual number of kevents which were committed.
+ * After this function is completed userspace ring->ring_uidx will be updated.
+ *
+ * @ctl_fd - kevent file descriptor.
+ * @start - index of the first kevent to be committed.
+ * @num - number of kevents to commit.
+ * @over - number of overflows given queue had.
+ *
+ * If several threads are going to commit the same events, and one of them
+ * has committed events, while other was scheduled away for too long, that
+ * ring indexes have wrapped, it is possible that incorrect
+ */
+asmlinkage long sys_kevent_commit(int ctl_fd, unsigned int start, unsigned int num, unsigned int over)
+{
+ int err = -EINVAL, comm = 0, i, over_changed = 0;
+ struct file *file;
+ struct kevent_user *u;
+ struct kevent_ring __user *ring;
+
+ file = fget(ctl_fd);
+ if (!file)
+ return -EBADF;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+ u = file->private_data;
+ ring = u->pring;
+
+ if (!ring || num > u->ring_size)
+ goto out_fput;
+
+ err = -EOVERFLOW;
+ mutex_lock(&u->ring_lock);
+ if (over != u->ring_over+1 && over != u->ring_over)
+ goto err_out_unlock;
+
+ if (start > u->uidx) {
+ if (over != u->ring_over+1) {
+ if (over == u->ring_over)
+ err = -EINVAL;
+ goto err_out_unlock;
+ } else {
+ /*
+ * To be or not to be, that is a question:
+ * Whether it is nobler in the mind to suffer...
+ * Stop. Not.
+ * To optimize 'the modulo' or not, that is a question:
+ * Are there many CPUs, which still being in the world production
+ * And suffer badly from that stuff in it.
+ */
+ unsigned int mod = (start + num) % u->ring_size;
+
+ if (mod >= u->uidx)
+ comm = mod - u->uidx;
+ }
+ } else {
+ if (over != u->ring_over)
+ goto err_out_unlock;
+
+ if (start + num >= u->uidx)
+ comm = start + num - u->uidx;
+ }
+
+ if (comm)
+ u->full = 0;
+
+ for (i=0; i<comm; ++i) {
+ if (kevent_ring_index_inc(&u->uidx, u->ring_size)) {
+ u->ring_over++;
+ over_changed = 1;
+ }
+ }
+
+ if (over_changed) {
+ if (put_user(u->ring_over, &ring->ring_over)) {
+ err = -EFAULT;
+ goto err_out_unlock;
+ }
+ }
+
+ if (put_user(u->uidx, &ring->ring_uidx)) {
+ err = -EFAULT;
+ goto err_out_unlock;
+ }
+ mutex_unlock(&u->ring_lock);
+
+ fput(file);
+
+ return comm;
+
+err_out_unlock:
+ mutex_unlock(&u->ring_lock);
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent __user *arg)
+{
+ int err = -EINVAL;
+ struct file *file;
+
+ file = fget(fd);
+ if (!file)
+ return -EBADF;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+
+ err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * Kevent subsystem initialization - create caches and register
+ * filesystem to get control file descriptors from.
+ */
+static int __init kevent_user_init(void)
+{
+ int err = 0;
+
+ kevent_cache = kmem_cache_create("kevent_cache",
+ sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+
+ kevent_user_cache = kmem_cache_create("kevent_user_cache",
+ sizeof(struct kevent_user), 0, SLAB_PANIC, NULL, NULL);
+
+ err = register_filesystem(&kevent_fs_type);
+ if (err)
+ goto err_out_exit;
+
+ kevent_mnt = kern_mount(&kevent_fs_type);
+ err = PTR_ERR(kevent_mnt);
+ if (IS_ERR(kevent_mnt))
+ goto err_out_unreg;
+
+ printk(KERN_INFO "KEVENT subsystem has been successfully registered.\n");
+
+ return 0;
+
+err_out_unreg:
+ unregister_filesystem(&kevent_fs_type);
+err_out_exit:
+ kmem_cache_destroy(kevent_cache);
+ return err;
+}
+
+module_init(kevent_user_init);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7a3b2e7..3b7d35f 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,12 @@ cond_syscall(ppc_rtas);
cond_syscall(sys_spu_run);
cond_syscall(sys_spu_create);
+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_ctl);
+cond_syscall(sys_kevent_wait);
+cond_syscall(sys_kevent_commit);
+cond_syscall(sys_kevent_init);
+
/* mmu depending weak syscall entries */
cond_syscall(sys_mprotect);
cond_syscall(sys_msync);
Description.
diff --git a/Documentation/kevent.txt b/Documentation/kevent.txt
new file mode 100644
index 0000000..49e1cc2
--- /dev/null
+++ b/Documentation/kevent.txt
@@ -0,0 +1,230 @@
+Description.
+
+int kevent_init(struct kevent_ring *ring, unsigned int ring_size);
+
+num - size of the ring buffer in events
+ring - pointer to allocated ring buffer
+
+Return value: kevent control file descriptor or negative error value.
+
+ struct kevent_ring
+ {
+ unsigned int ring_kidx, ring_uidx, ring_over;
+ struct ukevent event[0];
+ }
+
+ring_kidx - index in the ring buffer where kernel will put new events
+ when kevent_wait() or kevent_get_events() is called
+ring_uidx - index of the first entry userspace can start reading from
+ring_over - number of overflows of ring_uidx happend from the start.
+ Overflow counter is used to prevent situation when two threads
+ are going to free the same events, but one of them was scheduled
+ away for too long, so ring indexes were wrapped, so when that
+ thread will be awakened, it will free not those events, which
+ it suppose to free.
+
+Example userspace code (ring_buffer.c) can be found on project's homepage.
+
+Each kevent syscall can be so called cancellation point in glibc, i.e. when
+thread has been cancelled in kevent syscall, thread can be safely removed
+and no events will be lost, since each syscall (kevent_wait() or
+kevent_get_events()) will copy event into special ring buffer, accessible
+from other threads or even processes (if shared memory is used).
+
+When kevent is removed (not dequeued when it is ready, but just removed),
+even if it was ready, it is not copied into ring buffer, since if it is
+removed, no one cares about it (otherwise user would wait until it becomes
+ready and got it through usual way using kevent_get_events() or kevent_wait())
+and thus no need to copy it to the ring buffer.
+
+-------------------------------------------------------------------------------
+
+
+int kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent *arg);
+
+fd - is the file descriptor referring to the kevent queue to manipulate.
+It is created by opening "/dev/kevent" char device, which is created with
+dynamic minor number and major number assigned for misc devices.
+
+cmd - is the requested operation. It can be one of the following:
+ KEVENT_CTL_ADD - add event notification
+ KEVENT_CTL_REMOVE - remove event notification
+ KEVENT_CTL_MODIFY - modify existing notification
+
+num - number of struct ukevent in the array pointed to by arg
+arg - array of struct ukevent
+
+Return value:
+ number of events processed or negative error value.
+
+When called, kevent_ctl will carry out the operation specified in the
+cmd parameter.
+-------------------------------------------------------------------------------
+
+ int kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+ __u64 timeout, struct ukevent *buf, unsigned flags);
+
+ctl_fd - file descriptor referring to the kevent queue
+min_nr - minimum number of completed events that kevent_get_events will block
+ waiting for
+max_nr - number of struct ukevent in buf
+timeout - number of nanoseconds to wait before returning less than min_nr
+ events. If this is -1, then wait forever.
+buf - pointer to an array of struct ukevent.
+flags - unused
+
+Return value:
+ number of events copied or negative error value.
+
+kevent_get_events will wait timeout milliseconds for at least min_nr completed
+events, copying completed struct ukevents to buf and deleting any
+KEVENT_REQ_ONESHOT event requests. In nonblocking mode it returns as many
+events as possible, but not more than max_nr. In blocking mode it waits until
+timeout or if at least min_nr events are ready.
+
+This function copies event into ring buffer if it was initialized, if ring buffer
+is full, KEVENT_RET_COPY_FAILED flag is set in ret_flags field.
+-------------------------------------------------------------------------------
+
+ int kevent_wait(int ctl_fd, unsigned int num, __u64 timeout);
+
+ctl_fd - file descriptor referring to the kevent queue
+num - number of processed kevents
+timeout - this timeout specifies number of nanoseconds to wait until there is
+ free space in kevent queue
+
+Return value:
+ number of events copied into ring buffer or negative error value.
+
+This syscall waits until either timeout expires or at least one event becomes
+ready. It also copies events into special ring buffer. If ring buffer is full,
+it waits until there are ready events and then return.
+If kevent is one-shot kevent it is removed in this syscall.
+If kevent is edge-triggered (KEVENT_REQ_ET flag is set in 'req_flags') it is
+requeued in this syscall for performance reasons.
+-------------------------------------------------------------------------------
+
+ int kevent_commit(int ctl_fd, unsigned int start,
+ unsigned int num, unsigned int over);
+
+ctl_fd - file descriptor referring to the kevent queue
+start - index of the first index in the ring buffer to start to commit from
+num - number of kevents to commit
+over - overflow count for given $start value
+
+Return value:
+ number of committed kevents or negative error value.
+
+This function commits, i.e. marks as empty, slots in the ring buffer, so
+they can be reused when userspace completes that entries processing.
+
+Overflow counter is used to prevent situation when two threads are going
+to free the same events, but one of them was scheduled away for too long,
+so ring indexes were wrapped, so when that thread will be awakened, it
+will free not those events, which it suppose to free.
+
+It is possible that returned number of committed events will be smaller than
+requested number - it is possible when several threads try to commit the
+same events.
+-------------------------------------------------------------------------------
+
+The bulk of the interface is entirely done through the ukevent struct.
+It is used to add event requests, modify existing event requests,
+specify which event requests to remove, and return completed events.
+
+struct ukevent contains the following members:
+
+struct kevent_id id
+ Id of this request, e.g. socket number, file descriptor and so on
+__u32 type
+ Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on
+__u32 event
+ Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED
+__u32 req_flags
+ Per-event request flags,
+
+ KEVENT_REQ_ONESHOT
+ event will be removed when it is ready
+
+ KEVENT_REQ_WAKEUP_ALL
+ Kevent wakes up only first thread interested in given event,
+ or all threads if this flag is set.
+
+ KEVENT_REQ_ET
+ Edge Triggered behaviour. It is an optimisation which allows to move
+ ready and dequeued (i.e. copied to userspace) event to move into set
+ of interest for given storage (socket, inode and so on) again. It is
+ very usefull for cases when the same event should be used many times
+ (like reading from pipe). It is similar to epoll()'s EPOLLET flag.
+
+ KEVENT_REQ_LAST_CHECK
+ if set allows to perform the last check on kevent (call appropriate
+ callback) when kevent is marked as ready and has been removed from
+ ready queue. If it will be confirmed that kevent is ready
+ (k->callbacks.callback(k) returns true) then kevent will be copied
+ to userspace, otherwise it will be requeued back to storage.
+ Second (checking) call is performed with this bit cleared, so callback
+ can detect when it was called from kevent_storage_ready() - bit is set,
+ or kevent_dequeue_ready() - bit is cleared. If kevent will be requeued,
+ bit will be set again.
+
+ KEVENT_REQ_ALWAYS_QUEUE
+ If this flag is set kevent will be queued into ready queue if it is
+ ready at enqueue time, otherwise it will be copied back to userspace
+ and will not be queued into the storage.
+
+__u32 ret_flags
+ Per-event return flags
+
+ KEVENT_RET_BROKEN
+ Kevent is broken
+
+ KEVENT_RET_DONE
+ Kevent processing was finished successfully
+
+ KEVENT_RET_COPY_FAILED
+ Kevent was not copied into ring buffer due to some error conditions.
+
+__u32 ret_data
+ Event return data. Event originator fills it with anything it likes
+ (for example timer notifications put number of milliseconds when timer
+ has fired
+union { __u32 user[2]; void *ptr; }
+ User's data. It is not used, just copied to/from user. The whole structure
+ is aligned to 8 bytes already, so the last union is aligned properly.
+
+-------------------------------------------------------------------------------
+
+Usage
+
+For KEVENT_CTL_ADD, all fields relevant to the event type must be filled
+(id, type, event, req_flags).
+After kevent_ctl(..., KEVENT_CTL_ADD, ...) returns each struct's ret_flags
+should be checked to see if the event is already broken or done.
+
+For KEVENT_CTL_MODIFY, the id, req_flags, and user and event fields must be
+set and an existing kevent request must have matching id and user fields. If
+match is found, req_flags and event are replaced with the newly supplied
+values and requeueing is started, so modified kevent can be checked and
+probably marked as ready immediately. If a match can't be found, the
+passed in ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is
+always set.
+
+For KEVENT_CTL_REMOVE, the id and user fields must be set and an existing
+kevent request must have matching id and user fields. If a match is found,
+the kevent request is removed. If a match can't be found, the passed in
+ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is always set.
+
+For kevent_get_events, the entire structure is returned.
+
+-------------------------------------------------------------------------------
+
+Usage cases
+
+kevent_timer
+struct ukevent should contain following fields:
+ type - KEVENT_TIMER
+ event - KEVENT_TIMER_FIRED
+ req_flags - KEVENT_REQ_ONESHOT if you want to fire that timer only once
+ id.raw[0] - number of seconds after commit when this timer shout expire
+ id.raw[0] - additional to number of seconds number of nanoseconds
Timer notifications.
Timer notifications can be used for fine grained per-process time
management, since interval timers are very inconvenient to use,
and they are limited.
This subsystem uses high-resolution timers.
id.raw[0] is used as number of seconds
id.raw[1] is used as number of nanoseconds
Signed-off-by: Evgeniy Polyakov <[email protected]>
diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..df93049
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,112 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/hrtimer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+struct kevent_timer
+{
+ struct hrtimer ktimer;
+ struct kevent_storage ktimer_storage;
+ struct kevent *ktimer_event;
+};
+
+static int kevent_timer_func(struct hrtimer *timer)
+{
+ struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer);
+ struct kevent *k = t->ktimer_event;
+
+ kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL);
+ hrtimer_forward(timer, timer->base->softirq_time,
+ ktime_set(k->event.id.raw[0], k->event.id.raw[1]));
+ return HRTIMER_RESTART;
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+ int err;
+ struct kevent_timer *t;
+
+ t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+ if (!t)
+ return -ENOMEM;
+
+ hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL);
+ t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]);
+ t->ktimer.function = kevent_timer_func;
+ t->ktimer_event = k;
+
+ err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+ if (err)
+ goto err_out_free;
+ lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+ err = kevent_storage_enqueue(&t->ktimer_storage, k);
+ if (err)
+ goto err_out_st_fini;
+
+ hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL);
+
+ return 0;
+
+err_out_st_fini:
+ kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+ kfree(t);
+
+ return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+ struct kevent_storage *st = k->st;
+ struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage);
+
+ hrtimer_cancel(&t->ktimer);
+ kevent_storage_dequeue(st, k);
+ kfree(t);
+
+ return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+ k->event.ret_data[0] = jiffies_to_msecs(jiffies);
+ return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+ struct kevent_callbacks tc = {
+ .callback = &kevent_timer_callback,
+ .enqueue = &kevent_timer_enqueue,
+ .dequeue = &kevent_timer_dequeue};
+
+ return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);
+
Socket notifications.
This patch includes socket send/recv/accept notifications.
Using trivial web server based on kevent and this features
instead of epoll it's performance increased more than noticebly.
More details about various benchmarks and server itself
(evserver_kevent.c) can be found on project's homepage.
Signed-off-by: Evgeniy Polyakov <[email protected]>
diff --git a/fs/inode.c b/fs/inode.c
index ada7643..2740617 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@
#include <linux/cdev.h>
#include <linux/bootmem.h>
#include <linux/inotify.h>
+#include <linux/kevent.h>
#include <linux/mount.h>
/*
@@ -164,12 +165,18 @@ static struct inode *alloc_inode(struct
}
inode->i_private = 0;
inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+ kevent_storage_init(inode, &inode->st);
+#endif
}
return inode;
}
void destroy_inode(struct inode *inode)
{
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+ kevent_storage_fini(&inode->st);
+#endif
BUG_ON(inode_has_buffers(inode));
security_inode_free(inode);
if (inode->i_sb->s_op->destroy_inode)
diff --git a/include/net/sock.h b/include/net/sock.h
index edd4d73..d48ded8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -48,6 +48,7 @@
#include <linux/netdevice.h>
#include <linux/skbuff.h> /* struct sk_buff */
#include <linux/security.h>
+#include <linux/kevent.h>
#include <linux/filter.h>
@@ -450,6 +451,21 @@ static inline int sk_stream_memory_free(
extern void sk_stream_rfree(struct sk_buff *skb);
+struct socket_alloc {
+ struct socket socket;
+ struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+ return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+ return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
+}
+
static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
{
skb->sk = sk;
@@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct
sk->sk_backlog.tail = skb;
}
skb->next = NULL;
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
}
#define sk_wait_event(__sk, __timeo, __condition) \
@@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio
return si->kiocb;
}
-struct socket_alloc {
- struct socket socket;
- struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
- return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
- return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
-}
-
extern void __sk_stream_mem_reclaim(struct sock *sk);
extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..69f4ad2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so
tp->ucopy.memory = 0;
} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
wake_up_interruptible(sk->sk_sleep);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
if (!inet_csk_ack_scheduled(sk))
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
(3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 0000000..9c24b5b
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,142 @@
+/*
+ * kevent_socket.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/tcp.h>
+#include <linux/kevent.h>
+
+#include <net/sock.h>
+#include <net/request_sock.h>
+#include <net/inet_connection_sock.h>
+
+static int kevent_socket_callback(struct kevent *k)
+{
+ struct inode *inode = k->st->origin;
+ unsigned int events = SOCKET_I(inode)->ops->poll(SOCKET_I(inode)->file, SOCKET_I(inode), NULL);
+
+ if ((events & (POLLIN | POLLRDNORM)) && (k->event.event & (KEVENT_SOCKET_RECV | KEVENT_SOCKET_ACCEPT)))
+ return 1;
+ if ((events & (POLLOUT | POLLWRNORM)) && (k->event.event & KEVENT_SOCKET_SEND))
+ return 1;
+ if (events & (POLLERR | POLLHUP))
+ return -1;
+ return 0;
+}
+
+int kevent_socket_enqueue(struct kevent *k)
+{
+ struct inode *inode;
+ struct socket *sock;
+ int err = -EBADF;
+
+ sock = sockfd_lookup(k->event.id.raw[0], &err);
+ if (!sock)
+ goto err_out_exit;
+
+ inode = igrab(SOCK_INODE(sock));
+ if (!inode)
+ goto err_out_fput;
+
+ err = kevent_storage_enqueue(&inode->st, k);
+ if (err)
+ goto err_out_iput;
+
+ if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE) {
+ kevent_requeue(k);
+ err = 0;
+ } else {
+ err = k->callbacks.callback(k);
+ if (err)
+ goto err_out_dequeue;
+ }
+
+ return err;
+
+err_out_dequeue:
+ kevent_storage_dequeue(k->st, k);
+err_out_iput:
+ iput(inode);
+err_out_fput:
+ sockfd_put(sock);
+err_out_exit:
+ return err;
+}
+
+int kevent_socket_dequeue(struct kevent *k)
+{
+ struct inode *inode = k->st->origin;
+ struct socket *sock;
+
+ kevent_storage_dequeue(k->st, k);
+
+ sock = SOCKET_I(inode);
+ iput(inode);
+ sockfd_put(sock);
+
+ return 0;
+}
+
+void kevent_socket_notify(struct sock *sk, u32 event)
+{
+ if (sk->sk_socket)
+ kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event);
+}
+
+/*
+ * It is required for network protocols compiled as modules, like IPv6.
+ */
+EXPORT_SYMBOL_GPL(kevent_socket_notify);
+
+#ifdef CONFIG_LOCKDEP
+static struct lock_class_key kevent_sock_key;
+
+void kevent_socket_reinit(struct socket *sock)
+{
+ struct inode *inode = SOCK_INODE(sock);
+
+ lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+}
+
+void kevent_sk_reinit(struct sock *sk)
+{
+ if (sk->sk_socket) {
+ struct inode *inode = SOCK_INODE(sk->sk_socket);
+
+ lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+ }
+}
+#endif
+static int __init kevent_init_socket(void)
+{
+ struct kevent_callbacks sc = {
+ .callback = &kevent_socket_callback,
+ .enqueue = &kevent_socket_enqueue,
+ .dequeue = &kevent_socket_dequeue};
+
+ return kevent_add_callbacks(&sc, KEVENT_SOCKET);
+}
+module_init(kevent_init_socket);
diff --git a/net/core/sock.c b/net/core/sock.c
index b77e155..7d5fa3e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1402,6 +1402,7 @@ static void sock_def_wakeup(struct sock
if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
wake_up_interruptible_all(sk->sk_sleep);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}
static void sock_def_error_report(struct sock *sk)
@@ -1411,6 +1412,7 @@ static void sock_def_error_report(struct
wake_up_interruptible(sk->sk_sleep);
sk_wake_async(sk,0,POLL_ERR);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}
static void sock_def_readable(struct sock *sk, int len)
@@ -1420,6 +1422,7 @@ static void sock_def_readable(struct soc
wake_up_interruptible(sk->sk_sleep);
sk_wake_async(sk,1,POLL_IN);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}
static void sock_def_write_space(struct sock *sk)
@@ -1439,6 +1442,7 @@ static void sock_def_write_space(struct
}
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
}
static void sock_def_destruct(struct sock *sk)
@@ -1489,6 +1493,8 @@ void sock_init_data(struct socket *sock,
sk->sk_state = TCP_CLOSE;
sk->sk_socket = sock;
+ kevent_sk_reinit(sk);
+
sock_set_flag(sk, SOCK_ZAPPED);
if(sock)
@@ -1555,8 +1561,10 @@ void fastcall release_sock(struct sock *
if (sk->sk_backlog.tail)
__release_sock(sk);
sk->sk_lock.owner = NULL;
- if (waitqueue_active(&sk->sk_lock.wq))
+ if (waitqueue_active(&sk->sk_lock.wq)) {
wake_up(&sk->sk_lock.wq);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
+ }
spin_unlock_bh(&sk->sk_lock.slock);
}
EXPORT_SYMBOL(release_sock);
diff --git a/net/core/stream.c b/net/core/stream.c
index d1d7dec..2878c2a 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock *
wake_up_interruptible(sk->sk_sleep);
if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
sock_wake_async(sock, 2, POLL_OUT);
+ kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
}
}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3f884ce..e7dd989 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3119,6 +3119,7 @@ static void tcp_ofo_queue(struct sock *s
__skb_unlink(skb, &tp->out_of_order_queue);
__skb_queue_tail(&sk->sk_receive_queue, skb);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
if(skb->h.th->fin)
tcp_fin(skb, sk, skb->h.th);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index c83938b..b0dd70d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -61,6 +61,7 @@
#include <linux/jhash.h>
#include <linux/init.h>
#include <linux/times.h>
+#include <linux/kevent.h>
#include <net/icmp.h>
#include <net/inet_hashtables.h>
@@ -870,6 +871,7 @@ int tcp_v4_conn_request(struct sock *sk,
reqsk_free(req);
} else {
inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+ kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT);
}
return 0;
diff --git a/net/socket.c b/net/socket.c
index 1bc4167..5582b4a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -85,6 +85,7 @@
#include <linux/kmod.h>
#include <linux/audit.h>
#include <linux/wireless.h>
+#include <linux/kevent.h>
#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -490,6 +491,8 @@ static struct socket *sock_alloc(void)
inode->i_uid = current->fsuid;
inode->i_gid = current->fsgid;
+ kevent_socket_reinit(sock);
+
get_cpu_var(sockets_in_use)++;
put_cpu_var(sockets_in_use);
return sock;
Pipe notifications.
diff --git a/fs/pipe.c b/fs/pipe.c
index f3b6f71..aeaee9c 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -16,6 +16,7 @@
#include <linux/uio.h>
#include <linux/highmem.h>
#include <linux/pagemap.h>
+#include <linux/kevent.h>
#include <asm/uaccess.h>
#include <asm/ioctls.h>
@@ -312,6 +313,7 @@ redo:
break;
}
if (do_wakeup) {
+ kevent_pipe_notify(inode, KEVENT_SOCKET_SEND);
wake_up_interruptible_sync(&pipe->wait);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}
@@ -321,6 +323,7 @@ redo:
/* Signal writers asynchronously that there is more room. */
if (do_wakeup) {
+ kevent_pipe_notify(inode, KEVENT_SOCKET_SEND);
wake_up_interruptible(&pipe->wait);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}
@@ -490,6 +493,7 @@ redo2:
break;
}
if (do_wakeup) {
+ kevent_pipe_notify(inode, KEVENT_SOCKET_RECV);
wake_up_interruptible_sync(&pipe->wait);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
do_wakeup = 0;
@@ -501,6 +505,7 @@ redo2:
out:
mutex_unlock(&inode->i_mutex);
if (do_wakeup) {
+ kevent_pipe_notify(inode, KEVENT_SOCKET_RECV);
wake_up_interruptible(&pipe->wait);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
}
@@ -605,6 +610,7 @@ pipe_release(struct inode *inode, int de
free_pipe_info(inode);
} else {
wake_up_interruptible(&pipe->wait);
+ kevent_pipe_notify(inode, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}
diff --git a/kernel/kevent/kevent_pipe.c b/kernel/kevent/kevent_pipe.c
new file mode 100644
index 0000000..5080642
--- /dev/null
+++ b/kernel/kevent/kevent_pipe.c
@@ -0,0 +1,117 @@
+/*
+ * kevent_pipe.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/kevent.h>
+#include <linux/pipe_fs_i.h>
+
+static int kevent_pipe_callback(struct kevent *k)
+{
+ struct inode *inode = k->st->origin;
+ struct pipe_inode_info *pipe = inode->i_pipe;
+ int nrbufs = pipe->nrbufs;
+
+ if (k->event.event & KEVENT_SOCKET_RECV && nrbufs > 0) {
+ if (!pipe->writers)
+ return -1;
+ return 1;
+ }
+
+ if (k->event.event & KEVENT_SOCKET_SEND && nrbufs < PIPE_BUFFERS) {
+ if (!pipe->readers)
+ return -1;
+ return 1;
+ }
+
+ return 0;
+}
+
+int kevent_pipe_enqueue(struct kevent *k)
+{
+ struct file *pipe;
+ int err = -EBADF;
+ struct inode *inode;
+
+ pipe = fget(k->event.id.raw[0]);
+ if (!pipe)
+ goto err_out_exit;
+
+ inode = igrab(pipe->f_dentry->d_inode);
+ if (!inode)
+ goto err_out_fput;
+
+ err = kevent_storage_enqueue(&inode->st, k);
+ if (err)
+ goto err_out_iput;
+
+ if (k->event.req_flags & KEVENT_REQ_ALWAYS_QUEUE) {
+ kevent_requeue(k);
+ err = 0;
+ } else {
+ err = k->callbacks.callback(k);
+ if (err)
+ goto err_out_dequeue;
+ }
+
+ fput(pipe);
+
+ return err;
+
+err_out_dequeue:
+ kevent_storage_dequeue(k->st, k);
+err_out_iput:
+ iput(inode);
+err_out_fput:
+ fput(pipe);
+err_out_exit:
+ return err;
+}
+
+int kevent_pipe_dequeue(struct kevent *k)
+{
+ struct inode *inode = k->st->origin;
+
+ kevent_storage_dequeue(k->st, k);
+ iput(inode);
+
+ return 0;
+}
+
+void kevent_pipe_notify(struct inode *inode, u32 event)
+{
+ kevent_storage_ready(&inode->st, NULL, event);
+}
+
+static int __init kevent_init_pipe(void)
+{
+ struct kevent_callbacks sc = {
+ .callback = &kevent_pipe_callback,
+ .enqueue = &kevent_pipe_enqueue,
+ .dequeue = &kevent_pipe_dequeue};
+
+ return kevent_add_callbacks(&sc, KEVENT_PIPE);
+}
+module_init(kevent_init_pipe);
On Tuesday 21 November 2006 17:29, Evgeniy Polyakov wrote:
> Pipe notifications.
> +int kevent_pipe_enqueue(struct kevent *k)
> +{
> + struct file *pipe;
> + int err = -EBADF;
> + struct inode *inode;
> +
> + pipe = fget(k->event.id.raw[0]);
> + if (!pipe)
> + goto err_out_exit;
> +
> + inode = igrab(pipe->f_dentry->d_inode);
> + if (!inode)
> + goto err_out_fput;
> +
Well...
How can you be sure 'pipe/inode' really refers to a pipe/fifo here ?
Hint : i_pipe <> NULL is not sufficient because i_pipe, i_bdev, i_cdev share
the same location. (check pipe_info() in fs/splice.c)
So I guess you need :
err = -EINVAL;
if (!S_ISFIFO(inode->i_mode))
goto err_out_iput;
Eric
On Wed, Nov 22, 2006 at 12:20:50PM +0100, Eric Dumazet ([email protected]) wrote:
> On Tuesday 21 November 2006 17:29, Evgeniy Polyakov wrote:
> > Pipe notifications.
>
> > +int kevent_pipe_enqueue(struct kevent *k)
> > +{
> > + struct file *pipe;
> > + int err = -EBADF;
> > + struct inode *inode;
> > +
> > + pipe = fget(k->event.id.raw[0]);
> > + if (!pipe)
> > + goto err_out_exit;
> > +
> > + inode = igrab(pipe->f_dentry->d_inode);
> > + if (!inode)
> > + goto err_out_fput;
> > +
>
> Well...
>
> How can you be sure 'pipe/inode' really refers to a pipe/fifo here ?
>
> Hint : i_pipe <> NULL is not sufficient because i_pipe, i_bdev, i_cdev share
> the same location. (check pipe_info() in fs/splice.c)
>
> So I guess you need :
>
> err = -EINVAL;
> if (!S_ISFIFO(inode->i_mode))
> goto err_out_iput;
You are correct, I did not perform that check, since all pipe open
functions do rely on the i_pipe, which can not be block device at that
point, but with kevent file descriptor can be anything, so that check
must be performed.
I will put it into the tree, thanks Eric.
> Eric
--
Evgeniy Polyakov
Evgeniy Polyakov wrote:
> + int kevent_wait(int ctl_fd, unsigned int num, __u64 timeout);
> +
> +ctl_fd - file descriptor referring to the kevent queue
> +num - number of processed kevents
> +timeout - this timeout specifies number of nanoseconds to wait until there is
> + free space in kevent queue
> +
> +Return value:
> + number of events copied into ring buffer or negative error value.
This is not quite sufficient. What we also need is a parameter which
specifies which ring buffer the code assumes is currently active. This
is just like the EWOULDBLOCK error in the futex. I.e., the kernel
doesn't move the thread on the wait list if the index has changed.
Otherwise asynchronous ring buffer filling is impossible. Assume this
thread kernel
get current ring buffer idx
front and tail pointer the same
add new entry to ring buffer
bump front pointer
call kevent_wait()
With the interface above this leads to a deadlock. The kernel delivered
the event and is done with it.
If the kevent_wait() syscall gets an additional parameter which
specifies the expected front pointer the kernel wouldn't put the thread
to sleep since, in this case, the front pointer changed since last checked.
The kernel cannot and should not check the ring buffer is empty.
Userlevel should maintain the tail pointer all by itself. And even if
the tail pointer is available to the kernel, the program might want to
handle the queued events differently.
The above also comes to bear without asynchronous queuing if a thread
waits for more than one event and it is possible to handle both events
concurrently in two threads.
Passing in the expected front pointer value is flexible and efficient.
--
β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View, CA β
Evgeniy Polyakov wrote:
> + struct kevent_ring
> + {
> + unsigned int ring_kidx, ring_uidx, ring_over;
> + struct ukevent event[0];
> + }
> + [...]
> +ring_uidx - index of the first entry userspace can start reading from
Do we need this value in the structure? Userlevel cannot and should not
be able to modify it. So, userland has in any case to track the tail
pointer itself. Why then have this value at all?
After kevent_init() the tail pointer is implicitly assumed to be 0.
Since the front pointer (well index) is also zero nothing is available
for reading.
--
β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View, CA β
On Wed, Nov 22, 2006 at 03:46:42PM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >+ int kevent_wait(int ctl_fd, unsigned int num, __u64 timeout);
> >+
> >+ctl_fd - file descriptor referring to the kevent queue
> >+num - number of processed kevents
> >+timeout - this timeout specifies number of nanoseconds to wait until
> >there is + free space in kevent queue
> >+
> >+Return value:
> >+ number of events copied into ring buffer or negative error value.
>
> This is not quite sufficient. What we also need is a parameter which
> specifies which ring buffer the code assumes is currently active. This
> is just like the EWOULDBLOCK error in the futex. I.e., the kernel
> doesn't move the thread on the wait list if the index has changed.
> Otherwise asynchronous ring buffer filling is impossible. Assume this
>
> thread kernel
>
> get current ring buffer idx
>
> front and tail pointer the same
>
> add new entry to ring buffer
>
> bump front pointer
>
> call kevent_wait()
>
>
> With the interface above this leads to a deadlock. The kernel delivered
> the event and is done with it.
Kernel does not put there a new entry, it is only done inside
kevent_wait(). Entries are put into queue (in any context), where they can be obtained
from only kevent_wait() or kevent_get_events().
--
Evgeniy Polyakov
On Wed, Nov 22, 2006 at 03:52:11PM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >+ struct kevent_ring
> >+ {
> >+ unsigned int ring_kidx, ring_uidx, ring_over;
> >+ struct ukevent event[0];
> >+ }
> >+ [...]
> >+ring_uidx - index of the first entry userspace can start reading from
>
> Do we need this value in the structure? Userlevel cannot and should not
> be able to modify it. So, userland has in any case to track the tail
> pointer itself. Why then have this value at all?
>
> After kevent_init() the tail pointer is implicitly assumed to be 0.
> Since the front pointer (well index) is also zero nothing is available
> for reading.
uidx is an index, starting from which there are unread entries. It is
updated by userspace when it commits entries, so it is 'consumer'
pointer, while kidx is an index where kernel will put new entries, i.e.
'producer' index. We definitely need them both.
Userspace can only update (implicitly by calling kevent_commit()) uidx.
> --
> β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View,
> CA β
--
Evgeniy Polyakov
Evgeniy Polyakov wrote:
> Kernel does not put there a new entry, it is only done inside
> kevent_wait(). Entries are put into queue (in any context), where they can be obtained
> from only kevent_wait() or kevent_get_events().
I know this is how it's done now. But it is not where it has to end.
IMO we have to get to a solution where new events are posted to the ring
buffer asynchronously, i.e., without a thread calling kevent_wait. And
then you need the extra parameter and verification. Even if it's today
not needed we have to future-proof the interface since it cannot be
changed once in use.
--
β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View, CA β
Evgeniy Polyakov wrote:
> uidx is an index, starting from which there are unread entries. It is
> updated by userspace when it commits entries, so it is 'consumer'
> pointer, while kidx is an index where kernel will put new entries, i.e.
> 'producer' index. We definitely need them both.
> Userspace can only update (implicitly by calling kevent_commit()) uidx.
Right, which is why exporting this entry is not needed. Keep the
interface as small as possible.
Userlevel has to maintain its own index. Just assume kevent_wait
returns 10 new entries and you have multiple threads. In this case all
threads take their turns and pick an entry from the ring buffer. This
basically has to be done with something like this (I ignore wrap-arounds
here to simplify the example):
int getidx() {
while (uidx < kidx)
if (atomic_cmpxchg(uidx, uidx + 1, uidx) == 0)
return uidx;
return -1;
}
Very much simplified but it should show that we need a writable copy of
the uidx. And this value at any time must be consistent with the index
the kernel assumes.
The current ring_uidx value can at best be used to reinitialize the
userlevel uidx value after each kevent_wait call but this is unnecessary
at best (since uidx must already have this value) and racy in problem
cases (what if more than one thread gets woken concurrently with uidx
having the same value and one thread stores the uidx value and
immediately increments it to get an index; the second store would
overwrite the increment).
I can assure you that any implementation I write would not use the
ring_uidx value. Only trivial, single-threaded examples like you
ring_buffer.c could ever take advantage of this value. It's not worth it.
--
β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View, CA β
On Thursday 23 November 2006 21:00, Ulrich Drepper wrote:
> Evgeniy Polyakov wrote:
> > uidx is an index, starting from which there are unread entries. It is
> > updated by userspace when it commits entries, so it is 'consumer'
> > pointer, while kidx is an index where kernel will put new entries, i.e.
> > 'producer' index. We definitely need them both.
> > Userspace can only update (implicitly by calling kevent_commit()) uidx.
>
> Right, which is why exporting this entry is not needed. Keep the
> interface as small as possible.
>
> Userlevel has to maintain its own index. Just assume kevent_wait
> returns 10 new entries and you have multiple threads. In this case all
> threads take their turns and pick an entry from the ring buffer. This
> basically has to be done with something like this (I ignore wrap-arounds
> here to simplify the example):
>
> int getidx() {
> while (uidx < kidx)
> if (atomic_cmpxchg(uidx, uidx + 1, uidx) == 0)
> return uidx;
> return -1;
> }
I don't know if this falls under the simplification, but wouldn't there be a
race when reading/copying the event data? I guess this could be solved with
an extra user index.
--
Hans Henrik Happe
Evgeniy Polyakov wrote:
> + int kevent_commit(int ctl_fd, unsigned int start,
> + unsigned int num, unsigned int over);
I think we can simplify this interface:
int kevent_commit(int ctl_fd, unsigned int new_tail,
unsigned int over);
The kernel sets the ring_uidx value to the 'new_tail' value if the tail
pointer would be incremented (module wrap around) and is not higher then
the current front pointer. The test will be a bit complicated but not
more so than what the current code has to do to check for mistakes.
This approach has the advantage that the commit calls don't have to be
synchronized. If one thread sets the tail pointer to, say, 10 and
another to 12, then it does not matter whether the first thread is
delayed. If it will eventually be executed the result is simply a no-op
and since second thread's action supersedes it.
Maybe the current form is even impossible to use with explicit locking
at userlevel. What if one thread, which is about to call kevent_commit,
if indefinitely delayed. Then this commit request's value is never
taken into account and the tail pointer is always short of what it
should be.
There is one more thing to consider. Oftentimes the commit request will
be immediately followed by a kevent_wait call. It would be good to
merge this pair of calls. The two parameters new_tail and over could
also be passed to the kevent_wait call and the commit can happen before
the thread looks for new events and eventually goes to sleep. If this
can be implemented then the kevent_commit syscall by itself might not be
needed at all. Instead you'd call kevent_wait() and make the maximum
number of events which can be returned zero.
--
β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View, CA β
Hans Henrik Happe wrote:
> I don't know if this falls under the simplification, but wouldn't there be a
> race when reading/copying the event data? I guess this could be solved with
> an extra user index.
That's what I said, reading the value from the ring buffer structure's
head would be racy. All this can only work for single threaded code.
--
β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View, CA β
Ulrich Drepper wrote:
> Evgeniy Polyakov wrote:
>> + int kevent_commit(int ctl_fd, unsigned int start, + unsigned int
>> num, unsigned int over);
>
> I think we can simplify this interface:
>
> int kevent_commit(int ctl_fd, unsigned int new_tail,
> unsigned int over);
>
> The kernel sets the ring_uidx value to the 'new_tail' value if the tail
> pointer would be incremented (module wrap around) and is not higher then
> the current front pointer. The test will be a bit complicated but not
> more so than what the current code has to do to check for mistakes.
>
> This approach has the advantage that the commit calls don't have to be
> synchronized. If one thread sets the tail pointer to, say, 10 and
> another to 12, then it does not matter whether the first thread is
> delayed. If it will eventually be executed the result is simply a no-op
> and since second thread's action supersedes it.
>
> Maybe the current form is even impossible to use with explicit locking
> at userlevel. What if one thread, which is about to call kevent_commit,
> if indefinitely delayed. Then this commit request's value is never
> taken into account and the tail pointer is always short of what it
> should be.
I'm really wondering is designing for N-threads-to-1-ring is the wisest
choice?
Considering current designs, it seems more likely that a single thread
polls for socket activity, then dispatches work. How often do you
really see in userland multiple threads polling the same set of fds,
then fighting to decide who will handle raised events?
More likely, you will see "prefork" (start N threads, each with its own
ring) or a worker pool (single thread receives events, then dispatches
to multiple threads for execution) or even one-thread-per-fd (single
thread receives events, then starts new thread for handling).
If you have multiple threads accessing the same ring -- a poor design
choice -- I would think the burden should be on the application, to
provide proper synchronization.
If the desire is to have the kernel distributes events directly to
multiple threads, then the app should dup(2) the fd to be watched, and
create a ring buffer for each separate thread.
Jeff
Jeff Garzik wrote:
> Considering current designs, it seems more likely that a single thread
> polls for socket activity, then dispatches work. How often do you
> really see in userland multiple threads polling the same set of fds,
> then fighting to decide who will handle raised events?
>
> More likely, you will see "prefork" (start N threads, each with its own
> ring) or a worker pool (single thread receives events, then dispatches
> to multiple threads for execution) or even one-thread-per-fd (single
> thread receives events, then starts new thread for handling).
No, absolutely not. This is exactly not what should/is/will happen.
You create worker threads to handle to work for the entire program.
Look at something like a web server. When creating several queues, how
do you distribute all the connections to the different queues? To
ensure every connection is handled as quickly as possible you stuff them
all in the same queue and then have all threads use this one queue.
Whenever an event is posted a thread is woken. _One_ thread. If two
events are posted, two threads are woken. In this situation we have a
few atomic ops at userlevel to make sure that the two threads don't pick
the same event but that's all there is wrt "fighting".
The alternative is the sorry state we have now. In nscd, for instance,
we have one single thread waiting for incoming connections and it then
has to wake up a worker thread to handle the processing. This is done
because we cannot "park" all threads in the accept() call since when a
new connection is announced _all_ the threads are woken. With the new
event handling this wouldn't be the case, one thread only is woken and
we don't have to wake worker threads. All threads can be worker threads.
> If you have multiple threads accessing the same ring -- a poor design
> choice
To the contrary. It is the perfect means to distribute the workload to
multiple threads. Beside, how would you implement asynchronous filling
of the ring buffer to avoid unnecessary syscalls if you have many
different queues?
> -- I would think the burden should be on the application, to
> provide proper synchronization.
Sure, as much as possible. But there is no reason to design the commit
interface in the way which requires expensive synchronization when there
is another design which can do exactly the same work but does not
require synchronization. The currently proposed kevent_commit and my
proposed variant are functionally equivalent.
> If the desire is to have the kernel distributes events directly to
> multiple threads, then the app should dup(2) the fd to be watched, and
> create a ring buffer for each separate thread.
And how would you synchronize the file descriptor use across the
threads? The event would be sent to all the event queues so that you
would a) unnecessarily wake all threads and b) have all but one thread
see the operation (say, read or write on a socket) fail with
EWOULDBLOCK. That's just silly, we can have that today and continue to
waste precious CPU cycles.
If you say that you post exactly one event per file description (not
handle) then what do you do if the programmer wants the opposite? And
again, what do you do for asynchronous ring buffer filling. Which queue
do you pick? Pick the wrong one and the event might be in the ring
buffer for a long time which another thread handling another queue is ready.
Using a single central queue is the perfect means to distribute the load
to a number of threads. Nobody is forcing you to do it, you're free to
use separate queues if you want. But the model should not enforce this.
Overall, I cannot see at all where your problem is. I agree that the
synchronization of the access to the ring buffer must be done at
userlevel. This is why the uidx exposure isn't needed. The wakeup in
any case has to take threads into account. The only change I proposed
to enable better multi-thread handling is the revised commit interface
and this change in no way hinders single-threaded users. The interface
is not hindered in any way or form by the use of threads.
Oh, and when I say "threads" I should have said "threads or processes".
The whole also applies to multi-process applications. They can share
event queues by placing them in shared memory. And I hope that everyone
agrees that programs have to go into the direction of having more than
one execution context to take advantage of increased CPU power in
future. CMP is only becoming more and more important.
--
β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View, CA β
On Thursday 23 November 2006 23:48, Jeff Garzik wrote:
> I'm really wondering is designing for N-threads-to-1-ring is the wisest
> choice?
>
> Considering current designs, it seems more likely that a single thread
> polls for socket activity, then dispatches work. How often do you
> really see in userland multiple threads polling the same set of fds,
> then fighting to decide who will handle raised events?
They should not fight, but gently divide event handling work.
> More likely, you will see "prefork" (start N threads, each with its own
> ring)
One ring could be more busy than others, leaving all the work to one thread.
> or a worker pool (single thread receives events, then dispatches
> to multiple threads for execution) or even one-thread-per-fd (single
> thread receives events, then starts new thread for handling).
This is more like fighting :-)
It adds context switches and therefore extra latency for event handling.
> If you have multiple threads accessing the same ring -- a poor design
> choice -- I would think the burden should be on the application, to
> provide proper synchronization.
Comming from the HPC world I do not agree. Context switches should be avoided.
This paper is a good example from the HPC world:
http://cobweb.ecn.purdue.edu/~vpai/Publications/majumder-lacsi04.pdf.
The latency problems introduced by context switches in this work calls for
even more functionality in event handling. I will not go into details now.
There are enough problems with kevent's current feature set and I believe
these extra features can be added later without breaking the API.
--
Hans Henrik Happe
Ulrich Drepper a Γ©crit :
>
> You create worker threads to handle to work for the entire program. Look
> at something like a web server. When creating several queues, how do
> you distribute all the connections to the different queues? To ensure
> every connection is handled as quickly as possible you stuff them all in
> the same queue and then have all threads use this one queue. Whenever an
> event is posted a thread is woken. _One_ thread. If two events are
> posted, two threads are woken. In this situation we have a few atomic
> ops at userlevel to make sure that the two threads don't pick the same
> event but that's all there is wrt "fighting".
>
> The alternative is the sorry state we have now. In nscd, for instance,
> we have one single thread waiting for incoming connections and it then
> has to wake up a worker thread to handle the processing. This is done
> because we cannot "park" all threads in the accept() call since when a
> new connection is announced _all_ the threads are woken. With the new
> event handling this wouldn't be the case, one thread only is woken and
> we don't have to wake worker threads. All threads can be worker threads.
Having one specialized thread handling the distribution of work to worker
threads is better most of the time. This thread can be a worker thread by
itself (to avoid context switchs), but can decide to wake up 'slave threads'
if he believes it has too (for example if he can notice that a *lot* of
requests are pending)
This is because with moderate load, it's better to have only one CPU running
80% of its time, keeping its cache hot, than 'distribute' the work on four
CPU, that would be used 25% of their time, but with lot of cache line ping
pongs and poor cache reuse.
If you let 'kevent'/'dumb kernel dispatcher'/'futex'/'whatever' decide to wake
up one thread for each new event, you *may* have lower performance, because of
higher system overhead (system means : system scheduler/internals, but also
bus trafic)
Only the application writer can have a clue of average use of its worker
threads, and can decide to dynamically adjust parameters if needed to handle
load spikes.
SMP machines are nice, but for many workloads, it's better to avoid spreading
a working set on several CPUS that fight for common resources (memory).
Back to 'kevent':
-----------------
I think that having a syscall to commit events should not be mandatory. A
syscall is needed only to wait for new events if the ring is empty. But then
maybe we dont need yet a new syscall to perform a wait :
We already have nice synchronisations primitives (futex for example).
User program should be able to update a 'uidx' in user space (using atomic ops
only if multi-threaded), and could just use futex infrastructure if ring
buffer is empty (uidx == kidx) , and call FUTEX_WAIT( &kidx, current value = uidx)
I think I already gave my opinion on a ring buffer, but let just rephrase it :
One part should be read/write for application (to be able to change uidx)
(or User app just give at init time to kernel the address of a futex in its vm
space)
One part could be read only for application (but could be read/write : we dont
care if user application is stupid) : kernel writes its kidx (or a copy of it)
and events.
For best performance, uidx and kidx should be on different cache lines (basic
isolation of producer / consumer)
When kernel wants to queue a new event in a ring buffer it can :
See if user program did consume some events since last invocation (kernel
fetches uidx and compare it with its own uidx value : no syscall needed)
Check if a slot is available in ring buffer.
Copy the event in ring buffer, perform a memory barrier, then increment kidx.
call futex_wake(&kidx, 1 thread)
User application is free to have one thread/process or several
threads/processes waiting for new events (or even no thread at all :) )
Eric
On Fri, 24 Nov 2006 01:48:32 +0100
Eric Dumazet <[email protected]> wrote:
> > The alternative is the sorry state we have now. In nscd, for instance,
> > we have one single thread waiting for incoming connections and it then
> > has to wake up a worker thread to handle the processing. This is done
> > because we cannot "park" all threads in the accept() call since when a
> > new connection is announced _all_ the threads are woken. With the new
> > event handling this wouldn't be the case, one thread only is woken and
> > we don't have to wake worker threads. All threads can be worker threads.
>
> Having one specialized thread handling the distribution of work to worker
> threads is better most of the time.
It might be now. Think "commodity 128-way". Your single distribution thread
will run out of steam.
What Ulrich is proposing is faster. This is a new interface. Let's design
it to be fast.
Andrew Morton a ?crit :
> On Fri, 24 Nov 2006 01:48:32 +0100
> Eric Dumazet <[email protected]> wrote:
>
>>> The alternative is the sorry state we have now. In nscd, for instance,
>>> we have one single thread waiting for incoming connections and it then
>>> has to wake up a worker thread to handle the processing. This is done
>>> because we cannot "park" all threads in the accept() call since when a
>>> new connection is announced _all_ the threads are woken. With the new
>>> event handling this wouldn't be the case, one thread only is woken and
>>> we don't have to wake worker threads. All threads can be worker threads.
>> Having one specialized thread handling the distribution of work to worker
>> threads is better most of the time.
>
> It might be now. Think "commodity 128-way". Your single distribution thread
> will run out of steam.
>
> What Ulrich is proposing is faster. This is a new interface. Let's design
> it to be fast.
Hum... I guess you didnt read my mail... I basically agree with Ulrich.
I just wanted to say that a fast application cannot rely only on a "let's park
N threads waiting for single event in this queue", and hope kernel will be
smart for us.
Even with 128-ways, you still hit a central point of coordination (it can be a
mutex in kevent code, a atomic uidx in userland, or whatever) for a 'kevent
queue'. Once you paid the cache lines ping/pong, you wont be *fast*.
I wish *you* dont think of kevent of only dispatching HTTP 1.0 trivial web
requests.
Being able to direct a particular request on a particular CPU is certainly
something that cannot be hardcoded in 'the new kevent interface'.
Eric
On Thu, Nov 23, 2006 at 11:45:36AM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >Kernel does not put there a new entry, it is only done inside
> >kevent_wait(). Entries are put into queue (in any context), where they can
> >be obtained
> >from only kevent_wait() or kevent_get_events().
>
> I know this is how it's done now. But it is not where it has to end.
> IMO we have to get to a solution where new events are posted to the ring
> buffer asynchronously, i.e., without a thread calling kevent_wait. And
> then you need the extra parameter and verification. Even if it's today
> not needed we have to future-proof the interface since it cannot be
> changed once in use.
There is a special flag in kevent_user to wake it if there are no ready
events - kernel thread which has added new events will set it and thus
subsequent kevent_wait() will return with updated indexes - userspace
must check indexes after kevent_wait().
> --
> β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View,
> CA β
--
Evgeniy Polyakov
On Thu, Nov 23, 2006 at 12:00:45PM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >uidx is an index, starting from which there are unread entries. It is
> >updated by userspace when it commits entries, so it is 'consumer'
> >pointer, while kidx is an index where kernel will put new entries, i.e.
> >'producer' index. We definitely need them both.
> >Userspace can only update (implicitly by calling kevent_commit()) uidx.
>
> Right, which is why exporting this entry is not needed. Keep the
> interface as small as possible.
If there are several callers of kevent_commit(), uidx can be changed far
than first user expects, so there should be possibility to check that
value. It is thus exported into shared ring buffer structure.
> Userlevel has to maintain its own index. Just assume kevent_wait
> returns 10 new entries and you have multiple threads. In this case all
> threads take their turns and pick an entry from the ring buffer. This
> basically has to be done with something like this (I ignore wrap-arounds
> here to simplify the example):
>
> int getidx() {
> while (uidx < kidx)
> if (atomic_cmpxchg(uidx, uidx + 1, uidx) == 0)
> return uidx;
> return -1;
> }
>
> Very much simplified but it should show that we need a writable copy of
> the uidx. And this value at any time must be consistent with the index
> the kernel assumes.
I seriously doubt it is simpler than having index provided by kernel.
> The current ring_uidx value can at best be used to reinitialize the
> userlevel uidx value after each kevent_wait call but this is unnecessary
> at best (since uidx must already have this value) and racy in problem
> cases (what if more than one thread gets woken concurrently with uidx
> having the same value and one thread stores the uidx value and
> immediately increments it to get an index; the second store would
> overwrite the increment).
>
> I can assure you that any implementation I write would not use the
> ring_uidx value. Only trivial, single-threaded examples like you
> ring_buffer.c could ever take advantage of this value. It's not worth it.
You propose to make uidx shared local variable - it is doable, but it
is not required - userspace can use kernel's variable, since it is
updated exactly in the places where that index is changed.
> --
> β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View,
> CA β
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Evgeniy Polyakov
On Thu, Nov 23, 2006 at 02:34:46PM -0800, Ulrich Drepper ([email protected]) wrote:
> Hans Henrik Happe wrote:
> >I don't know if this falls under the simplification, but wouldn't there be
> >a race when reading/copying the event data? I guess this could be solved
> >with an extra user index.
>
> That's what I said, reading the value from the ring buffer structure's
> head would be racy. All this can only work for single threaded code.
Value in the userspace ring is updated each time it is changed in kernel
(when userspace calls kevent_commit()), when userspace has read its old
value it is guaranteed that requested number of events _is_ there
(although it is possible that there are more than that value).
Ulrich, why didn't you comment on previous interface, which had exactly
_one_ index exported to userspace - it is only required to add implicit
uidx and (if you prefer that way) additional syscall, since in previous
interface both waiting and commit was handled by kevent_wait() with
different parameters.
> --
> β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View,
> CA β
--
Evgeniy Polyakov
On Thu, Nov 23, 2006 at 02:33:16PM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >+ int kevent_commit(int ctl_fd, unsigned int start,
> >+ unsigned int num, unsigned int over);
>
> I think we can simplify this interface:
>
> int kevent_commit(int ctl_fd, unsigned int new_tail,
> unsigned int over);
>
> The kernel sets the ring_uidx value to the 'new_tail' value if the tail
> pointer would be incremented (module wrap around) and is not higher then
> the current front pointer. The test will be a bit complicated but not
> more so than what the current code has to do to check for mistakes.
>
> This approach has the advantage that the commit calls don't have to be
> synchronized. If one thread sets the tail pointer to, say, 10 and
> another to 12, then it does not matter whether the first thread is
> delayed. If it will eventually be executed the result is simply a no-op
> and since second thread's action supersedes it.
>
> Maybe the current form is even impossible to use with explicit locking
> at userlevel. What if one thread, which is about to call kevent_commit,
> if indefinitely delayed. Then this commit request's value is never
> taken into account and the tail pointer is always short of what it
> should be.
I like this interface, although current one does not allow special
synchronization in userspace, since it calculates if new commit is in
the area where previous commit was.
Will change for the next release.
> There is one more thing to consider. Oftentimes the commit request will
> be immediately followed by a kevent_wait call. It would be good to
> merge this pair of calls. The two parameters new_tail and over could
> also be passed to the kevent_wait call and the commit can happen before
> the thread looks for new events and eventually goes to sleep. If this
> can be implemented then the kevent_commit syscall by itself might not be
> needed at all. Instead you'd call kevent_wait() and make the maximum
> number of events which can be returned zero.
It _IS_ how previous interface worked.
EXACTLY!
There was one syscall which committed requested number of events and
waited when there are new ready events. The only thing it missed, was
userspace index (it assumed that if userspace waits for something, then
all previous work is done).
Ulrich, I'm not going to think for other people all over the world and
blindly implementing ideas, which in a day or two will be commented as
redundant, since flow of mind has changed, and they had not enough time
to check previous version.
I will wait for some time until you and other people made theirs comments
on interfaces and release final version in about a week, and now I will
go to hack netchannels.
NO INTERFACE CHANGES AFTER THAT DAY.
COMPLETELY.
So, feel free to think about perfect interface anyone will be happy
with. But please release your thoughts not in form of abstract words,
but more precisely, at least like in this e-mail, so I could understand
what _you_ want from _your_ interface.
> --
> β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View,
> CA β
--
Evgeniy Polyakov
On Fri, Nov 24, 2006 at 03:05:31PM +0300, Evgeniy Polyakov ([email protected]) wrote:
> On Thu, Nov 23, 2006 at 02:33:16PM -0800, Ulrich Drepper ([email protected]) wrote:
> > Evgeniy Polyakov wrote:
> > >+ int kevent_commit(int ctl_fd, unsigned int start,
> > >+ unsigned int num, unsigned int over);
> >
> > I think we can simplify this interface:
> >
> > int kevent_commit(int ctl_fd, unsigned int new_tail,
> > unsigned int over);
> >
> > The kernel sets the ring_uidx value to the 'new_tail' value if the tail
> > pointer would be incremented (module wrap around) and is not higher then
> > the current front pointer. The test will be a bit complicated but not
> > more so than what the current code has to do to check for mistakes.
> >
> > This approach has the advantage that the commit calls don't have to be
> > synchronized. If one thread sets the tail pointer to, say, 10 and
> > another to 12, then it does not matter whether the first thread is
> > delayed. If it will eventually be executed the result is simply a no-op
> > and since second thread's action supersedes it.
> >
> > Maybe the current form is even impossible to use with explicit locking
> > at userlevel. What if one thread, which is about to call kevent_commit,
> > if indefinitely delayed. Then this commit request's value is never
> > taken into account and the tail pointer is always short of what it
> > should be.
>
> I like this interface, although current one does not allow special
...does not require...
> synchronization in userspace, since it calculates if new commit is in
> the area where previous commit was.
> Will change for the next release.
--
Evgeniy Polyakov
In article <[email protected]>,
Ulrich Drepper <[email protected]> wrote:
>Jeff Garzik wrote:
>> Considering current designs, it seems more likely that a single thread
>> polls for socket activity, then dispatches work. How often do you
>> really see in userland multiple threads polling the same set of fds,
>> then fighting to decide who will handle raised events?
>>
>> More likely, you will see "prefork" (start N threads, each with its own
>> ring) or a worker pool (single thread receives events, then dispatches
>> to multiple threads for execution) or even one-thread-per-fd (single
>> thread receives events, then starts new thread for handling).
>
>No, absolutely not. This is exactly not what should/is/will happen.
>
>You create worker threads to handle to work for the entire program.
>Look at something like a web server. When creating several queues, how
>do you distribute all the connections to the different queues? To
>ensure every connection is handled as quickly as possible you stuff them
>all in the same queue and then have all threads use this one queue.
>Whenever an event is posted a thread is woken. _One_ thread. If two
>events are posted, two threads are woken. In this situation we have a
>few atomic ops at userlevel to make sure that the two threads don't pick
>the same event but that's all there is wrt "fighting".
What you really want is if one thread is able to do all the work,
only keep that one thread busy. Only wake up other threads when
the currently running threads cannot handle the load.
Say you have 8 threads blocked in kevent_wait(). One or more events
become available. You wake one thread, and let it run.
If the one thread has done its work and returns to kevent_wait()
before its timeslice has run out, deliver the next event(s) (which
may already be outstanding) to the same thread, don't wake another one.
If the running thread blocks on say disk i/o, or its timeslice
runs out, the scheduler runs and wakes another thread that is
waiting in kevent_wait().
Mike.
Eric Dumazet wrote:
> Being able to direct a particular request on a particular CPU is
> certainly something that cannot be hardcoded in 'the new kevent interface'.
Nobody is proposing this. Although I have proposed that if the kernel
knows which CPU can best service a request it might hint as much.
But in general, you're free to decentralize as much as you want. But
this does not mean it should not also be possible to use a number of
threads in the same loop and the same kevent queue. That's the part
which needs designing, the separate queues will always be possible.
--
β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View, CA β
Evgeniy Polyakov wrote:
>> I know this is how it's done now. But it is not where it has to end.
>> IMO we have to get to a solution where new events are posted to the ring
>> buffer asynchronously, i.e., without a thread calling kevent_wait. And
>> then you need the extra parameter and verification. Even if it's today
>> not needed we have to future-proof the interface since it cannot be
>> changed once in use.
>
> There is a special flag in kevent_user to wake it if there are no ready
> events - kernel thread which has added new events will set it and thus
> subsequent kevent_wait() will return with updated indexes - userspace
> must check indexes after kevent_wait().
You misunderstand. I don't want to return without waiting unconditionally.
There is a race which has to be closed. It's exactly the same as in the
futex syscall. I've shown the interaction between the kernel and the
thread in the previous mail. There is inevitably a time difference
between the thread checking whether the ring buffer is empty and the
kernel putting the thread to sleep in the kevent_wait call.
This is no problem with the current kevent_wait implementation since the
ring buffer is not filled asynchronously. But if/when it will be the
kernel might add something to the ring buffer _after_ the thread checks
for an empty ring buffer and _before_ it enters the kernel in the
kevent_wait syscall.
The kevent_wait syscall will only wake the thread when a new event is
posted. We do not in general want it to be woken when the ring buffer
is non empty. This would create far too many unnecessary wakeups it
there is more than one thread working on the queue.
With the addition parameters for kevent_wait indicating when the calling
thread last checked the ring buffer the kernel can find out whether the
decision to call kevent_wait was made based on outdated information or
not. Outdated in the case a new event has been posted. In this case
the thread is not put to sleep but instead returns.
--
β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View, CA β
On Fri, Nov 24, 2006 at 08:06:59AM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >>I know this is how it's done now. But it is not where it has to end.
> >>IMO we have to get to a solution where new events are posted to the ring
> >>buffer asynchronously, i.e., without a thread calling kevent_wait. And
> >>then you need the extra parameter and verification. Even if it's today
> >>not needed we have to future-proof the interface since it cannot be
> >>changed once in use.
> >
> >There is a special flag in kevent_user to wake it if there are no ready
> >events - kernel thread which has added new events will set it and thus
> >subsequent kevent_wait() will return with updated indexes - userspace
> >must check indexes after kevent_wait().
>
> You misunderstand. I don't want to return without waiting unconditionally.
>
> There is a race which has to be closed. It's exactly the same as in the
> futex syscall. I've shown the interaction between the kernel and the
> thread in the previous mail. There is inevitably a time difference
> between the thread checking whether the ring buffer is empty and the
> kernel putting the thread to sleep in the kevent_wait call.
>
> This is no problem with the current kevent_wait implementation since the
> ring buffer is not filled asynchronously. But if/when it will be the
> kernel might add something to the ring buffer _after_ the thread checks
> for an empty ring buffer and _before_ it enters the kernel in the
> kevent_wait syscall.
>
> The kevent_wait syscall will only wake the thread when a new event is
> posted. We do not in general want it to be woken when the ring buffer
> is non empty. This would create far too many unnecessary wakeups it
> there is more than one thread working on the queue.
>
> With the addition parameters for kevent_wait indicating when the calling
> thread last checked the ring buffer the kernel can find out whether the
> decision to call kevent_wait was made based on outdated information or
> not. Outdated in the case a new event has been posted. In this case
> the thread is not put to sleep but instead returns.
Read my mail again.
If kernel has put data asynchronously it will setup special flag, thus
kevent_wait() will not sleep and will return, so thread will check new
entries and process them.
> --
> β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View,
> CA β
--
Evgeniy Polyakov
Evgeniy Polyakov wrote:
> Ulrich, why didn't you comment on previous interface, which had exactly
> _one_ index exported to userspace - it is only required to add implicit
> uidx and (if you prefer that way) additional syscall, since in previous
> interface both waiting and commit was handled by kevent_wait() with
> different parameters.
If you read my old mails you'll find that I'm pretty consistent wrt to
the ring buffer interface. The old code had other problems, not the
missing exposure of the uidx value.
There is really not much disagreement here. I just don't like the
interface unnecessarily and misleadingly large by exposing the uidx
value which is not useful to the userlevel code. Just remove the
element and stuff it into a kernel-internal struct for the queue and
you're done.
--
β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View, CA β
Evgeniy Polyakov wrote:
>> Very much simplified but it should show that we need a writable copy of
>> the uidx. And this value at any time must be consistent with the index
>> the kernel assumes.
>
> I seriously doubt it is simpler than having index provided by kernel.
What has simpler to do with it? The userlevel code should not modify
the ring buffer structure at all. If we'd do this then all operations,
at least on the uidx field, would have to be atomic operations. This is
currently not the case for the kernel side since it's protected by a
lock for the event queue. Using the uidx field from userlevel would
therefore just make things slower.
And for what? Changing the uidx value would make the commit syscall
unnecessary. This might be an argument but it sounds too dangerous.
IMO the value should be protected by the kernel.
And in any case, the uidx value cannot be updated until the event
actually has been processed. But the threads still need to coordinate
distributing the events from the ring buffer amongst themselves. This
will in any case require a second variable.
So, if you want to do away with the commit syscall, keep the uidx value.
This also requires that the ring buffer head will always be writable
(something I'd like to avoid making part of the interface but I'm
flexible on this). Otherwise, the ring_uidx element can go away, it's
not needed and will only make people think about wrong approaches to use it.
> You propose to make uidx shared local variable - it is doable, but it
> is not required - userspace can use kernel's variable, since it is
> updated exactly in the places where that index is changed.
As said above, we always need another variable and uidx is only a
replacement for the commit call. Until the event is processed the uidx
cannot be incremented since otherwise the ring buffer entry might be
overwritten.
And kernel people of all should be happy to limit the exposure of the
implementation. So, leave the problem of keeping track of the tail
pointer to the userlevel code.
--
β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View, CA β
On Fri, Nov 24, 2006 at 07:14:06PM +0300, Evgeniy Polyakov ([email protected]) wrote:
> If kernel has put data asynchronously it will setup special flag, thus
> kevent_wait() will not sleep and will return, so thread will check new
> entries and process them.
For the clarification - only kevent_wait() updates index, userspace
will not detect that it has changed after thread has put there new
data.
In case kernel thread will updated index too, you are correct,
kevent_wait() should get index as parameter.
--
Evgeniy Polyakov
On Fri, Nov 24, 2006 at 08:30:14AM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >>Very much simplified but it should show that we need a writable copy of
> >>the uidx. And this value at any time must be consistent with the index
> >>the kernel assumes.
> >
> >I seriously doubt it is simpler than having index provided by kernel.
>
> What has simpler to do with it? The userlevel code should not modify
> the ring buffer structure at all. If we'd do this then all operations,
> at least on the uidx field, would have to be atomic operations. This is
> currently not the case for the kernel side since it's protected by a
> lock for the event queue. Using the uidx field from userlevel would
> therefore just make things slower.
That index is provided by kernel for userspace so that userspace could
determine where indexes are - of course userspace can maintain it
itself, but it can also use provided by kernel. It is not written
explicitly, but only through kevent_commit().
> And for what? Changing the uidx value would make the commit syscall
> unnecessary. This might be an argument but it sounds too dangerous.
> IMO the value should be protected by the kernel.
>
> And in any case, the uidx value cannot be updated until the event
> actually has been processed. But the threads still need to coordinate
> distributing the events from the ring buffer amongst themselves. This
> will in any case require a second variable.
>
> So, if you want to do away with the commit syscall, keep the uidx value.
> This also requires that the ring buffer head will always be writable
> (something I'd like to avoid making part of the interface but I'm
> flexible on this). Otherwise, the ring_uidx element can go away, it's
> not needed and will only make people think about wrong approaches to use it.
No, head will not be writeable - it is absolutely.
I do not care actually about that index, but as you have probably noticed,
there was such an interface already, and I changed it. So, this will be the
last change of the interface. You think it should not be exported -
fine, it will not be.
--
Evgeniy Polyakov
Evgeniy Polyakov wrote:
> If kernel has put data asynchronously it will setup special flag, thus
> kevent_wait() will not sleep and will return, so thread will check new
> entries and process them.
This is not sufficient.
The userlevel code does not commit the events until they are processed.
So assume two threads at userlevel, one event is asynchronously
posted. The first thread picks it up, the second call kevent_wait.
With your scheme it will not be put to sleep and unnecessarily returns
to userlevel.
What I propose and what has been proven to work in many situations is to
have part of the kevent_wait syscall the information about "I am aware
of all events up to XX; wake me only if anything beyond that is added".
Please take a look at how futexes work, it's really the same concept.
And it's really also simpler for the implementation. Having such a flag
is much more complicated than adding a simple index comparison before
going to sleep.
--
β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View, CA β
Evgeniy Polyakov wrote:
> That index is provided by kernel for userspace so that userspace could
> determine where indexes are - of course userspace can maintain it
> itself, but it can also use provided by kernel.
Indeed. That's what I said. But I also pointed out that the field is
only useful in simple minded programs and certainly not in the wrappers
the runtime (glibc) will provide.
As you said yourself, there is no real need for the value being there,
userland can keep track of it by itself. So, let's reduce the interface.
> I do not care actually about that index, but as you have probably noticed,
> there was such an interface already, and I changed it. So, this will be the
> last change of the interface. You think it should not be exported -
> fine, it will not be.
Thanks.
--
β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View, CA β
Evgeniy Polyakov wrote:
> It _IS_ how previous interface worked.
>
> EXACTLY!
No, the old interface committed everything not only up to a given index.
This is the huge difference which makes or breaks it.
--
β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View, CA β
On Mon, Nov 27, 2006 at 11:43:46AM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >It _IS_ how previous interface worked.
> >
> > EXACTLY!
>
> No, the old interface committed everything not only up to a given index.
> This is the huge difference which makes or breaks it.
Interface was the same - logic behind it was differnet, the only thing
required was to add consumer's index - that is all, no need to change a
lot of declarations, userspace and so on - just use existing interface
and extend its functionality.
But it does not matter anymore, later this week I will collect all
proposed changes and implement (hopefully) last release, which will
close most of the questions regarding userspace interfaces (except
signal mask, it is in fluent state), so we could concentrate on
internals and/or new kernel users.
> --
> β§ Ulrich Drepper β§ Red Hat, Inc. β§ 444 Castro St β§ Mountain View,
> CA β
--
Evgeniy Polyakov