2006-08-25 09:31:43

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take14 0/3] kevent: Generic event handling mechanism.


Generic event handling mechanism.

Changes from 'take13' patchset:
* do not get lock aroung user data check in __kevent_search()
* fail early if there were no registered callbacks for given type of kevent
* trailing whitespace cleanup

Changes from 'take12' patchset:
* remove non-chardev interface for initialization
* use pointer to kevent_mring instead of unsigned longs
* use aligned 64bit type in raw user data (can be used by high-res timer if needed)
* simplified enqueue/dequeue callbacks and kevent initialization
* use nanoseconds for timeout
* put number of milliseconds into timer's return data
* move some definitions into user-visible header
* removed filenames from comments

Changes from 'take11' patchset:
* include missing headers into patchset
* some trivial code cleanups (use goto instead of if/else games and so on)
* some whitespace cleanups
* check for ready_callback() callback before main loop which should save us some ticks

Changes from 'take10' patchset:
* removed non-existent prototypes
* added helper function for kevent_registered_callbacks
* fixed 80 lines comments issues
* added shared between userspace and kernelspace header instead of embedd them in one
* core restructuring to remove forward declarations
* s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
* use vm_insert_page() instead of remap_pfn_range()

Changes from 'take9' patchset:
* fixed ->nopage method

Changes from 'take8' patchset:
* fixed mmap release bug
* use module_init() instead of late_initcall()
* use better structures for timer notifications

Changes from 'take7' patchset:
* new mmap interface (not tested, waiting for other changes to be acked)
- use nopage() method to dynamically substitue pages
- allocate new page for events only when new added kevent requres it
- do not use ugly index dereferencing, use structure instead
- reduced amount of data in the ring (id and flags),
maximum 12 pages on x86 per kevent fd

Changes from 'take6' patchset:
* a lot of comments!
* do not use list poisoning for detection of the fact, that entry is in the list
* return number of ready kevents even if copy*user() fails
* strict check for number of kevents in syscall
* use ARRAY_SIZE for array size calculation
* changed superblock magic number
* use SLAB_PANIC instead of direct panic() call
* changed -E* return values
* a lot of small cleanups and indent fixes

Changes from 'take5' patchset:
* removed compilation warnings about unused wariables when lockdep is not turned on
* do not use internal socket structures, use appropriate (exported) wrappers instead
* removed default 1 second timeout
* removed AIO stuff from patchset

Changes from 'take4' patchset:
* use miscdevice instead of chardevice
* comments fixes

Changes from 'take3' patchset:
* removed serializing mutex from kevent_user_wait()
* moved storage list processing to RCU
* removed lockdep screaming - all storage locks are initialized in the same function, so it was learned
to differentiate between various cases
* remove kevent from storage if is marked as broken after callback
* fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion

Changes from 'take2' patchset:
* split kevent_finish_user() to locked and unlocked variants
* do not use KEVENT_STAT ifdefs, use inline functions instead
* use array of callbacks of each type instead of each kevent callback initialization
* changed name of ukevent guarding lock
* use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
* do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
* various indent cleanups
* added optimisation, which is aimed to help when a lot of kevents are being copied from userspace
* mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
- rebased against 2.6.18-git tree
- removed ioctl controlling
- added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
unsigned int timeout, void __user *buf, unsigned flags)
- use old syscall kevent_ctl for creation/removing, modification and initial kevent
initialization
- use mutuxes instead of semaphores
- added file descriptor check and return error if provided descriptor does not match
kevent file operations
- various indent fixes
- removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov <[email protected]>



2006-08-25 09:31:54

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take14 3/3] kevent: Timer notifications.


Timer notifications.

Timer notifications can be used for fine grained per-process time
management, since interval timers are very inconvenient to use,
and they are limited.

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..b2fee61
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,105 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+struct kevent_timer
+{
+ struct timer_list ktimer;
+ struct kevent_storage ktimer_storage;
+};
+
+static void kevent_timer_func(unsigned long data)
+{
+ struct kevent *k = (struct kevent *)data;
+ struct timer_list *t = k->st->origin;
+
+ kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL);
+ mod_timer(t, jiffies + msecs_to_jiffies(k->event.id.raw[0]));
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+ int err;
+ struct kevent_timer *t;
+
+ t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+ if (!t)
+ return -ENOMEM;
+
+ setup_timer(&t->ktimer, &kevent_timer_func, (unsigned long)k);
+
+ err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+ if (err)
+ goto err_out_free;
+ lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+ err = kevent_storage_enqueue(&t->ktimer_storage, k);
+ if (err)
+ goto err_out_st_fini;
+
+ mod_timer(&t->ktimer, jiffies + msecs_to_jiffies(k->event.id.raw[0]));
+
+ return 0;
+
+err_out_st_fini:
+ kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+ kfree(t);
+
+ return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+ struct kevent_storage *st = k->st;
+ struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage);
+
+ del_timer_sync(&t->ktimer);
+ kevent_storage_dequeue(st, k);
+ kfree(t);
+
+ return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+ k->event.ret_data[0] = jiffies_to_msecs(jiffies);
+ return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+ struct kevent_callbacks tc = {
+ .callback = &kevent_timer_callback,
+ .enqueue = &kevent_timer_enqueue,
+ .dequeue = &kevent_timer_dequeue};
+
+ return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);

2006-08-25 09:32:30

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take14 1/3] kevent: Core files.


Core files.

This patch includes core kevent files:
- userspace controlling
- kernelspace interfaces
- initialization
- notification state machines

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index dd63d47..091ff42 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -317,3 +317,5 @@ ENTRY(sys_call_table)
.long sys_tee /* 315 */
.long sys_vmsplice
.long sys_move_pages
+ .long sys_kevent_get_events
+ .long sys_kevent_ctl
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index 5d4a7d1..b2af4a8 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -713,4 +713,6 @@ #endif
.quad sys_tee
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
+ .quad sys_kevent_get_events
+ .quad sys_kevent_ctl
ia32_syscall_end:
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index fc1c8dd..c9dde13 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -323,10 +323,12 @@ #define __NR_sync_file_range 314
#define __NR_tee 315
#define __NR_vmsplice 316
#define __NR_move_pages 317
+#define __NR_kevent_get_events 318
+#define __NR_kevent_ctl 319

#ifdef __KERNEL__

-#define NR_syscalls 318
+#define NR_syscalls 320

/*
* user-visible error numbers are in the range -1 - -128: see
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 94387c9..61363e0 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,14 @@ #define __NR_vmsplice 278
__SYSCALL(__NR_vmsplice, sys_vmsplice)
#define __NR_move_pages 279
__SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events 280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl 281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)

#ifdef __KERNEL__

-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_ctl

#ifndef __NO_STUBS

diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..de33ec7
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,173 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/kevent_storage.h>
+#include <linux/ukevent.h>
+
+#define KEVENT_MIN_BUFFS_ALLOC 3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+ kevent_callback_t callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY 0x1
+#define KEVENT_STORAGE 0x2
+#define KEVENT_USER 0x4
+
+struct kevent
+{
+ /* Used for kevent freeing.*/
+ struct rcu_head rcu_head;
+ struct ukevent event;
+ /* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+ spinlock_t ulock;
+
+ /* Entry of user's queue. */
+ struct list_head kevent_entry;
+ /* Entry of origin's queue. */
+ struct list_head storage_entry;
+ /* Entry of user's ready. */
+ struct list_head ready_entry;
+
+ u32 flags;
+
+ /* User who requested this kevent. */
+ struct kevent_user *user;
+ /* Kevent container. */
+ struct kevent_storage *st;
+
+ struct kevent_callbacks callbacks;
+
+ /* Private data for different storages.
+ * poll()/select storage has a list of wait_queue_t containers
+ * for each ->poll() { poll_wait()' } here.
+ */
+ void *priv;
+};
+
+#define KEVENT_HASH_MASK 0xff
+
+struct kevent_user
+{
+ struct list_head kevent_list[KEVENT_HASH_MASK+1];
+ spinlock_t kevent_lock;
+ /* Number of queued kevents. */
+ unsigned int kevent_num;
+
+ /* List of ready kevents. */
+ struct list_head ready_list;
+ /* Number of ready kevents. */
+ unsigned int ready_num;
+ /* Protects all manipulations with ready queue. */
+ spinlock_t ready_lock;
+
+ /* Protects against simultaneous kevent_user control manipulations. */
+ struct mutex ctl_mutex;
+ /* Wait until some events are ready. */
+ wait_queue_head_t wait;
+
+ /* Reference counter, increased for each new kevent. */
+ atomic_t refcnt;
+
+ unsigned int pages_in_use;
+ /* Array of pages forming mapped ring buffer */
+ struct kevent_mring **pring;
+
+#ifdef CONFIG_KEVENT_USER_STAT
+ unsigned long im_num;
+ unsigned long wait_num;
+ unsigned long total;
+#endif
+};
+
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos);
+
+void kevent_user_ring_add_event(struct kevent *k);
+
+void kevent_storage_ready(struct kevent_storage *st,
+ kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+ u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+ pr_debug("%s: u=%p, wait=%lu, immediately=%lu, total=%lu.\n",
+ __func__, u, u->wait_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+ u->im_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+ u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+ u->total++;
+}
+#else
+#define kevent_stat_print(u) ({ (void) u;})
+#define kevent_stat_init(u) ({ (void) u;})
+#define kevent_stat_im(u) ({ (void) u;})
+#define kevent_stat_wait(u) ({ (void) u;})
+#define kevent_stat_total(u) ({ (void) u;})
+#endif
+
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+ void *origin; /* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+ struct list_head list; /* List of queued kevents. */
+ spinlock_t lock; /* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 008f04c..4d72286 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -597,4 +597,7 @@ asmlinkage long sys_get_robust_list(int
asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
size_t len);

+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max,
+ __u64 timeout, void __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, void __user *buf);
#endif
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
new file mode 100644
index 0000000..f8ff3a2
--- /dev/null
+++ b/include/linux/ukevent.h
@@ -0,0 +1,155 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __UKEVENT_H
+#define __UKEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+/* Process this event only once and then dequeue. */
+#define KEVENT_REQ_ONESHOT 0x1
+
+/*
+ * Kevent return flags.
+ */
+/* Kevent is broken. */
+#define KEVENT_RET_BROKEN 0x1
+/* Kevent processing was finished successfully. */
+#define KEVENT_RET_DONE 0x2
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 0
+#define KEVENT_INODE 1
+#define KEVENT_TIMER 2
+#define KEVENT_POLL 3
+#define KEVENT_NAIO 4
+#define KEVENT_AIO 5
+#define KEVENT_MAX 6
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define KEVENT_TIMER_FIRED 0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#define KEVENT_SOCKET_RECV 0x1
+#define KEVENT_SOCKET_ACCEPT 0x2
+#define KEVENT_SOCKET_SEND 0x4
+
+/*
+ * Inode events.
+ */
+#define KEVENT_INODE_CREATE 0x1
+#define KEVENT_INODE_REMOVE 0x2
+
+/*
+ * Poll events.
+ */
+#define KEVENT_POLL_POLLIN 0x0001
+#define KEVENT_POLL_POLLPRI 0x0002
+#define KEVENT_POLL_POLLOUT 0x0004
+#define KEVENT_POLL_POLLERR 0x0008
+#define KEVENT_POLL_POLLHUP 0x0010
+#define KEVENT_POLL_POLLNVAL 0x0020
+
+#define KEVENT_POLL_POLLRDNORM 0x0040
+#define KEVENT_POLL_POLLRDBAND 0x0080
+#define KEVENT_POLL_POLLWRNORM 0x0100
+#define KEVENT_POLL_POLLWRBAND 0x0200
+#define KEVENT_POLL_POLLMSG 0x0400
+#define KEVENT_POLL_POLLREMOVE 0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define KEVENT_AIO_BIO 0x1
+
+#define KEVENT_MASK_ALL 0xffffffff
+/* Mask of all possible event values. */
+#define KEVENT_MASK_EMPTY 0x0
+/* Empty mask of ready events. */
+
+struct kevent_id
+{
+ union {
+ __u32 raw[2];
+ __u64 raw_u64 __attribute__((aligned(8)));
+ };
+};
+
+struct ukevent
+{
+ /* Id of this request, e.g. socket number, file descriptor and so on... */
+ struct kevent_id id;
+ /* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+ __u32 type;
+ /* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+ __u32 event;
+ /* Per-event request flags */
+ __u32 req_flags;
+ /* Per-event return flags */
+ __u32 ret_flags;
+ /* Event return data. Event originator fills it with anything it likes. */
+ __u32 ret_data[2];
+ /* User's data. It is not used, just copied to/from user.
+ * The whole structure is aligned to 8 bytes already, so the last union
+ * is aligned properly.
+ */
+ union {
+ __u32 user[2];
+ void *ptr;
+ };
+};
+
+struct mukevent
+{
+ struct kevent_id id;
+ __u32 ret_flags;
+};
+
+#define KEVENT_MAX_EVENTS 4096
+
+/*
+ * Note that kevents does not exactly fill the page (each mukevent is 12 bytes),
+ * so we reuse 4 bytes at the begining of the first page to store index.
+ * Take that into account if you want to change size of struct mukevent.
+ */
+#define KEVENTS_ON_PAGE ((PAGE_SIZE-sizeof(unsigned int))/sizeof(struct mukevent))
+struct kevent_mring
+{
+ unsigned int index;
+ struct mukevent event[KEVENTS_ON_PAGE];
+};
+
+#define KEVENT_CTL_ADD 0
+#define KEVENT_CTL_REMOVE 1
+#define KEVENT_CTL_MODIFY 2
+
+#endif /* __UKEVENT_H */
diff --git a/init/Kconfig b/init/Kconfig
index a099fc6..c550fcc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -218,6 +218,8 @@ config AUDITSYSCALL
such as SELinux. To use audit's filesystem watch feature, please
ensure that INOTIFY is configured.

+source "kernel/kevent/Kconfig"
+
config IKCONFIG
bool "Kernel .config support"
---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
obj-$(CONFIG_RELAY) += relay.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..977699c
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,31 @@
+config KEVENT
+ bool "Kernel event notification mechanism"
+ help
+ This option enables event queue mechanism.
+ It can be used as replacement for poll()/select(), AIO callback
+ invocations, advanced timer notifications and other kernel
+ object status changes.
+
+config KEVENT_USER_STAT
+ bool "Kevent user statistic"
+ depends on KEVENT
+ default N
+ help
+ This option will turn kevent_user statistic collection on.
+ Statistic data includes total number of kevent, number of kevents
+ which are ready immediately at insertion time and number of kevents
+ which were removed through readiness completion.
+ It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_TIMER
+ bool "Kernel event notifications for timers"
+ depends on KEVENT
+ help
+ This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+ bool "Kernel event notifications for poll()/select()"
+ depends on KEVENT
+ help
+ This option allows to use kevent subsystem for poll()/select()
+ notifications.
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..ab6bca0
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,3 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..422f585
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,227 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+ return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+ return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.ret_flags |= KEVENT_RET_BROKEN;
+ spin_unlock_irqrestore(&k->ulock, flags);
+ return -EINVAL;
+}
+
+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX];
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos)
+{
+ struct kevent_callbacks *p;
+
+ if (pos >= KEVENT_MAX)
+ return -EINVAL;
+
+ p = &kevent_registered_callbacks[pos];
+
+ p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
+ p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
+ p->callback = (cb->callback) ? cb->callback : kevent_break;
+
+ printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
+ return 0;
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+ spin_lock_init(&k->ulock);
+ k->flags = 0;
+
+ if (unlikely(k->event.type >= KEVENT_MAX ||
+ !kevent_registered_callbacks[k->event.type].callback))
+ return kevent_break(k);
+
+ k->callbacks = kevent_registered_callbacks[k->event.type];
+ if (unlikely(k->callbacks.callback == kevent_break))
+ return kevent_break(k);
+
+ return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+ unsigned long flags;
+
+ k->st = st;
+ spin_lock_irqsave(&st->lock, flags);
+ list_add_tail_rcu(&k->storage_entry, &st->list);
+ k->flags |= KEVENT_STORAGE;
+ spin_unlock_irqrestore(&st->lock, flags);
+ return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue.
+ * It does not decrease origin's reference counter in any way
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&st->lock, flags);
+ if (k->flags & KEVENT_STORAGE) {
+ list_del_rcu(&k->storage_entry);
+ k->flags &= ~KEVENT_STORAGE;
+ }
+ spin_unlock_irqrestore(&st->lock, flags);
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static void __kevent_requeue(struct kevent *k, u32 event)
+{
+ int ret, rem;
+ unsigned long flags;
+
+ ret = k->callbacks.callback(k);
+
+ spin_lock_irqsave(&k->ulock, flags);
+ if (ret > 0)
+ k->event.ret_flags |= KEVENT_RET_DONE;
+ else if (ret < 0)
+ k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE);
+ else
+ ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+ rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+ spin_unlock_irqrestore(&k->ulock, flags);
+
+ if (ret) {
+ if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) {
+ list_del_rcu(&k->storage_entry);
+ k->flags &= ~KEVENT_STORAGE;
+ }
+
+ spin_lock_irqsave(&k->user->ready_lock, flags);
+ if (!(k->flags & KEVENT_READY)) {
+ kevent_user_ring_add_event(k);
+ list_add_tail(&k->ready_entry, &k->user->ready_list);
+ k->flags |= KEVENT_READY;
+ k->user->ready_num++;
+ }
+ spin_unlock_irqrestore(&k->user->ready_lock, flags);
+ wake_up(&k->user->wait);
+ }
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->st->lock, flags);
+ __kevent_requeue(k, 0);
+ spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st,
+ kevent_callback_t ready_callback, u32 event)
+{
+ struct kevent *k;
+
+ rcu_read_lock();
+ if (ready_callback)
+ list_for_each_entry_rcu(k, &st->list, storage_entry)
+ (*ready_callback)(k);
+
+ list_for_each_entry_rcu(k, &st->list, storage_entry)
+ if (event & k->event.event)
+ __kevent_requeue(k, event);
+ rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+ spin_lock_init(&st->lock);
+ st->origin = origin;
+ INIT_LIST_HEAD(&st->list);
+ return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, sockt and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+ kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..8e01ec3
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,863 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/jhash.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static char kevent_name[] = "kevent";
+static kmem_cache_t *kevent_cache;
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+ struct kevent_user *u = file->private_data;
+ unsigned int mask;
+
+ poll_wait(file, &u->wait, wait);
+ mask = 0;
+
+ if (u->ready_num)
+ mask |= POLLIN | POLLRDNORM;
+
+ return mask;
+}
+
+static inline void kevent_user_ring_set(struct kevent_user *u, unsigned int num)
+{
+ u->pring[0]->index = num;
+}
+
+static int kevent_user_ring_grow(struct kevent_user *u)
+{
+ unsigned int idx;
+
+ idx = (u->pring[0]->index + 1) / KEVENTS_ON_PAGE;
+ if (idx >= u->pages_in_use) {
+ u->pring[idx] = (void *)__get_free_page(GFP_KERNEL);
+ if (!u->pring[idx])
+ return -ENOMEM;
+ u->pages_in_use++;
+ }
+ return 0;
+}
+
+/*
+ * Called under kevent_user->ready_lock, so updates are always protected.
+ */
+void kevent_user_ring_add_event(struct kevent *k)
+{
+ unsigned int pidx, off;
+ struct kevent_mring *ring, *copy_ring;
+
+ ring = k->user->pring[0];
+
+ pidx = ring->index/KEVENTS_ON_PAGE;
+ off = ring->index%KEVENTS_ON_PAGE;
+
+ copy_ring = k->user->pring[pidx];
+
+ copy_ring->event[off].id.raw[0] = k->event.id.raw[0];
+ copy_ring->event[off].id.raw[1] = k->event.id.raw[1];
+ copy_ring->event[off].ret_flags = k->event.ret_flags;
+
+ if (++ring->index >= KEVENT_MAX_EVENTS)
+ ring->index = 0;
+}
+
+/*
+ * Initialize mmap ring buffer.
+ * It will store ready kevents, so userspace could get them directly instead
+ * of using syscall. Esentially syscall becomes just a waiting point.
+ */
+static int kevent_user_ring_init(struct kevent_user *u)
+{
+ int pnum;
+
+ pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct mukevent) + sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;
+
+ u->pring = kmalloc(pnum * sizeof(struct kevent_mring *), GFP_KERNEL);
+ if (!u->pring)
+ return -ENOMEM;
+
+ u->pring[0] = (struct kevent_mring *)__get_free_page(GFP_KERNEL);
+ if (!u->pring[0])
+ goto err_out_free;
+
+ u->pages_in_use = 1;
+ kevent_user_ring_set(u, 0);
+
+ return 0;
+
+err_out_free:
+ kfree(u->pring);
+
+ return -ENOMEM;
+}
+
+static void kevent_user_ring_fini(struct kevent_user *u)
+{
+ int i;
+
+ for (i = 0; i < u->pages_in_use; ++i)
+ free_page((unsigned long)u->pring[i]);
+
+ kfree(u->pring);
+}
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+ struct kevent_user *u;
+ int i;
+
+ u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+ if (!u)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&u->ready_list);
+ spin_lock_init(&u->ready_lock);
+ kevent_stat_init(u);
+ spin_lock_init(&u->kevent_lock);
+ for (i = 0; i < ARRAY_SIZE(u->kevent_list); ++i)
+ INIT_LIST_HEAD(&u->kevent_list[i]);
+
+ mutex_init(&u->ctl_mutex);
+ init_waitqueue_head(&u->wait);
+
+ atomic_set(&u->refcnt, 1);
+
+ if (unlikely(kevent_user_ring_init(u))) {
+ kfree(u);
+ return -ENOMEM;
+ }
+
+ file->private_data = u;
+ return 0;
+}
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+ atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+ if (atomic_dec_and_test(&u->refcnt)) {
+ kevent_stat_print(u);
+ kevent_user_ring_fini(u);
+ kfree(u);
+ }
+}
+
+static struct page *kevent_user_nopage(struct vm_area_struct *vma, unsigned long addr, int *type)
+{
+ struct kevent_user *u = vma->vm_file->private_data;
+ unsigned long off = (addr - vma->vm_start)/PAGE_SIZE;
+
+ if (type)
+ *type = VM_FAULT_MINOR;
+
+ if (off >= u->pages_in_use)
+ goto err_out_sigbus;
+
+ return virt_to_page(u->pring[off]);
+
+err_out_sigbus:
+ return NOPAGE_SIGBUS;
+}
+
+static struct vm_operations_struct kevent_user_vm_ops = {
+ .nopage = &kevent_user_nopage,
+};
+
+/*
+ * Mmap implementation for ring buffer, which is created as array
+ * of pages, so vm_pgoff is an offset (in pages, not in bytes) of
+ * the first page to be mapped.
+ */
+static int kevent_user_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ unsigned long start = vma->vm_start;
+ struct kevent_user *u = file->private_data;
+
+ if (vma->vm_flags & VM_WRITE)
+ return -EPERM;
+
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ vma->vm_ops = &kevent_user_vm_ops;
+ vma->vm_flags |= VM_RESERVED;
+ vma->vm_file = file;
+
+ if (vm_insert_page(vma, start, virt_to_page(u->pring[0])))
+ return -EFAULT;
+
+ return 0;
+}
+
+static inline unsigned int kevent_user_hash(struct ukevent *uk)
+{
+ return jhash_1word(uk->id.raw[0], 0) & KEVENT_HASH_MASK;
+}
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+ struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+ kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+
+ if (deq)
+ kevent_dequeue(k);
+
+ spin_lock_irqsave(&u->ready_lock, flags);
+ if (k->flags & KEVENT_READY) {
+ list_del(&k->ready_entry);
+ k->flags &= ~KEVENT_READY;
+ u->ready_num--;
+ }
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+
+ kevent_user_put(u);
+ call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+
+ list_del(&k->kevent_entry);
+ k->flags &= ~KEVENT_USER;
+ u->kevent_num--;
+ kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events,
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ list_del(&k->kevent_entry);
+ k->flags &= ~KEVENT_USER;
+ u->kevent_num--;
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+ kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
+{
+ unsigned long flags;
+ struct kevent *k = NULL;
+
+ spin_lock_irqsave(&u->ready_lock, flags);
+ if (u->ready_num && !list_empty(&u->ready_list)) {
+ k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+ list_del(&k->ready_entry);
+ k->flags &= ~KEVENT_READY;
+ u->ready_num--;
+ }
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+
+ return k;
+}
+
+/*
+ * Search a kevent inside hash bucket for given ukevent.
+ */
+static struct kevent *__kevent_search(struct list_head *head, struct ukevent *uk,
+ struct kevent_user *u)
+{
+ struct kevent *k, *ret = NULL;
+
+ list_for_each_entry(k, head, kevent_entry) {
+ if (k->event.user[0] == uk->user[0] && k->event.user[1] == uk->user[1] &&
+ k->event.id.raw[0] == uk->id.raw[0] &&
+ k->event.id.raw[1] == uk->id.raw[1]) {
+ ret = k;
+ break;
+ }
+ }
+
+ return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ unsigned int hash = kevent_user_hash(uk);
+ int err = -ENODEV;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&u->kevent_list[hash], uk, u);
+ if (k) {
+ spin_lock(&k->ulock);
+ k->event.event = uk->event;
+ k->event.req_flags = uk->req_flags;
+ k->event.ret_flags = 0;
+ spin_unlock(&k->ulock);
+ kevent_requeue(k);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+ int err = -ENODEV;
+ struct kevent *k;
+ unsigned int hash = kevent_user_hash(uk);
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&u->kevent_list[hash], uk, u);
+ if (k) {
+ __kevent_finish_user(k, 1);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+ struct kevent_user *u = file->private_data;
+ struct kevent *k, *n;
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(u->kevent_list); ++i) {
+ list_for_each_entry_safe(k, n, &u->kevent_list[i], kevent_entry)
+ kevent_finish_user(k, 1);
+ }
+
+ kevent_user_put(u);
+ file->private_data = NULL;
+
+ return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+ struct ukevent *ukev;
+
+ ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+ if (!ukev)
+ return NULL;
+
+ if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+ kfree(ukev);
+ return NULL;
+ }
+
+ return ukev;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = 0, i;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > u->kevent_num) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ if (kevent_modify(&ukev[i], u))
+ ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+ ukev[i].ret_flags |= KEVENT_RET_DONE;
+ }
+ if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+ err = -EFAULT;
+ kfree(ukev);
+ goto out;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ if (kevent_modify(&uk, u))
+ uk.ret_flags |= KEVENT_RET_BROKEN;
+ uk.ret_flags |= KEVENT_RET_DONE;
+
+ if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ arg += sizeof(struct ukevent);
+ }
+out:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = 0, i;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > u->kevent_num) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ if (kevent_remove(&ukev[i], u))
+ ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+ ukev[i].ret_flags |= KEVENT_RET_DONE;
+ }
+ if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+ err = -EFAULT;
+ kfree(ukev);
+ goto out;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ if (kevent_remove(&uk, u))
+ uk.ret_flags |= KEVENT_RET_BROKEN;
+
+ uk.ret_flags |= KEVENT_RET_DONE;
+
+ if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ arg += sizeof(struct ukevent);
+ }
+out:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static void kevent_user_enqueue(struct kevent_user *u, struct kevent *k)
+{
+ unsigned long flags;
+ unsigned int hash = kevent_user_hash(&k->event);
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ list_add_tail(&k->kevent_entry, &u->kevent_list[hash]);
+ k->flags |= KEVENT_USER;
+ u->kevent_num++;
+ kevent_user_get(u);
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ int err;
+
+ if (kevent_user_ring_grow(u)) {
+ err = -ENOMEM;
+ goto err_out_exit;
+ }
+
+ k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+ if (!k) {
+ err = -ENOMEM;
+ goto err_out_exit;
+ }
+
+ memcpy(&k->event, uk, sizeof(struct ukevent));
+ INIT_RCU_HEAD(&k->rcu_head);
+
+ k->event.ret_flags = 0;
+
+ err = kevent_init(k);
+ if (err) {
+ kmem_cache_free(kevent_cache, k);
+ goto err_out_exit;
+ }
+ k->user = u;
+ kevent_stat_total(u);
+ kevent_user_enqueue(u, k);
+
+ err = kevent_enqueue(k);
+ if (err) {
+ memcpy(uk, &k->event, sizeof(struct ukevent));
+ kevent_finish_user(k, 0);
+ goto err_out_exit;
+ }
+
+ return 0;
+
+err_out_exit:
+ if (err < 0) {
+ uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
+ uk->ret_data[1] = err;
+ } else if (err > 0)
+ uk->ret_flags |= KEVENT_RET_DONE;
+ return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one
+ * and add them into appropriate kevent_storages,
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err, cerr = 0, knum = 0, rnum = 0, i;
+ void __user *orig = arg;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ err = -EINVAL;
+ if (u->kevent_num + num >= KEVENT_MAX_EVENTS)
+ goto out_remove;
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ err = kevent_user_add_ukevent(&ukev[i], u);
+ if (err) {
+ kevent_stat_im(u);
+ if (i != rnum)
+ memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+ rnum++;
+ } else
+ knum++;
+ }
+ if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+ cerr = -EFAULT;
+ kfree(ukev);
+ goto out_setup;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ cerr = -EFAULT;
+ break;
+ }
+ arg += sizeof(struct ukevent);
+
+ err = kevent_user_add_ukevent(&uk, u);
+ if (err) {
+ kevent_stat_im(u);
+ if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+ cerr = -EFAULT;
+ break;
+ }
+ orig += sizeof(struct ukevent);
+ rnum++;
+ } else
+ knum++;
+ }
+
+out_setup:
+ if (cerr < 0) {
+ err = cerr;
+ goto out_remove;
+ }
+
+ err = rnum;
+out_remove:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u,
+ unsigned int min_nr, unsigned int max_nr, __u64 timeout,
+ void __user *buf)
+{
+ struct kevent *k;
+ int num = 0;
+
+ if (!(file->f_flags & O_NONBLOCK)) {
+ wait_event_interruptible_timeout(u->wait,
+ u->ready_num >= min_nr,
+ clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+ }
+
+ while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
+ if (copy_to_user(buf + num*sizeof(struct ukevent),
+ &k->event, sizeof(struct ukevent)))
+ break;
+
+ /*
+ * If it is one-shot kevent, it has been removed already from
+ * origin's queue, so we can easily free it here.
+ */
+ if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+ kevent_finish_user(k, 1);
+ ++num;
+ kevent_stat_wait(u);
+ }
+
+ return num;
+}
+
+static struct file_operations kevent_user_fops = {
+ .mmap = kevent_user_mmap,
+ .open = kevent_user_open,
+ .release = kevent_user_release,
+ .poll = kevent_user_poll,
+ .owner = THIS_MODULE,
+};
+
+static struct miscdevice kevent_miscdev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = kevent_name,
+ .fops = &kevent_user_fops,
+};
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+ int err;
+ struct kevent_user *u = file->private_data;
+
+ if (!u || num > KEVENT_MAX_EVENTS)
+ return -EINVAL;
+
+ switch (cmd) {
+ case KEVENT_CTL_ADD:
+ err = kevent_user_ctl_add(u, num, arg);
+ break;
+ case KEVENT_CTL_REMOVE:
+ err = kevent_user_ctl_remove(u, num, arg);
+ break;
+ case KEVENT_CTL_MODIFY:
+ err = kevent_user_ctl_modify(u, num, arg);
+ break;
+ default:
+ err = -EINVAL;
+ break;
+ }
+
+ return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - timeout in nanoseconds to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - ununsed for now (will be used for mmap implementation).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+ __u64 timeout, void __user *buf, unsigned flags)
+{
+ int err = -EINVAL;
+ struct file *file;
+ struct kevent_user *u;
+
+ file = fget(ctl_fd);
+ if (!file)
+ return -ENODEV;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+ u = file->private_data;
+
+ err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, void __user *arg)
+{
+ int err = -EINVAL;
+ struct file *file;
+
+ file = fget(fd);
+ if (!file)
+ return -ENODEV;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+
+ err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * Kevent subsystem initialization - create kevent cache and register
+ * filesystem to get control file descriptors from.
+ */
+static int __devinit kevent_user_init(void)
+{
+ int err = 0;
+
+ kevent_cache = kmem_cache_create("kevent_cache",
+ sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+
+ err = misc_register(&kevent_miscdev);
+ if (err) {
+ printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
+ goto err_out_exit;
+ }
+
+ printk("KEVENT subsystem has been successfully registered.\n");
+
+ return 0;
+
+err_out_exit:
+ return err;
+}
+
+static void __devexit kevent_user_fini(void)
+{
+ misc_deregister(&kevent_miscdev);
+}
+
+module_init(kevent_user_init);
+module_exit(kevent_user_fini);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 6991bec..8d3769b 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,9 @@ cond_syscall(ppc_rtas);
cond_syscall(sys_spu_run);
cond_syscall(sys_spu_create);

+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_ctl);
+
/* mmu depending weak syscall entries */
cond_syscall(sys_mprotect);
cond_syscall(sys_msync);

2006-08-25 09:32:11

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take14 2/3] kevent: poll/select() notifications.


poll/select() notifications.

This patch includes generic poll/select and timer notifications.

kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake).

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2561020..76b3039 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -236,6 +236,7 @@ #include <linux/prio_tree.h>
#include <linux/init.h>
#include <linux/sched.h>
#include <linux/mutex.h>
+#include <linux/kevent.h>

#include <asm/atomic.h>
#include <asm/semaphore.h>
@@ -698,6 +699,9 @@ #ifdef CONFIG_EPOLL
struct list_head f_ep_links;
spinlock_t f_ep_lock;
#endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+ struct kevent_storage st;
+#endif
struct address_space *f_mapping;
};
extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..fb74e0f
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,222 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+ struct poll_table_struct pt;
+ struct kevent *k;
+};
+
+struct kevent_poll_wait_container
+{
+ struct list_head container_entry;
+ wait_queue_head_t *whead;
+ wait_queue_t wait;
+ struct kevent *k;
+};
+
+struct kevent_poll_private
+{
+ struct list_head container_list;
+ spinlock_t container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+ unsigned mode, int sync, void *key)
+{
+ struct kevent_poll_wait_container *cont =
+ container_of(wait, struct kevent_poll_wait_container, wait);
+ struct kevent *k = cont->k;
+ struct file *file = k->st->origin;
+ u32 revents;
+
+ revents = file->f_op->poll(file, NULL);
+
+ kevent_storage_ready(k->st, NULL, revents);
+
+ return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+ struct poll_table_struct *poll_table)
+{
+ struct kevent *k =
+ container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+ struct kevent_poll_private *priv = k->priv;
+ struct kevent_poll_wait_container *cont;
+ unsigned long flags;
+
+ cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+ if (!cont) {
+ kevent_break(k);
+ return;
+ }
+
+ cont->k = k;
+ init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+ cont->whead = whead;
+
+ spin_lock_irqsave(&priv->container_lock, flags);
+ list_add_tail(&cont->container_entry, &priv->container_list);
+ spin_unlock_irqrestore(&priv->container_lock, flags);
+
+ add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+ struct file *file;
+ int err, ready = 0;
+ unsigned int revents;
+ struct kevent_poll_ctl ctl;
+ struct kevent_poll_private *priv;
+
+ file = fget(k->event.id.raw[0]);
+ if (!file)
+ return -ENODEV;
+
+ err = -EINVAL;
+ if (!file->f_op || !file->f_op->poll)
+ goto err_out_fput;
+
+ err = -ENOMEM;
+ priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+ if (!priv)
+ goto err_out_fput;
+
+ spin_lock_init(&priv->container_lock);
+ INIT_LIST_HEAD(&priv->container_list);
+
+ k->priv = priv;
+
+ ctl.k = k;
+ init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+
+ err = kevent_storage_enqueue(&file->st, k);
+ if (err)
+ goto err_out_free;
+
+ revents = file->f_op->poll(file, &ctl.pt);
+ if (revents & k->event.event) {
+ ready = 1;
+ kevent_poll_dequeue(k);
+ }
+
+ return ready;
+
+err_out_free:
+ kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+ fput(file);
+ return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+ struct file *file = k->st->origin;
+ struct kevent_poll_private *priv = k->priv;
+ struct kevent_poll_wait_container *w, *n;
+ unsigned long flags;
+
+ kevent_storage_dequeue(k->st, k);
+
+ spin_lock_irqsave(&priv->container_lock, flags);
+ list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+ list_del(&w->container_entry);
+ remove_wait_queue(w->whead, &w->wait);
+ kmem_cache_free(kevent_poll_container_cache, w);
+ }
+ spin_unlock_irqrestore(&priv->container_lock, flags);
+
+ kmem_cache_free(kevent_poll_priv_cache, priv);
+ k->priv = NULL;
+
+ fput(file);
+
+ return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+ struct file *file = k->st->origin;
+ unsigned int revents = file->f_op->poll(file, NULL);
+
+ k->event.ret_data[0] = revents & k->event.event;
+
+ return (revents & k->event.event);
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+ struct kevent_callbacks pc = {
+ .callback = &kevent_poll_callback,
+ .enqueue = &kevent_poll_enqueue,
+ .dequeue = &kevent_poll_dequeue};
+
+ kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache",
+ sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+ if (!kevent_poll_container_cache) {
+ printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+ return -ENOMEM;
+ }
+
+ kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache",
+ sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+ if (!kevent_poll_priv_cache) {
+ printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+ kmem_cache_destroy(kevent_poll_container_cache);
+ kevent_poll_container_cache = NULL;
+ return -ENOMEM;
+ }
+
+ kevent_add_callbacks(&pc, KEVENT_POLL);
+
+ printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+ return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+ lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+ kmem_cache_destroy(kevent_poll_priv_cache);
+ kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);

2006-08-27 21:04:13

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take14 0/3] kevent: Generic event handling mechanism.

[Sorry for the length, but I want to be clear.]

As promised, before Monday (at least my time), here are my thoughts on
the proposed kevent interfaces. Not necessarily well ordered:


- one point of critique which applied to many proposals over the years:
multiplexer syscalls a bad, really bad. They are more complicated
to use at userlevel and in the kernel. We've seen more than once that
unimplemented functions are not reported correctly with ENOSYS. Just
use individual syscalls. Adding them is cheap and probably overall
less expensive than the multiplexer.



Events to wait for are basically all those with syscalls which can
potentially block indefinitely:

- file descriptor
- POSIX message queues (these are in fact file descriptors but
let's make it legitimate)
- timer expiration
- signals (just as sigwait, not normal delivery instead of a handler)
- futexes (needs a lot more investigation)
- SysV message queues
- SysV semaphores
- bind socket operations (Alan brought this up in a different context)
- delays (nanosleep/clock_nanosleep, could be done using timers but the
overhead would likely be too high)
- process state change (waitpid, wait4, waitid etc)
- file locking (flock, lockf)
-

We might also want to think about

- msync/fsync: Today's wait/no-wait option doesn't allow us to work on
other things if the sync takes time and we need a real notification
(i.e., if no-wait cannot be used)


The reporting must of course provide the userlevel code with enough
information to identify the request. For submitting requests we need
such identification, too, so having unique identifiers for all the
different event types is necessary. So some extend this is what the
KEVENT_TIMER_FIRED, KEVENT_SOCKET_RECV, etc constants do. But they
should be more generic in their names since we need to use them also
when registering the even. I.e., KEVENT_EVENT_TIMER or so is more
appropriate.

Often (most of the time) this ID and the actual descriptor (file
descriptor, message queue descriptor, signal number, etc) is not
sufficient. In the POSIX API we therefore usually have a cookie value
which the userlevel code can provide and which is returned unchanged as
part of the notification. See the sigev_value member of struct
sigevent. I think this is the best approach: is compact and it gives
all the flexibility needed. Userlevel code will store a value or more
often a pointer in the cookie and can then access additional information
based of the cookie.

I know there is a controversy around using pointer-sized values in
kernel structures which are exposed to userlevel. It should be possible
to work around this. We can simply always use 64-bit values and when
the data structure is exposed to 32-bit userland code only the first or
second 32-bit word of the structure is exposed with the name. The other
word is padding. If planned in from the beginning this should not cause
any problems at all.

Looking at the current struct mukevent, I don't think it is sufficient.
We need more room for the various types of events. And we shouldn't
prevent future innovative uses. I suggest to create records of a fixed
size with sufficient room. Maybe 32 bytes are sufficient but I'd leave
this open for can until the very end. Members of the structure must be
- ID if the type of event; type int
- descriptor (file descriptor, SysV msg descriptors etc); type int
- user-provided cookie; type uint64_t
That's only 16 bytes so far but we'll likely need more for some uses.


Next, the current interfaces once again fail to learn from a mistake we
made and which got corrected for the other interfaces. We need to be
able to change the signal mask around the delay atomically. Just like
we have ppoll for poll, pselect for select (and hopefully soon also
epoll_pwait for epoll_wait) we need to have this feature in the new
interfaces.


I read the description Nicholas Miell produced (the example programs
aren't available, accessing the URL fails for me) and looked over the
last patch (take 14).

The biggest problem I see so far is the integration into the existing
interfaces. kevent notification *really* should be usable as a new
sigevent type. Whether the POSIX interfaces are liked by kernel folks
or not, they are what the majority of the userlevel programmers use.
The mechanism is easily extensible. I've described this in my paper. I
cannot comment on the complexity of the kernel side but I'd imagine it's
not much more difficult, just different from what is implemented now.
Let's learn for a change from the mistakes of the past. The new and
innovative AIO interfaces never took off because their implementation
differs so much from the POSIX interfaces. People are interested in
portable code. So, please, let's introduce SIGEV_KEVENT. Then we
magically get timer notification etc for free.


The ring buffer interface is not described in Nicholas' description.
I'm looking at the sources and am a bit baffled. For instance, the
kevent_user_ring_add_event function simply adds an event without
determining whether this overwrites an undelivered entry. One single
index into the buffer isn't sufficient for this anyway. So let me ask
some questions:

- how is userlevel code supposed to locate events in the buffer? We
can maintain a separate pointer for the ring buffer (in a separate
location, which might actually be good for CPU cache reasons). But
this cannot solve all problems. E.g., if the read pointer is
initialized to zero (as is the write pointer), the ring buffer fits N
entries, if now N+1 entries arrive before the first event is handled
by the userlevel code, how does the userland code know that all ring
buffer entries are valid? If the code supposed to always scan the
entire buffer?

- we need to signal the ring buffer overflow in some form to the
userlevel code. What proposals have been made for this? Signals
are the old and tried mechanism. I.e., one would be allowed to
associate a signal with each kevent descriptor and receive overflow
notifications this way. When rt signals are used we event can get
the kevent descriptor and the possible a user cookie delivered.
Something like this is needed in case such a kevent queue is used
in library code where we cannot rely on being the only user for an
event.

I must admit I haven't spent too much time thinking about the ideal ring
buffer interface. At OLS there were quite a few people (like Zach) who
said they did. So, let's solicit advice. I think the kernel AIO
interface can also provide some info on what not to do.


One aspect of the interface I did think about: the delay syscall. I
already mentioned the signal mask issue above. The interface already
has a timeout value (good!). But we need to specify the semantics quite
detailed to avoid problems.

What I mean by that is the problem we are facing if there is more than
one thread waiting for events. If no event is available all threads use
the delay syscall. If now an event becomes available, what do we do?
Do we want exactly one thread? This is a problem. The thread might not
be working on the event after it gets woken (e.g., because the thread
gets canceled). The result is that there is an event available and no
other thread gets woken. This can be avoided by requiring that if a
thread, which got woken from a delay syscall, doesn't use the event, it
has to wake another thread. But how do we do this?

One possibility I could see is that the delay syscall returns the event
which caused the thread to be woken. This event is _not_ also reported
in the ring buffer. Then, if the thread does not use the event, it
simply requeues it. This will then implicitly wake another delayed thread.

Which brings me to the second point about the current kevent_get_events
syscall. I don't think the min_nr parameter is useful. Probably we
should not even allow the kevent queue to be used with different max_nr
parameters in different thread. If you'd allow this, how would the
event notification be handled? A waiter with a smaller required number
of events would always be woken first. I think the number of required
events should be a property of the kevent object. Then the code would
create different kevent object if the requirement is different. At the
very least I'd declare it an error if at any time there are two or more
threads delayed which have different requirements on the number of
events. This could provide all the flexibility needed while preventing
some of the mistakes one can make.



In summary, I don't think we're at the point where the current
interfaces are usable. I'd like to see them redesigned and
reimplemented. The bad news is that I'll not be able to help with the
coding. The somewhat good news is that I can given some more
recommendations. In general I still think the text from my OLS paper
applies:


- one syscall to create a kevent queue. Using a special filesystem like
take 14 does is OK. But who do you pass parameters like the maximum
number of expected outstanding events? I think a dedicated syscall is
better. It also works more reliably since /proc might not be yet
mounted when the first user of the interface is started. The result
should be a file descriptor. At least an object which can be handled
like a file descriptor when it comes to transmitting it over Unix
domain sockets. Questions to answer: what happens if you use the
descriptor with any other interface but the kevent interfaces (I think
all such calls like dup, read, write, ... should fail).

int kevent_init (int num);


- one system call to create the userlevel ring buffer. Simply
overloading the mmap operation for the special kevent filesystem can
work so no separate syscall is needed in that case. We need to
nail down the semantics, though. What happens if more than one mmap
call is made? Does only the last one count? Does the second one
fail? Will mremap() work to increase/descrease the size? Will
mremap() be allowed to be called with MREMAP_MAYMOVE? What if mmap()
is called from different processes (in the POSIX sense, i.e., from
different address spaces)?

Either

mmap(...)

Or

int kevent_map_ringbuf (int kfd, size_t num)


- one interface to set additional parameters. This is likely mostly to
make the interfaces safe for the future. Perhaps the number of events
needed per delay call should be set this way.

int kevent_ctl (int kfd, int cmd, ...)


- one interface to shut the kevent down. This might be overkill. We
should be able to use munmap() and close(). If a real interface for
this would be created it should look like this

int kevent_destroy (int kfd, void *ringbuf, size_t num)

I find this rather more cumbersome. Just use close and munmap.


- one interface to submit requests.

int kevent_submit (int kfd, struct kevent_event *ev, int flags,
struct timespec *timeout)

Maybe the flags parameter isn't needed, it's just another way to make
sure we won't regret the design later. If the ring buffer can fill up
and this is detected by the kernel (unlike what happens in take 14)
then the calling thread could be delayed undefinitely. Maybe we even
have a deadlock if there is only one thread. If only a wait/no-wait
mode is needed, then use only a flags parameter and no timeout
parameter.

A special variant should be if ev == NULL the call is taken as a
request to wake one or more delayed threads.


- one interface to delay threads until the next event becomes available.
No data is transfered along with the call. The event data must be
read from the ring buffer:

int kevent_wait (int kfd, unsigned ringstate,
const struct timespec *timeout,
const sigset_t *sigmask)

Wait-mode can be implemented by recognizing timeout==NULL. no-wait
mode is implemented using timeout->tv_sec==timeout->tv_nsec==0. If
sigset_t is NULL the signal mask is not changed.

The ringstate parameter is also not present in the take 14 proposal.
Something like it is necessary to prevent the thread from going to
sleep while there are events in the ring buffer. It would be very
wasteful if the kernel would have to keep track of outstanding
events. This would also mean then handling events would require
a system call, exactly what the ring buffer approach should prevent.

I think the sequence for waiting for an event should be like this:

+ get current ring state
+ check whether any outstanding event in ring buffer
+ if yes, copy data out of ring buffer, mark ring buffer record
as unused (atomically).
+ if no, call kevent_wait with ring state value

When the kernel delivers a new event it does:

+ find place to store event
+ change ring state (might be a simple counter)

The kevent_wait implementation in the kernel would then as the first
thing determine whether the ring state changed. If yes, the syscall
returns immediate with -ENWOULDBLOCK. Otherwise it is queued for
waiting.

With these steps and the requirement that all ring buffer entries are
processed FIFO we can
a) avoid syscalls to avoid freeing ring buffer entries
b) detect overflows in the ring buffer
c) can maintain the read pointer at userlevel while the kernel can
maintain the write pointer into the buffer


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


Attachments:
signature.asc (251.00 B)
OpenPGP digital signature

2006-08-28 01:57:43

by David Miller

[permalink] [raw]
Subject: Re: [take14 0/3] kevent: Generic event handling mechanism.

From: Ulrich Drepper <[email protected]>
Date: Sun, 27 Aug 2006 14:03:33 -0700

> The biggest problem I see so far is the integration into the existing
> interfaces. kevent notification *really* should be usable as a new
> sigevent type. Whether the POSIX interfaces are liked by kernel folks
> or not, they are what the majority of the userlevel programmers use.
> The mechanism is easily extensible. I've described this in my paper. I
> cannot comment on the complexity of the kernel side but I'd imagine it's
> not much more difficult, just different from what is implemented now.
> Let's learn for a change from the mistakes of the past. The new and
> innovative AIO interfaces never took off because their implementation
> differs so much from the POSIX interfaces. People are interested in
> portable code. So, please, let's introduce SIGEV_KEVENT. Then we
> magically get timer notification etc for free.

I have to disagree with this.

SigEvent, and signals in general, are crap. They are complex
and userland gets it wrong more often than not. Interfaces
for userland should be simple, signals are not simple. A core
loop that says "give me events to process", on the other hand,
is. And this is what is most natural for userspace.

The user can say when he wants the process events. In fact,
ripping out the complex signal handling will be a welcome
change for most server applications.

We are going to require the use of a new interface to register
the events anyways, why keep holding onto the delivery baggage
as well when we can break free of those limitations?

2006-08-28 02:12:18

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take14 0/3] kevent: Generic event handling mechanism.

David Miller wrote:
> SigEvent, and signals in general, are crap. They are complex
> and userland gets it wrong more often than not. Interfaces
> for userland should be simple, signals are not simple.

You miss the point.

sigevent has nothing necessarily to do with signals. I don't want
signals. I just want the same interface to specify the action to be used.

If I'm using

struct sigevent sigev;
int kfd;

kfd = kevent_create (...);

sigev.sigev_notify = SIGEV_KEVENT;
sigev.sigev_kfd = kfd;
sigev.sigev_valie.sival_ptr = &some_data;


then I can use this sigev variable in an unmodified timer_create call.
The kernel would see SIGEV_KEVENT (as opposed to SIGEV_SIGNAL etc) and
**not** generate a signal but instead create the event in the kevent queue.


The proposal to use sigevent has nothing to do with signals. It's just
about the interface and to have smooth integration with existing
functionality.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


Attachments:
signature.asc (251.00 B)
OpenPGP digital signature

2006-08-28 02:40:53

by Nicholas Miell

[permalink] [raw]
Subject: Re: [take14 0/3] kevent: Generic event handling mechanism.

On Sun, 2006-08-27 at 18:57 -0700, David Miller wrote:
> From: Ulrich Drepper <[email protected]>
> Date: Sun, 27 Aug 2006 14:03:33 -0700
>
> > The biggest problem I see so far is the integration into the existing
> > interfaces. kevent notification *really* should be usable as a new
> > sigevent type. Whether the POSIX interfaces are liked by kernel folks
> > or not, they are what the majority of the userlevel programmers use.
> > The mechanism is easily extensible. I've described this in my paper. I
> > cannot comment on the complexity of the kernel side but I'd imagine it's
> > not much more difficult, just different from what is implemented now.
> > Let's learn for a change from the mistakes of the past. The new and
> > innovative AIO interfaces never took off because their implementation
> > differs so much from the POSIX interfaces. People are interested in
> > portable code. So, please, let's introduce SIGEV_KEVENT. Then we
> > magically get timer notification etc for free.
>
> I have to disagree with this.
>
> SigEvent, and signals in general, are crap. They are complex
> and userland gets it wrong more often than not. Interfaces
> for userland should be simple, signals are not simple. A core
> loop that says "give me events to process", on the other hand,
> is. And this is what is most natural for userspace.
>
> The user can say when he wants the process events. In fact,
> ripping out the complex signal handling will be a welcome
> change for most server applications.
>
> We are going to require the use of a new interface to register
> the events anyways, why keep holding onto the delivery baggage
> as well when we can break free of those limitations?

struct sigevent is the POSIX method for describing how event
notifications are delivered.

Two methods are specified in POSIX -- SIGEV_SIGNAL, which delivers a
signal to the process and SIGEV_THREAD which creates a new thread in the
process and calls a user-supplied function. In addition to these two
methods, Linux also implements SIGEV_THREAD_ID, which sends a signal to
a specific thread (this is used internally by glibc to implement
SIGEV_THREAD, but I imagine that would change on the addition of
SIGEV_KEVENT).

Ulrich is suggesting the addition of SIGEV_KEVENT, which causes the
event notification to be delivered to a specific kevent queue. This
would allow for event delivery to kevent queues from POSIX AIO
completions, POSIX message queues, POSIX timers, glibc's async name
resolution interface and anything else that might use a struct sigevent
in the future.

--
Nicholas Miell <[email protected]>

2006-08-28 02:59:57

by Nicholas Miell

[permalink] [raw]
Subject: Re: [take14 0/3] kevent: Generic event handling mechanism.

On Sun, 2006-08-27 at 14:03 -0700, Ulrich Drepper wrote:

[ note: there was lots of good stuff that I cut out because it was a
long email and I'm only replying to some of its points ]

> Events to wait for are basically all those with syscalls which can
> potentially block indefinitely:
>
> - file descriptor
> - POSIX message queues (these are in fact file descriptors but
> let's make it legitimate)
> - timer expiration
> - signals (just as sigwait, not normal delivery instead of a handler)

For some of them (like SIGTERM), delivery to a kevent queue would
actually make sense.

> The ring buffer interface is not described in Nicholas' description.

I wasn't even aware there was a ring-buffer interface in the proposed
patches. Another reason why the onus of documenting a patch is on the
originator: the random nobody who ends up doing the documenting may
screw it up.

> Which brings me to the second point about the current kevent_get_events
> syscall. I don't think the min_nr parameter is useful. Probably we
> should not even allow the kevent queue to be used with different max_nr
> parameters in different thread. If you'd allow this, how would the
> event notification be handled? A waiter with a smaller required number
> of events would always be woken first. I think the number of required
> events should be a property of the kevent object. Then the code would
> create different kevent object if the requirement is different. At the
> very least I'd declare it an error if at any time there are two or more
> threads delayed which have different requirements on the number of
> events. This could provide all the flexibility needed while preventing
> some of the mistakes one can make.

I was thinking about this, and it's even worse in the case where a
kevent fd is shared by different processes (either by forking or by
passing it via PF_UNIX sockets).

What happens when you queue an AIO completion to a shared kevent queue?
(The AIO read only happened in one address space, or did it? What if the
read was to a shared memory region? What if the memory region is shared,
but mapped at different addresses? What if not all of the processes
involved have that AIO fd open?)

Also complicated is the case where waiting threads have different
priorities, different timeouts, and different minimum event counts --
how do you decide which thread gets events first? What if the decisions
are different depending on whether you want to maximize throughput or
interactivity?

--
Nicholas Miell <[email protected]>

2006-08-28 11:47:52

by Jari Sundell

[permalink] [raw]
Subject: Re: [take14 0/3] kevent: Generic event handling mechanism.

On 8/28/06, Nicholas Miell <[email protected]> wrote:
> Also complicated is the case where waiting threads have different
> priorities, different timeouts, and different minimum event counts --
> how do you decide which thread gets events first? What if the decisions
> are different depending on whether you want to maximize throughput or
> interactivity?

BTW, what is the intended use of the min event count parameter? The
obvious reason I can see, avoiding waking up a thread too often with
few queued events, would imo be handled cleaner by just passing a
parameter telling the kernel to try to queue more events.

With a min event count you'd have to use a rather low timeout to
ensure that events get handled within a resonable time.

Rakshasa

2006-08-31 08:02:36

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take14 0/3] kevent: Generic event handling mechanism.

Hello.

Sorry ofr long delay - I was on small vacations.

On Sun, Aug 27, 2006 at 02:03:33PM -0700, Ulrich Drepper ([email protected]) wrote:
> [Sorry for the length, but I want to be clear.]
>
> As promised, before Monday (at least my time), here are my thoughts on
> the proposed kevent interfaces. Not necessarily well ordered:
>
>
> - one point of critique which applied to many proposals over the years:
> multiplexer syscalls a bad, really bad. They are more complicated
> to use at userlevel and in the kernel. We've seen more than once that
> unimplemented functions are not reported correctly with ENOSYS. Just
> use individual syscalls. Adding them is cheap and probably overall
> less expensive than the multiplexer.

Can you convince Christoph?
I do not care about interfaces, but until several people agree on it, I
will not change anything.

> Events to wait for are basically all those with syscalls which can
> potentially block indefinitely:
>
> - file descriptor
> - POSIX message queues (these are in fact file descriptors but
> let's make it legitimate)
> - timer expiration
> - signals (just as sigwait, not normal delivery instead of a handler)
> - futexes (needs a lot more investigation)
> - SysV message queues
> - SysV semaphores
> - bind socket operations (Alan brought this up in a different context)
> - delays (nanosleep/clock_nanosleep, could be done using timers but the
> overhead would likely be too high)
> - process state change (waitpid, wait4, waitid etc)
> - file locking (flock, lockf)
> -

You completely miss AIO here (I talk not about POSIX AIO).

> We might also want to think about
>
> - msync/fsync: Today's wait/no-wait option doesn't allow us to work on
> other things if the sync takes time and we need a real notification
> (i.e., if no-wait cannot be used)
>
>
> The reporting must of course provide the userlevel code with enough
> information to identify the request. For submitting requests we need
> such identification, too, so having unique identifiers for all the
> different event types is necessary. So some extend this is what the
> KEVENT_TIMER_FIRED, KEVENT_SOCKET_RECV, etc constants do. But they
> should be more generic in their names since we need to use them also
> when registering the even. I.e., KEVENT_EVENT_TIMER or so is more
> appropriate.

There are such identifiers.
We have _two_ levels of id
- event type (KEVENT_EVENT_TIMER,
KEVENT_EVENT_POLL, KEVENT_EVENT_AIO and so one, but they are called
without _EVENT_ inside), which is type of origin for given events
- events itself - timer fired, data received, client accepted and so
on.

> Often (most of the time) this ID and the actual descriptor (file
> descriptor, message queue descriptor, signal number, etc) is not
> sufficient. In the POSIX API we therefore usually have a cookie value
> which the userlevel code can provide and which is returned unchanged as
> part of the notification. See the sigev_value member of struct
> sigevent. I think this is the best approach: is compact and it gives
> all the flexibility needed. Userlevel code will store a value or more
> often a pointer in the cookie and can then access additional information
> based of the cookie.

kevents have such "cookies".

> I know there is a controversy around using pointer-sized values in
> kernel structures which are exposed to userlevel. It should be possible
> to work around this. We can simply always use 64-bit values and when
> the data structure is exposed to 32-bit userland code only the first or
> second 32-bit word of the structure is exposed with the name. The other
> word is padding. If planned in from the beginning this should not cause
> any problems at all.

I use union of two 32bit values and pointer to simplify userspace.
It was planned and implemented already.

> Looking at the current struct mukevent, I don't think it is sufficient.
> We need more room for the various types of events. And we shouldn't
> prevent future innovative uses. I suggest to create records of a fixed
> size with sufficient room. Maybe 32 bytes are sufficient but I'd leave
> this open for can until the very end. Members of the structure must be
> - ID if the type of event; type int
> - descriptor (file descriptor, SysV msg descriptors etc); type int
> - user-provided cookie; type uint64_t
> That's only 16 bytes so far but we'll likely need more for some uses.

I use there only id provided by user, it is not his cookie, but it was
done to make strucutre as small as possible.
Think about size of the mapped buffer when there are several kevent
queues - it is all mapped and thus pinned memory.
It of course can be extended.

> Next, the current interfaces once again fail to learn from a mistake we
> made and which got corrected for the other interfaces. We need to be
> able to change the signal mask around the delay atomically. Just like
> we have ppoll for poll, pselect for select (and hopefully soon also
> epoll_pwait for epoll_wait) we need to have this feature in the new
> interfaces.

We able to change kevents atomically.

> I read the description Nicholas Miell produced (the example programs
> aren't available, accessing the URL fails for me) and looked over the
> last patch (take 14).
>
> The biggest problem I see so far is the integration into the existing
> interfaces. kevent notification *really* should be usable as a new
> sigevent type. Whether the POSIX interfaces are liked by kernel folks
> or not, they are what the majority of the userlevel programmers use.
> The mechanism is easily extensible. I've described this in my paper. I
> cannot comment on the complexity of the kernel side but I'd imagine it's
> not much more difficult, just different from what is implemented now.
> Let's learn for a change from the mistakes of the past. The new and
> innovative AIO interfaces never took off because their implementation
> differs so much from the POSIX interfaces. People are interested in
> portable code. So, please, let's introduce SIGEV_KEVENT. Then we
> magically get timer notification etc for free.

Well, I rarely talk about what other people want, but if you strongly
feel, that all posix crap is better than epoll interface, then I can not
agree with you.

It is possible to create additional one using any POSIX API you like,
but I strongly insist on having possibility to use lightweight syscall
interface too.

> The ring buffer interface is not described in Nicholas' description.
> I'm looking at the sources and am a bit baffled. For instance, the
> kevent_user_ring_add_event function simply adds an event without
> determining whether this overwrites an undelivered entry. One single
> index into the buffer isn't sufficient for this anyway. So let me ask
> some questions:
>
> - how is userlevel code supposed to locate events in the buffer? We
> can maintain a separate pointer for the ring buffer (in a separate
> location, which might actually be good for CPU cache reasons). But
> this cannot solve all problems. E.g., if the read pointer is
> initialized to zero (as is the write pointer), the ring buffer fits N
> entries, if now N+1 entries arrive before the first event is handled
> by the userlevel code, how does the userland code know that all ring
> buffer entries are valid? If the code supposed to always scan the
> entire buffer?

Ring buffer _always_ has space for new events until queue is not filled.
So if userspace do not read for too much time it's events and eventually
tries to add new one, it will fail early.

> - we need to signal the ring buffer overflow in some form to the
> userlevel code. What proposals have been made for this? Signals
> are the old and tried mechanism. I.e., one would be allowed to
> associate a signal with each kevent descriptor and receive overflow
> notifications this way. When rt signals are used we event can get
> the kevent descriptor and the possible a user cookie delivered.
> Something like this is needed in case such a kevent queue is used
> in library code where we cannot rely on being the only user for an
> event.

There is no overflow - I do not want to introduce another signal queue
overflow crap here.
And once again - no signals.

> I must admit I haven't spent too much time thinking about the ideal ring
> buffer interface. At OLS there were quite a few people (like Zach) who
> said they did. So, let's solicit advice. I think the kernel AIO
> interface can also provide some info on what not to do.

Sure, I would like to see different design if it is ready.

> One aspect of the interface I did think about: the delay syscall. I
> already mentioned the signal mask issue above. The interface already
> has a timeout value (good!). But we need to specify the semantics quite
> detailed to avoid problems.
>
> What I mean by that is the problem we are facing if there is more than
> one thread waiting for events. If no event is available all threads use
> the delay syscall. If now an event becomes available, what do we do?
> Do we want exactly one thread? This is a problem. The thread might not
> be working on the event after it gets woken (e.g., because the thread
> gets canceled). The result is that there is an event available and no
> other thread gets woken. This can be avoided by requiring that if a
> thread, which got woken from a delay syscall, doesn't use the event, it
> has to wake another thread. But how do we do this?

I can reformulate your words in a different manner. PLease correct me if
I'm wrong.

You basically want to deliver the same event to several users.
But how do you want to achive it with network buffers for example.
When several threads reads from the same socket, they do not obtain the
same data.
So I disagree that we need to deliver some events to several threads. If
you need to wake up several ones when one network socket is ready (I
seriously doubt you want), create per-thread kevent queue and put that
socket into them.
You want to wake several threads on timeout - create per-thread queue
and put there an event.

The more simple interface is, the less problems we will catch when some
tricky configuration is used.

> One possibility I could see is that the delay syscall returns the event
> which caused the thread to be woken. This event is _not_ also reported
> in the ring buffer. Then, if the thread does not use the event, it
> simply requeues it. This will then implicitly wake another delayed thread.
>
> Which brings me to the second point about the current kevent_get_events
> syscall. I don't think the min_nr parameter is useful. Probably we
> should not even allow the kevent queue to be used with different max_nr
> parameters in different thread. If you'd allow this, how would the
> event notification be handled? A waiter with a smaller required number
> of events would always be woken first. I think the number of required
> events should be a property of the kevent object. Then the code would
> create different kevent object if the requirement is different. At the
> very least I'd declare it an error if at any time there are two or more
> threads delayed which have different requirements on the number of
> events. This could provide all the flexibility needed while preventing
> some of the mistakes one can make.

min_nr is used to specify special case "wake up when at least one event
is ready and get all ready ones".

> In summary, I don't think we're at the point where the current
> interfaces are usable. I'd like to see them redesigned and
> reimplemented. The bad news is that I'll not be able to help with the
> coding. The somewhat good news is that I can given some more
> recommendations. In general I still think the text from my OLS paper
> applies:

I can do it.
But I will not, until other core developers acks your proposals.
As I described above I disagree with most of them.

> - one syscall to create a kevent queue. Using a special filesystem like
> take 14 does is OK. But who do you pass parameters like the maximum
> number of expected outstanding events? I think a dedicated syscall is
> better. It also works more reliably since /proc might not be yet

There are no "expected outstanding events", I think it can be a problem.
Currently there is absolute maximum of events, which can not be
increased in real-time.

> mounted when the first user of the interface is started. The result
> should be a file descriptor. At least an object which can be handled
> like a file descriptor when it comes to transmitting it over Unix
> domain sockets. Questions to answer: what happens if you use the
> descriptor with any other interface but the kevent interfaces (I think
> all such calls like dup, read, write, ... should fail).
>
> int kevent_init (int num);

Kevent always provides file descriptor (which is "poll"able) as result
of either opening of special file (like in the latest patchset), or
using special filesystem (which was removed by Christoph).

> - one system call to create the userlevel ring buffer. Simply
> overloading the mmap operation for the special kevent filesystem can
> work so no separate syscall is needed in that case. We need to
> nail down the semantics, though. What happens if more than one mmap
> call is made? Does only the last one count? Does the second one
> fail? Will mremap() work to increase/descrease the size? Will
> mremap() be allowed to be called with MREMAP_MAYMOVE? What if mmap()
> is called from different processes (in the POSIX sense, i.e., from
> different address spaces)?
>
> Either
>
> mmap(...)
>
> Or
>
> int kevent_map_ringbuf (int kfd, size_t num)

Each subsequent mmap will mmap existing buffers, first one mmap can
create that buffer.

> - one interface to set additional parameters. This is likely mostly to
> make the interfaces safe for the future. Perhaps the number of events
> needed per delay call should be set this way.
>
> int kevent_ctl (int kfd, int cmd, ...)
>
>
> - one interface to shut the kevent down. This might be overkill. We
> should be able to use munmap() and close(). If a real interface for
> this would be created it should look like this
>
> int kevent_destroy (int kfd, void *ringbuf, size_t num)
>
> I find this rather more cumbersome. Just use close and munmap.
>
>
> - one interface to submit requests.
>
> int kevent_submit (int kfd, struct kevent_event *ev, int flags,
> struct timespec *timeout)
>
> Maybe the flags parameter isn't needed, it's just another way to make
> sure we won't regret the design later. If the ring buffer can fill up
> and this is detected by the kernel (unlike what happens in take 14)

Just a repeat - with current buffer implementation it can not happen -
maximum queue length is a limit for buffer size.

> then the calling thread could be delayed undefinitely. Maybe we even
> have a deadlock if there is only one thread. If only a wait/no-wait
> mode is needed, then use only a flags parameter and no timeout
> parameter.
>
> A special variant should be if ev == NULL the call is taken as a
> request to wake one or more delayed threads.

Well, you propose three different syscalls for threee operations. I use
one with multiplexer. I do not have strong opinion on how it must be
done, but I created a policy for such changes - until other developers
ack such changes, nothing will be done.

> - one interface to delay threads until the next event becomes available.
> No data is transfered along with the call. The event data must be
> read from the ring buffer:
>
> int kevent_wait (int kfd, unsigned ringstate,
> const struct timespec *timeout,
> const sigset_t *sigmask)

Yes, I agree, this is good syscall.
Except signals (no signals, that's the rule) and variable sized timespec
structure. What about putting there u64 number of nanoseconds?

> Wait-mode can be implemented by recognizing timeout==NULL. no-wait
> mode is implemented using timeout->tv_sec==timeout->tv_nsec==0. If
> sigset_t is NULL the signal mask is not changed.
>
> The ringstate parameter is also not present in the take 14 proposal.
> Something like it is necessary to prevent the thread from going to
> sleep while there are events in the ring buffer. It would be very
> wasteful if the kernel would have to keep track of outstanding
> events. This would also mean then handling events would require
> a system call, exactly what the ring buffer approach should prevent.

It is possible to put there a number of the last "acked" kevent, so
kernel will remove all events which were placed into the buffer before
and including this one.

> I think the sequence for waiting for an event should be like this:
>
> + get current ring state
> + check whether any outstanding event in ring buffer
> + if yes, copy data out of ring buffer, mark ring buffer record
> as unused (atomically).
> + if no, call kevent_wait with ring state value
>
> When the kernel delivers a new event it does:
>
> + find place to store event
> + change ring state (might be a simple counter)

What about following:
userspace:
- check ring index, if it differs from stored in userspace, then there
are events between old stored index and new one just read.
- copy events
- call kevent_wait() or other method to show kernel that all events
upto provided in syscall numbers are processed, and thus kernel can
remove them and put there new ones.

kernelspace:
- when new kevent is added, it guarantees that there is a place for it
in kernel ring buffer
- when event is ready it is copied into mapped buffer and index of the
"last ready" is increased (it is fully atomic operation)
- when userspace calls kevent_wait() kernel get ring index from
syscall, searches for all events upto provided number and free them
(or rearm)


Except kevent_wait() imeplementation it is how it is implemented right
now.

> The kevent_wait implementation in the kernel would then as the first
> thing determine whether the ring state changed. If yes, the syscall
> returns immediate with -ENWOULDBLOCK. Otherwise it is queued for
> waiting.
>
> With these steps and the requirement that all ring buffer entries are
> processed FIFO we can
> a) avoid syscalls to avoid freeing ring buffer entries
> b) detect overflows in the ring buffer
> c) can maintain the read pointer at userlevel while the kernel can
> maintain the write pointer into the buffer

As shown above it is already implemented.

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
>



--
Evgeniy Polyakov

2006-09-09 16:10:40

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take14 0/3] kevent: Generic event handling mechanism.

On 8/31/06, Evgeniy Polyakov <[email protected]> wrote:
> Sorry ofr long delay - I was on small vacations.

No vacation here, but travel nontheless.

> > - one point of critique which applied to many proposals over the years:
> > multiplexer syscalls a bad, really bad. [...]
>
> Can you convince Christoph?
> I do not care about interfaces, but until several people agree on it, I
> will not change anything.

I hope that Linus and/or Andrew simply decree that multiplexers are
bad. glibc and probably strace are the two most affected programs so
their maintainers should have a say. My opinion os clear. Also for
analysis tools the multiplexers are bad since different numbers of
parameters are used and maybe even with different types.


> You completely miss AIO here (I talk not about POSIX AIO).

Sure, I should have mentioned it. But I was assuming this all along.


> I use there only id provided by user, it is not his cookie, but it was
> done to make strucutre as small as possible.
> Think about size of the mapped buffer when there are several kevent
> queues - it is all mapped and thus pinned memory.
> It of course can be extended.

"It" being what? The problem is that the structure of the ring buffer
elements cannot easily be changed later. So we have to get it right
now which means being a bit pessimistic about future requirements.
Add padding, there will certainly be future uses which need more
space.


> > Next, the current interfaces once again fail to learn from a mistake we
> > made and which got corrected for the other interfaces. We need to be
> > able to change the signal mask around the delay atomically. Just like
> > we have ppoll for poll, pselect for select (and hopefully soon also
> > epoll_pwait for epoll_wait) we need to have this feature in the new
> > interfaces.
>
> We able to change kevents atomically.

I don't understand. Or you don't understand. I was talking about
changing the signal mask atomically around the wait call. I.e., the
call needs an additional optional parameter specifying the signal mask
to use (for the kernel: two parameters, pointer and length). This
parameter is not available in the version of the patch I looked at and
should be added if it's still missing in the latest version of the
patch. Again, look at the difference between poll() and ppoll() and
do the same.


> Well, I rarely talk about what other people want, but if you strongly
> feel, that all posix crap is better than epoll interface, then I can not
> agree with you.

You miss the point entirely like DaveM before you. What I ask for is
simply a uniform and well established form to tell an interface to use
the kevent notification mechanism and not sue signals etc. Look at
the mail I sent in reply to DaveM's mail.


> It is possible to create additional one using any POSIX API you like,
> but I strongly insist on having possibility to use lightweight syscall
> interface too.

Again, missing the point. We can without any significant change
enable POSIX interfaces and GNU extensions like the timer, AIO, the
async DNS code, etc use kevents. For the latter, which is entirely
implemented at userlevel, we need interfaces to queue kevents from
userlevel. I think this is already supported. The other two
definitely benefit from using kevent notification and since they
are/will be handled in the kernel the completion events should be
queued in a kevent queue as specified in the sigevent structure passed
to the system call.


> Ring buffer _always_ has space for new events until queue is not filled.
> So if userspace do not read for too much time it's events and eventually
> tries to add new one, it will fail early.

Sorry, I don't understand this at all.

If the ring buffer always has enough room then events must be
preregistered. Is this the case? Seems very inflexible and who would
this work with event sources like timers which can trigger many times?

I hope you don't mean that ring buffers probably won't overflow since
programs have to handle events fast enough. That's not acceptable.


> There is no overflow - I do not want to introduce another signal queue
> overflow crap here.
> And once again - no signals.

Well, signals are the only asynchronous notification mechanism we
have. But more to the point: why cannot there be overflows?


> You basically want to deliver the same event to several users.
> But how do you want to achive it with network buffers for example.
> When several threads reads from the same socket, they do not obtain the
> same data.

That's not what I am after. I'm perfectly fine with waking only one
thread. In fact, this is how it must be to avoid the trampling herd
effects. But there is the problem that if the woken thread is not
working on the issue for which it was woken (e.g., if the thread got
canceled) then it must be able to wake another thread. In affect,
there should be a syscall which causes a given number of other waiters
(make the number a parameter to the syscall) is woken. They would
start running and if nothing is to be done go back to sleep. The
wakeup interface is what is needed.


> min_nr is used to specify special case "wake up when at least one event
> is ready and get all ready ones".

I understand but when is this really necessary? The nature of the
event queue will find many different types of events being reported
via them. In such a situation a minimum count is not really useful.
I would argue this is unnecessary complexity which easily and more
flexible can be handled at userlevel.

> There are no "expected outstanding events", I think it can be a problem.
> Currently there is absolute maximum of events, which can not be
> increased in real-time.

That is a problem. If we succeed in having a unified event mechanism
the number of outstanding events can be unbounded, only limited by the
systems capabilities.


> Each subsequent mmap will mmap existing buffers, first one mmap can
> create that buffer.

OK, so you have magic in mmap() calls using the kevent file
descriptor? Seems OK but I will not export this as the interface
glibc exports. All this should be abstracted out.


> > Maybe the flags parameter isn't needed, it's just another way to make
> > sure we won't regret the design later. If the ring buffer can fill up
> > and this is detected by the kernel (unlike what happens in take 14)
>
> Just a repeat - with current buffer implementation it can not happen -
> maximum queue length is a limit for buffer size.

How can the buffer not fill up? Where is the intformation stored in
case the userlevel code did not process the ring buffer entries in
time?


> > int kevent_wait (int kfd, unsigned ringstate,
> > const struct timespec *timeout,
> > const sigset_t *sigmask)
>
> Yes, I agree, this is good syscall.
> Except signals (no signals, that's the rule) and variable sized timespec
> structure. What about putting there u64 number of nanoseconds?

Well, I've explained it already above and repeated during the
pselect/ppoll discussions. The sigmask parameter is not to in any way
a signal that events should be sent using signals. It is simply a way
to set the signal mask atomically around the delay to some other
value. This is functionality which cannot be implemented at
userlevel. Hence we now have pselect and ppoll system call. The
kevent_wait syscall will need the same.


> What about following:
> userspace:
> - check ring index, if it differs from stored in userspace, then there
> are events between old stored index and new one just read.
> - copy events
> - call kevent_wait() or other method to show kernel that all events
> upto provided in syscall numbers are processed, and thus kernel can
> remove them and put there new ones.

This would require a system call to free ring buffer entries. And
delaying the ack of an event (to avoid syscall overhead) means that
the ring buffer might fill up.

Having a userlevel-writable fields which indicated whether an entry in
the ring buffer is free would help to prevent these syscalls and allow
freeing up the entries. These fields could be in the form of a bitmap
outside the actual ring buffer.

If a ring buffer is not wanted, then a simple writable buffer index
should be used. This will require that all entries in the ring buffer
are processed in sequence but I don't consider this too much of a
limitation. The kernel only ever reads this buffer index field.
Instead of making this field part of the mapping (which could be
read-only) the field index position could be passed to the kernel in
the syscall to create an kevent queue.


> kernelspace:
> - when new kevent is added, it guarantees that there is a place for it
> in kernel ring buffer

How? Unless you severely want to limit the usefulness of kevents this
is not possible. One example, already given above, are periodic
timers.


> - when event is ready it is copied into mapped buffer and index of the
> "last ready" is increased (it is fully atomic operation)
> - when userspace calls kevent_wait() kernel get ring index from
> syscall, searches for all events upto provided number and free them
> (or rearm)

Yes, that's OK. But in the fast path no kevent_wait syscall should be
needed. If the index variable is exposed in the memory region
containing the ring buffer no syscall is needed in case the ring
buffer is not empty.

> As shown above it is already implemented.

How can you say that. Just before you said the kevent_wait syscall is
not implemented. This paragraph was all about how to use kevent_wait.
I'll have to look at the latest code to see how the _wait syscall is
now implemented.

2006-09-11 05:43:07

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take14 0/3] kevent: Generic event handling mechanism.

On Sat, Sep 09, 2006 at 09:10:35AM -0700, Ulrich Drepper ([email protected]) wrote:
> >> - one point of critique which applied to many proposals over the years:
> >> multiplexer syscalls a bad, really bad. [...]
> >
> >Can you convince Christoph?
> >I do not care about interfaces, but until several people agree on it, I
> >will not change anything.
>
> I hope that Linus and/or Andrew simply decree that multiplexers are
> bad. glibc and probably strace are the two most affected programs so
> their maintainers should have a say. My opinion os clear. Also for
> analysis tools the multiplexers are bad since different numbers of
> parameters are used and maybe even with different types.

Types are exactly the same, actually the whole set of operations
multiplexed in kevents is add/remove/modify. They really look and work
very similar, so it is not that bad to multiplex them in one syscall.
But yes, we can extend it to 3 differently named ones, which will end up
just in waste of space in syscall tables.

> >I use there only id provided by user, it is not his cookie, but it was
> >done to make strucutre as small as possible.
> >Think about size of the mapped buffer when there are several kevent
> >queues - it is all mapped and thus pinned memory.
> >It of course can be extended.
>
> "It" being what? The problem is that the structure of the ring buffer
> elements cannot easily be changed later. So we have to get it right
> now which means being a bit pessimistic about future requirements.
> Add padding, there will certainly be future uses which need more
> space.

"It" was/is a whole situation about mmaped buffer - we can extend it, no
problem, what fields you think needs to be added?

> >> Next, the current interfaces once again fail to learn from a mistake we
> >> made and which got corrected for the other interfaces. We need to be
> >> able to change the signal mask around the delay atomically. Just like
> >> we have ppoll for poll, pselect for select (and hopefully soon also
> >> epoll_pwait for epoll_wait) we need to have this feature in the new
> >> interfaces.
> >
> >We able to change kevents atomically.
>
> I don't understand. Or you don't understand. I was talking about
> changing the signal mask atomically around the wait call. I.e., the
> call needs an additional optional parameter specifying the signal mask
> to use (for the kernel: two parameters, pointer and length). This
> parameter is not available in the version of the patch I looked at and
> should be added if it's still missing in the latest version of the
> patch. Again, look at the difference between poll() and ppoll() and
> do the same.

You meant "atomically" with respect to signals, I meant about atomically
compared to simultaneous access.
Looking into ppol() I wonder what is the difference between doing the
same in userspace? There are no special locks, nothing special except
TIF_RESTORE_SIGMASK bit set, so what's the point of it not being done in
userspace?

> >Well, I rarely talk about what other people want, but if you strongly
> >feel, that all posix crap is better than epoll interface, then I can not
> >agree with you.
>
> You miss the point entirely like DaveM before you. What I ask for is
> simply a uniform and well established form to tell an interface to use
> the kevent notification mechanism and not sue signals etc. Look at
> the mail I sent in reply to DaveM's mail.

There is special function in kevents which is used for kevents addition,
which can be called from everywhere (except modules since it is not
exported right now), so one can create _any_ interface he likes.
POSIX timer-look API is not what a lot of people want, since
epoll/poll/select is completely different thing and exactly _that_ is
what majority of people use. So I create similar interface.
But there are no problem to implement any additional, is is simple.

> >It is possible to create additional one using any POSIX API you like,
> >but I strongly insist on having possibility to use lightweight syscall
> >interface too.
>
> Again, missing the point. We can without any significant change
> enable POSIX interfaces and GNU extensions like the timer, AIO, the
> async DNS code, etc use kevents. For the latter, which is entirely
> implemented at userlevel, we need interfaces to queue kevents from
> userlevel. I think this is already supported. The other two
> definitely benefit from using kevent notification and since they
> are/will be handled in the kernel the completion events should be
> queued in a kevent queue as specified in the sigevent structure passed
> to the system call.

I do not object against additional interfaces, no problem,
implementation is really simple. But I strongly object against removing
existing interface, it is there not for the furniture, but since it is
the most convenient way (in my opinion) to use existing (supported by
kevent) event notifications. If we need additional interfaces, it is
really simple to add them, just use kevent_user_add_ukevent(), which
requires struct ukevent, which desctribe requested notification, and
struct kevent_user, which is those queue where you want to put your
events and which will be checked when events are ready.

> >Ring buffer _always_ has space for new events until queue is not filled.
> >So if userspace do not read for too much time it's events and eventually
> >tries to add new one, it will fail early.
>
> Sorry, I don't understand this at all.
>
> If the ring buffer always has enough room then events must be
> preregistered. Is this the case? Seems very inflexible and who would
> this work with event sources like timers which can trigger many times?

Ready event is only placed once into the buffer, even if timer has fired
many times, how would it look if we put there a notification each time
new data has arrived in network instead of marking KEVENT_SOCKET_RECV
event as ready? It can eat the whole memory, if for each one byte packet
we put there 12 bytes of event.
There is a limit of maximum allowed events in one kevent queue, when
this limit is reached, no new events can be added from userspace until
all previously commited are removed, so that moment is used as limit
factor for mapped buffer - it can grow until maximum allowed number of
events can be placed there.
Such method can look like unconvenient, but I doubt that buffer overflow
scenario (what happens with rt-signals) is really much nicier...

> I hope you don't mean that ring buffers probably won't overflow since
> programs have to handle events fast enough. That's not acceptable.

:)

> >There is no overflow - I do not want to introduce another signal queue
> >overflow crap here.
> >And once again - no signals.
>
> Well, signals are the only asynchronous notification mechanism we
> have. But more to the point: why cannot there be overflows?

Kevent queue is limited (for purpose of mapped buffer), so mapped buffer
will grow until it can host maximum number of events (it is 4096 right
now), when such situation happens (i.e. queue is full), no new event can
be added, so no events can be put into the mapped buffer, and it can not
overflow.

> >You basically want to deliver the same event to several users.
> >But how do you want to achive it with network buffers for example.
> >When several threads reads from the same socket, they do not obtain the
> >same data.
>
> That's not what I am after. I'm perfectly fine with waking only one
> thread. In fact, this is how it must be to avoid the trampling herd
> effects. But there is the problem that if the woken thread is not
> working on the issue for which it was woken (e.g., if the thread got
> canceled) then it must be able to wake another thread. In affect,
> there should be a syscall which causes a given number of other waiters
> (make the number a parameter to the syscall) is woken. They would
> start running and if nothing is to be done go back to sleep. The
> wakeup interface is what is needed.

You look to the problem from some strange and it looks like wrong angle.
There is one queue of events, and that queue does not and can not know
who will read it. It just exists and hosts ready events, if there are
several threads which can access it, how can it detect which one will do
it? How recv() syscall will wake up exactly those thread, which is
supposed to receive the data, but not those one which is supposed to
print info into syslog that data has arrived?

> >min_nr is used to specify special case "wake up when at least one event
> >is ready and get all ready ones".
>
> I understand but when is this really necessary? The nature of the
> event queue will find many different types of events being reported
> via them. In such a situation a minimum count is not really useful.
> I would argue this is unnecessary complexity which easily and more
> flexible can be handled at userlevel.

Consider situation when you have web server. Connected user do not want
to wait until 10 other users have connected (or some timeout),
so server would be awakened and started to process them.
>From the other point, consider someone who writes data asynchronously,
it is much beter to wake him up when several writes are completed, but
not each time one write is ready.

> >There are no "expected outstanding events", I think it can be a problem.
> >Currently there is absolute maximum of events, which can not be
> >increased in real-time.
>
> That is a problem. If we succeed in having a unified event mechanism
> the number of outstanding events can be unbounded, only limited by the
> systems capabilities.

Then I will remove mapped buffer implementation, since unbounded pinned
memory is not what we want. Buffer overflow is not the case - recall
rt-signals overflow and recovery.

> >Each subsequent mmap will mmap existing buffers, first one mmap can
> >create that buffer.
>
> OK, so you have magic in mmap() calls using the kevent file
> descriptor? Seems OK but I will not export this as the interface
> glibc exports. All this should be abstracted out.

Yes, I use private area created when kevent file descriptor was
allocated.

> >> Maybe the flags parameter isn't needed, it's just another way to make
> >> sure we won't regret the design later. If the ring buffer can fill up
> >> and this is detected by the kernel (unlike what happens in take 14)
> >
> >Just a repeat - with current buffer implementation it can not happen -
> >maximum queue length is a limit for buffer size.
>
> How can the buffer not fill up? Where is the intformation stored in
> case the userlevel code did not process the ring buffer entries in
> time?

Buffer can be filled completely, but there is no possibility to have an
overflow, since maximum number of events is a limiting factor for buffer
size.

> >> int kevent_wait (int kfd, unsigned ringstate,
> >> const struct timespec *timeout,
> >> const sigset_t *sigmask)
> >
> >Yes, I agree, this is good syscall.
> >Except signals (no signals, that's the rule) and variable sized timespec
> >structure. What about putting there u64 number of nanoseconds?
>
> Well, I've explained it already above and repeated during the
> pselect/ppoll discussions. The sigmask parameter is not to in any way
> a signal that events should be sent using signals. It is simply a way
> to set the signal mask atomically around the delay to some other
> value. This is functionality which cannot be implemented at
> userlevel. Hence we now have pselect and ppoll system call. The
> kevent_wait syscall will need the same.

What I see in sys_ppol() is just change of the mask and call for usual
poll(), there are no locks and no special tricks.

> >What about following:
> >userspace:
> > - check ring index, if it differs from stored in userspace, then there
> > are events between old stored index and new one just read.
> > - copy events
> > - call kevent_wait() or other method to show kernel that all events
> > upto provided in syscall numbers are processed, and thus kernel can
> > remove them and put there new ones.
>
> This would require a system call to free ring buffer entries. And
> delaying the ack of an event (to avoid syscall overhead) means that
> the ring buffer might fill up.
>
> Having a userlevel-writable fields which indicated whether an entry in
> the ring buffer is free would help to prevent these syscalls and allow
> freeing up the entries. These fields could be in the form of a bitmap
> outside the actual ring buffer.

I added kevent_wait() exactly for that.
I's parameters allow to complete events, although it is not possible to
complete some event in the middle of set of ready events, only from the
begining.

> If a ring buffer is not wanted, then a simple writable buffer index
> should be used. This will require that all entries in the ring buffer
> are processed in sequence but I don't consider this too much of a
> limitation. The kernel only ever reads this buffer index field.
> Instead of making this field part of the mapping (which could be
> read-only) the field index position could be passed to the kernel in
> the syscall to create an kevent queue.
>
>
> >kernelspace:
> > - when new kevent is added, it guarantees that there is a place for it
> > in kernel ring buffer
>
> How? Unless you severely want to limit the usefulness of kevents this
> is not possible. One example, already given above, are periodic
> timers.

Periodic timer is added only once from userspace. And it is marked as
ready when it fires first time. If userspace has missed that it was
fired several times before it was read, than it is userspace's problem -
I put a last timer's ready time into ret_data so userspace can check
how many times this event would be marked as ready.
The same can be applied to other such way triggered events.

> > - when event is ready it is copied into mapped buffer and index of the
> > "last ready" is increased (it is fully atomic operation)
> > - when userspace calls kevent_wait() kernel get ring index from
> > syscall, searches for all events upto provided number and free them
> > (or rearm)
>
> Yes, that's OK. But in the fast path no kevent_wait syscall should be
> needed. If the index variable is exposed in the memory region
> containing the ring buffer no syscall is needed in case the ring
> buffer is not empty.

We need to inform kernel that some events have been processed by
userspace and thus can be rearmed (i.e. mapped as not ready, so
rearming work could map them as ready again: like received data or
timer timout) or freed - kevent_wait() both
waits and commits (or it does not wait, if events are ready, or
does not commit if provided number of events is zero).

Commiting through writable mapping is not the best way I think, it
introduces a lot of problems with damaged by error events, with
unability to sleep on that variable and so on.

> >As shown above it is already implemented.
>
> How can you say that. Just before you said the kevent_wait syscall is
> not implemented. This paragraph was all about how to use kevent_wait.
> I'll have to look at the latest code to see how the _wait syscall is
> now implemented.

Kevent development is quite fast :)
kevent_wait() is already implemented in the take14 patchset.

--
Evgeniy Polyakov