2006-08-21 09:56:54

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take12 0/3] kevent: Generic event handling mechanism.


Generic event handling mechanism.

Changes from 'take11' patchset:
* include missing headers into patchset
* some trivial code cleanups (use goto instead if if/else games and so on)
* some whitespace cleanups
* check for ready_callback() callback before main loop which should save us some ticks

Changes from 'take10' patchset:
* removed non-existent prototypes
* added helper function for kevent_registered_callbacks
* fixed 80 lines comments issues
* added shared between userspace and kernelspace header instead of embedd them in one
* core restructuring to remove forward declarations
* s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
* use vm_insert_page() instead of remap_pfn_range()

Changes from 'take9' patchset:
* fixed ->nopage method

Changes from 'take8' patchset:
* fixed mmap release bug
* use module_init() instead of late_initcall()
* use better structures for timer notifications

Changes from 'take7' patchset:
* new mmap interface (not tested, waiting for other changes to be acked)
- use nopage() method to dynamically substitue pages
- allocate new page for events only when new added kevent requres it
- do not use ugly index dereferencing, use structure instead
- reduced amount of data in the ring (id and flags),
maximum 12 pages on x86 per kevent fd

Changes from 'take6' patchset:
* a lot of comments!
* do not use list poisoning for detection of the fact, that entry is in the list
* return number of ready kevents even if copy*user() fails
* strict check for number of kevents in syscall
* use ARRAY_SIZE for array size calculation
* changed superblock magic number
* use SLAB_PANIC instead of direct panic() call
* changed -E* return values
* a lot of small cleanups and indent fixes

Changes from 'take5' patchset:
* removed compilation warnings about unused wariables when lockdep is not turned on
* do not use internal socket structures, use appropriate (exported) wrappers instead
* removed default 1 second timeout
* removed AIO stuff from patchset

Changes from 'take4' patchset:
* use miscdevice instead of chardevice
* comments fixes

Changes from 'take3' patchset:
* removed serializing mutex from kevent_user_wait()
* moved storage list processing to RCU
* removed lockdep screaming - all storage locks are initialized in the same function, so it was learned
to differentiate between various cases
* remove kevent from storage if is marked as broken after callback
* fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion

Changes from 'take2' patchset:
* split kevent_finish_user() to locked and unlocked variants
* do not use KEVENT_STAT ifdefs, use inline functions instead
* use array of callbacks of each type instead of each kevent callback initialization
* changed name of ukevent guarding lock
* use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
* do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
* various indent cleanups
* added optimisation, which is aimed to help when a lot of kevents are being copied from userspace
* mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
- rebased against 2.6.18-git tree
- removed ioctl controlling
- added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
unsigned int timeout, void __user *buf, unsigned flags)
- use old syscall kevent_ctl for creation/removing, modification and initial kevent
initialization
- use mutuxes instead of semaphores
- added file descriptor check and return error if provided descriptor does not match
kevent file operations
- various indent fixes
- removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov <[email protected]>



2006-08-21 09:57:21

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take12 2/3] kevent: poll/select() notifications.


poll/select() notifications.

This patch includes generic poll/select and timer notifications.

kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake).

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2561020..76b3039 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -236,6 +236,7 @@ #include <linux/prio_tree.h>
#include <linux/init.h>
#include <linux/sched.h>
#include <linux/mutex.h>
+#include <linux/kevent.h>

#include <asm/atomic.h>
#include <asm/semaphore.h>
@@ -698,6 +699,9 @@ #ifdef CONFIG_EPOLL
struct list_head f_ep_links;
spinlock_t f_ep_lock;
#endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+ struct kevent_storage st;
+#endif
struct address_space *f_mapping;
};
extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..75a75d1
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,221 @@
+/*
+ * kevent_poll.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+ struct poll_table_struct pt;
+ struct kevent *k;
+};
+
+struct kevent_poll_wait_container
+{
+ struct list_head container_entry;
+ wait_queue_head_t *whead;
+ wait_queue_t wait;
+ struct kevent *k;
+};
+
+struct kevent_poll_private
+{
+ struct list_head container_list;
+ spinlock_t container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+ unsigned mode, int sync, void *key)
+{
+ struct kevent_poll_wait_container *cont =
+ container_of(wait, struct kevent_poll_wait_container, wait);
+ struct kevent *k = cont->k;
+ struct file *file = k->st->origin;
+ u32 revents;
+
+ revents = file->f_op->poll(file, NULL);
+
+ kevent_storage_ready(k->st, NULL, revents);
+
+ return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+ struct poll_table_struct *poll_table)
+{
+ struct kevent *k =
+ container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+ struct kevent_poll_private *priv = k->priv;
+ struct kevent_poll_wait_container *cont;
+ unsigned long flags;
+
+ cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+ if (!cont) {
+ kevent_break(k);
+ return;
+ }
+
+ cont->k = k;
+ init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+ cont->whead = whead;
+
+ spin_lock_irqsave(&priv->container_lock, flags);
+ list_add_tail(&cont->container_entry, &priv->container_list);
+ spin_unlock_irqrestore(&priv->container_lock, flags);
+
+ add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+ struct file *file;
+ int err, ready = 0;
+ unsigned int revents;
+ struct kevent_poll_ctl ctl;
+ struct kevent_poll_private *priv;
+
+ file = fget(k->event.id.raw[0]);
+ if (!file)
+ return -ENODEV;
+
+ err = -EINVAL;
+ if (!file->f_op || !file->f_op->poll)
+ goto err_out_fput;
+
+ err = -ENOMEM;
+ priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+ if (!priv)
+ goto err_out_fput;
+
+ spin_lock_init(&priv->container_lock);
+ INIT_LIST_HEAD(&priv->container_list);
+
+ k->priv = priv;
+
+ ctl.k = k;
+ init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+
+ err = kevent_storage_enqueue(&file->st, k);
+ if (err)
+ goto err_out_free;
+
+ revents = file->f_op->poll(file, &ctl.pt);
+ if (revents & k->event.event) {
+ ready = 1;
+ kevent_poll_dequeue(k);
+ }
+
+ return ready;
+
+err_out_free:
+ kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+ fput(file);
+ return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+ struct file *file = k->st->origin;
+ struct kevent_poll_private *priv = k->priv;
+ struct kevent_poll_wait_container *w, *n;
+ unsigned long flags;
+
+ kevent_storage_dequeue(k->st, k);
+
+ spin_lock_irqsave(&priv->container_lock, flags);
+ list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+ list_del(&w->container_entry);
+ remove_wait_queue(w->whead, &w->wait);
+ kmem_cache_free(kevent_poll_container_cache, w);
+ }
+ spin_unlock_irqrestore(&priv->container_lock, flags);
+
+ kmem_cache_free(kevent_poll_priv_cache, priv);
+ k->priv = NULL;
+
+ fput(file);
+
+ return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+ struct file *file = k->st->origin;
+ unsigned int revents = file->f_op->poll(file, NULL);
+ return (revents & k->event.event);
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+ struct kevent_callbacks pc = {
+ .callback = &kevent_poll_callback,
+ .enqueue = &kevent_poll_enqueue,
+ .dequeue = &kevent_poll_dequeue};
+
+ kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache",
+ sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+ if (!kevent_poll_container_cache) {
+ printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+ return -ENOMEM;
+ }
+
+ kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache",
+ sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+ if (!kevent_poll_priv_cache) {
+ printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+ kmem_cache_destroy(kevent_poll_container_cache);
+ kevent_poll_container_cache = NULL;
+ return -ENOMEM;
+ }
+
+ kevent_add_callbacks(&pc, KEVENT_POLL);
+
+ printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+ return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+ lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+ kmem_cache_destroy(kevent_poll_priv_cache);
+ kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);

2006-08-21 09:57:27

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take12 1/3] kevent: Core files.


Core files.

This patch includes core kevent files:
- userspace controlling
- kernelspace interfaces
- initialization
- notification state machines

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index dd63d47..091ff42 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -317,3 +317,5 @@ ENTRY(sys_call_table)
.long sys_tee /* 315 */
.long sys_vmsplice
.long sys_move_pages
+ .long sys_kevent_get_events
+ .long sys_kevent_ctl
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index 5d4a7d1..b2af4a8 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -713,4 +713,6 @@ #endif
.quad sys_tee
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
+ .quad sys_kevent_get_events
+ .quad sys_kevent_ctl
ia32_syscall_end:
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index fc1c8dd..c9dde13 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -323,10 +323,12 @@ #define __NR_sync_file_range 314
#define __NR_tee 315
#define __NR_vmsplice 316
#define __NR_move_pages 317
+#define __NR_kevent_get_events 318
+#define __NR_kevent_ctl 319

#ifdef __KERNEL__

-#define NR_syscalls 318
+#define NR_syscalls 320

/*
* user-visible error numbers are in the range -1 - -128: see
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 94387c9..61363e0 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,14 @@ #define __NR_vmsplice 278
__SYSCALL(__NR_vmsplice, sys_vmsplice)
#define __NR_move_pages 279
__SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events 280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl 281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)

#ifdef __KERNEL__

-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_ctl

#ifndef __NO_STUBS

diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..eef9709
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,174 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/kevent_storage.h>
+#include <linux/ukevent.h>
+
+#define KEVENT_MAX_EVENTS 4096
+#define KEVENT_MIN_BUFFS_ALLOC 3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+ kevent_callback_t callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY 0x1
+#define KEVENT_STORAGE 0x2
+#define KEVENT_USER 0x4
+
+struct kevent
+{
+ /* Used for kevent freeing.*/
+ struct rcu_head rcu_head;
+ struct ukevent event;
+ /* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+ spinlock_t ulock;
+
+ /* Entry of user's queue. */
+ struct list_head kevent_entry;
+ /* Entry of origin's queue. */
+ struct list_head storage_entry;
+ /* Entry of user's ready. */
+ struct list_head ready_entry;
+
+ u32 flags;
+
+ /* User who requested this kevent. */
+ struct kevent_user *user;
+ /* Kevent container. */
+ struct kevent_storage *st;
+
+ struct kevent_callbacks callbacks;
+
+ /* Private data for different storages.
+ * poll()/select storage has a list of wait_queue_t containers
+ * for each ->poll() { poll_wait()' } here.
+ */
+ void *priv;
+};
+
+#define KEVENT_HASH_MASK 0xff
+
+struct kevent_user
+{
+ struct list_head kevent_list[KEVENT_HASH_MASK+1];
+ spinlock_t kevent_lock;
+ /* Number of queued kevents. */
+ unsigned int kevent_num;
+
+ /* List of ready kevents. */
+ struct list_head ready_list;
+ /* Number of ready kevents. */
+ unsigned int ready_num;
+ /* Protects all manipulations with ready queue. */
+ spinlock_t ready_lock;
+
+ /* Protects against simultaneous kevent_user control manipulations. */
+ struct mutex ctl_mutex;
+ /* Wait until some events are ready. */
+ wait_queue_head_t wait;
+
+ /* Reference counter, increased for each new kevent. */
+ atomic_t refcnt;
+
+ unsigned int pages_in_use;
+ /* Array of pages forming mapped ring buffer */
+ unsigned long *pring;
+
+#ifdef CONFIG_KEVENT_USER_STAT
+ unsigned long im_num;
+ unsigned long wait_num;
+ unsigned long total;
+#endif
+};
+
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+int kevent_add_callbacks(struct kevent_callbacks *cb, int pos);
+
+void kevent_user_ring_add_event(struct kevent *k);
+
+void kevent_storage_ready(struct kevent_storage *st,
+ kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+ u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+ pr_debug("%s: u=%p, wait=%lu, immediately=%lu, total=%lu.\n",
+ __func__, u, u->wait_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+ u->im_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+ u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+ u->total++;
+}
+#else
+#define kevent_stat_print(u) ({ (void) u;})
+#define kevent_stat_init(u) ({ (void) u;})
+#define kevent_stat_im(u) ({ (void) u;})
+#define kevent_stat_wait(u) ({ (void) u;})
+#define kevent_stat_total(u) ({ (void) u;})
+#endif
+
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+ void *origin; /* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+ struct list_head list; /* List of queued kevents. */
+ spinlock_t lock; /* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 008f04c..8609910 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -597,4 +597,7 @@ asmlinkage long sys_get_robust_list(int
asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
size_t len);

+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max,
+ unsigned int timeout, void __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, void __user *buf);
#endif
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
new file mode 100644
index 0000000..4282793
--- /dev/null
+++ b/include/linux/ukevent.h
@@ -0,0 +1,136 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __UKEVENT_H
+#define __UKEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+/* Process this event only once and then dequeue. */
+#define KEVENT_REQ_ONESHOT 0x1
+
+/*
+ * Kevent return flags.
+ */
+/* Kevent is broken. */
+#define KEVENT_RET_BROKEN 0x1
+/* Kevent processing was finished successfully. */
+#define KEVENT_RET_DONE 0x2
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 0
+#define KEVENT_INODE 1
+#define KEVENT_TIMER 2
+#define KEVENT_POLL 3
+#define KEVENT_NAIO 4
+#define KEVENT_AIO 5
+#define KEVENT_MAX 6
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define KEVENT_TIMER_FIRED 0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#define KEVENT_SOCKET_RECV 0x1
+#define KEVENT_SOCKET_ACCEPT 0x2
+#define KEVENT_SOCKET_SEND 0x4
+
+/*
+ * Inode events.
+ */
+#define KEVENT_INODE_CREATE 0x1
+#define KEVENT_INODE_REMOVE 0x2
+
+/*
+ * Poll events.
+ */
+#define KEVENT_POLL_POLLIN 0x0001
+#define KEVENT_POLL_POLLPRI 0x0002
+#define KEVENT_POLL_POLLOUT 0x0004
+#define KEVENT_POLL_POLLERR 0x0008
+#define KEVENT_POLL_POLLHUP 0x0010
+#define KEVENT_POLL_POLLNVAL 0x0020
+
+#define KEVENT_POLL_POLLRDNORM 0x0040
+#define KEVENT_POLL_POLLRDBAND 0x0080
+#define KEVENT_POLL_POLLWRNORM 0x0100
+#define KEVENT_POLL_POLLWRBAND 0x0200
+#define KEVENT_POLL_POLLMSG 0x0400
+#define KEVENT_POLL_POLLREMOVE 0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define KEVENT_AIO_BIO 0x1
+
+#define KEVENT_MASK_ALL 0xffffffff
+/* Mask of all possible event values. */
+#define KEVENT_MASK_EMPTY 0x0
+/* Empty mask of ready events. */
+
+struct kevent_id
+{
+ __u32 raw[2];
+};
+
+struct ukevent
+{
+ /* Id of this request, e.g. socket number, file descriptor and so on... */
+ struct kevent_id id;
+ /* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+ __u32 type;
+ /* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+ __u32 event;
+ /* Per-event request flags */
+ __u32 req_flags;
+ /* Per-event return flags */
+ __u32 ret_flags;
+ /* Event return data. Event originator fills it with anything it likes. */
+ __u32 ret_data[2];
+ /* User's data. It is not used, just copied to/from user. */
+ union {
+ __u32 user[2];
+ void *ptr;
+ };
+};
+
+struct mukevent
+{
+ struct kevent_id id;
+ __u32 ret_flags;
+};
+
+#define KEVENT_CTL_ADD 0
+#define KEVENT_CTL_REMOVE 1
+#define KEVENT_CTL_MODIFY 2
+#define KEVENT_CTL_INIT 3
+
+#endif /* __UKEVENT_H */
diff --git a/init/Kconfig b/init/Kconfig
index a099fc6..c550fcc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -218,6 +218,8 @@ config AUDITSYSCALL
such as SELinux. To use audit's filesystem watch feature, please
ensure that INOTIFY is configured.

+source "kernel/kevent/Kconfig"
+
config IKCONFIG
bool "Kernel .config support"
---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
obj-$(CONFIG_RELAY) += relay.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..a756e85
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,31 @@
+config KEVENT
+ bool "Kernel event notification mechanism"
+ help
+ This option enables event queue mechanism.
+ It can be used as replacement for poll()/select(), AIO callback
+ invocations, advanced timer notifications and other kernel
+ object status changes.
+
+config KEVENT_USER_STAT
+ bool "Kevent user statistic"
+ depends on KEVENT
+ default N
+ help
+ This option will turn kevent_user statistic collection on.
+ Statistic data includes total number of kevent, number of kevents
+ which are ready immediately at insertion time and number of kevents
+ which were removed through readiness completion.
+ It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_TIMER
+ bool "Kernel event notifications for timers"
+ depends on KEVENT
+ help
+ This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+ bool "Kernel event notifications for poll()/select()"
+ depends on KEVENT
+ help
+ This option allows to use kevent subsystem for poll()/select()
+ notifications.
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..ab6bca0
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,3 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..2872aa2
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,238 @@
+/*
+ * kevent.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+ if (k->event.type >= KEVENT_MAX)
+ return -EINVAL;
+
+ if (!k->callbacks.enqueue) {
+ kevent_break(k);
+ return -EINVAL;
+ }
+
+ return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+ if (k->event.type >= KEVENT_MAX)
+ return -EINVAL;
+
+ if (!k->callbacks.dequeue) {
+ kevent_break(k);
+ return -EINVAL;
+ }
+
+ return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.ret_flags |= KEVENT_RET_BROKEN;
+ spin_unlock_irqrestore(&k->ulock, flags);
+ return 0;
+}
+
+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX];
+
+int kevent_add_callbacks(struct kevent_callbacks *cb, int pos)
+{
+ if (pos >= KEVENT_MAX)
+ return -EINVAL;
+ kevent_registered_callbacks[pos] = *cb;
+ printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
+ return 0;
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+ spin_lock_init(&k->ulock);
+ k->flags = 0;
+
+ if (k->event.type >= KEVENT_MAX)
+ return -EINVAL;
+
+ k->callbacks = kevent_registered_callbacks[k->event.type];
+ if (!k->callbacks.callback) {
+ kevent_break(k);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+ unsigned long flags;
+
+ k->st = st;
+ spin_lock_irqsave(&st->lock, flags);
+ list_add_tail_rcu(&k->storage_entry, &st->list);
+ k->flags |= KEVENT_STORAGE;
+ spin_unlock_irqrestore(&st->lock, flags);
+ return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue.
+ * It does not decrease origin's reference counter in any way
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&st->lock, flags);
+ if (k->flags & KEVENT_STORAGE) {
+ list_del_rcu(&k->storage_entry);
+ k->flags &= ~KEVENT_STORAGE;
+ }
+ spin_unlock_irqrestore(&st->lock, flags);
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static void __kevent_requeue(struct kevent *k, u32 event)
+{
+ int ret, rem;
+ unsigned long flags;
+
+ ret = k->callbacks.callback(k);
+
+ spin_lock_irqsave(&k->ulock, flags);
+ if (ret > 0)
+ k->event.ret_flags |= KEVENT_RET_DONE;
+ else if (ret < 0)
+ k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE);
+ else
+ ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+ rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+ spin_unlock_irqrestore(&k->ulock, flags);
+
+ if (ret) {
+ if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) {
+ list_del_rcu(&k->storage_entry);
+ k->flags &= ~KEVENT_STORAGE;
+ }
+
+ spin_lock_irqsave(&k->user->ready_lock, flags);
+ if (!(k->flags & KEVENT_READY)) {
+ kevent_user_ring_add_event(k);
+ list_add_tail(&k->ready_entry, &k->user->ready_list);
+ k->flags |= KEVENT_READY;
+ k->user->ready_num++;
+ }
+ spin_unlock_irqrestore(&k->user->ready_lock, flags);
+ wake_up(&k->user->wait);
+ }
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->st->lock, flags);
+ __kevent_requeue(k, 0);
+ spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st,
+ kevent_callback_t ready_callback, u32 event)
+{
+ struct kevent *k;
+
+ rcu_read_lock();
+ if (ready_callback)
+ list_for_each_entry_rcu(k, &st->list, storage_entry)
+ (*ready_callback)(k);
+
+ list_for_each_entry_rcu(k, &st->list, storage_entry)
+ if (event & k->event.event)
+ __kevent_requeue(k, event);
+ rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+ spin_lock_init(&st->lock);
+ st->origin = origin;
+ INIT_LIST_HEAD(&st->list);
+ return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, sockt and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+ kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..6e1bf3a
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,986 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/jhash.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static char kevent_name[] = "kevent";
+static kmem_cache_t *kevent_cache;
+
+static int kevent_get_sb(struct file_system_type *fs_type,
+ int flags, const char *dev_name, void *data, struct vfsmount *mnt)
+{
+ /* So original magic... */
+ return get_sb_pseudo(fs_type, kevent_name, NULL, 0xbcdbcdul, mnt);
+}
+
+static struct file_system_type kevent_fs_type = {
+ .name = kevent_name,
+ .get_sb = kevent_get_sb,
+ .kill_sb = kill_anon_super,
+};
+
+static struct vfsmount *kevent_mnt;
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+ struct kevent_user *u = file->private_data;
+ unsigned int mask;
+
+ poll_wait(file, &u->wait, wait);
+ mask = 0;
+
+ if (u->ready_num)
+ mask |= POLLIN | POLLRDNORM;
+
+ return mask;
+}
+
+/*
+ * Note that kevents does not exactly fill the page (each mukevent is 40 bytes),
+ * so we reuse 4 bytes at the begining of the first page to store index.
+ * Take that into account if you want to change size of struct ukevent.
+ */
+#define KEVENTS_ON_PAGE ((PAGE_SIZE-sizeof(unsigned int))/sizeof(struct mukevent))
+struct kevent_mring
+{
+ unsigned int index;
+ struct mukevent event[KEVENTS_ON_PAGE];
+};
+
+static inline void kevent_user_ring_set(struct kevent_user *u, unsigned int num)
+{
+ struct kevent_mring *ring;
+
+ ring = (struct kevent_mring *)u->pring[0];
+ ring->index = num;
+}
+
+static inline void kevent_user_ring_inc(struct kevent_user *u)
+{
+ struct kevent_mring *ring;
+
+ ring = (struct kevent_mring *)u->pring[0];
+ ring->index++;
+}
+
+static int kevent_user_ring_grow(struct kevent_user *u)
+{
+ struct kevent_mring *ring;
+ unsigned int idx;
+
+ ring = (struct kevent_mring *)u->pring[0];
+
+ idx = (ring->index + 1) / KEVENTS_ON_PAGE;
+ if (idx >= u->pages_in_use) {
+ u->pring[idx] = __get_free_page(GFP_KERNEL);
+ if (!u->pring[idx])
+ return -ENOMEM;
+ u->pages_in_use++;
+ }
+ return 0;
+}
+
+/*
+ * Called under kevent_user->ready_lock, so updates are always protected.
+ */
+void kevent_user_ring_add_event(struct kevent *k)
+{
+ unsigned int pidx, off;
+ struct kevent_mring *ring, *copy_ring;
+
+ ring = (struct kevent_mring *)k->user->pring[0];
+
+ pidx = ring->index/KEVENTS_ON_PAGE;
+ off = ring->index%KEVENTS_ON_PAGE;
+
+ copy_ring = (struct kevent_mring *)k->user->pring[pidx];
+
+ copy_ring->event[off].id.raw[0] = k->event.id.raw[0];
+ copy_ring->event[off].id.raw[1] = k->event.id.raw[1];
+ copy_ring->event[off].ret_flags = k->event.ret_flags;
+
+ if (++ring->index >= KEVENT_MAX_EVENTS)
+ ring->index = 0;
+}
+
+/*
+ * Initialize mmap ring buffer.
+ * It will store ready kevents, so userspace could get them directly instead
+ * of using syscall. Esentially syscall becomes just a waiting point.
+ */
+static int kevent_user_ring_init(struct kevent_user *u)
+{
+ int pnum;
+
+ pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct mukevent) + sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;
+
+ u->pring = kmalloc(pnum * sizeof(unsigned long), GFP_KERNEL);
+ if (!u->pring)
+ return -ENOMEM;
+
+ u->pring[0] = __get_free_page(GFP_KERNEL);
+ if (!u->pring[0])
+ goto err_out_free;
+
+ u->pages_in_use = 1;
+ kevent_user_ring_set(u, 0);
+
+ return 0;
+
+err_out_free:
+ kfree(u->pring);
+
+ return -ENOMEM;
+}
+
+static void kevent_user_ring_fini(struct kevent_user *u)
+{
+ int i;
+
+ for (i = 0; i < u->pages_in_use; ++i)
+ free_page(u->pring[i]);
+
+ kfree(u->pring);
+}
+
+
+/*
+ * Allocate new kevent userspace control entry.
+ */
+static struct kevent_user *kevent_user_alloc(void)
+{
+ struct kevent_user *u;
+ int i;
+
+ u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+ if (!u)
+ return NULL;
+
+ INIT_LIST_HEAD(&u->ready_list);
+ spin_lock_init(&u->ready_lock);
+ kevent_stat_init(u);
+ spin_lock_init(&u->kevent_lock);
+ for (i = 0; i < ARRAY_SIZE(u->kevent_list); ++i)
+ INIT_LIST_HEAD(&u->kevent_list[i]);
+
+ mutex_init(&u->ctl_mutex);
+ init_waitqueue_head(&u->wait);
+
+ atomic_set(&u->refcnt, 1);
+
+ if (kevent_user_ring_init(u)) {
+ kfree(u);
+ u = NULL;
+ }
+
+ return u;
+}
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+ struct kevent_user *u = kevent_user_alloc();
+
+ if (!u)
+ return -ENOMEM;
+
+ file->private_data = u;
+
+ return 0;
+}
+
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+ atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+ if (atomic_dec_and_test(&u->refcnt)) {
+ kevent_stat_print(u);
+ kevent_user_ring_fini(u);
+ kfree(u);
+ }
+}
+
+static struct page *kevent_user_nopage(struct vm_area_struct *vma, unsigned long addr, int *type)
+{
+ struct kevent_user *u = vma->vm_file->private_data;
+ unsigned long off = (addr - vma->vm_start)/PAGE_SIZE;
+
+ if (type)
+ *type = VM_FAULT_MINOR;
+
+ if (off >= u->pages_in_use)
+ goto err_out_sigbus;
+
+ return virt_to_page(u->pring[off]);
+
+err_out_sigbus:
+ return NOPAGE_SIGBUS;
+}
+
+static struct vm_operations_struct kevent_user_vm_ops = {
+ .nopage = &kevent_user_nopage,
+};
+
+/*
+ * Mmap implementation for ring buffer, which is created as array
+ * of pages, so vm_pgoff is an offset (in pages, not in bytes) of
+ * the first page to be mapped.
+ */
+static int kevent_user_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ unsigned long start = vma->vm_start;
+ struct kevent_user *u = file->private_data;
+
+ if (vma->vm_flags & VM_WRITE)
+ return -EPERM;
+
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ vma->vm_ops = &kevent_user_vm_ops;
+ vma->vm_flags |= VM_RESERVED;
+ vma->vm_file = file;
+
+ if (vm_insert_page(vma, start, virt_to_page((void *)u->pring[0])))
+ return -EFAULT;
+
+ return 0;
+}
+
+static inline unsigned int kevent_user_hash(struct ukevent *uk)
+{
+ return jhash_1word(uk->id.raw[0], 0) & KEVENT_HASH_MASK;
+}
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+ struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+ kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+
+ if (deq)
+ kevent_dequeue(k);
+
+ spin_lock_irqsave(&u->ready_lock, flags);
+ if (k->flags & KEVENT_READY) {
+ list_del(&k->ready_entry);
+ k->flags &= ~KEVENT_READY;
+ u->ready_num--;
+ }
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+
+ kevent_user_put(u);
+ call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+
+ list_del(&k->kevent_entry);
+ k->flags &= ~KEVENT_USER;
+ u->kevent_num--;
+ kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events,
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ list_del(&k->kevent_entry);
+ k->flags &= ~KEVENT_USER;
+ u->kevent_num--;
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+ kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
+{
+ unsigned long flags;
+ struct kevent *k = NULL;
+
+ spin_lock_irqsave(&u->ready_lock, flags);
+ if (u->ready_num && !list_empty(&u->ready_list)) {
+ k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+ list_del(&k->ready_entry);
+ k->flags &= ~KEVENT_READY;
+ u->ready_num--;
+ }
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+
+ return k;
+}
+
+/*
+ * Search a kevent inside hash bucket for given ukevent.
+ */
+static struct kevent *__kevent_search(struct list_head *head, struct ukevent *uk,
+ struct kevent_user *u)
+{
+ struct kevent *k, *ret = NULL;
+
+ list_for_each_entry(k, head, kevent_entry) {
+ spin_lock(&k->ulock);
+ if (k->event.user[0] == uk->user[0] && k->event.user[1] == uk->user[1] &&
+ k->event.id.raw[0] == uk->id.raw[0] &&
+ k->event.id.raw[1] == uk->id.raw[1]) {
+ ret = k;
+ spin_unlock(&k->ulock);
+ break;
+ }
+ spin_unlock(&k->ulock);
+ }
+
+ return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ unsigned int hash = kevent_user_hash(uk);
+ int err = -ENODEV;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&u->kevent_list[hash], uk, u);
+ if (k) {
+ spin_lock(&k->ulock);
+ k->event.event = uk->event;
+ k->event.req_flags = uk->req_flags;
+ k->event.ret_flags = 0;
+ spin_unlock(&k->ulock);
+ kevent_requeue(k);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+ int err = -ENODEV;
+ struct kevent *k;
+ unsigned int hash = kevent_user_hash(uk);
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&u->kevent_list[hash], uk, u);
+ if (k) {
+ __kevent_finish_user(k, 1);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+ struct kevent_user *u = file->private_data;
+ struct kevent *k, *n;
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(u->kevent_list); ++i) {
+ list_for_each_entry_safe(k, n, &u->kevent_list[i], kevent_entry)
+ kevent_finish_user(k, 1);
+ }
+
+ kevent_user_put(u);
+ file->private_data = NULL;
+
+ return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+ struct ukevent *ukev;
+
+ ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+ if (!ukev)
+ return NULL;
+
+ if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+ kfree(ukev);
+ return NULL;
+ }
+
+ return ukev;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = 0, i;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > u->kevent_num) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ if (kevent_modify(&ukev[i], u))
+ ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+ ukev[i].ret_flags |= KEVENT_RET_DONE;
+ }
+ if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+ err = -EFAULT;
+ kfree(ukev);
+ goto out;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ if (kevent_modify(&uk, u))
+ uk.ret_flags |= KEVENT_RET_BROKEN;
+ uk.ret_flags |= KEVENT_RET_DONE;
+
+ if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ arg += sizeof(struct ukevent);
+ }
+out:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = 0, i;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > u->kevent_num) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ if (kevent_remove(&ukev[i], u))
+ ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+ ukev[i].ret_flags |= KEVENT_RET_DONE;
+ }
+ if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+ err = -EFAULT;
+ kfree(ukev);
+ goto out;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ if (kevent_remove(&uk, u))
+ uk.ret_flags |= KEVENT_RET_BROKEN;
+
+ uk.ret_flags |= KEVENT_RET_DONE;
+
+ if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ arg += sizeof(struct ukevent);
+ }
+out:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static void kevent_user_enqueue(struct kevent_user *u, struct kevent *k)
+{
+ unsigned long flags;
+ unsigned int hash = kevent_user_hash(&k->event);
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ list_add_tail(&k->kevent_entry, &u->kevent_list[hash]);
+ k->flags |= KEVENT_USER;
+ u->kevent_num++;
+ kevent_user_get(u);
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ int err;
+
+ if (kevent_user_ring_grow(u)) {
+ err = -ENOMEM;
+ goto err_out_exit;
+ }
+
+ k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+ if (!k) {
+ err = -ENOMEM;
+ goto err_out_exit;
+ }
+
+ memcpy(&k->event, uk, sizeof(struct ukevent));
+ INIT_RCU_HEAD(&k->rcu_head);
+
+ k->event.ret_flags = 0;
+
+ err = kevent_init(k);
+ if (err) {
+ kmem_cache_free(kevent_cache, k);
+ goto err_out_exit;
+ }
+ k->user = u;
+ kevent_stat_total(u);
+ kevent_user_enqueue(u, k);
+
+ err = kevent_enqueue(k);
+ if (err) {
+ memcpy(uk, &k->event, sizeof(struct ukevent));
+ kevent_finish_user(k, 0);
+ goto err_out_exit;
+ }
+
+ kevent_user_ring_inc(u);
+ return 0;
+
+err_out_exit:
+ if (err < 0) {
+ uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
+ uk->ret_data[1] = err;
+ } else if (err > 0)
+ uk->ret_flags |= KEVENT_RET_DONE;
+ return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one
+ * and add them into appropriate kevent_storages,
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err, cerr = 0, knum = 0, rnum = 0, i;
+ void __user *orig = arg;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ err = -EINVAL;
+ if (u->kevent_num + num >= KEVENT_MAX_EVENTS)
+ goto out_remove;
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ err = kevent_user_add_ukevent(&ukev[i], u);
+ if (err) {
+ kevent_stat_im(u);
+ if (i != rnum)
+ memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+ rnum++;
+ } else
+ knum++;
+ }
+ if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+ cerr = -EFAULT;
+ kfree(ukev);
+ goto out_setup;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ cerr = -EFAULT;
+ break;
+ }
+ arg += sizeof(struct ukevent);
+
+ err = kevent_user_add_ukevent(&uk, u);
+ if (err) {
+ kevent_stat_im(u);
+ if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+ cerr = -EFAULT;
+ break;
+ }
+ orig += sizeof(struct ukevent);
+ rnum++;
+ } else
+ knum++;
+ }
+
+out_setup:
+ if (cerr < 0) {
+ err = cerr;
+ goto out_remove;
+ }
+
+ err = rnum;
+out_remove:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u,
+ unsigned int min_nr, unsigned int max_nr, unsigned int timeout,
+ void __user *buf)
+{
+ struct kevent *k;
+ int num = 0;
+
+ if (!(file->f_flags & O_NONBLOCK)) {
+ wait_event_interruptible_timeout(u->wait,
+ u->ready_num >= min_nr, msecs_to_jiffies(timeout));
+ }
+
+ while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
+ if (copy_to_user(buf + num*sizeof(struct ukevent),
+ &k->event, sizeof(struct ukevent)))
+ break;
+
+ /*
+ * If it is one-shot kevent, it has been removed already from
+ * origin's queue, so we can easily free it here.
+ */
+ if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+ kevent_finish_user(k, 1);
+ ++num;
+ kevent_stat_wait(u);
+ }
+
+ return num;
+}
+
+static struct file_operations kevent_user_fops = {
+ .mmap = kevent_user_mmap,
+ .open = kevent_user_open,
+ .release = kevent_user_release,
+ .poll = kevent_user_poll,
+ .owner = THIS_MODULE,
+};
+
+static struct miscdevice kevent_miscdev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = kevent_name,
+ .fops = &kevent_user_fops,
+};
+
+
+/*
+ * Userspace control block creation and initialization.
+ */
+static int kevent_ctl_init(void)
+{
+ struct kevent_user *u;
+ struct file *file;
+ int fd, ret;
+
+ fd = get_unused_fd();
+ if (fd < 0)
+ return fd;
+
+ file = get_empty_filp();
+ if (!file) {
+ ret = -ENFILE;
+ goto out_put_fd;
+ }
+
+ u = kevent_user_alloc();
+ if (unlikely(!u)) {
+ ret = -ENOMEM;
+ goto out_put_file;
+ }
+
+ file->f_op = &kevent_user_fops;
+ file->f_vfsmnt = mntget(kevent_mnt);
+ file->f_dentry = dget(kevent_mnt->mnt_root);
+ file->f_mapping = file->f_dentry->d_inode->i_mapping;
+ file->f_mode = FMODE_READ;
+ file->f_flags = O_RDONLY;
+ file->private_data = u;
+
+ fd_install(fd, file);
+
+ return fd;
+
+out_put_file:
+ put_filp(file);
+out_put_fd:
+ put_unused_fd(fd);
+ return ret;
+}
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+ int err;
+ struct kevent_user *u = file->private_data;
+
+ if (!u || num > KEVENT_MAX_EVENTS)
+ return -EINVAL;
+
+ switch (cmd) {
+ case KEVENT_CTL_ADD:
+ err = kevent_user_ctl_add(u, num, arg);
+ break;
+ case KEVENT_CTL_REMOVE:
+ err = kevent_user_ctl_remove(u, num, arg);
+ break;
+ case KEVENT_CTL_MODIFY:
+ err = kevent_user_ctl_modify(u, num, arg);
+ break;
+ default:
+ err = -EINVAL;
+ break;
+ }
+
+ return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - timeout in milliseconds to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - ununsed for now (will be used for mmap implementation).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+ unsigned int timeout, void __user *buf, unsigned flags)
+{
+ int err = -EINVAL;
+ struct file *file;
+ struct kevent_user *u;
+
+ file = fget(ctl_fd);
+ if (!file)
+ return -ENODEV;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+ u = file->private_data;
+
+ err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, void __user *arg)
+{
+ int err = -EINVAL;
+ struct file *file;
+
+ if (cmd == KEVENT_CTL_INIT)
+ return kevent_ctl_init();
+
+ file = fget(fd);
+ if (!file)
+ return -ENODEV;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+
+ err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * Kevent subsystem initialization - create kevent cache and register
+ * filesystem to get control file descriptors from.
+ */
+static int __devinit kevent_user_init(void)
+{
+ int err = 0;
+
+ kevent_cache = kmem_cache_create("kevent_cache",
+ sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+
+ err = register_filesystem(&kevent_fs_type);
+ if (err)
+ panic("%s: failed to register filesystem: err=%d.\n",
+ kevent_name, err);
+
+ kevent_mnt = kern_mount(&kevent_fs_type);
+ if (IS_ERR(kevent_mnt))
+ panic("%s: failed to mount silesystem: err=%ld.\n",
+ kevent_name, PTR_ERR(kevent_mnt));
+
+ err = misc_register(&kevent_miscdev);
+ if (err) {
+ printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
+ goto err_out_exit;
+ }
+
+ printk("KEVENT subsystem has been successfully registered.\n");
+
+ return 0;
+
+err_out_exit:
+ mntput(kevent_mnt);
+ unregister_filesystem(&kevent_fs_type);
+
+ return err;
+}
+
+static void __devexit kevent_user_fini(void)
+{
+ misc_deregister(&kevent_miscdev);
+ mntput(kevent_mnt);
+ unregister_filesystem(&kevent_fs_type);
+}
+
+module_init(kevent_user_init);
+module_exit(kevent_user_fini);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 6991bec..8d3769b 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,9 @@ cond_syscall(ppc_rtas);
cond_syscall(sys_spu_run);
cond_syscall(sys_spu_create);

+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_ctl);
+
/* mmu depending weak syscall entries */
cond_syscall(sys_mprotect);
cond_syscall(sys_msync);

2006-08-21 09:56:49

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take12 3/3] kevent: Timer notifications.



Timer notifications.

Timer notifications can be used for fine grained per-process time
management, since interval timers are very inconvenient to use,
and they are limited.

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..5217cd1
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,107 @@
+/*
+ * kevent_timer.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+struct kevent_timer
+{
+ struct timer_list ktimer;
+ struct kevent_storage ktimer_storage;
+};
+
+static void kevent_timer_func(unsigned long data)
+{
+ struct kevent *k = (struct kevent *)data;
+ struct timer_list *t = k->st->origin;
+
+ kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL);
+ mod_timer(t, jiffies + msecs_to_jiffies(k->event.id.raw[0]));
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+ int err;
+ struct kevent_timer *t;
+
+ t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+ if (!t)
+ return -ENOMEM;
+
+ setup_timer(&t->ktimer, &kevent_timer_func, (unsigned long)k);
+
+ err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+ if (err)
+ goto err_out_free;
+ lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+ err = kevent_storage_enqueue(&t->ktimer_storage, k);
+ if (err)
+ goto err_out_st_fini;
+
+ mod_timer(&t->ktimer, jiffies + msecs_to_jiffies(k->event.id.raw[0]));
+
+ return 0;
+
+err_out_st_fini:
+ kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+ kfree(t);
+
+ return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+ struct kevent_storage *st = k->st;
+ struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage);
+
+ del_timer_sync(&t->ktimer);
+ kevent_storage_dequeue(st, k);
+ kfree(t);
+
+ return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+ k->event.ret_data[0] = (__u32)jiffies;
+ return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+ struct kevent_callbacks tc = {
+ .callback = &kevent_timer_callback,
+ .enqueue = &kevent_timer_enqueue,
+ .dequeue = &kevent_timer_dequeue};
+
+ return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);

2006-08-21 11:12:42

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [take12 3/3] kevent: Timer notifications.

On Mon, Aug 21, 2006 at 02:19:49PM +0400, Evgeniy Polyakov wrote:
>
>
> Timer notifications.
>
> Timer notifications can be used for fine grained per-process time
> management, since interval timers are very inconvenient to use,
> and they are limited.

Shouldn't this at leat use a hrtimer?

> new file mode 100644
> index 0000000..5217cd1
> --- /dev/null
> +++ b/kernel/kevent/kevent_timer.c
> @@ -0,0 +1,107 @@
> +/*
> + * kevent_timer.c

You still include those sill filename ontop of file comments..

> +static struct lock_class_key kevent_timer_key;
> +
> +static int kevent_timer_enqueue(struct kevent *k)
> +{
> + int err;
> + struct kevent_timer *t;
> +
> + t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
> + if (!t)
> + return -ENOMEM;
> +
> + setup_timer(&t->ktimer, &kevent_timer_func, (unsigned long)k);
> +
> + err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
> + if (err)
> + goto err_out_free;
> + lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);

When looking at the kevent_storage_init callers most need to do
those lockdep_set_class class. Shouldn't kevent_storage_init just
get a "struct lock_class_key *" argument?

> +static int kevent_timer_callback(struct kevent *k)
> +{
> + k->event.ret_data[0] = (__u32)jiffies;

This is returned to userspace, isn't it? raw jiffies should never be
user-visible. Please convert this to an unit that actually makes sense
for userspace (probably nanoseconds)

> +static int __init kevent_init_timer(void)
> +{
> + struct kevent_callbacks tc = {
> + .callback = &kevent_timer_callback,
> + .enqueue = &kevent_timer_enqueue,
> + .dequeue = &kevent_timer_dequeue};

I think this should be static, and the normal style to write it would be:

static struct kevent_callbacks tc = {
.callback = kevent_timer_callback,
.enqueue = kevent_timer_enqueue,
.dequeue = kevent_timer_dequeue,
};

also please consider makring all the kevent_callbacks structs const
to avoid false cacheline sharing and accidental modification, similar
to what we did to various other operation vectors.

2006-08-21 11:20:34

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 3/3] kevent: Timer notifications.

On Mon, Aug 21, 2006 at 12:12:39PM +0100, Christoph Hellwig ([email protected]) wrote:
> On Mon, Aug 21, 2006 at 02:19:49PM +0400, Evgeniy Polyakov wrote:
> >
> >
> > Timer notifications.
> >
> > Timer notifications can be used for fine grained per-process time
> > management, since interval timers are very inconvenient to use,
> > and they are limited.
>
> Shouldn't this at leat use a hrtimer?

Not everymachine has them and getting into account possibility that
userspace can be scheduled away, it will be overkill.

> > new file mode 100644
> > index 0000000..5217cd1
> > --- /dev/null
> > +++ b/kernel/kevent/kevent_timer.c
> > @@ -0,0 +1,107 @@
> > +/*
> > + * kevent_timer.c
>
> You still include those sill filename ontop of file comments..

Sorry, it looks like I updated not every file.

> > +static struct lock_class_key kevent_timer_key;
> > +
> > +static int kevent_timer_enqueue(struct kevent *k)
> > +{
> > + int err;
> > + struct kevent_timer *t;
> > +
> > + t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
> > + if (!t)
> > + return -ENOMEM;
> > +
> > + setup_timer(&t->ktimer, &kevent_timer_func, (unsigned long)k);
> > +
> > + err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
> > + if (err)
> > + goto err_out_free;
> > + lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
>
> When looking at the kevent_storage_init callers most need to do
> those lockdep_set_class class. Shouldn't kevent_storage_init just
> get a "struct lock_class_key *" argument?

It will not work, since inode is used for both socket and inode
notifications (to save some space in struct sock), lockdep initalization
is performed on the highest level, so I put it alone.

> > +static int kevent_timer_callback(struct kevent *k)
> > +{
> > + k->event.ret_data[0] = (__u32)jiffies;
>
> This is returned to userspace, isn't it? raw jiffies should never be
> user-visible. Please convert this to an unit that actually makes sense
> for userspace (probably nanoseconds)

It is just to show something, my userspace application just prints it to
stdout to show kernelspace difference between events.
Andrew pointed to it too, I think it is easier to just remove this string.

> > +static int __init kevent_init_timer(void)
> > +{
> > + struct kevent_callbacks tc = {
> > + .callback = &kevent_timer_callback,
> > + .enqueue = &kevent_timer_enqueue,
> > + .dequeue = &kevent_timer_dequeue};
>
> I think this should be static, and the normal style to write it would be:
>
> static struct kevent_callbacks tc = {
> .callback = kevent_timer_callback,
> .enqueue = kevent_timer_enqueue,
> .dequeue = kevent_timer_dequeue,
> };
>
> also please consider makring all the kevent_callbacks structs const
> to avoid false cacheline sharing and accidental modification, similar
> to what we did to various other operation vectors.

Ok.

--
Evgeniy Polyakov

2006-08-21 11:27:43

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [take12 3/3] kevent: Timer notifications.

On Mon, 2006-08-21 at 15:18 +0400, Evgeniy Polyakov wrote:
> ]> > + lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
> >
> > When looking at the kevent_storage_init callers most need to do
> > those lockdep_set_class class. Shouldn't kevent_storage_init just
> > get a "struct lock_class_key *" argument?
>
> It will not work, since inode is used for both socket and inode
> notifications (to save some space in struct sock), lockdep initalization
> is performed on the highest level, so I put it alone.

Call me a cynic, but I'm always a bit sceptical about needing lockdep
annotations like this... Can you explain why you need it in this case,
including the proof that it's safe?


2006-08-21 12:01:04

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 3/3] kevent: Timer notifications.

On Mon, Aug 21, 2006 at 01:27:22PM +0200, Arjan van de Ven ([email protected]) wrote:
> On Mon, 2006-08-21 at 15:18 +0400, Evgeniy Polyakov wrote:
> > ]> > + lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
> > >
> > > When looking at the kevent_storage_init callers most need to do
> > > those lockdep_set_class class. Shouldn't kevent_storage_init just
> > > get a "struct lock_class_key *" argument?
> >
> > It will not work, since inode is used for both socket and inode
> > notifications (to save some space in struct sock), lockdep initalization
> > is performed on the highest level, so I put it alone.
>
> Call me a cynic, but I'm always a bit sceptical about needing lockdep
> annotations like this... Can you explain why you need it in this case,
> including the proof that it's safe?

Ok, again :)
Kevent uses meaning of storage of kevents without any special knowledge
what is is (inode, socket, file, timer - anything), so it's
initalization function among other things calls spin_lock_init().
Lockdep inserts static variable just before real spinlock
initialization, and since all locks are initialized in the same place,
all of them get the same static magic.
Later those locks are used in different context (for example inode
notificatins only in process context, but socket can be called from BH
context), since lockdep thinks they are the same, it screams.
Obviously the same inode can not be used for sockets and files, so I
added above lockdep initialization.

--
Evgeniy Polyakov

2006-08-21 12:10:15

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 3/3] kevent: Timer notifications.

On Mon, Aug 21, 2006 at 12:12:39PM +0100, Christoph Hellwig ([email protected]) wrote:
> > +static int __init kevent_init_timer(void)
> > +{
> > + struct kevent_callbacks tc = {
> > + .callback = &kevent_timer_callback,
> > + .enqueue = &kevent_timer_enqueue,
> > + .dequeue = &kevent_timer_dequeue};
>
> I think this should be static, and the normal style to write it would be:
>
> static struct kevent_callbacks tc = {
> .callback = kevent_timer_callback,
> .enqueue = kevent_timer_enqueue,
> .dequeue = kevent_timer_dequeue,
> };
>
> also please consider makring all the kevent_callbacks structs const
> to avoid false cacheline sharing and accidental modification, similar
> to what we did to various other operation vectors.

Actually I do not think it should be static, since it is only used for
initialization and it's members are copied into main structure.

--
Evgeniy Polyakov

2006-08-21 12:14:16

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [take12 3/3] kevent: Timer notifications.

On Mon, 2006-08-21 at 15:59 +0400, Evgeniy Polyakov wrote:
> On Mon, Aug 21, 2006 at 01:27:22PM +0200, Arjan van de Ven ([email protected]) wrote:
> > On Mon, 2006-08-21 at 15:18 +0400, Evgeniy Polyakov wrote:
> > > ]> > + lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
> > > >
> > > > When looking at the kevent_storage_init callers most need to do
> > > > those lockdep_set_class class. Shouldn't kevent_storage_init just
> > > > get a "struct lock_class_key *" argument?
> > >
> > > It will not work, since inode is used for both socket and inode
> > > notifications (to save some space in struct sock), lockdep initalization
> > > is performed on the highest level, so I put it alone.
> >
> > Call me a cynic, but I'm always a bit sceptical about needing lockdep
> > annotations like this... Can you explain why you need it in this case,
> > including the proof that it's safe?
>
> Ok, again :)
> Kevent uses meaning of storage of kevents without any special knowledge
> what is is (inode, socket, file, timer - anything), so it's
> initalization function among other things calls spin_lock_init().
> Lockdep inserts static variable just before real spinlock
> initialization, and since all locks are initialized in the same place,
> all of them get the same static magic.
> Later those locks are used in different context (for example inode
> notificatins only in process context, but socket can be called from BH
> context), since lockdep thinks they are the same, it screams.
> Obviously the same inode can not be used for sockets and files, so I
> added above lockdep initialization.

ok... but since kevent doesn't know what is in it, wouldn't the locking
rules need to be such that it can deal with the "worst case" event? Eg
do you really have both no knowledge of what is inside, and specific
locking implementations for the different types of content??? That
sounds rather error prone.....
(if you had consistent locking rules lockdep would be perfectly fine
with that)


--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com

2006-08-21 12:26:57

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 3/3] kevent: Timer notifications.

On Mon, Aug 21, 2006 at 02:13:49PM +0200, Arjan van de Ven ([email protected]) wrote:
> > > Call me a cynic, but I'm always a bit sceptical about needing lockdep
> > > annotations like this... Can you explain why you need it in this case,
> > > including the proof that it's safe?
> >
> > Ok, again :)
> > Kevent uses meaning of storage of kevents without any special knowledge
> > what is is (inode, socket, file, timer - anything), so it's
> > initalization function among other things calls spin_lock_init().
> > Lockdep inserts static variable just before real spinlock
> > initialization, and since all locks are initialized in the same place,
> > all of them get the same static magic.
> > Later those locks are used in different context (for example inode
> > notificatins only in process context, but socket can be called from BH
> > context), since lockdep thinks they are the same, it screams.
> > Obviously the same inode can not be used for sockets and files, so I
> > added above lockdep initialization.
>
> ok... but since kevent doesn't know what is in it, wouldn't the locking
> rules need to be such that it can deal with the "worst case" event? Eg
> do you really have both no knowledge of what is inside, and specific
> locking implementations for the different types of content??? That
> sounds rather error prone.....
> (if you had consistent locking rules lockdep would be perfectly fine
> with that)

It is tricky - currently there is RCU protection for storage list
traversal from each origin, i.e. the that path takes a lock when
kevent is really ready (to put it into ready list) and to check kevent's
flags (actually event that could be replced by more strict callback's
return values).
No existing origins call kevent_storage_ready() from context where it
can be reentered (it guarantees from other side that locks will not be
reentered), so in theory that lockdep tricks are not needed right now
(they were added before I moved list traversal to RCU).

In the current code storage->lock is safe and correct from lockdep point
of view (since different origins are never crossed), so I have not
changed reinit.

> --
> if you want to mail me at work (you don't), use arjan (at) linux.intel.com

--
Evgeniy Polyakov

2006-08-21 12:41:59

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take12 4/3] kevent: Comment cleanup.

Remove file name from comments.

dda1ae6fe306b485a91ebb5873eeee4bba06aebf
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
index 2872aa2..02ecf30 100644
--- a/kernel/kevent/kevent.c
+++ b/kernel/kevent/kevent.c
@@ -1,6 +1,4 @@
/*
- * kevent.c
- *
* 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
* All rights reserved.
*
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
index 75a75d1..0233a4d 100644
--- a/kernel/kevent/kevent_poll.c
+++ b/kernel/kevent/kevent_poll.c
@@ -1,6 +1,4 @@
/*
- * kevent_poll.c
- *
* 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
* All rights reserved.
*
diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
index 5217cd1..08dfc55 100644
--- a/kernel/kevent/kevent_timer.c
+++ b/kernel/kevent/kevent_timer.c
@@ -1,6 +1,4 @@
/*
- * kevent_timer.c
- *
* 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
* All rights reserved.
*

--
Evgeniy Polyakov

2006-08-22 04:37:37

by Andrew Morton

[permalink] [raw]
Subject: Re: [take12 3/3] kevent: Timer notifications.

On Mon, 21 Aug 2006 16:09:34 +0400
Evgeniy Polyakov <[email protected]> wrote:

> On Mon, Aug 21, 2006 at 12:12:39PM +0100, Christoph Hellwig ([email protected]) wrote:
> > > +static int __init kevent_init_timer(void)
> > > +{
> > > + struct kevent_callbacks tc = {
> > > + .callback = &kevent_timer_callback,
> > > + .enqueue = &kevent_timer_enqueue,
> > > + .dequeue = &kevent_timer_dequeue};
> >
> > I think this should be static, and the normal style to write it would be:
> >
> > static struct kevent_callbacks tc = {
> > .callback = kevent_timer_callback,
> > .enqueue = kevent_timer_enqueue,
> > .dequeue = kevent_timer_dequeue,
> > };
> >
> > also please consider makring all the kevent_callbacks structs const
> > to avoid false cacheline sharing and accidental modification, similar
> > to what we did to various other operation vectors.
>
> Actually I do not think it should be static, since it is only used for
> initialization and it's members are copied into main structure.
>

It should be static __initdata a) so we don't need to construct it at
runtime and b) so it gets dropped from memory after initcalls have run.

(But given that kevent_init_timer() also gets dropped from memory after initcalls
it hardly matters).

2006-08-22 05:49:35

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 3/3] kevent: Timer notifications.

On Mon, Aug 21, 2006 at 09:36:50PM -0700, Andrew Morton ([email protected]) wrote:
> On Mon, 21 Aug 2006 16:09:34 +0400
> Evgeniy Polyakov <[email protected]> wrote:
>
> > On Mon, Aug 21, 2006 at 12:12:39PM +0100, Christoph Hellwig ([email protected]) wrote:
> > > > +static int __init kevent_init_timer(void)
> > > > +{
> > > > + struct kevent_callbacks tc = {
> > > > + .callback = &kevent_timer_callback,
> > > > + .enqueue = &kevent_timer_enqueue,
> > > > + .dequeue = &kevent_timer_dequeue};
> > >
> > > I think this should be static, and the normal style to write it would be:
> > >
> > > static struct kevent_callbacks tc = {
> > > .callback = kevent_timer_callback,
> > > .enqueue = kevent_timer_enqueue,
> > > .dequeue = kevent_timer_dequeue,
> > > };
> > >
> > > also please consider makring all the kevent_callbacks structs const
> > > to avoid false cacheline sharing and accidental modification, similar
> > > to what we did to various other operation vectors.
> >
> > Actually I do not think it should be static, since it is only used for
> > initialization and it's members are copied into main structure.
> >
>
> It should be static __initdata a) so we don't need to construct it at
> runtime and b) so it gets dropped from memory after initcalls have run.
>
> (But given that kevent_init_timer() also gets dropped from memory after initcalls
> it hardly matters).

That's what I'm talking about.

--
Evgeniy Polyakov

2006-08-22 07:01:33

by Nicholas Miell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Mon, 2006-08-21 at 14:19 +0400, Evgeniy Polyakov wrote:
> Generic event handling mechanism.

Since this is the sixth[1] event notification system that's getting
added to the kernel, could somebody please convince me that the
userspace API is right this time? (Evidently, the others weren't and are
now just backward compatibility bloat.)

Just looking at the proposed kevent API, it appears that the timer event
queuing mechanism can't be used for the queuing of POSIX.1b interval
timer events (i.e. via a SIGEV_KEVENT notification value in a struct
sigevent) because (being a very thin veneer over the internal kernel
timer system) you can't specify a clockid, the time value doesn't have
the flexibility of a struct itimerspec (no re-arm timeout or absolute
times), and there's no way to alter, disable or query a pending timer or
query a timer overrun count.

Overall, kevent timers appear to be inconvenient to use and limited
compared to POSIX interval timers (excepting the fact you can read their
expiray events out of a queue, of course).



[1] Previously: select, poll, AIO, epoll, and inotify. Did I miss any?

--
Nicholas Miell <[email protected]>

2006-08-22 07:25:33

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, Aug 22, 2006 at 12:00:51AM -0700, Nicholas Miell ([email protected]) wrote:
> On Mon, 2006-08-21 at 14:19 +0400, Evgeniy Polyakov wrote:
> > Generic event handling mechanism.
>
> Since this is the sixth[1] event notification system that's getting
> added to the kernel, could somebody please convince me that the
> userspace API is right this time? (Evidently, the others weren't and are
> now just backward compatibility bloat.)
>
> Just looking at the proposed kevent API, it appears that the timer event
> queuing mechanism can't be used for the queuing of POSIX.1b interval
> timer events (i.e. via a SIGEV_KEVENT notification value in a struct
> sigevent) because (being a very thin veneer over the internal kernel
> timer system) you can't specify a clockid, the time value doesn't have
> the flexibility of a struct itimerspec (no re-arm timeout or absolute
> times), and there's no way to alter, disable or query a pending timer or
> query a timer overrun count.
>
> Overall, kevent timers appear to be inconvenient to use and limited
> compared to POSIX interval timers (excepting the fact you can read their
> expiray events out of a queue, of course).

Kevent timers are just trivial kevent user.
But even as is it is not that bad solution.
I, as user, do not want to know which timer is used - I only need to
get some signal when interval completed, especially I do not want to
have some troubles when timer with given clockid has disappeared.
Kevent timer can be trivially rearmed (acutally it is always rearmed
until one-shot flag is set).
Of course it can be disabled by removing requested kevent.
I can add possibility to alter timeout without removing kevent if there
is strong requirement for that.

Timer notifications were designed not from committee point of view, when
theoretical discussions end up in multi-megabyte documentation 99.9% of
which can not be used without major brain surgery.

I just implemented what I use, if you want more - say what you need.

> [1] Previously: select, poll, AIO, epoll, and inotify. Did I miss any?

Let me guess - kevent, which can do all above and a lot of other things?
And you forget netlink-based notificators - netlink, rtnetlink,
gennetlink, connector and tons of accounting application based on them,
kobject, kobject_uevent.
There also filessytem based ones - sysfs, procfs, debugfs, relayfs.

> --
> Nicholas Miell <[email protected]>

--
Evgeniy Polyakov

2006-08-22 08:18:16

by Nicholas Miell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, 2006-08-22 at 11:24 +0400, Evgeniy Polyakov wrote:
> On Tue, Aug 22, 2006 at 12:00:51AM -0700, Nicholas Miell ([email protected]) wrote:
> > On Mon, 2006-08-21 at 14:19 +0400, Evgeniy Polyakov wrote:
> > > Generic event handling mechanism.
> >
> > Since this is the sixth[1] event notification system that's getting
> > added to the kernel, could somebody please convince me that the
> > userspace API is right this time? (Evidently, the others weren't and are
> > now just backward compatibility bloat.)
> >
> > Just looking at the proposed kevent API, it appears that the timer event
> > queuing mechanism can't be used for the queuing of POSIX.1b interval
> > timer events (i.e. via a SIGEV_KEVENT notification value in a struct
> > sigevent) because (being a very thin veneer over the internal kernel
> > timer system) you can't specify a clockid, the time value doesn't have
> > the flexibility of a struct itimerspec (no re-arm timeout or absolute
> > times), and there's no way to alter, disable or query a pending timer or
> > query a timer overrun count.
> >
> > Overall, kevent timers appear to be inconvenient to use and limited
> > compared to POSIX interval timers (excepting the fact you can read their
> > expiry events out of a queue, of course).
>
> Kevent timers are just trivial kevent user.
> But even as is it is not that bad solution.
> I, as user, do not want to know which timer is used - I only need to
> get some signal when interval completed, especially I do not want to
> have some troubles when timer with given clockid has disappeared.
> Kevent timer can be trivially rearmed (acutally it is always rearmed
> until one-shot flag is set).
> Of course it can be disabled by removing requested kevent.
> I can add possibility to alter timeout without removing kevent if there
> is strong requirement for that.
>

Is any of this documented anywhere? I'd think that any new userspace
interfaces should have man pages explaining their use and some example
code before getting merged into the kernel to shake out any interface
problems.


> Timer notifications were designed not from committee point of view, when
> theoretical discussions end up in multi-megabyte documentation 99.9% of
> which can not be used without major brain surgery.

Do you have any technical objections to the POSIX.1b interval timer
design to back up your insults?

> I just implemented what I use, if you want more - say what you need.

I don't know what I need, I just know what POSIX already has, and your
extensions don't appear to be compatible with that model and
deliberately designing something that has no hope of ever getting into
the POSIX standard or serving as the basis for whatever comes out of the
standard committee seems rather stupid. (Especially considering that
Linux's only viable competitor has already shipped a unified event
queuing API that does fit into the existing POSIX design.)

Ulrich Drepper is probably better able to speak on this than I am,
considering that he's involved with POSIX and is probably going to be
involved in the Linux libc work, whatever it may be.

>
> > [1] Previously: select, poll, AIO, epoll, and inotify. Did I miss any?
>
> Let me guess - kevent, which can do all above and a lot of other things?
> And you forget netlink-based notificators - netlink, rtnetlink,
> gennetlink, connector and tons of accounting application based on them,
> kobject, kobject_uevent.
> There also filessytem based ones - sysfs, procfs, debugfs, relayfs.

OK, so with literally a dozen different interfaces to queue events to
userspace, all of which are apparently inadequate and in need of
replacement by kevent, don't you want to slow down a bit and make sure
that the kevent API is correct before it becomes permanent and then just
has to be replaced *again* ?


--
Nicholas Miell <[email protected]>

2006-08-22 08:23:39

by David Miller

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

From: Nicholas Miell <[email protected]>
Date: Tue, 22 Aug 2006 01:17:52 -0700

> Is any of this documented anywhere? I'd think that any new userspace
> interfaces should have man pages explaining their use and some example
> code before getting merged into the kernel to shake out any interface
> problems.

Get real.

Nobody made this requirement for things like splice() et al.

I think people are being mostly very unreasonable in the
demands they are making upon Evgeniy. It will only serve
to discourage the one person who is doing work to solve
these problems.

2006-08-22 08:38:08

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, Aug 22, 2006 at 01:17:52AM -0700, Nicholas Miell ([email protected]) wrote:
> On Tue, 2006-08-22 at 11:24 +0400, Evgeniy Polyakov wrote:
> > On Tue, Aug 22, 2006 at 12:00:51AM -0700, Nicholas Miell ([email protected]) wrote:
> > > On Mon, 2006-08-21 at 14:19 +0400, Evgeniy Polyakov wrote:
> > > > Generic event handling mechanism.
> > >
> > > Since this is the sixth[1] event notification system that's getting
> > > added to the kernel, could somebody please convince me that the
> > > userspace API is right this time? (Evidently, the others weren't and are
> > > now just backward compatibility bloat.)
> > >
> > > Just looking at the proposed kevent API, it appears that the timer event
> > > queuing mechanism can't be used for the queuing of POSIX.1b interval
> > > timer events (i.e. via a SIGEV_KEVENT notification value in a struct
> > > sigevent) because (being a very thin veneer over the internal kernel
> > > timer system) you can't specify a clockid, the time value doesn't have
> > > the flexibility of a struct itimerspec (no re-arm timeout or absolute
> > > times), and there's no way to alter, disable or query a pending timer or
> > > query a timer overrun count.
> > >
> > > Overall, kevent timers appear to be inconvenient to use and limited
> > > compared to POSIX interval timers (excepting the fact you can read their
> > > expiry events out of a queue, of course).
> >
> > Kevent timers are just trivial kevent user.
> > But even as is it is not that bad solution.
> > I, as user, do not want to know which timer is used - I only need to
> > get some signal when interval completed, especially I do not want to
> > have some troubles when timer with given clockid has disappeared.
> > Kevent timer can be trivially rearmed (acutally it is always rearmed
> > until one-shot flag is set).
> > Of course it can be disabled by removing requested kevent.
> > I can add possibility to alter timeout without removing kevent if there
> > is strong requirement for that.
> >
>
> Is any of this documented anywhere? I'd think that any new userspace
> interfaces should have man pages explaining their use and some example
> code before getting merged into the kernel to shake out any interface
> problems.

There are two excellent articles on lwn.net

> > Timer notifications were designed not from committee point of view, when
> > theoretical discussions end up in multi-megabyte documentation 99.9% of
> > which can not be used without major brain surgery.
>
> Do you have any technical objections to the POSIX.1b interval timer
> design to back up your insults?

POSIX timers can have any design, but do not force others to use the
same.

> > I just implemented what I use, if you want more - say what you need.
>
> I don't know what I need, I just know what POSIX already has, and your

And I do know what I need, that is why I do it.

> extensions don't appear to be compatible with that model and
> deliberately designing something that has no hope of ever getting into
> the POSIX standard or serving as the basis for whatever comes out of the
> standard committee seems rather stupid. (Especially considering that
> Linux's only viable competitor has already shipped a unified event
> queuing API that does fit into the existing POSIX design.)

I think I even know what it is :)

> Ulrich Drepper is probably better able to speak on this than I am,
> considering that he's involved with POSIX and is probably going to be
> involved in the Linux libc work, whatever it may be.

Feel free to use POSIX timers, but do not force others to it too.

> >
> > > [1] Previously: select, poll, AIO, epoll, and inotify. Did I miss any?
> >
> > Let me guess - kevent, which can do all above and a lot of other things?
> > And you forget netlink-based notificators - netlink, rtnetlink,
> > gennetlink, connector and tons of accounting application based on them,
> > kobject, kobject_uevent.
> > There also filessytem based ones - sysfs, procfs, debugfs, relayfs.
>
> OK, so with literally a dozen different interfaces to queue events to
> userspace, all of which are apparently inadequate and in need of
> replacement by kevent, don't you want to slow down a bit and make sure
> that the kevent API is correct before it becomes permanent and then just
> has to be replaced *again* ?

I only though that license issues remains unresolved in that
linux-kernel@ flood, but not, I was wrong :)

I will ask just one question, do _you_ propose anything here?

> --
> Nicholas Miell <[email protected]>

--
Evgeniy Polyakov

2006-08-22 09:00:15

by Nicholas Miell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, 2006-08-22 at 01:23 -0700, David Miller wrote:
> From: Nicholas Miell <[email protected]>
> Date: Tue, 22 Aug 2006 01:17:52 -0700
>
> > Is any of this documented anywhere? I'd think that any new userspace
> > interfaces should have man pages explaining their use and some example
> > code before getting merged into the kernel to shake out any interface
> > problems.
>
> Get real.
>
> Nobody made this requirement for things like splice() et al.
>
> I think people are being mostly very unreasonable in the
> demands they are making upon Evgeniy. It will only serve
> to discourage the one person who is doing work to solve
> these problems.

splice() is a single synchronous function call, and it's signature still
managed to change wildly during it's development.

In this brave new world of always stable kernel development, the time a
new interface has for public testing before a new kernel release is
drastically shorter than the old unstable development series, and if
nobody is documenting how this stuff is supposed to work and
demonstrating how it will be used, then mistakes are bound to slip
through.

--
Nicholas Miell <[email protected]>

2006-08-22 09:30:18

by Nicholas Miell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, 2006-08-22 at 12:37 +0400, Evgeniy Polyakov wrote:
> On Tue, Aug 22, 2006 at 01:17:52AM -0700, Nicholas Miell ([email protected]) wrote:
> > On Tue, 2006-08-22 at 11:24 +0400, Evgeniy Polyakov wrote:
> > > On Tue, Aug 22, 2006 at 12:00:51AM -0700, Nicholas Miell ([email protected]) wrote:
> > > > On Mon, 2006-08-21 at 14:19 +0400, Evgeniy Polyakov wrote:
> > > > > Generic event handling mechanism.
> > > >
> > > > Since this is the sixth[1] event notification system that's getting
> > > > added to the kernel, could somebody please convince me that the
> > > > userspace API is right this time? (Evidently, the others weren't and are
> > > > now just backward compatibility bloat.)
> > > >
> > > > Just looking at the proposed kevent API, it appears that the timer event
> > > > queuing mechanism can't be used for the queuing of POSIX.1b interval
> > > > timer events (i.e. via a SIGEV_KEVENT notification value in a struct
> > > > sigevent) because (being a very thin veneer over the internal kernel
> > > > timer system) you can't specify a clockid, the time value doesn't have
> > > > the flexibility of a struct itimerspec (no re-arm timeout or absolute
> > > > times), and there's no way to alter, disable or query a pending timer or
> > > > query a timer overrun count.
> > > >
> > > > Overall, kevent timers appear to be inconvenient to use and limited
> > > > compared to POSIX interval timers (excepting the fact you can read their
> > > > expiry events out of a queue, of course).
> > >
> > > Kevent timers are just trivial kevent user.
> > > But even as is it is not that bad solution.
> > > I, as user, do not want to know which timer is used - I only need to
> > > get some signal when interval completed, especially I do not want to
> > > have some troubles when timer with given clockid has disappeared.
> > > Kevent timer can be trivially rearmed (acutally it is always rearmed
> > > until one-shot flag is set).
> > > Of course it can be disabled by removing requested kevent.
> > > I can add possibility to alter timeout without removing kevent if there
> > > is strong requirement for that.
> > >
> >
> > Is any of this documented anywhere? I'd think that any new userspace
> > interfaces should have man pages explaining their use and some example
> > code before getting merged into the kernel to shake out any interface
> > problems.
>
> There are two excellent articles on lwn.net

Google knows of one and it doesn't actually explain how to use kevents.


> >
> > OK, so with literally a dozen different interfaces to queue events to
> > userspace, all of which are apparently inadequate and in need of
> > replacement by kevent, don't you want to slow down a bit and make sure
> > that the kevent API is correct before it becomes permanent and then just
> > has to be replaced *again* ?
>
> I only though that license issues remains unresolved in that
> linux-kernel@ flood, but not, I was wrong :)
>
> I will ask just one question, do _you_ propose anything here?
>

struct sigevent sigev = {
.sigev_notify = SIGEV_KEVENT,
.sigev_kevent_fd = kev_fd,
.sigev_value.sival_ptr = &MyCookie
};

struct itimerspec its = {
.it_value = { ... },
.it_interval = { ... }
};

struct timespec timeout = { .. };

struct ukevent events[max];

timer_t timer;

timer_create(CLOCK_MONOTONIC, &sigev, &timer);
timer_settime(timer, 0, &its, NULL);

/* ... */

kevent_get_events(kev_fd, min, max, &timeout, events, 0);



Which isn't all that different from what Ulrich Drepper suggested and
Solaris does right now. (timer_create would probably end up calling
kevent_ctl itself, but it obviously can't do that unless kevents
actually support real interval timers).

--
Nicholas Miell <[email protected]>

2006-08-22 10:04:05

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, Aug 22, 2006 at 02:29:48AM -0700, Nicholas Miell ([email protected]) wrote:
> > > Is any of this documented anywhere? I'd think that any new userspace
> > > interfaces should have man pages explaining their use and some example
> > > code before getting merged into the kernel to shake out any interface
> > > problems.
> >
> > There are two excellent articles on lwn.net
>
> Google knows of one and it doesn't actually explain how to use kevents.

http://lwn.net/Articles/192964/
http://lwn.net/Articles/172844/

In the thread there were enough links to homepage where you can find
several examples of how to use kevents (and timers among others) with
old interfaces and new ones.

> > I will ask just one question, do _you_ propose anything here?
> >
>
> struct sigevent sigev = {
> .sigev_notify = SIGEV_KEVENT,
> .sigev_kevent_fd = kev_fd,
> .sigev_value.sival_ptr = &MyCookie
> };
>
> struct itimerspec its = {
> .it_value = { ... },
> .it_interval = { ... }
> };
>
> struct timespec timeout = { .. };
>
> struct ukevent events[max];
>
> timer_t timer;
>
> timer_create(CLOCK_MONOTONIC, &sigev, &timer);
> timer_settime(timer, 0, &its, NULL);
>
> /* ... */
>
> kevent_get_events(kev_fd, min, max, &timeout, events, 0);
>
>
>
> Which isn't all that different from what Ulrich Drepper suggested and
> Solaris does right now. (timer_create would probably end up calling
> kevent_ctl itself, but it obviously can't do that unless kevents
> actually support real interval timers).

Ugh, rtsignals... Their's problems forced me to not implement
"interrupt"-like mechanism for kevents in addition to dequeueing.

Anyway, it seems you did not read the whole thread, homepage, lwn and
userpsace examples, so you do not understand what kevents are.

They are userspace requests which are returned back when they are ready.
It means that userspace must provide something to kernel and ask it to
notify when that "something" is ready. For example it can provide a
timeout value and ask kernel to fire a timer with it and inform
userspace when timeout has expired.
It does not matter what timer is used there - feel free to use
high-resolution one, usual timer, busyloop or anything else. Main issue
that userspace request must be completed.

What you are trying to do is to put kevents under POSIX API.
That means that those kevents can not be read using
kevent_get_events(), basicaly because there are no user-known kevents,
i.e. user has not requested timer, so it should not receive it's
notifications (otherwise it will receive everything requested by other
threads and other issues, i.e. how to differentiate timer request made
by timer_create(), which is not supposed to be caught by
kevent_get_events()).

You could implement POSIX timer _fully_ on top of kevents, i.e. both
create and read, for example network AIO is implemented in that way -
there is a system calls aio_send()/aio_recv() and aio_sendfile() which
create kevent internally and then get it's readiness notifications over
provided callback, process data and finally remove kevent,
so POSIX timers could create timer kevent, wait until it is ready, in
completeness callback it would call signal delivering mechanism...

But there are no reading mechanism in POSIX timers (I mean not reading
pending timeout values or remaining time), they use signals for
completeness delivering... So where do you want to put kevent's
userspace there?

What you are trying to achive is not POSIX timers in any way, you want
completely new machanism which has similar to POSIX API, and I give it to
you (well, with API which can be used not only with timers, but with any
other type of notifications you like).
You need clockid_t? Put it in raw.id[0] and make kevent_timer_enqueue()
callback select different type of timers.
What else?

> --
> Nicholas Miell <[email protected]>

--
Evgeniy Polyakov

2006-08-22 11:55:05

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH] kevent_user: remove non-chardev interface

Currently a user can create a user kevents in two ways:

a) simply open() the kevent chardevice
b) use sys_kevent_ctl with the KEVENT_CTL_INIT cmd type

both are equally easy to use for the user, but to support type b) a lot
of code in kernelspace is required. remove type b to save lots of code
without functionality loss.


include/linux/ukevent.h | 1
kernel/kevent/kevent_user.c | 99 +-------------------------------------------
2 files changed, 4 insertions(+), 96 deletions(-)

Signed-off-by: Christoph Hellwig <[email protected]>

Index: linux-2.6/kernel/kevent/kevent_user.c
===================================================================
--- linux-2.6.orig/kernel/kevent/kevent_user.c 2006-08-22 13:26:25.000000000 +0200
+++ linux-2.6/kernel/kevent/kevent_user.c 2006-08-22 13:46:08.000000000 +0200
@@ -36,20 +36,6 @@
static char kevent_name[] = "kevent";
static kmem_cache_t *kevent_cache;

-static int kevent_get_sb(struct file_system_type *fs_type,
- int flags, const char *dev_name, void *data, struct vfsmount *mnt)
-{
- /* So original magic... */
- return get_sb_pseudo(fs_type, kevent_name, NULL, 0xbcdbcdul, mnt);
-}
-
-static struct file_system_type kevent_fs_type = {
- .name = kevent_name,
- .get_sb = kevent_get_sb,
- .kill_sb = kill_anon_super,
-};
-
-static struct vfsmount *kevent_mnt;

/*
* kevents are pollable, return POLLIN and POLLRDNORM
@@ -178,17 +164,14 @@
}


-/*
- * Allocate new kevent userspace control entry.
- */
-static struct kevent_user *kevent_user_alloc(void)
+static int kevent_user_open(struct inode *inode, struct file *file)
{
struct kevent_user *u;
int i;

u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
if (!u)
- return NULL;
+ return -ENOMEM;

INIT_LIST_HEAD(&u->ready_list);
spin_lock_init(&u->ready_lock);
@@ -202,23 +185,12 @@

atomic_set(&u->refcnt, 1);

- if (kevent_user_ring_init(u)) {
+ if (unlikely(kevent_user_ring_init(u))) {
kfree(u);
- u = NULL;
- }
-
- return u;
-}
-
-static int kevent_user_open(struct inode *inode, struct file *file)
-{
- struct kevent_user *u = kevent_user_alloc();
-
- if (!u)
return -ENOMEM;
+ }

file->private_data = u;
-
return 0;
}

@@ -807,51 +779,6 @@
.fops = &kevent_user_fops,
};

-
-/*
- * Userspace control block creation and initialization.
- */
-static int kevent_ctl_init(void)
-{
- struct kevent_user *u;
- struct file *file;
- int fd, ret;
-
- fd = get_unused_fd();
- if (fd < 0)
- return fd;
-
- file = get_empty_filp();
- if (!file) {
- ret = -ENFILE;
- goto out_put_fd;
- }
-
- u = kevent_user_alloc();
- if (unlikely(!u)) {
- ret = -ENOMEM;
- goto out_put_file;
- }
-
- file->f_op = &kevent_user_fops;
- file->f_vfsmnt = mntget(kevent_mnt);
- file->f_dentry = dget(kevent_mnt->mnt_root);
- file->f_mapping = file->f_dentry->d_inode->i_mapping;
- file->f_mode = FMODE_READ;
- file->f_flags = O_RDONLY;
- file->private_data = u;
-
- fd_install(fd, file);
-
- return fd;
-
-out_put_file:
- put_filp(file);
-out_put_fd:
- put_unused_fd(fd);
- return ret;
-}
-
static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
{
int err;
@@ -920,9 +847,6 @@
int err = -EINVAL;
struct file *file;

- if (cmd == KEVENT_CTL_INIT)
- return kevent_ctl_init();
-
file = fget(fd);
if (!file)
return -ENODEV;
@@ -948,16 +872,6 @@
kevent_cache = kmem_cache_create("kevent_cache",
sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);

- err = register_filesystem(&kevent_fs_type);
- if (err)
- panic("%s: failed to register filesystem: err=%d.\n",
- kevent_name, err);
-
- kevent_mnt = kern_mount(&kevent_fs_type);
- if (IS_ERR(kevent_mnt))
- panic("%s: failed to mount silesystem: err=%ld.\n",
- kevent_name, PTR_ERR(kevent_mnt));
-
err = misc_register(&kevent_miscdev);
if (err) {
printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
@@ -969,17 +883,12 @@
return 0;

err_out_exit:
- mntput(kevent_mnt);
- unregister_filesystem(&kevent_fs_type);
-
return err;
}

static void __devexit kevent_user_fini(void)
{
misc_deregister(&kevent_miscdev);
- mntput(kevent_mnt);
- unregister_filesystem(&kevent_fs_type);
}

module_init(kevent_user_init);
Index: linux-2.6/include/linux/ukevent.h
===================================================================
--- linux-2.6.orig/include/linux/ukevent.h 2006-08-22 12:10:24.000000000 +0200
+++ linux-2.6/include/linux/ukevent.h 2006-08-22 13:48:05.000000000 +0200
@@ -131,6 +131,5 @@
#define KEVENT_CTL_ADD 0
#define KEVENT_CTL_REMOVE 1
#define KEVENT_CTL_MODIFY 2
-#define KEVENT_CTL_INIT 3

#endif /* __UKEVENT_H */

2006-08-22 11:55:29

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH] kevent_user: use struct kevent_mring for the page ring

Currently struct kevent_user.print is an array of unsigned long, but it's used
as an array of pointers to struct kevent_mring everyewhere in the code.

Switch it to use the real type and cast the return value from __get_free_page /
argument to free_page.


include/linux/kevent.h | 2 +-
kernel/kevent/kevent_user.c | 31 +++++++++++--------------------
2 files changed, 12 insertions(+), 21 deletions(-)

Signed-off-by: Christoph Hellwig <[email protected]>

Index: linux-2.6/include/linux/kevent.h
===================================================================
--- linux-2.6.orig/include/linux/kevent.h 2006-08-22 12:10:24.000000000 +0200
+++ linux-2.6/include/linux/kevent.h 2006-08-22 13:30:48.000000000 +0200
@@ -105,7 +105,7 @@

unsigned int pages_in_use;
/* Array of pages forming mapped ring buffer */
- unsigned long *pring;
+ struct kevent_mring **pring;

#ifdef CONFIG_KEVENT_USER_STAT
unsigned long im_num;
Index: linux-2.6/kernel/kevent/kevent_user.c
===================================================================
--- linux-2.6.orig/kernel/kevent/kevent_user.c 2006-08-22 13:28:06.000000000 +0200
+++ linux-2.6/kernel/kevent/kevent_user.c 2006-08-22 13:43:46.000000000 +0200
@@ -69,30 +69,21 @@

static inline void kevent_user_ring_set(struct kevent_user *u, unsigned int num)
{
- struct kevent_mring *ring;
-
- ring = (struct kevent_mring *)u->pring[0];
- ring->index = num;
+ u->pring[0]->index = num;
}

static inline void kevent_user_ring_inc(struct kevent_user *u)
{
- struct kevent_mring *ring;
-
- ring = (struct kevent_mring *)u->pring[0];
- ring->index++;
+ u->pring[0]->index++;
}

static int kevent_user_ring_grow(struct kevent_user *u)
{
- struct kevent_mring *ring;
unsigned int idx;

- ring = (struct kevent_mring *)u->pring[0];
-
- idx = (ring->index + 1) / KEVENTS_ON_PAGE;
+ idx = (u->pring[0]->index + 1) / KEVENTS_ON_PAGE;
if (idx >= u->pages_in_use) {
- u->pring[idx] = __get_free_page(GFP_KERNEL);
+ u->pring[idx] = (void *)__get_free_page(GFP_KERNEL);
if (!u->pring[idx])
return -ENOMEM;
u->pages_in_use++;
@@ -108,12 +99,12 @@
unsigned int pidx, off;
struct kevent_mring *ring, *copy_ring;

- ring = (struct kevent_mring *)k->user->pring[0];
-
+ ring = k->user->pring[0];
+
pidx = ring->index/KEVENTS_ON_PAGE;
off = ring->index%KEVENTS_ON_PAGE;

- copy_ring = (struct kevent_mring *)k->user->pring[pidx];
+ copy_ring = k->user->pring[pidx];

copy_ring->event[off].id.raw[0] = k->event.id.raw[0];
copy_ring->event[off].id.raw[1] = k->event.id.raw[1];
@@ -134,11 +125,11 @@

pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct mukevent) + sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;

- u->pring = kmalloc(pnum * sizeof(unsigned long), GFP_KERNEL);
+ u->pring = kmalloc(pnum * sizeof(struct kevent_mring *), GFP_KERNEL);
if (!u->pring)
return -ENOMEM;

- u->pring[0] = __get_free_page(GFP_KERNEL);
+ u->pring[0] = (struct kevent_mring *)__get_free_page(GFP_KERNEL);
if (!u->pring[0])
goto err_out_free;

@@ -158,7 +149,7 @@
int i;

for (i = 0; i < u->pages_in_use; ++i)
- free_page(u->pring[i]);
+ free_page((unsigned long)u->pring[i]);

kfree(u->pring);
}
@@ -254,7 +245,7 @@
vma->vm_flags |= VM_RESERVED;
vma->vm_file = file;

- if (vm_insert_page(vma, start, virt_to_page((void *)u->pring[0])))
+ if (vm_insert_page(vma, start, virt_to_page(u->pring[0])))
return -EFAULT;

return 0;

2006-08-22 12:17:49

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH] kevent_user: remove non-chardev interface

On Tue, Aug 22, 2006 at 12:54:59PM +0100, Christoph Hellwig ([email protected]) wrote:
> Currently a user can create a user kevents in two ways:
>
> a) simply open() the kevent chardevice
> b) use sys_kevent_ctl with the KEVENT_CTL_INIT cmd type
>
> both are equally easy to use for the user, but to support type b) a lot
> of code in kernelspace is required. remove type b to save lots of code
> without functionality loss.

I personally do not have objections against it, but it introduces
additional complexies - one needs to open /dev/kevent and then perform
syscalls on top of returuned file descriptor.

If there are no objections, I will apply this patch.

--
Evgeniy Polyakov

2006-08-22 12:21:32

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH] kevent_user: use struct kevent_mring for the page ring

On Tue, Aug 22, 2006 at 12:55:04PM +0100, Christoph Hellwig ([email protected]) wrote:
> Currently struct kevent_user.print is an array of unsigned long, but it's used
> as an array of pointers to struct kevent_mring everyewhere in the code.
>
> Switch it to use the real type and cast the return value from __get_free_page /
> argument to free_page.
>
>
> include/linux/kevent.h | 2 +-
> kernel/kevent/kevent_user.c | 31 +++++++++++--------------------
> 2 files changed, 12 insertions(+), 21 deletions(-)
>
> Signed-off-by: Christoph Hellwig <[email protected]>

I will apply this patch with small minor nit below.

> if (idx >= u->pages_in_use) {
> - u->pring[idx] = __get_free_page(GFP_KERNEL);
> + u->pring[idx] = (void *)__get_free_page(GFP_KERNEL);

Better cast it directly to (struct kevent_mring *).


If there will not be any objectsion about syscall-based initialization,
I will release new patchset tomorrow with your changes.

Thank you, Christoph.

--
Evgeniy Polyakov

2006-08-22 12:27:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] kevent_user: remove non-chardev interface

On Tue, Aug 22, 2006 at 04:17:10PM +0400, Evgeniy Polyakov wrote:
> I personally do not have objections against it, but it introduces
> additional complexies - one needs to open /dev/kevent and then perform
> syscalls on top of returuned file descriptor.

it disalllows

int fd = sys_kevent_ctl(<random>, KEVENT_CTL_INIT, <random>, <random>);

in favour of only

int fd = open("/dev/kevent", O_SOMETHING);

which doesn't seem like a problem, especially as I really badly hope
no one will use the syscalls but some library instead.

In addition to that I'm researching whether there's a better way to
implement the other functionality instead of the two syscalls. But I'd
rather let code speak, so wait for some patches from me on that.

2006-08-22 12:40:21

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH] kevent_user: remove non-chardev interface

On Tue, Aug 22, 2006 at 01:27:31PM +0100, Christoph Hellwig ([email protected]) wrote:
> On Tue, Aug 22, 2006 at 04:17:10PM +0400, Evgeniy Polyakov wrote:
> > I personally do not have objections against it, but it introduces
> > additional complexies - one needs to open /dev/kevent and then perform
> > syscalls on top of returuned file descriptor.
>
> it disalllows
>
> int fd = sys_kevent_ctl(<random>, KEVENT_CTL_INIT, <random>, <random>);
>
> in favour of only
>
> int fd = open("/dev/kevent", O_SOMETHING);
>
> which doesn't seem like a problem, especially as I really badly hope
> no one will use the syscalls but some library instead.

Yep, exactly about above open/kevent_ctl I'm talking.
I still have a system which has ioctl() based kevent setup, and it
works - I really do not want to rise another flamewar about which
approach is better. If no one will complain until tomorrow I will commit
it.

> In addition to that I'm researching whether there's a better way to
> implement the other functionality instead of the two syscalls. But I'd
> rather let code speak, so wait for some patches from me on that.

There were implementation with pure ioctl() and with one syscall for all
oprations (and control block embedded in it), all were rejected in
favour of two syscalls, so I'm waiting for your patches.

--
Evgeniy Polyakov

2006-08-22 15:00:05

by James Morris

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, 22 Aug 2006, Nicholas Miell wrote:

> In this brave new world of always stable kernel development, the time a
> new interface has for public testing before a new kernel release is
> drastically shorter than the old unstable development series, and if
> nobody is documenting how this stuff is supposed to work and
> demonstrating how it will be used, then mistakes are bound to slip
> through.

Feel free to provide the documentation. Perhaps, even as much as you've
written so far in these emails would be enough.



- James
--
James Morris
<[email protected]>

2006-08-22 17:02:26

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [take12 3/3] kevent: Timer notifications.

On Mon, 2006-08-21 at 15:18 +0400, Evgeniy Polyakov wrote:
> On Mon, Aug 21, 2006 at 12:12:39PM +0100, Christoph Hellwig ([email protected]) wrote:
> > On Mon, Aug 21, 2006 at 02:19:49PM +0400, Evgeniy Polyakov wrote:
> > >
> > >
> > > Timer notifications.
> > >
> > > Timer notifications can be used for fine grained per-process time
> > > management, since interval timers are very inconvenient to use,
> > > and they are limited.
>
> > Shouldn't this at leat use a hrtimer?
>
> Not everymachine has them

Every machine has hrtimers - not necessarily with high resolution timer
support, but the core code is there in any case and it is designed to
provide fine grained timers.

In case of high resolution time support one would expect that the "fine
grained" timer event is actually fine grained.

> and getting into account possibility that
> userspace can be scheduled away, it will be overkill.

If you think out your argument then everything which is fine grained or
high responsive should be removed from userspace access for the very
same reason. Please look at the existing users of the hrtimer subsystem
- all of them are exposed to userspace.

tglx


2006-08-22 17:09:48

by Jari Sundell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

Not to mention the name used causes (at least me) some confusion with
BSD's kqueue implementation. Skimming over the patches it actually
looks somewhat like kqueue with the more interesting features removed,
like the ability to pass the filter changes simultaneously with
polling.

Maybe this is a topic that will singe my fur, but what is wrong with
the kqueue API? Will I really have to implement support for yet
another event API in my program.

Rakshasa

2006-08-22 18:02:40

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, Aug 22, 2006 at 06:57:05PM +0200, Jari Sundell ([email protected]) wrote:
> On 8/22/06, Nicholas Miell <[email protected]> wrote:
> >
> >
> >OK, so with literally a dozen different interfaces to queue events to
> >userspace, all of which are apparently inadequate and in need of
> >replacement by kevent, don't you want to slow down a bit and make sure
> >that the kevent API is correct before it becomes permanent and then just
> >has to be replaced *again* ?
> >
>
> Not to mention the name used causes (at least me) some confusion with BSD's
> kqueue implementation. Skimming over the patches it actually looks somewhat
> like kqueue with the more interesting features removed, like the ability to
> pass the filter changes simultaneously with polling.

I do not understand, what do you mean?
It is obviously allowed to poll and change kevents at the same time.

> Maybe this is a topic that will singe my fur, but what is wrong with the
> kqueue API? Will I really have to implement support for yet another event
> API in my program.

Why did I not implemented it like Solaris did?
Or FreeBSD did?
It was designed with features mention on AIO homepage in mind, but not
to be compatible with some other implementation.
And why should it be?

> Rakshasa

--
Evgeniy Polyakov

2006-08-22 18:26:41

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 3/3] kevent: Timer notifications.

On Mon, Aug 21, 2006 at 04:25:49PM +0200, Thomas Gleixner ([email protected]) wrote:
> > Not everymachine has them
>
> Every machine has hrtimers - not necessarily with high resolution timer
> support, but the core code is there in any case and it is designed to
> provide fine grained timers.
>
> In case of high resolution time support one would expect that the "fine
> grained" timer event is actually fine grained.

Ok, I should reformulate, that currently not every machine has support
in kernel. Obviously each machine has a clock which runs faster than
jiffies.
And as a side note - kevents were created half a year ago - there were
no hrtimers in kernel in that time, btw, does kernel have high-resolutin
clock engine already in?

> > and getting into account possibility that
> > userspace can be scheduled away, it will be overkill.
>
> If you think out your argument then everything which is fine grained or
> high responsive should be removed from userspace access for the very
> same reason. Please look at the existing users of the hrtimer subsystem
> - all of them are exposed to userspace.

Getting into account that system call gets more than 100 nsec, and one
should create kevent and then read it (with at least three rescheduling
- after two syscalls and wake up), it is not exactly the best way to
obtain nanoseconds resolution. And even one usec is good one for
userspace, and I can create an interface through kevents, but let's get
it real - if we still can not agree on other issues, should we do it
right now? I would like kevent core's issues are resolved and everyone
become happy with it before adding new kevent users.

If everyone says "yes, replace usual timers with high-resolution ones",
then ok, I will schedule it for the next patchset.

> tglx
>

--
Evgeniy Polyakov

2006-08-22 19:14:33

by Jari Sundell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On 8/22/06, Evgeniy Polyakov <[email protected]> wrote:
> > Not to mention the name used causes (at least me) some confusion with BSD's
> > kqueue implementation. Skimming over the patches it actually looks somewhat
> > like kqueue with the more interesting features removed, like the ability to
> > pass the filter changes simultaneously with polling.
>
> I do not understand, what do you mean?
> It is obviously allowed to poll and change kevents at the same time.

Changing kevents are done with a separate system call from polling
afaics, thus every change requires a context switch. This in contrast
to BSD's kqueue which allows user-space to pass the changes when
kevent (polling) is called.

It may also choose to update the filters immediately with the same call.

> > Maybe this is a topic that will singe my fur, but what is wrong with the
> > kqueue API? Will I really have to implement support for yet another event
> > API in my program.
>
> Why did I not implemented it like Solaris did?
> Or FreeBSD did?
> It was designed with features mention on AIO homepage in mind, but not
> to be compatible with some other implementation.
> And why should it be?

If it can be, why should it not be? At least, if you reinvent the
wheel its advantages should be obvious.

Considering that kqueue is available on more popular OSes like darwin
it would ease portability greatly if there was a shared event API.
That is, unless you think there's something fundamentally wrong with
their design.

Your interface:

+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min,
unsigned int max,
+ unsigned int timeout, void __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned
int num, void __user *buf);

BSD's kqueue:

struct kevent {
uintptr_t ident; /* identifier for this event */
short filter; /* filter for event */
u_short flags; /* action flags for kqueue */
u_int fflags; /* filter flag value */
intptr_t data; /* filter data value */
void *udata; /* opaque user data identifier */
};

int kevent(int kq, const struct kevent *changelist, int nchanges,
struct kevent *eventlist, int nevents, const struct timespec
*timeout);

The only thing missing in BSD's kevent is the min/max parameters, the
various filters in kevent_get_events either have equivalent filters or
could be added as extensions. (I didn't look too carefully through
them)

On the other hand, your API lacks the ability to pass changes when
polling, as mentioned above. It would be preferable if the timeout
parameter was either timespec or timeval.

Rakshasa

2006-08-22 19:48:36

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, Aug 22, 2006 at 09:14:30PM +0200, Jari Sundell ([email protected]) wrote:
> Changing kevents are done with a separate system call from polling
> afaics, thus every change requires a context switch. This in contrast
> to BSD's kqueue which allows user-space to pass the changes when
> kevent (polling) is called.
>
> It may also choose to update the filters immediately with the same call.

Word "polling" really confuses me here, but now I understand you.
Such approach actually has unresolved issues - consider for
example a situation when all provided events are ready immediately - what
should be returned (as far as I recall they are always added into kqueue in
BSDs before started to be checked, so old events will be returned
first)? And currently ready events can be read through mapped buffer
without any syscall at all.
And Linux syscall is much cheaper than BSD's one.
Consider (especially apped buffer) that issues, it really does not cost
interface complexity.

> >> Maybe this is a topic that will singe my fur, but what is wrong with the
> >> kqueue API? Will I really have to implement support for yet another event
> >> API in my program.
> >
> >Why did I not implemented it like Solaris did?
> >Or FreeBSD did?
> >It was designed with features mention on AIO homepage in mind, but not
> >to be compatible with some other implementation.
> >And why should it be?
>
> If it can be, why should it not be? At least, if you reinvent the
> wheel its advantages should be obvious.
>
> Considering that kqueue is available on more popular OSes like darwin
> it would ease portability greatly if there was a shared event API.
> That is, unless you think there's something fundamentally wrong with
> their design.

First of all, there are completely different types.
Design of the in-kernel part is very different too.

> Your interface:
>
> +asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min,
> unsigned int max,
> + unsigned int timeout, void __user *buf, unsigned flags);
> +asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned
> int num, void __user *buf);
>
> BSD's kqueue:
>
> struct kevent {
> uintptr_t ident; /* identifier for this event */
> short filter; /* filter for event */
> u_short flags; /* action flags for kqueue */
> u_int fflags; /* filter flag value */
> intptr_t data; /* filter data value */
> void *udata; /* opaque user data identifier */
> };


>From your description there is a serious problem with arches which
supports different width of the pointer. I do not have sources of ny BSD
right now, but if it is really like you've described, it can not be used
in Linux at all.

> int kevent(int kq, const struct kevent *changelist, int nchanges,
> struct kevent *eventlist, int nevents, const struct timespec
> *timeout);
>
> The only thing missing in BSD's kevent is the min/max parameters, the
> various filters in kevent_get_events either have equivalent filters or
> could be added as extensions. (I didn't look too carefully through
> them)
>
> On the other hand, your API lacks the ability to pass changes when
> polling, as mentioned above. It would be preferable if the timeout
> parameter was either timespec or timeval.

No way - timespec uses long.

> Rakshasa

--
Evgeniy Polyakov

2006-08-22 19:57:58

by Nicholas Miell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, 2006-08-22 at 14:03 +0400, Evgeniy Polyakov wrote:
> On Tue, Aug 22, 2006 at 02:29:48AM -0700, Nicholas Miell ([email protected]) wrote:
> > > > Is any of this documented anywhere? I'd think that any new userspace
> > > > interfaces should have man pages explaining their use and some example
> > > > code before getting merged into the kernel to shake out any interface
> > > > problems.
> > >
> > > There are two excellent articles on lwn.net
> >
> > Google knows of one and it doesn't actually explain how to use kevents.
>
> http://lwn.net/Articles/192964/
> http://lwn.net/Articles/172844/
>
> In the thread there were enough links to homepage where you can find
> several examples of how to use kevents (and timers among others) with
> old interfaces and new ones.
>

Oh, I found both of those. Neither of them told me what values I could
use in a struct kevent_user_control or what they meant or what any of
the fields in a struct ukevent or struct kevent_id meant or what I'm
supposed to pass in kevent_get_event's "void* buf", or many other things
that I don't remember now.

In short, I'm stuck trying to reverse engineer from the source what the
API is supposed to be (which might not even be what is actually
implemented due to the as of yet unfound bug).

Of course, since you already know how all this stuff is supposed to
work, you could maybe write it down somewhere?


> > > I will ask just one question, do _you_ propose anything here?
> > >
> >
> > struct sigevent sigev = {
> > .sigev_notify = SIGEV_KEVENT,
> > .sigev_kevent_fd = kev_fd,
> > .sigev_value.sival_ptr = &MyCookie
> > };
> >
> > struct itimerspec its = {
> > .it_value = { ... },
> > .it_interval = { ... }
> > };
> >
> > struct timespec timeout = { .. };
> >
> > struct ukevent events[max];
> >
> > timer_t timer;
> >
> > timer_create(CLOCK_MONOTONIC, &sigev, &timer);
> > timer_settime(timer, 0, &its, NULL);
> >
> > /* ... */
> >
> > kevent_get_events(kev_fd, min, max, &timeout, events, 0);
> >
> >
> >
> > Which isn't all that different from what Ulrich Drepper suggested and
> > Solaris does right now. (timer_create would probably end up calling
> > kevent_ctl itself, but it obviously can't do that unless kevents
> > actually support real interval timers).
>
> Ugh, rtsignals... Their's problems forced me to not implement
> "interrupt"-like mechanism for kevents in addition to dequeueing.
>
> Anyway, it seems you did not read the whole thread, homepage, lwn and
> userpsace examples, so you do not understand what kevents are.
>
> They are userspace requests which are returned back when they are ready.
> It means that userspace must provide something to kernel and ask it to
> notify when that "something" is ready. For example it can provide a
> timeout value and ask kernel to fire a timer with it and inform
> userspace when timeout has expired.
> It does not matter what timer is used there - feel free to use
> high-resolution one, usual timer, busyloop or anything else. Main issue
> that userspace request must be completed.
>
> What you are trying to do is to put kevents under POSIX API.
> That means that those kevents can not be read using
> kevent_get_events(), basicaly because there are no user-known kevents,
> i.e. user has not requested timer, so it should not receive it's
> notifications (otherwise it will receive everything requested by other
> threads and other issues, i.e. how to differentiate timer request made
> by timer_create(), which is not supposed to be caught by
> kevent_get_events()).
>

I have no idea what you're trying to say here. I've created a timer,
specified which kevent queue I want it's expiry notification delivered
to, and armed it. Where have I not specified enough information to
request the reception of timer notifications?

Also, differentiating timers made by timer_create() that aren't supposed
to deliver events via kevent_get_events() is easy -- their .sigev_notify
isn't SIGEV_KEVENT.

> You could implement POSIX timer _fully_ on top of kevents, i.e. both
> create and read, for example network AIO is implemented in that way -
> there is a system calls aio_send()/aio_recv() and aio_sendfile() which
> create kevent internally and then get it's readiness notifications over
> provided callback, process data and finally remove kevent,
> so POSIX timers could create timer kevent, wait until it is ready, in
> completeness callback it would call signal delivering mechanism...
>

Yes, but that would be stupid. The kernel already has a fully functional
POSIX timer implementation, so throwing it out to reimplement it using
kevents would be a waste of effort, especially considering that your
kevent timers can't fully express a POSIX interval timer.

Now, if there were some way for me to ask that an interval timer queue
it's expiry notices into a kevent queue, that would combine the best of
both worlds.

> But there are no reading mechanism in POSIX timers (I mean not reading
> pending timeout values or remaining time), they use signals for
> completeness delivering... So where do you want to put kevent's
> userspace there?
>

The goal of this proposal is to extend sigevent completions to include
kevent queues along with signals and created threads, exactly because
thread creation is too heavy and signals are a pain to use.

> What you are trying to achive is not POSIX timers in any way, you want
> completely new machanism which has similar to POSIX API, and I give it to
> you (well, with API which can be used not only with timers, but with any
> other type of notifications you like).
> You need clockid_t? Put it in raw.id[0] and make kevent_timer_enqueue()
> callback select different type of timers.
> What else?

No, it's still POSIX timers -- the vast majority of the API is the same,
they just report their completion differently.

--
Nicholas Miell <[email protected]>

2006-08-22 20:00:41

by Nicholas Miell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, 2006-08-22 at 10:59 -0400, James Morris wrote:
> On Tue, 22 Aug 2006, Nicholas Miell wrote:
>
> > In this brave new world of always stable kernel development, the time a
> > new interface has for public testing before a new kernel release is
> > drastically shorter than the old unstable development series, and if
> > nobody is documenting how this stuff is supposed to work and
> > demonstrating how it will be used, then mistakes are bound to slip
> > through.
>
> Feel free to provide the documentation. Perhaps, even as much as you've
> written so far in these emails would be enough.
>

I'm not the one proposing the new (potentially wrong) interface. The
onus isn't on me.

--
Nicholas Miell <[email protected]>

2006-08-22 20:17:23

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, Aug 22, 2006 at 12:57:38PM -0700, Nicholas Miell ([email protected]) wrote:
> On Tue, 2006-08-22 at 14:03 +0400, Evgeniy Polyakov wrote:
> > On Tue, Aug 22, 2006 at 02:29:48AM -0700, Nicholas Miell ([email protected]) wrote:
> > > > > Is any of this documented anywhere? I'd think that any new userspace
> > > > > interfaces should have man pages explaining their use and some example
> > > > > code before getting merged into the kernel to shake out any interface
> > > > > problems.
> > > >
> > > > There are two excellent articles on lwn.net
> > >
> > > Google knows of one and it doesn't actually explain how to use kevents.
> >
> > http://lwn.net/Articles/192964/
> > http://lwn.net/Articles/172844/
> >
> > In the thread there were enough links to homepage where you can find
> > several examples of how to use kevents (and timers among others) with
> > old interfaces and new ones.
> >
>
> Oh, I found both of those. Neither of them told me what values I could
> use in a struct kevent_user_control or what they meant or what any of
> the fields in a struct ukevent or struct kevent_id meant or what I'm
> supposed to pass in kevent_get_event's "void* buf", or many other things
> that I don't remember now.

Well, I think LWN has very good explaination of what all parameters
mean, but it is possible that there can be some white areas.
No one forbids to look into userspace examples, link to them was posted
a lot of times.

> In short, I'm stuck trying to reverse engineer from the source what the
> API is supposed to be (which might not even be what is actually
> implemented due to the as of yet unfound bug).
>
> Of course, since you already know how all this stuff is supposed to
> work, you could maybe write it down somewhere?

I will write documantation, but as you can see some interfaces are
changed.

>
> > > > I will ask just one question, do _you_ propose anything here?
> > > >
> > >
> > > struct sigevent sigev = {
> > > .sigev_notify = SIGEV_KEVENT,
> > > .sigev_kevent_fd = kev_fd,
> > > .sigev_value.sival_ptr = &MyCookie
> > > };
> > >
> > > struct itimerspec its = {
> > > .it_value = { ... },
> > > .it_interval = { ... }
> > > };
> > >
> > > struct timespec timeout = { .. };
> > >
> > > struct ukevent events[max];
> > >
> > > timer_t timer;
> > >
> > > timer_create(CLOCK_MONOTONIC, &sigev, &timer);
> > > timer_settime(timer, 0, &its, NULL);
> > >
> > > /* ... */
> > >
> > > kevent_get_events(kev_fd, min, max, &timeout, events, 0);
> > >
> > >
> > >
> > > Which isn't all that different from what Ulrich Drepper suggested and
> > > Solaris does right now. (timer_create would probably end up calling
> > > kevent_ctl itself, but it obviously can't do that unless kevents
> > > actually support real interval timers).
> >
> > Ugh, rtsignals... Their's problems forced me to not implement
> > "interrupt"-like mechanism for kevents in addition to dequeueing.
> >
> > Anyway, it seems you did not read the whole thread, homepage, lwn and
> > userpsace examples, so you do not understand what kevents are.
> >
> > They are userspace requests which are returned back when they are ready.
> > It means that userspace must provide something to kernel and ask it to
> > notify when that "something" is ready. For example it can provide a
> > timeout value and ask kernel to fire a timer with it and inform
> > userspace when timeout has expired.
> > It does not matter what timer is used there - feel free to use
> > high-resolution one, usual timer, busyloop or anything else. Main issue
> > that userspace request must be completed.
> >
> > What you are trying to do is to put kevents under POSIX API.
> > That means that those kevents can not be read using
> > kevent_get_events(), basicaly because there are no user-known kevents,
> > i.e. user has not requested timer, so it should not receive it's
> > notifications (otherwise it will receive everything requested by other
> > threads and other issues, i.e. how to differentiate timer request made
> > by timer_create(), which is not supposed to be caught by
> > kevent_get_events()).
> >
>
> I have no idea what you're trying to say here. I've created a timer,
> specified which kevent queue I want it's expiry notification delivered
> to, and armed it. Where have I not specified enough information to
> request the reception of timer notifications?

You can do it with kevent timer notifications. Easily.
I've even attached simple program for that.

> Also, differentiating timers made by timer_create() that aren't supposed
> to deliver events via kevent_get_events() is easy -- their .sigev_notify
> isn't SIGEV_KEVENT.

What should be returned to user? What should be placed into user's data,
into id? How user can determine that given event fires after which
initial value?
Finally, if you think that kevents should use different API for
different events, think about complicated userspace code which must know
tons of syscalls for the same task.

> > You could implement POSIX timer _fully_ on top of kevents, i.e. both
> > create and read, for example network AIO is implemented in that way -
> > there is a system calls aio_send()/aio_recv() and aio_sendfile() which
> > create kevent internally and then get it's readiness notifications over
> > provided callback, process data and finally remove kevent,
> > so POSIX timers could create timer kevent, wait until it is ready, in
> > completeness callback it would call signal delivering mechanism...
> >
>
> Yes, but that would be stupid. The kernel already has a fully functional
> POSIX timer implementation, so throwing it out to reimplement it using
> kevents would be a waste of effort, especially considering that your
> kevent timers can't fully express a POSIX interval timer.
>
> Now, if there were some way for me to ask that an interval timer queue
> it's expiry notices into a kevent queue, that would combine the best of
> both worlds.

Just use kevents directly without POSIX timers at all.
It is possible to add there high-resolution timers.

> > But there are no reading mechanism in POSIX timers (I mean not reading
> > pending timeout values or remaining time), they use signals for
> > completeness delivering... So where do you want to put kevent's
> > userspace there?
> >
>
> The goal of this proposal is to extend sigevent completions to include
> kevent queues along with signals and created threads, exactly because
> thread creation is too heavy and signals are a pain to use.

What you propose is completely new mechanism - it is implemented inside
kevent timer notifications expect that API does not match POSIX one.

> > What you are trying to achive is not POSIX timers in any way, you want
> > completely new machanism which has similar to POSIX API, and I give it to
> > you (well, with API which can be used not only with timers, but with any
> > other type of notifications you like).
> > You need clockid_t? Put it in raw.id[0] and make kevent_timer_enqueue()
> > callback select different type of timers.
> > What else?
>
> No, it's still POSIX timers -- the vast majority of the API is the same,
> they just report their completion differently.

POSIX timers are just a API over in-kernel timers.
Kevent provides different and much more convenient API (since you want
to not use signals but kevent's queue), so where are those similar
things?
How will you change a timer from POSIX syscall, when it completely does
not know about kevents, there will be racess, so you need to
change POSIX API (it's internal part which works with timers).
If you are going to change internal part of POSIX timers implementatin
and add new syscall into your userspace program, you can just switch to
the new API entirely, since right now I do not see any major problems in
kevent timer implementation (expect that it does not use high-res
timers, which were not included into kernel when kevents were created).

> --
> Nicholas Miell <[email protected]>

--
Evgeniy Polyakov

2006-08-22 20:35:53

by David Miller

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

From: Nicholas Miell <[email protected]>
Date: Tue, 22 Aug 2006 13:00:23 -0700

> I'm not the one proposing the new (potentially wrong) interface. The
> onus isn't on me.

You can't demand a volunteer to do work, period.

If it matters to you, you have the option of doing the work.
Otherwise you can't complain.

2006-08-22 21:13:40

by Nicholas Miell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Wed, 2006-08-23 at 00:16 +0400, Evgeniy Polyakov wrote:
> On Tue, Aug 22, 2006 at 12:57:38PM -0700, Nicholas Miell ([email protected]) wrote:
> > On Tue, 2006-08-22 at 14:03 +0400, Evgeniy Polyakov wrote:
> > Of course, since you already know how all this stuff is supposed to
> > work, you could maybe write it down somewhere?
>
> I will write documantation, but as you can see some interfaces are
> changed.

Thanks; rapidly changing interfaces need good documentation even more
than stable interfaces simply because reverse engineering the intended
API from a changing implementation becomes even more difficult.

> > > > > I will ask just one question, do _you_ propose anything here?
> > > >
> > > > struct sigevent sigev = {
> > > > .sigev_notify = SIGEV_KEVENT,
> > > > .sigev_kevent_fd = kev_fd,
> > > > .sigev_value.sival_ptr = &MyCookie
> > > > };
> > > >
> > > > struct itimerspec its = {
> > > > .it_value = { ... },
> > > > .it_interval = { ... }
> > > > };
> > > >
> > > > struct timespec timeout = { .. };
> > > >
> > > > struct ukevent events[max];
> > > >
> > > > timer_t timer;
> > > >
> > > > timer_create(CLOCK_MONOTONIC, &sigev, &timer);
> > > > timer_settime(timer, 0, &its, NULL);
> > > >
> > > > /* ... */
> > > >
> > > > kevent_get_events(kev_fd, min, max, &timeout, events, 0);
> > > >
> > > >
> > > >
> > > > Which isn't all that different from what Ulrich Drepper suggested and
> > > > Solaris does right now. (timer_create would probably end up calling
> > > > kevent_ctl itself, but it obviously can't do that unless kevents
> > > > actually support real interval timers).
> > >
> > > Ugh, rtsignals... Their's problems forced me to not implement
> > > "interrupt"-like mechanism for kevents in addition to dequeueing.
> > >
> > > Anyway, it seems you did not read the whole thread, homepage, lwn and
> > > userpsace examples, so you do not understand what kevents are.
> > >
> > > They are userspace requests which are returned back when they are ready.
> > > It means that userspace must provide something to kernel and ask it to
> > > notify when that "something" is ready. For example it can provide a
> > > timeout value and ask kernel to fire a timer with it and inform
> > > userspace when timeout has expired.
> > > It does not matter what timer is used there - feel free to use
> > > high-resolution one, usual timer, busyloop or anything else. Main issue
> > > that userspace request must be completed.
> > >
> > > What you are trying to do is to put kevents under POSIX API.
> > > That means that those kevents can not be read using
> > > kevent_get_events(), basicaly because there are no user-known kevents,
> > > i.e. user has not requested timer, so it should not receive it's
> > > notifications (otherwise it will receive everything requested by other
> > > threads and other issues, i.e. how to differentiate timer request made
> > > by timer_create(), which is not supposed to be caught by
> > > kevent_get_events()).
> >
> > I have no idea what you're trying to say here. I've created a timer,
> > specified which kevent queue I want it's expiry notification delivered
> > to, and armed it. Where have I not specified enough information to
> > request the reception of timer notifications?
>
> You can do it with kevent timer notifications. Easily.
> I've even attached simple program for that.

You forgot to attach the program.

> > Also, differentiating timers made by timer_create() that aren't supposed
> > to deliver events via kevent_get_events() is easy -- their .sigev_notify
> > isn't SIGEV_KEVENT.
>
> What should be returned to user?
> What should be placed into user's data, into id?

The cookie I passed in -- in this example, it was &MyCookie.

> How user can determine that given event fires after which
> initial value?

I don't know what this means.

> Finally, if you think that kevents should use different API for
> different events, think about complicated userspace code which must know
> tons of syscalls for the same task.

I don't think cramming everything together into the same syscall is any
better. In fact, a series of discrete, easy-to-understand function calls
is a hell of a lot easier to deal with than a single call that takes an
array of large multi-purpose structures, especially when most of those
function calls have standard specified behavior.

In fact, I doubt anything will *ever* use kevents directly -- it's
either going to be something like libevent which wraps this stuff
portably or the app's own portability layer or GLib's event loop or
something else that abstracts away the fact that nobody can agree on
what the primitives for a unified event loop should be. There's nothing
like another layer of indirection to solve your problems.

> > > You could implement POSIX timer _fully_ on top of kevents, i.e. both
> > > create and read, for example network AIO is implemented in that way -
> > > there is a system calls aio_send()/aio_recv() and aio_sendfile() which
> > > create kevent internally and then get it's readiness notifications over
> > > provided callback, process data and finally remove kevent,
> > > so POSIX timers could create timer kevent, wait until it is ready, in
> > > completeness callback it would call signal delivering mechanism...
> >
> > Yes, but that would be stupid. The kernel already has a fully functional
> > POSIX timer implementation, so throwing it out to reimplement it using
> > kevents would be a waste of effort, especially considering that your
> > kevent timers can't fully express a POSIX interval timer.
> >
> > Now, if there were some way for me to ask that an interval timer queue
> > it's expiry notices into a kevent queue, that would combine the best of
> > both worlds.
>
> Just use kevents directly without POSIX timers at all.
> It is possible to add there high-resolution timers.

So the existing kevent API is currently incomplete?

> > > But there are no reading mechanism in POSIX timers (I mean not reading
> > > pending timeout values or remaining time), they use signals for
> > > completeness delivering... So where do you want to put kevent's
> > > userspace there?
> >
> > The goal of this proposal is to extend sigevent completions to include
> > kevent queues along with signals and created threads, exactly because
> > thread creation is too heavy and signals are a pain to use.
>
> What you propose is completely new mechanism - it is implemented inside
> kevent timer notifications expect that API does not match POSIX one.

A completely new delivery mechanism, yes. The rest of the API for timer
creation, arming, query, destruction, etc. remains the same.

This is opposed to the completely new mechanism for delivery, creation,
arming, query, destruction, etc. that is the currently proposed kevents
timer interface.

> > > What you are trying to achive is not POSIX timers in any way, you want
> > > completely new machanism which has similar to POSIX API, and I give it to
> > > you (well, with API which can be used not only with timers, but with any
> > > other type of notifications you like).
> > > You need clockid_t? Put it in raw.id[0] and make kevent_timer_enqueue()
> > > callback select different type of timers.
> > > What else?
> >
> > No, it's still POSIX timers -- the vast majority of the API is the same,
> > they just report their completion differently.
>
> POSIX timers are just a API over in-kernel timers.
> Kevent provides different and much more convenient API (since you want
> to not use signals but kevent's queue), so where are those similar
> things?

I don't think you have established the kevent timer API's convenience
yet (beyond the fact that it doesn't use signals, which everybody
wants).

> How will you change a timer from POSIX syscall, when it completely does
> not know about kevents, there will be racess, so you need to
> change POSIX API (it's internal part which works with timers).

Yes, the kernel's POSIX timer implementation will need to be altered so
that it can queue timer completion events to a kevent queue.

> If you are going to change internal part of POSIX timers implementatin
> and add new syscall into your userspace program, you can just switch to
> the new API entirely, since right now I do not see any major problems in
> kevent timer implementation (expect that it does not use high-res
> timers, which were not included into kernel when kevents were created).

Switching an entire program away from POSIX interval timers would be
more work then the modifications necessary to switch only it's timer
delivery mechanism, especially when the new timer system doesn't have
documentation and isn't as functional as the old.

And of course you don't see any major problems in the kevent timer
implementation -- if you did, you would have fixed them already.
However, that doesn't mean that they don't exist.

--
Nicholas Miell <[email protected]>

2006-08-22 21:14:25

by Nicholas Miell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, 2006-08-22 at 13:36 -0700, David Miller wrote:
> From: Nicholas Miell <[email protected]>
> Date: Tue, 22 Aug 2006 13:00:23 -0700
>
> > I'm not the one proposing the new (potentially wrong) interface. The
> > onus isn't on me.
>
> You can't demand a volunteer to do work, period.
>
> If it matters to you, you have the option of doing the work.
> Otherwise you can't complain.

So if a volunteer does bad work, I'm obligated to accept it just because
I haven't done better?

Alternately, if a volunteer does bad work, must it be merged into the
kernel because there's isn't a better implementation? (I believe that
was tried at least once with devfs.)

And how is the quality of the work to be judged if the work isn't
commented, documented and explained, especially the userland-visible
parts that *cannot* *ever* *be* *changed* *or* *removed* once they're in
a stable kernel release?

--
Nicholas Miell <[email protected]>

2006-08-22 21:24:46

by David Miller

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

From: Nicholas Miell <[email protected]>
Date: Tue, 22 Aug 2006 14:13:40 -0700

> And how is the quality of the work to be judged if the work isn't
> commented, documented and explained, especially the userland-visible
> parts that *cannot* *ever* *be* *changed* *or* *removed* once they're in
> a stable kernel release?

Are you even willing to look at the collection of example applications
Evgeniy wrote against this API?

That is the true test of a set of interfaces, what happens when you
try to actually use them in real programs.

Everything else is fluff, including standards and "documentation".

He even bothered to benchmark things, and post assosciated graphs and
performance analysis during the course of development.

2006-08-22 21:34:41

by Randy Dunlap

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, 22 Aug 2006 14:13:02 -0700 Nicholas Miell wrote:

> On Wed, 2006-08-23 at 00:16 +0400, Evgeniy Polyakov wrote:
> > On Tue, Aug 22, 2006 at 12:57:38PM -0700, Nicholas Miell ([email protected]) wrote:
> > > On Tue, 2006-08-22 at 14:03 +0400, Evgeniy Polyakov wrote:
> > > Of course, since you already know how all this stuff is supposed to
> > > work, you could maybe write it down somewhere?
> >
> > I will write documantation, but as you can see some interfaces are
> > changed.
>
> Thanks; rapidly changing interfaces need good documentation even more
> than stable interfaces simply because reverse engineering the intended
> API from a changing implementation becomes even more difficult.

OK, I don't quite get it.
Can you be precise about what you would like?

a. good documentation
b. a POSIX API
c. a Windows-compatible API
d. other?

and we won't make you use any of this code.

---
~Randy

2006-08-22 22:02:11

by Andrew Morton

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, 22 Aug 2006 14:37:47 -0700
"Randy.Dunlap" <[email protected]> wrote:

> On Tue, 22 Aug 2006 14:13:02 -0700 Nicholas Miell wrote:
>
> > On Wed, 2006-08-23 at 00:16 +0400, Evgeniy Polyakov wrote:
> > > On Tue, Aug 22, 2006 at 12:57:38PM -0700, Nicholas Miell ([email protected]) wrote:
> > > > On Tue, 2006-08-22 at 14:03 +0400, Evgeniy Polyakov wrote:
> > > > Of course, since you already know how all this stuff is supposed to
> > > > work, you could maybe write it down somewhere?
> > >
> > > I will write documantation, but as you can see some interfaces are
> > > changed.
> >
> > Thanks; rapidly changing interfaces need good documentation even more
> > than stable interfaces simply because reverse engineering the intended
> > API from a changing implementation becomes even more difficult.
>
> OK, I don't quite get it.
> Can you be precise about what you would like?
>
> a. good documentation
> b. a POSIX API
> c. a Windows-compatible API
> d. other?
>
> and we won't make you use any of this code.
>

Today seems to be beat-up-Nick day?

This is a major, major new addition to the kernel API. It's a big deal.
Getting it documented prior to committing ourselves is a useful part of the
review process. It certainly can't hurt, and it might help. It is a
little too soon to spend too much time on that though. (It's actually
_better_ if someone other than the developer writes the documentation,
too).


And the "why not emulate kqueue" question strikes me as an excellent one.
Presumably a lot of developer thought and in-field experience has gone into
kqueue. It would benefit us to use that knowledge as much as we can.

I mean, if there's nothing wrong with kqueue then let's minimise app
developer pain and copy it exactly. If there _is_ something wrong with
kqueue then let us identify those weaknesses and then diverge. Doing
something which looks the same and works the same and does the same thing
but has a different API doesn't benefit anyone.

2006-08-22 22:17:36

by David Miller

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

From: Andrew Morton <[email protected]>
Date: Tue, 22 Aug 2006 15:01:44 -0700

> If there _is_ something wrong with kqueue then let us identify those
> weaknesses and then diverge.

Evgeniy already enumerated this, both on his web site and in the
current thread.

Unlike some people seem to imply, Evgeniy did research all the other
implementations of event queueing out there, including kqueue.
He took the best of that survey, adding some of his own ideas,
and that's what kevent is. It's not like he's some kind of
charlatan and made arbitrary decisions in his design without any
regard for what's out there already.

Again, the proof is in the pudding, he wrote applications against his
interfaces and tested them. That's what people need to really do if
they want to judge his interface, try to write programs against it and
report back any problems they run into.

2006-08-22 22:51:13

by Jari Sundell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On 8/22/06, Evgeniy Polyakov <[email protected]> wrote:
> Word "polling" really confuses me here, but now I understand you.
> Such approach actually has unresolved issues - consider for
> example a situation when all provided events are ready immediately - what
> should be returned (as far as I recall they are always added into kqueue in
> BSDs before started to be checked, so old events will be returned
> first)? And currently ready events can be read through mapped buffer
> without any syscall at all.
> And Linux syscall is much cheaper than BSD's one.
> Consider (especially apped buffer) that issues, it really does not cost
> interface complexity.

There's no reason I can see that kqueue's kevent should not be able to
check an mmaped buffer as in your implementation, after having passed
any filter changes to the kernel.

I'm not sure if I read you correctly, but the situation where all
events are ready immediately is not a problem. Only the delta is
passed with the kevent call, so old events will still be first in the
queue. And as long as the user doesn't randomize the order of the
changelist and passes the changedlist with each kevent call, the
resulting order in which changes are received will be no different
from using individual system calls.

If there's some very specific reason the user needs to retain the
order in which events happen in the interval between adding it to the
changelist and calling kevent, he may decide to call kevent
immediately without asking for any events.

> First of all, there are completely different types.
> Design of the in-kernel part is very different too.

The question I'm asking is not whet ever kqueue can fit this
implementation, but rather if it is possible to make the
implementation fit kqueue. I can't really see any fundemental
differences, merely implementation details. Maybe I'm just unfamiliar
with the requirements.

> > BSD's kqueue:
> >
> > struct kevent {
> > uintptr_t ident; /* identifier for this event */
> > short filter; /* filter for event */
> > u_short flags; /* action flags for kqueue */
> > u_int fflags; /* filter flag value */
> > intptr_t data; /* filter data value */
> > void *udata; /* opaque user data identifier */
> > };
>
>
> From your description there is a serious problem with arches which
> supports different width of the pointer. I do not have sources of ny BSD
> right now, but if it is really like you've described, it can not be used
> in Linux at all.

Are you referring to udata or data? I'll assume the latter as the
former is more of a restriction on user-space. intptr_t is required to
be safely convertible to a void*, so I don't see what the problem
would be.

> No way - timespec uses long.

I must have missed that discussion. Please enlighten me in what regard
using an opaque type with lower resolution is preferable to a type
defined in POSIX for this sort of purpose. Considering the extra code
I need to write to properly handle having just ms resolution, it
better be something fundamentally broken. ;)

Rakshasa

2006-08-22 22:58:47

by Nicholas Miell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, 2006-08-22 at 14:37 -0700, Randy.Dunlap wrote:
> On Tue, 22 Aug 2006 14:13:02 -0700 Nicholas Miell wrote:
>
> > On Wed, 2006-08-23 at 00:16 +0400, Evgeniy Polyakov wrote:
> > > On Tue, Aug 22, 2006 at 12:57:38PM -0700, Nicholas Miell ([email protected]) wrote:
> > > > On Tue, 2006-08-22 at 14:03 +0400, Evgeniy Polyakov wrote:
> > > > Of course, since you already know how all this stuff is supposed to
> > > > work, you could maybe write it down somewhere?
> > >
> > > I will write documantation, but as you can see some interfaces are
> > > changed.
> >
> > Thanks; rapidly changing interfaces need good documentation even more
> > than stable interfaces simply because reverse engineering the intended
> > API from a changing implementation becomes even more difficult.
>
> OK, I don't quite get it.
> Can you be precise about what you would like?
>
> a. good documentation
> b. a POSIX API
> c. a Windows-compatible API
> d. other?
>
> and we won't make you use any of this code.

I want something that I can be confident won't be replaced again in two
years because nobody noticed problems with the old API design or they're
just feeling very NIH with their snazzy new feature.

Maybe then we won't end up with another in the { signal/sigaction,
waitpid/wait4, select/pselect, poll/ppol, msgrcv, mq_receive,
io_getevents, aio_suspend/aio_return, epoll_wait, inotify read,
kevent_get_events } collection -- or do you like having a maze of
twisted interfaces, all subtly different and none supporting the
complete feature set?

Good documentation giving enough detail to judge the design and an API
that fits with the current POSIX API (at least, the parts that everybody
agrees don't suck) goes a long way toward assuaging my fears that this
won't just be another waste of effort, doomed to be replaced by the Next
Great Thing (We Really Mean It This Time!) in unified event loop API
design or whatever other interface somebody happens to be working on.

---

This is made extraordinarily difficult by the fact kernel people don't
even agree themselves on what APIs should look like anyway and Linus
won't take a stand on the issue -- people with influence are
simultaneously arguing things like:

- ioctls are bad because they aren't typesafe and you should use
syscalls instead because they are typesafe

- ioctls are good, because they're much easier to add than syscalls,
type safety can be supplied by the library wrapper, and syscalls are a
(relatively) scarce resource, harder to wire up in the first place, and
are more difficult to make optional or remove entirely if you decide
they were a stupid idea.

- multiplexors are bad because they're too complex or not typesafe

- multiplexors are good because they save syscall slots or ioctl numbers
and the library wrapper provides the typesafety anyway.

- instead of syscalls or ioctls, you should create a whole new
filesystem that has a bunch of magic files that you read from and write
to in order to talk to the kernel

- filesystem interfaces are bad, because they're take more effort to
write than a syscall or a ioctl and nobody seems to know how to maintain
and evolve a filesystem-based ABI or make them easy to use outside of a
fragile shell script (see: sysfs)

- that everything in those custom filesystems should ASCII strings and
nobody needs an actual grammar describing how to parse them, we can just
break userspace whenever we feel like it

- that everything in those custom filesystems should be C structs, and
screw the shell scripts

- new filesystem metadata should be exposed by:
- xattrs
- ioctls
- new syscalls
or
- named streams/forks/not-xattrs
and three out of four of these suggestions are completely wrong for
some critical reason

- meanwhile, the networking folks are doing everything via AF_NETLINK
sockets instead of syscalls or ioctl or whatever, I guess because the
network stack is what's most familiar to them

- and there's the usual arguments about typedefs verses bare struct
names, #defines verses enums, returning 0 on success vs. 0 on failure,
and lots of other piddly stupid stuff that somebody just needs to say
"this is how it's done and no arguing" about.

Honestly, somebody with enough clout to make it stick needs to write out
a spec describing what new kernel interfaces should look like and how
they should fit in with existing interfaces.

It'd probably make Evgeniy's life easier if you could just point at the
interface guidelines and say "you did this wrong" instead of random
people telling him to change his design and random other people telling
him to change it back.

--
Nicholas Miell <[email protected]>

2006-08-22 22:59:23

by Nicholas Miell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, 2006-08-22 at 14:25 -0700, David Miller wrote:
> From: Nicholas Miell <[email protected]>
> Date: Tue, 22 Aug 2006 14:13:40 -0700
>
> > And how is the quality of the work to be judged if the work isn't
> > commented, documented and explained, especially the userland-visible
> > parts that *cannot* *ever* *be* *changed* *or* *removed* once they're in
> > a stable kernel release?
>
> Are you even willing to look at the collection of example applications
> Evgeniy wrote against this API?
>
> That is the true test of a set of interfaces, what happens when you
> try to actually use them in real programs.
>
> Everything else is fluff, including standards and "documentation".
>
> He even bothered to benchmark things, and post assosciated graphs and
> performance analysis during the course of development.

I wasn't aware that any of these existed, he didn't mention them in this
patch series. Having now looked, all I've managed to find are a series
of simple example apps that no longer work because of API changes.

Also, if you've been paying attention, you'll note that I've never
criticized the performance or quality of the underlying kevent
implementation -- as best I can tell, aside from some lockdep complaints
(which, afaik, are the result of lockdep's limitations rather than
problems with kevent), the internals of kevent are excellent.

--
Nicholas Miell <[email protected]>

2006-08-22 23:06:07

by David Miller

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

From: Nicholas Miell <[email protected]>
Date: Tue, 22 Aug 2006 15:58:12 -0700

> Honestly, somebody with enough clout to make it stick needs to write out
> a spec describing what new kernel interfaces should look like and how
> they should fit in with existing interfaces.

With the time you spent writing this long email alone you could have
worked on either documenting Evgeniy's interfaces or trying to write
test applications against kevent to validate how useful the interfaces
are and if there are any problems with them.

You choose to rant and complain instead of participate.

Therefore, many of us cannot take you seriously.

2006-08-22 23:12:32

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

Hello!

> >No way - timespec uses long.
>
> I must have missed that discussion. Please enlighten me in what regard
> using an opaque type with lower resolution is preferable to a type
> defined in POSIX for this sort of purpose.

Let me explain, as a person who did this mistake and deeply
regrets about this.

F.e. in this case you just cannot use kevents in 32bit application
on x86_64, unless you add the whole translation layer inside kevent core.
Even when you deal with plain syscall, translation is a big pain,
but when you use mmapped buffer, it can be simply impossible.

F.e. my mistake was "unsigned long" in struct tpacket_hdr in linux/if_packet.h.
It makes use of mmapped packet socket essentially impossible by 32bit
applications on 64bit archs.

Alexey

2006-08-22 23:19:00

by Randy Dunlap

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, 22 Aug 2006 15:58:12 -0700 Nicholas Miell wrote:

> On Tue, 2006-08-22 at 14:37 -0700, Randy.Dunlap wrote:
> > On Tue, 22 Aug 2006 14:13:02 -0700 Nicholas Miell wrote:
> >
> > > On Wed, 2006-08-23 at 00:16 +0400, Evgeniy Polyakov wrote:
> > > > On Tue, Aug 22, 2006 at 12:57:38PM -0700, Nicholas Miell ([email protected]) wrote:
> > > > > On Tue, 2006-08-22 at 14:03 +0400, Evgeniy Polyakov wrote:
> > > > > Of course, since you already know how all this stuff is supposed to
> > > > > work, you could maybe write it down somewhere?
> > > >
> > > > I will write documantation, but as you can see some interfaces are
> > > > changed.
> > >
> > > Thanks; rapidly changing interfaces need good documentation even more
> > > than stable interfaces simply because reverse engineering the intended
> > > API from a changing implementation becomes even more difficult.
> >
> > OK, I don't quite get it.
> > Can you be precise about what you would like?
> >
> > a. good documentation
> > b. a POSIX API
> > c. a Windows-compatible API
> > d. other?
> >
> > and we won't make you use any of this code.
>
> I want something that I can be confident won't be replaced again in two
> years because nobody noticed problems with the old API design or they're
> just feeling very NIH with their snazzy new feature.
>
> Maybe then we won't end up with another in the { signal/sigaction,
> waitpid/wait4, select/pselect, poll/ppol, msgrcv, mq_receive,
> io_getevents, aio_suspend/aio_return, epoll_wait, inotify read,
> kevent_get_events } collection -- or do you like having a maze of
> twisted interfaces, all subtly different and none supporting the
> complete feature set?
>
> Good documentation giving enough detail to judge the design and an API
> that fits with the current POSIX API (at least, the parts that everybody
> agrees don't suck) goes a long way toward assuaging my fears that this
> won't just be another waste of effort, doomed to be replaced by the Next
> Great Thing (We Really Mean It This Time!) in unified event loop API
> design or whatever other interface somebody happens to be working on.
>
> ---

OK, thank you for elaborating.

I suppose that I am more <choose one> {cynical, sarcastic,
practial, pragmatic}. I don't have a crystal ball for 2 years
out and I don't know anyone who does.

IMO we do the best that we can given some human constraints
and probably some marketplace constraints (like ship something
instead of playing with it for 5 years before shipping it).


> This is made extraordinarily difficult by the fact kernel people don't
> even agree themselves on what APIs should look like anyway and Linus
> won't take a stand on the issue -- people with influence are
> simultaneously arguing things like:
>
> - ioctls are bad because they aren't typesafe and you should use
> syscalls instead because they are typesafe
>
> - ioctls are good, because they're much easier to add than syscalls,
> type safety can be supplied by the library wrapper, and syscalls are a
> (relatively) scarce resource, harder to wire up in the first place, and
> are more difficult to make optional or remove entirely if you decide
> they were a stupid idea.

Yes, I was recently part of that argument in Ottawa.

> - multiplexors are bad because they're too complex or not typesafe
>
> - multiplexors are good because they save syscall slots or ioctl numbers
> and the library wrapper provides the typesafety anyway.

Multiplexors have already lost AFAIK. Unless someone changes their
mind. Which happens and will continue to happen.

> - instead of syscalls or ioctls, you should create a whole new
> filesystem that has a bunch of magic files that you read from and write
> to in order to talk to the kernel

Yep. Some people like that one. Not everyone.

> - filesystem interfaces are bad, because they're take more effort to
> write than a syscall or a ioctl and nobody seems to know how to maintain
> and evolve a filesystem-based ABI or make them easy to use outside of a
> fragile shell script (see: sysfs)

Ack.

> - that everything in those custom filesystems should ASCII strings and
> nobody needs an actual grammar describing how to parse them, we can just
> break userspace whenever we feel like it

sysfs requires one value per file. Little parsing required.
But I don't know how to capture atomic values from N files with sysfs.

> - that everything in those custom filesystems should be C structs, and
> screw the shell scripts

Hm, I don't recall that one.

> - new filesystem metadata should be exposed by:
> - xattrs
> - ioctls
> - new syscalls
> or
> - named streams/forks/not-xattrs
> and three out of four of these suggestions are completely wrong for
> some critical reason
>
> - meanwhile, the networking folks are doing everything via AF_NETLINK
> sockets instead of syscalls or ioctl or whatever, I guess because the
> network stack is what's most familiar to them

I sympathize with you on that one. I don't care for netlink much
either. To me it's still an ioctl, even though it is routable
and extensible.

> - and there's the usual arguments about typedefs verses bare struct
> names, #defines verses enums, returning 0 on success vs. 0 on failure,
> and lots of other piddly stupid stuff that somebody just needs to say
> "this is how it's done and no arguing" about.

That's all part of the open-source development process. Sorry
you dislike it.

> Honestly, somebody with enough clout to make it stick needs to write out
> a spec describing what new kernel interfaces should look like and how
> they should fit in with existing interfaces.

I only know 2 people who could make it stick.

> It'd probably make Evgeniy's life easier if you could just point at the
> interface guidelines and say "you did this wrong" instead of random
> people telling him to change his design and random other people telling
> him to change it back.

I agree. We went thru some of this at the kernel summit.
The ioctls dissent topic didn't really have many dissenters.
(That was a surprise to me.)

Anyway, thanks again for the additional details.

---
~Randy

2006-08-22 23:36:12

by Andrew Morton

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, 22 Aug 2006 15:17:47 -0700 (PDT)
David Miller <[email protected]> wrote:

> From: Andrew Morton <[email protected]>
> Date: Tue, 22 Aug 2006 15:01:44 -0700
>
> > If there _is_ something wrong with kqueue then let us identify those
> > weaknesses and then diverge.
>
> Evgeniy already enumerated this, both on his web site and in the
> current thread.

<googles, spends a few minutes clicking around on
http://tservice.net.ru/~s0mbre/, fails. Looks in changelogs, also fails>

Best I can find is
http://tservice.net.ru/~s0mbre/blog/devel/kevent/index.html, and that's
doesn't cover these things.

At some stage we're going to need to tell Linus (for example) what we've
done and why we did it. I don't know how to do that.

2006-08-22 23:47:08

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

I so far also haven't taken the time to look exactly at the interface.
I plan to do it asap since this is IMO our big chance to get it right.
I want to have a unifying interface which can handle all the different
events we need and which come up today and tomorrow. We have to be able
to handle not only file descriptors and AIO but also timers, signals,
message queues (OK, they are file descriptors but let's make it
official), futexes. I'm probably missing the one or the other thing now.

DaveM says there are example programs for the current interfaces. I
must admit I haven't seen those either. So if possible, point the world
to them again. If you do that now I'll review everything and write up
my recommendations re the interface before Monday.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


Attachments:
signature.asc (251.00 B)
OpenPGP digital signature

2006-08-23 00:28:34

by Jari Sundell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On 8/23/06, Alexey Kuznetsov <[email protected]> wrote:
> Let me explain, as a person who did this mistake and deeply
> regrets about this.
>
> F.e. in this case you just cannot use kevents in 32bit application
> on x86_64, unless you add the whole translation layer inside kevent core.
> Even when you deal with plain syscall, translation is a big pain,
> but when you use mmapped buffer, it can be simply impossible.
>
> F.e. my mistake was "unsigned long" in struct tpacket_hdr in linux/if_packet.h.
> It makes use of mmapped packet socket essentially impossible by 32bit
> applications on 64bit archs.

There are system calls that take timespec, so I assume the magic is
already available for handling the timeout argument of kevent.
Although I'm not entirely sure about the kqueue timer interface, there
isn't any reason timespec would need to be written to the mmaped
buffer for the rest.

AFAICS, only struct ukevent is visible to the user, same would go for
kqueue's struct kevent.

Rakshasa

2006-08-23 00:32:05

by David Miller

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

From: "Jari Sundell" <[email protected]>
Date: Wed, 23 Aug 2006 02:28:32 +0200

> There are system calls that take timespec, so I assume the magic is
> already available for handling the timeout argument of kevent.

System calls are one thing, they can be translated for these
kinds of situations. But this doesn't help, and nothing at
all can be done, for datastructures exposed to userspace via
mmap()'d buffers, which is what kevent will be doing.

This is what Alexey is trying to explain to you.

2006-08-23 00:43:52

by Jari Sundell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On 8/23/06, David Miller <[email protected]> wrote:
> > There are system calls that take timespec, so I assume the magic is
> > already available for handling the timeout argument of kevent.
>
> System calls are one thing, they can be translated for these
> kinds of situations. But this doesn't help, and nothing at
> all can be done, for datastructures exposed to userspace via
> mmap()'d buffers, which is what kevent will be doing.
>
> This is what Alexey is trying to explain to you.

Actually, I didn't miss that, it is an orthogonal issue. A timespec
timeout parameter for the syscall does not imply the use of timespec
in any timer event, etc. Nor is there any timespec timer in kqueue's
struct kevent, which is the only (interface related) thing that will
be exposed.

Rakshasa

2006-08-23 01:36:27

by Nicholas Miell

[permalink] [raw]
Subject: The Proposed Linux kevent API (was: Re: [take12 0/3] kevent: Generic event handling mechanism.)

On Tue, 2006-08-22 at 16:06 -0700, David Miller wrote:
> With the time you spent writing this long email alone you could have
> worked on either documenting Evgeniy's interfaces or trying to write
> test applications against kevent to validate how useful the interfaces
> are and if there are any problems with them.
>
> You choose to rant and complain instead of participate.
>
> Therefore, many of us cannot take you seriously.

== The Proposed Linux kevent API ==

The proposed Linux kevent API is a new unified event handling
interface, similar in spirit to Windows completion ports and Solaris
completion ports and similar in fact to the FreeBSD/OS X kqueue
interface.

Using a single kernel call, a thread can wait for all possible event
types that the kernel can generate, instead of past interfaces that
only allow you to wait for specific subsets of events (e.g. POSIX
sigevent completions are limited only to AIO completion, timer expiry,
and the arrival of new messages to a message queue, while epoll_wait
is just a more efficient method of doing a traditional Unix select or
poll).

Instead of evolving the struct sigevent notification methods to allow
you to continue using standard POSIX interfaces like lio_listio(),
mq_notify() or timer_create() while queuing completion notifications
to a kevent completion queue (much the way the Solaris port API is
designed, or the the API proposed by Ulrich Drepper in "The
Need for Asynchronous, Zero-Copy Network I/O" as found at
http://people.redhat.com/drepper/newni.pdf ), kevent choooses to
follow the FreeBSD route and introduce an entirely new and
incompatible method of requesting and reporting event notifications
(while also managing to be incompatible with FreeBSD's kqueue).

This is done through the introduction of two new syscalls and a
variety of supporting datatypes. The first function, kevent_ctl(), is
used to create and manipulate kevent queues, while the second,
kevent_get_events(), is use to wait for new events.


They operate as follows:

int kevent_ctl(int fd, unsigned int cmd, unsigned int num, void *arg);

fd is the file descriptor referring to the kevent queue to
manipulate. It is ignored if the cmd parameter is KEVENT_CTL_INIT.

cmd is the requested operation. It can be one of the following:

KEVENT_CTL_INIT - create a new kevent queue and return it's file
descriptor. The fd, num, and arg parameters are ignored.

KEVENT_CTL_ADD, KEVENT_CTL_MODIFY, KEVENT_CTL_REMOVE - add new,
modify existing, or remove existing event notification
requests.

num is the number of struct ukevent in the array pointed to by arg

arg is an array of struct ukevent. Why it is of type void* and not
struct ukevent* is a mystery.

When called, kevent_ctl will carry out the operation specified in the
cmd parameter.


int kevent_get_events(int ctl_fd, unsigned int min_nr,
unsigned int max_nr, unsigned int timeout,
void *buf, unsigned flags)

ctl_fd is the file descriptor referring to the kevent queue.

min_nr is the minimum number of completed events that
kevent_get_events will block waiting for.

max_nr is the number of struct ukevent in buf.

timeout is the number of milliseconds to wait before returning less
than min_nr events. If this is -1, I *think* it'll wait
indefinitely, but I'm not sure that msecs_to_jiffies(-1) ends
up being MAX_SCHEDULE_TIMEOUT

buf is a pointer an array of struct ukevent. Why it is of type void*
and not struct ukevent* is a mystery.

flags is unused.

When called, kevent_get_events will wait timeout milliseconds for at
least min_nr completed events, copying completed struct ukevents to
buf and deleting any KEVENT_REQ_ONESHOT event requests.


The bulk of the interface is entirely done through the ukevent struct.
It is used to add event requests, modify existing event requests,
specify which event requests to remove, and return completed events.

struct ukevent contains the following members:

struct kevent_id id
This is described as containing the "socket number, file
descriptor and so on", which I take to mean it contains an fd,
however for some mysterious reason struct kevent_id contains
__u32 raw[2] and (for KEVENT_POLL events) the actual fd is
placed in raw[0] and raw[1] is never mentioned except to
faithfully copy it around.

For KEVENT_TIMER events, raw[0] contains a relative time in
milliseconds and raw[1] is still not used.

Why the struct member is called "raw" remains a mystery.

__u32 type
The actual event type, either KEVENT_POLL for fd polling or
KEVENT_TIMER for timers.

__u32 event
For events of type KEVENT_POLL, event contains the polling flags
of interest (i.e. POLLIN, POLLPRI, POLLOUT, POLLERR, POLLHUP,
POLLNVAL).

For events of type KEVENT_TIMER, event is ignored.

__u32 req_flags
Per-event request flags. Currently, this may be 0 or
KEVENT_REQ_ONESHOT to specify that the event be removed after it
is fired.

__u32 ret_flags
Per-event return flags. This may be 0 or a combination of
KEVENT_RET_DONE if the event has completed or
KVENT_RET_BROKEN if "the event is broken", which I take to mean
any sort of error condition. DONE|BROKEN is a valid state, but I
don't really know what it means.

__u32 ret_data[2]
Event return data. This is unused by KEVENT_POLL events, while
KEVENT_TIMER inexplicably places jiffies in ret_data[0]. If the
event is broken, an error code is placed in ret_data[1].

union { __u32 user[2]; void *ptr; }
An anonymous union (which is a fairly recent C addition)
containing data saved for the user and otherwise ignored by the
kernel.

For KEVENT_CTL_ADD, all fields relevant to the event type must be
filled (id, type, possibly event, req_flags). After kevent_ctl(...,
KEVENT_CTL_ADD, ...) returns each struct's ret_flags should be
checked to see if the event is already broken or done.

For KEVENT_CTL_MODIFY, the id, req_flags, and user and event fields
must be set and an existing kevent request must have matching id and
user fields. If a match is found, req_flags and event are replaced
with the newly supplied values. If a match can't be found, the passed
in ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is
always set.

For KEVENT_CTL_REMOVE, the id and user fields must be set and an
existing kevent request must have matching id and user fields. If a
match is found, the kevent request is removed. If a match can't be
found, the passed in ukevent's ret_flags has KEVENT_RET_BROKEN
set. KEVENT_RET_DONE is always set.

For kevent_get_events, the entire structure is returned with ret_data[0]
modified to contain jiffies for KEVENT_TIMER events.

--

Having looked all this over to figure out what it actually does, I can
make the following comments:

- there's a distinct lack of any sort of commenting beyond brief
descriptions of what the occasional function is supposed to do

- the kevent interface is all the horror of the BSD kqueue interface,
but with no compatibility with the BSD kqueue interface.

- lots of parameters from userspace go unsanitized, although I'm not
sure if this will actually cause problems. At the very least, there
should be checks for unknown flags and use of reserved fields, lest
somebody start using them for their own purposes and then their app
breaks when a newer version of the kernel starts using them itself.

- timeouts are specified as int instead of struct timespec.

- kevent_ctl() and kevent_get_events() take void* for no discernible
reason.

- KEVENT_POLL is less functional than epoll (no return of which events
were actually signalled) and KEVENT_TIMER isn't as flexible as POSIX
interval timers (no clocks, only millisecond resolution, timers don't
have separate start and interval values).

- kevent_get_events() looks a whole lot like io_getevents() and a kevent
fd looks a whole lot like an io_context_t.

- struct ukevent has problems/inconsistencies -- id is wrapped in it's
own member struct, while user and ret_data aren't; id's single member is
named raw which does nothing to describe it's purpose; the user data is
an anonymous union, which is C99 only and might not be widely supported,
req_flags and ret_flags are more difficult to visually distinguish than
they have to be; and every member name could use a ukev_ prefix.

--

P.S.

Dear DaveM,

Go fuck yourself.

Love,
Nicholas

--
Nicholas Miell <[email protected]>

2006-08-23 01:51:39

by Nicholas Miell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, 2006-08-22 at 16:46 -0700, Ulrich Drepper wrote:
> I so far also haven't taken the time to look exactly at the interface.
> I plan to do it asap since this is IMO our big chance to get it right.
> I want to have a unifying interface which can handle all the different
> events we need and which come up today and tomorrow. We have to be able
> to handle not only file descriptors and AIO but also timers, signals,
> message queues (OK, they are file descriptors but let's make it
> official), futexes. I'm probably missing the one or the other thing now.

Are you sure about signals? I thought about that, but they generally
fall into two categories: signals that have to be signals (i.e. SIGILL,
SIGABRT, SIGFPE, SIGSEGV, etc.) and signals that should be replaced a
queued event notification (SIGALRM, SIGRTMIN-SIGRTMAX).

Of course, that leaves things like SIGTERM, SIGINT, SIGQUIT, etc. so,
uh, nevermind then. Signal redirection to event queues is definitely
needed.

> DaveM says there are example programs for the current interfaces. I
> must admit I haven't seen those either. So if possible, point the world
> to them again. If you do that now I'll review everything and write up
> my recommendations re the interface before Monday.

There's a handful of little test apps at
http://tservice.net.ru/~s0mbre/archive/kevent/ , but they don't work
with the current iteration of the interface. I don't know if there are
others somewhere else.

--
Nicholas Miell <[email protected]>

2006-08-23 02:02:46

by Howard Chu

[permalink] [raw]
Subject: Re: The Proposed Linux kevent API

Nicholas Miell wrote:
> Having looked all this over to figure out what it actually does, I can
> make the following comments:
>
> - there's a distinct lack of any sort of commenting beyond brief
> descriptions of what the occasional function is supposed to do
>
> - the kevent interface is all the horror of the BSD kqueue interface,
> but with no compatibility with the BSD kqueue interface.
>
> - lots of parameters from userspace go unsanitized, although I'm not
> sure if this will actually cause problems. At the very least, there
> should be checks for unknown flags and use of reserved fields, lest
> somebody start using them for their own purposes and then their app
> breaks when a newer version of the kernel starts using them itself.
>


Which reminds me, why go through the trouble of copying the structs back
and forth between userspace and kernel space? Why not map the struct
array and leave it in place, as I proposed back here?
http://groups.google.com/group/linux.kernel/browse_frm/
thread/57847cfedb61bdd5/8d02afa60a8f83af?lnk=gst&q=equeue&rnum=
1#8d02afa60a8f83af

--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/

2006-08-23 03:31:16

by David Miller

[permalink] [raw]
Subject: Re: The Proposed Linux kevent API

From: Nicholas Miell <[email protected]>
Date: Tue, 22 Aug 2006 18:36:07 -0700

> Dear DaveM,
>
> Go fuck yourself.

I guess this is the bit that's supposed to make me take you seriously
:-)

2006-08-23 03:47:54

by Nicholas Miell

[permalink] [raw]
Subject: Re: The Proposed Linux kevent API

On Tue, 2006-08-22 at 20:31 -0700, David Miller wrote:
> From: Nicholas Miell <[email protected]>
> Date: Tue, 22 Aug 2006 18:36:07 -0700
>
> > Dear DaveM,
> >
> > Go fuck yourself.
>
> I guess this is the bit that's supposed to make me take you seriously
> :-)

Of course. ^_^

--
Nicholas Miell <[email protected]>

2006-08-23 04:23:40

by Nicholas Miell

[permalink] [raw]
Subject: Re: The Proposed Linux kevent API

On Tue, 2006-08-22 at 20:47 -0700, Nicholas Miell wrote:
> On Tue, 2006-08-22 at 20:31 -0700, David Miller wrote:
> > From: Nicholas Miell <[email protected]>
> > Date: Tue, 22 Aug 2006 18:36:07 -0700
> >
> > > Dear DaveM,
> > >
> > > Go fuck yourself.
> >
> > I guess this is the bit that's supposed to make me take you seriously
> > :-)
>
> Of course. ^_^
>

Note that when I made this suggestion, I was not literally instructing
you to perform sexual acts upon yourself, especially if such a thing
would be illegal in your jurisdiction (although, IIRC, you moved to
Seattle recently and I'm pretty sure we allow that kind of thing here,
but we don't generally talk about it in public). So, my apologies to
you, Dave, for making such metaphorical instructions.

However, your choice to characterize my technical criticism as "rants"
and "complaints" and your continuous variations on "let's see you do
something better" as if it were a valid response to my objections did
get on my nerves and made it very hard for me to take you seriously.

--
Nicholas Miell <[email protected]>

2006-08-23 04:35:28

by Albert Cahalan

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

Ulrich Drepper writes:

> I so far also haven't taken the time to look exactly at the
> interface. I plan to do it asap since this is IMO our big chance
> to get it right. I want to have a unifying interface which can
> handle all the different events we need and which come up today
> and tomorrow. We have to be able to handle not only file
> descriptors and AIO but also timers, signals, message queues
> (OK, they are file descriptors but let's make it official),
> futexes. I'm probably missing the one or the other thing now.

Yeah, you're missing one. I must warn you, it's tasteless.
You forgot ptrace events. (everybody now: EEEEEEEW!)

The wait-related functions in general are interesting.
People like to use a general event mechanism to deal with
threads exiting. Seriously, it would really help with
porting code from that other OS.

How about SysV semaphores?

2006-08-23 06:25:13

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: The Proposed Linux kevent API (was: Re: [take12 0/3] kevent: Generic event handling mechanism.)

On Tue, Aug 22, 2006 at 06:36:07PM -0700, Nicholas Miell ([email protected]) wrote:
> == The Proposed Linux kevent API ==
>
> The proposed Linux kevent API is a new unified event handling
> interface, similar in spirit to Windows completion ports and Solaris
> completion ports and similar in fact to the FreeBSD/OS X kqueue
> interface.
>
> Using a single kernel call, a thread can wait for all possible event
> types that the kernel can generate, instead of past interfaces that
> only allow you to wait for specific subsets of events (e.g. POSIX
> sigevent completions are limited only to AIO completion, timer expiry,
> and the arrival of new messages to a message queue, while epoll_wait
> is just a more efficient method of doing a traditional Unix select or
> poll).
>
> Instead of evolving the struct sigevent notification methods to allow
> you to continue using standard POSIX interfaces like lio_listio(),
> mq_notify() or timer_create() while queuing completion notifications
> to a kevent completion queue (much the way the Solaris port API is
> designed, or the the API proposed by Ulrich Drepper in "The
> Need for Asynchronous, Zero-Copy Network I/O" as found at
> http://people.redhat.com/drepper/newni.pdf ), kevent choooses to
> follow the FreeBSD route and introduce an entirely new and
> incompatible method of requesting and reporting event notifications
> (while also managing to be incompatible with FreeBSD's kqueue).
>
> This is done through the introduction of two new syscalls and a
> variety of supporting datatypes. The first function, kevent_ctl(), is
> used to create and manipulate kevent queues, while the second,
> kevent_get_events(), is use to wait for new events.
>
>
> They operate as follows:
>
> int kevent_ctl(int fd, unsigned int cmd, unsigned int num, void *arg);
>
> fd is the file descriptor referring to the kevent queue to
> manipulate. It is ignored if the cmd parameter is KEVENT_CTL_INIT.
>
> cmd is the requested operation. It can be one of the following:
>
> KEVENT_CTL_INIT - create a new kevent queue and return it's file
> descriptor. The fd, num, and arg parameters are ignored.
>
> KEVENT_CTL_ADD, KEVENT_CTL_MODIFY, KEVENT_CTL_REMOVE - add new,
> modify existing, or remove existing event notification
> requests.
>
> num is the number of struct ukevent in the array pointed to by arg
>
> arg is an array of struct ukevent. Why it is of type void* and not
> struct ukevent* is a mystery.
>
> When called, kevent_ctl will carry out the operation specified in the
> cmd parameter.
>
>
> int kevent_get_events(int ctl_fd, unsigned int min_nr,
> unsigned int max_nr, unsigned int timeout,
> void *buf, unsigned flags)
>
> ctl_fd is the file descriptor referring to the kevent queue.
>
> min_nr is the minimum number of completed events that
> kevent_get_events will block waiting for.
>
> max_nr is the number of struct ukevent in buf.
>
> timeout is the number of milliseconds to wait before returning less
> than min_nr events. If this is -1, I *think* it'll wait
> indefinitely, but I'm not sure that msecs_to_jiffies(-1) ends
> up being MAX_SCHEDULE_TIMEOUT

You forget the case for non-blocked file descriptor.
Here is comment from the code:

* In nonblocking mode it returns as many events as possible, but not more than @max_nr.
* In blocking mode it waits until timeout or if at least @min_nr events are ready.

> buf is a pointer an array of struct ukevent. Why it is of type void*
> and not struct ukevent* is a mystery.
>
> flags is unused.
>
> When called, kevent_get_events will wait timeout milliseconds for at
> least min_nr completed events, copying completed struct ukevents to
> buf and deleting any KEVENT_REQ_ONESHOT event requests.
>
>
> The bulk of the interface is entirely done through the ukevent struct.
> It is used to add event requests, modify existing event requests,
> specify which event requests to remove, and return completed events.
>
> struct ukevent contains the following members:
>
> struct kevent_id id
> This is described as containing the "socket number, file
> descriptor and so on", which I take to mean it contains an fd,
> however for some mysterious reason struct kevent_id contains
> __u32 raw[2] and (for KEVENT_POLL events) the actual fd is
> placed in raw[0] and raw[1] is never mentioned except to
> faithfully copy it around.
>
> For KEVENT_TIMER events, raw[0] contains a relative time in
> milliseconds and raw[1] is still not used.
>
> Why the struct member is called "raw" remains a mystery.

If you followed previous patchsets you could find, that there were
network AIO, fs IO and fs-inotify-like notifications.
Some of them use that fields.
I got two u32 numbers to be "union"ed with pointer like user data is.
That pointer should be obtained through Ulrich's dma_alloc() and
friends.

> __u32 type
> The actual event type, either KEVENT_POLL for fd polling or
> KEVENT_TIMER for timers.
>
> __u32 event
> For events of type KEVENT_POLL, event contains the polling flags
> of interest (i.e. POLLIN, POLLPRI, POLLOUT, POLLERR, POLLHUP,
> POLLNVAL).
>
> For events of type KEVENT_TIMER, event is ignored.
>
> __u32 req_flags
> Per-event request flags. Currently, this may be 0 or
> KEVENT_REQ_ONESHOT to specify that the event be removed after it
> is fired.
>
> __u32 ret_flags
> Per-event return flags. This may be 0 or a combination of
> KEVENT_RET_DONE if the event has completed or
> KVENT_RET_BROKEN if "the event is broken", which I take to mean
> any sort of error condition. DONE|BROKEN is a valid state, but I
> don't really know what it means.

DONE means that event processing is completed and it can be read back to
userspace, if in addition it contains BROKEN it means that kevent is
broken.

> __u32 ret_data[2]
> Event return data. This is unused by KEVENT_POLL events, while
> KEVENT_TIMER inexplicably places jiffies in ret_data[0]. If the
> event is broken, an error code is placed in ret_data[1].

Each kevent user can place here any hints it wants, for example network
socket notifications place there length of the accept queue and so on.
In error condition error is placed there too.

> union { __u32 user[2]; void *ptr; }
> An anonymous union (which is a fairly recent C addition)
> containing data saved for the user and otherwise ignored by the
> kernel.
>
> For KEVENT_CTL_ADD, all fields relevant to the event type must be
> filled (id, type, possibly event, req_flags). After kevent_ctl(...,
> KEVENT_CTL_ADD, ...) returns each struct's ret_flags should be
> checked to see if the event is already broken or done.
>
> For KEVENT_CTL_MODIFY, the id, req_flags, and user and event fields
> must be set and an existing kevent request must have matching id and
> user fields. If a match is found, req_flags and event are replaced
> with the newly supplied values. If a match can't be found, the passed
> in ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is
> always set.

DONE means that user's request is completed.
I.e. it was copied from userspace, watched, analyzed and somehow
processed.

> For KEVENT_CTL_REMOVE, the id and user fields must be set and an
> existing kevent request must have matching id and user fields. If a
> match is found, the kevent request is removed. If a match can't be
> found, the passed in ukevent's ret_flags has KEVENT_RET_BROKEN
> set. KEVENT_RET_DONE is always set.
>
> For kevent_get_events, the entire structure is returned with ret_data[0]
> modified to contain jiffies for KEVENT_TIMER events.

ret_data can contain any hint kernel wants to put there.
It can contain 0.

> --
>
> Having looked all this over to figure out what it actually does, I can
> make the following comments:
>
> - there's a distinct lack of any sort of commenting beyond brief
> descriptions of what the occasional function is supposed to do
>
> - the kevent interface is all the horror of the BSD kqueue interface,
> but with no compatibility with the BSD kqueue interface.
>
> - lots of parameters from userspace go unsanitized, although I'm not
> sure if this will actually cause problems. At the very least, there
> should be checks for unknown flags and use of reserved fields, lest
> somebody start using them for their own purposes and then their app
> breaks when a newer version of the kernel starts using them itself.

All parameters which are not checked are not used.
If user puts own flags where it is not allowed it to do (like ret_flags)
he creates problems for himself. No one complains when arbitrary number
is placed into file dsecriptor and write() fails.

> - timeouts are specified as int instead of struct timespec.

timespec uses long, which is wrong.
I can put there any other structure which has strict types - no longs,
that's the rule, no matter if there is a wrappers in per-arch syscall
code.
poll alwasy ued millisecods and all are happy.

> - kevent_ctl() and kevent_get_events() take void* for no discernible
> reason.

Because interfaces are changed - they used contorl block before, and now
they do not. There is an opinion from Cristoph that syscall there is
wrong too and better to use ioctls(), so I do not change it right now,
since it can be changed in future (again).

> - KEVENT_POLL is less functional than epoll (no return of which events
> were actually signalled) and KEVENT_TIMER isn't as flexible as POSIX
> interval timers (no clocks, only millisecond resolution, timers don't
> have separate start and interval values).

That's nonsence - kevent returns fired events, POSIX timer API can only
use timers. When you can put network AIO into timer API call me, I will
buy your a t-shrt.
Your meaning of "separate start and interval values" is not correct,
please see how both timers work.

The only thing correct is that it only support millisecond resolution -
I use poll quite for a while and it is really good interface, so it was
copied from there.

> - kevent_get_events() looks a whole lot like io_getevents() and a kevent
> fd looks a whole lot like an io_context_t.
>
> - struct ukevent has problems/inconsistencies -- id is wrapped in it's
> own member struct, while user and ret_data aren't; id's single member is
> named raw which does nothing to describe it's purpose; the user data is
> an anonymous union, which is C99 only and might not be widely supported,
> req_flags and ret_flags are more difficult to visually distinguish than
> they have to be; and every member name could use a ukev_ prefix.

I described what id is and why it is placed into u32[2] - it must be
union'ed with pointer, when such interface will be created.
How can you describe id for inode notification and user pointer?

As you can see there are no problems with understanding how it works -
I'm sure it has not took you too much time, I think writing previous
messages took much longer.

Now my point:
1. unified interface - since there are many types of different event
mechanisms (already implemented, but not theoretical handwaving), I
created unified interface, which does not know about what the event is
provided, it just routes it into appropriate storage and start
processing. Anyone who thinks that kevents must have separate interface
for each type of events just do not see, how many types there are.
It is simple to wrap it into epoll and POSIX timers, but there are quite
a few others - inotify, socket notifications, various AIO
implementatinos. Who will create new API for them?
If you think that kevents are going to be used through wrapper library -
implement there any interface you like. If you do not, consider how many
syscalls are required, and finally the same function will be called.

2. Wrong documantation and examples.
For the last two weeks interface were changed at least three (!) times.
Do you really think that I have some slaves in the cellar?
When interface is ready I will write docs and change examples.
But even with old applicatinos, it is _really_ trivial to understand
what parameter is used and where, especially with excellent LWN
articles.


And actually I do not see that this process comes to it's end -

NO FSCKING ONE knows what we want!

So I will say as author what _I_ want.

Until there is strong objection on API, nothing will be changed.

Something will be changed only when there are several people who acks
the change.

This can end up with declining of merge - do not care, I hack not for
the entry in MAINTAINERS, but because I like the process,
and I can use it with external patches easily.

Nick, you want POSIX timers API? Ok, I can change it, if several core
developers ack this. If they do not, I will not even discuss it.
You can implement it as addon, no problem.

Dixi.

> --
>
> P.S.
>
> Dear DaveM,
>
> Go fuck yourself.
>
> Love,
> Nicholas

In a decent society you would have your nose broken...
But in virtual one you just can not be considered as serious person.

> --
> Nicholas Miell <[email protected]>

--
Evgeniy Polyakov

2006-08-23 06:55:14

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Tue, Aug 22, 2006 at 04:46:19PM -0700, Ulrich Drepper ([email protected]) wrote:
> DaveM says there are example programs for the current interfaces. I
> must admit I haven't seen those either. So if possible, point the world
> to them again. If you do that now I'll review everything and write up
> my recommendations re the interface before Monday.

Attached typical usage for inode and timer events.
Network AIO was implemented as separated syscalls.

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
>



--
Evgeniy Polyakov


Attachments:
(No filename) (578.00 B)
evtest.c (4.01 kB)
Download all attachments

2006-08-23 06:58:19

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Wed, Aug 23, 2006 at 02:43:50AM +0200, Jari Sundell ([email protected]) wrote:
> Actually, I didn't miss that, it is an orthogonal issue. A timespec
> timeout parameter for the syscall does not imply the use of timespec
> in any timer event, etc. Nor is there any timespec timer in kqueue's
> struct kevent, which is the only (interface related) thing that will
> be exposed.

void * in structure exported to userspace is forbidden.
long in syscall requires wrapper in per-arch code (although that
workaround _is_ there, it does not mean that broken interface should
be used).
poll uses millisecods - it is perfectly ok.

> Rakshasa

--
Evgeniy Polyakov

2006-08-23 07:09:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Wed, 23 Aug 2006 10:56:59 +0400
Evgeniy Polyakov <[email protected]> wrote:

> On Wed, Aug 23, 2006 at 02:43:50AM +0200, Jari Sundell ([email protected]) wrote:
> > Actually, I didn't miss that, it is an orthogonal issue. A timespec
> > timeout parameter for the syscall does not imply the use of timespec
> > in any timer event, etc. Nor is there any timespec timer in kqueue's
> > struct kevent, which is the only (interface related) thing that will
> > be exposed.
>
> void * in structure exported to userspace is forbidden.
> long in syscall requires wrapper in per-arch code (although that
> workaround _is_ there, it does not mean that broken interface should
> be used).
> poll uses millisecods - it is perfectly ok.

I wonder whether designing-in a millisecond granularity is the right thing
to do. If in a few years the kernel is running tickless with high-res clock
interrupt sources, that might look a bit lumpy.

Switching it to a __u64 nanosecond counter would be basically free on
64-bit machines, and not very expensive on 32-bit, no?

2006-08-23 07:11:23

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Wed, Aug 23, 2006 at 12:07:58AM -0700, Andrew Morton ([email protected]) wrote:
> On Wed, 23 Aug 2006 10:56:59 +0400
> Evgeniy Polyakov <[email protected]> wrote:
>
> > On Wed, Aug 23, 2006 at 02:43:50AM +0200, Jari Sundell ([email protected]) wrote:
> > > Actually, I didn't miss that, it is an orthogonal issue. A timespec
> > > timeout parameter for the syscall does not imply the use of timespec
> > > in any timer event, etc. Nor is there any timespec timer in kqueue's
> > > struct kevent, which is the only (interface related) thing that will
> > > be exposed.
> >
> > void * in structure exported to userspace is forbidden.
> > long in syscall requires wrapper in per-arch code (although that
> > workaround _is_ there, it does not mean that broken interface should
> > be used).
> > poll uses millisecods - it is perfectly ok.
>
> I wonder whether designing-in a millisecond granularity is the right thing
> to do. If in a few years the kernel is running tickless with high-res clock
> interrupt sources, that might look a bit lumpy.
>
> Switching it to a __u64 nanosecond counter would be basically free on
> 64-bit machines, and not very expensive on 32-bit, no?

Let's then place there a structure with 64bit seconds and nanoseconds,
similar to timspec, but without longs there.
What do you think?

--
Evgeniy Polyakov

2006-08-23 07:35:21

by David Miller

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

From: Andrew Morton <[email protected]>
Date: Wed, 23 Aug 2006 00:07:58 -0700

> I wonder whether designing-in a millisecond granularity is the right thing
> to do. If in a few years the kernel is running tickless with high-res clock
> interrupt sources, that might look a bit lumpy.
>
> Switching it to a __u64 nanosecond counter would be basically free on
> 64-bit machines, and not very expensive on 32-bit, no?

If it ends up in a structure we'll need to use the "aligned_u64" type
in order to avoid problems with 32-bit x86 binaries running on 64-bit
kernels.

2006-08-23 07:43:54

by Ian McDonald

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

> I wonder whether designing-in a millisecond granularity is the right thing
> to do. If in a few years the kernel is running tickless with high-res clock
> interrupt sources, that might look a bit lumpy.
>
I'd second that - when working on DCCP I've done a lot of the work in
microseconds and it made quite a difference instead of milliseconds
because of it's design.

I haven't followed kevents in great detail but it sounds like
something that could be useful for me with higher resolution timers
than milliseconds.
--
Ian McDonald
Web: http://wand.net.nz/~iam4
Blog: http://imcdnzl.blogspot.com
WAND Network Research Group
Department of Computer Science
University of Waikato
New Zealand

2006-08-23 07:52:12

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Wed, Aug 23, 2006 at 12:07:58AM -0700, Andrew Morton ([email protected]) wrote:
> On Wed, 23 Aug 2006 10:56:59 +0400
> Evgeniy Polyakov <[email protected]> wrote:
>
> > On Wed, Aug 23, 2006 at 02:43:50AM +0200, Jari Sundell ([email protected]) wrote:
> > > Actually, I didn't miss that, it is an orthogonal issue. A timespec
> > > timeout parameter for the syscall does not imply the use of timespec
> > > in any timer event, etc. Nor is there any timespec timer in kqueue's
> > > struct kevent, which is the only (interface related) thing that will
> > > be exposed.
> >
> > void * in structure exported to userspace is forbidden.
> > long in syscall requires wrapper in per-arch code (although that
> > workaround _is_ there, it does not mean that broken interface should
> > be used).
> > poll uses millisecods - it is perfectly ok.
>
> I wonder whether designing-in a millisecond granularity is the right thing
> to do. If in a few years the kernel is running tickless with high-res clock
> interrupt sources, that might look a bit lumpy.
>
> Switching it to a __u64 nanosecond counter would be basically free on
> 64-bit machines, and not very expensive on 32-bit, no?

I can put nanoseconds as timer interval too (with aligned_u64 as David
mentioned), and put it for timeout value too - 64 bit nanosecods ends up
with 58 years, probably enough.
Structures with u64 a really not so good idea.

--
Evgeniy Polyakov

2006-08-23 08:02:24

by Nicholas Miell

[permalink] [raw]
Subject: Re: The Proposed Linux kevent API (was: Re: [take12 0/3] kevent: Generic event handling mechanism.)

On Wed, 2006-08-23 at 10:22 +0400, Evgeniy Polyakov wrote:
> On Tue, Aug 22, 2006 at 06:36:07PM -0700, Nicholas Miell ([email protected]) wrote:
> > int kevent_get_events(int ctl_fd, unsigned int min_nr,
> > unsigned int max_nr, unsigned int timeout,
> > void *buf, unsigned flags)
> >
> > ctl_fd is the file descriptor referring to the kevent queue.
> >
> > min_nr is the minimum number of completed events that
> > kevent_get_events will block waiting for.
> >
> > max_nr is the number of struct ukevent in buf.
> >
> > timeout is the number of milliseconds to wait before returning less
> > than min_nr events. If this is -1, I *think* it'll wait
> > indefinitely, but I'm not sure that msecs_to_jiffies(-1) ends
> > up being MAX_SCHEDULE_TIMEOUT
>
> You forget the case for non-blocked file descriptor.
> Here is comment from the code:
>
> * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
> * In blocking mode it waits until timeout or if at least @min_nr events are ready.

I missed that, but why bother with O_NONBLOCK? It appears to to make the
timeout parameter completely unnecessary, which means you could just
make timeout = 0 give you the nonblocking behavior, and non-zero the
blocking behavior (leaving -1 as wait forever).

> > buf is a pointer an array of struct ukevent. Why it is of type void*
> > and not struct ukevent* is a mystery.
> >
> > flags is unused.
> >
> > When called, kevent_get_events will wait timeout milliseconds for at
> > least min_nr completed events, copying completed struct ukevents to
> > buf and deleting any KEVENT_REQ_ONESHOT event requests.
> >
> >
> > The bulk of the interface is entirely done through the ukevent struct.
> > It is used to add event requests, modify existing event requests,
> > specify which event requests to remove, and return completed events.
> >
> > struct ukevent contains the following members:
> >
> > struct kevent_id id
> > This is described as containing the "socket number, file
> > descriptor and so on", which I take to mean it contains an fd,
> > however for some mysterious reason struct kevent_id contains
> > __u32 raw[2] and (for KEVENT_POLL events) the actual fd is
> > placed in raw[0] and raw[1] is never mentioned except to
> > faithfully copy it around.
> >
> > For KEVENT_TIMER events, raw[0] contains a relative time in
> > milliseconds and raw[1] is still not used.
> >
> > Why the struct member is called "raw" remains a mystery.
>
> If you followed previous patchsets you could find, that there were
> network AIO, fs IO and fs-inotify-like notifications.
> Some of them use that fields.
> I got two u32 numbers to be "union"ed with pointer like user data is.
> That pointer should be obtained through Ulrich's dma_alloc() and
> friends.
>
> > __u32 type
> > The actual event type, either KEVENT_POLL for fd polling or
> > KEVENT_TIMER for timers.
> >
> > __u32 event
> > For events of type KEVENT_POLL, event contains the polling flags
> > of interest (i.e. POLLIN, POLLPRI, POLLOUT, POLLERR, POLLHUP,
> > POLLNVAL).
> >
> > For events of type KEVENT_TIMER, event is ignored.
> >
> > __u32 req_flags
> > Per-event request flags. Currently, this may be 0 or
> > KEVENT_REQ_ONESHOT to specify that the event be removed after it
> > is fired.
> >
> > __u32 ret_flags
> > Per-event return flags. This may be 0 or a combination of
> > KEVENT_RET_DONE if the event has completed or
> > KVENT_RET_BROKEN if "the event is broken", which I take to mean
> > any sort of error condition. DONE|BROKEN is a valid state, but I
> > don't really know what it means.
>
> DONE means that event processing is completed and it can be read back to
> userspace, if in addition it contains BROKEN it means that kevent is
> broken.

So KEVENT_RET_DONE is purely an internal thing? And what does
KEVENT_RET_BROKEN mean, exactly?

> > __u32 ret_data[2]
> > Event return data. This is unused by KEVENT_POLL events, while
> > KEVENT_TIMER inexplicably places jiffies in ret_data[0]. If the
> > event is broken, an error code is placed in ret_data[1].
>
> Each kevent user can place here any hints it wants, for example network
> socket notifications place there length of the accept queue and so on.

I didn't document what it could theoretically be used for, just what it
is actually used for.

> In error condition error is placed there too.
>
> > union { __u32 user[2]; void *ptr; }
> > An anonymous union (which is a fairly recent C addition)
> > containing data saved for the user and otherwise ignored by the
> > kernel.
> >
> > For KEVENT_CTL_ADD, all fields relevant to the event type must be
> > filled (id, type, possibly event, req_flags). After kevent_ctl(...,
> > KEVENT_CTL_ADD, ...) returns each struct's ret_flags should be
> > checked to see if the event is already broken or done.
> >
> > For KEVENT_CTL_MODIFY, the id, req_flags, and user and event fields
> > must be set and an existing kevent request must have matching id and
> > user fields. If a match is found, req_flags and event are replaced
> > with the newly supplied values. If a match can't be found, the passed
> > in ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is
> > always set.
>
> DONE means that user's request is completed.
> I.e. it was copied from userspace, watched, analyzed and somehow
> processed.

KEVENT_RET_DONE certainly looks like a purely internal flag, which
shouldn't be exported to userspace. (Except for it's usage with
KEVENT_CTL_ADD, I guess.)

> > For KEVENT_CTL_REMOVE, the id and user fields must be set and an
> > existing kevent request must have matching id and user fields. If a
> > match is found, the kevent request is removed. If a match can't be
> > found, the passed in ukevent's ret_flags has KEVENT_RET_BROKEN
> > set. KEVENT_RET_DONE is always set.
> >
> > For kevent_get_events, the entire structure is returned with ret_data[0]
> > modified to contain jiffies for KEVENT_TIMER events.
>
> ret_data can contain any hint kernel wants to put there.
> It can contain 0.

Again, I didn't document theoretical usage, just what's actually done.

>
> > --
> >
> > Having looked all this over to figure out what it actually does, I can
> > make the following comments:
> >
> > - there's a distinct lack of any sort of commenting beyond brief
> > descriptions of what the occasional function is supposed to do
> >
> > - the kevent interface is all the horror of the BSD kqueue interface,
> > but with no compatibility with the BSD kqueue interface.
> >
> > - lots of parameters from userspace go unsanitized, although I'm not
> > sure if this will actually cause problems. At the very least, there
> > should be checks for unknown flags and use of reserved fields, lest
> > somebody start using them for their own purposes and then their app
> > breaks when a newer version of the kernel starts using them itself.
>
> All parameters which are not checked are not used.

Not used currently.

> If user puts own flags where it is not allowed it to do (like ret_flags)
> he creates problems for himself. No one complains when arbitrary number
> is placed into file dsecriptor and write() fails.

So prevent the user from causing future problems -- reject all invalid
uses.

> > - timeouts are specified as int instead of struct timespec.
>
> timespec uses long, which is wrong.
> I can put there any other structure which has strict types - no longs,
> that's the rule, no matter if there is a wrappers in per-arch syscall
> code.

I don't understand this -- you're saying that you can't use a long
because of compat tasks on 64-bit architectures?

> poll alwasy ued millisecods and all are happy.
>
> > - kevent_ctl() and kevent_get_events() take void* for no discernible
> > reason.
>
> Because interfaces are changed - they used contorl block before, and now
> they do not. There is an opinion from Cristoph that syscall there is
> wrong too and better to use ioctls(), so I do not change it right now,
> since it can be changed in future (again).

OK, that makes sense, but it still has to be fixed assuming this is the
final form of the interface.

> > - KEVENT_POLL is less functional than epoll (no return of which events
> > were actually signalled) and KEVENT_TIMER isn't as flexible as POSIX
> > interval timers (no clocks, only millisecond resolution, timers don't
> > have separate start and interval values).
>
> That's nonsence - kevent returns fired events,

Yes, but why did the event fire? poll/epoll return the
POLLIN/POLLPRI/POLLOUT/POLLERR/etc. bitmask when they return events.

> POSIX timer API can only
> use timers. When you can put network AIO into timer API call me, I will
> buy your a t-shrt.

I was talking about POSIX timers verses KEVENT_TIMER in isolation.
Ignoring the event delivery mechanism, POSIX timers are more capable
than kevent timers.

> Your meaning of "separate start and interval values" is not correct,
> please see how both timers work.

With POSIX timers, I can create a timer that starts periodically firing
itself at some point in the future, where the period isn't equal to the
different between now and it's first expiry (i.e. 10 seconds from now,
start firing every 2 seconds). I don't think I can do this using
KEVENT_TIMER.

>
> The only thing correct is that it only support millisecond resolution -
> I use poll quite for a while and it is really good interface, so it was
> copied from there.
>
> > - kevent_get_events() looks a whole lot like io_getevents() and a kevent
> > fd looks a whole lot like an io_context_t.
> >
> > - struct ukevent has problems/inconsistencies -- id is wrapped in it's
> > own member struct, while user and ret_data aren't; id's single member is
> > named raw which does nothing to describe it's purpose; the user data is
> > an anonymous union, which is C99 only and might not be widely supported,
> > req_flags and ret_flags are more difficult to visually distinguish than
> > they have to be; and every member name could use a ukev_ prefix.
>
> I described what id is and why it is placed into u32[2] - it must be
> union'ed with pointer, when such interface will be created.
> How can you describe id for inode notification and user pointer?
>
> As you can see there are no problems with understanding how it works -
> I'm sure it has not took you too much time, I think writing previous
> messages took much longer.

The previous message which DaveM was kind enough to characterize as a
rant took 10 minutes. The kevent docs took ~2 hours. You're welcome.

>
> Now my point:
> 1. unified interface - since there are many types of different event
> mechanisms (already implemented, but not theoretical handwaving), I
> created unified interface, which does not know about what the event is
> provided, it just routes it into appropriate storage and start
> processing. Anyone who thinks that kevents must have separate interface
> for each type of events just do not see, how many types there are.
> It is simple to wrap it into epoll and POSIX timers, but there are quite
> a few others - inotify, socket notifications, various AIO
> implementatinos. Who will create new API for them?

Ah, now there's your problem. The joy of this is that you don't have to
implement any of it. All you really need to do is implement the
userspace interface for creating kevent queues and dequeuing events and
the kernel interface for adding events to the queue and provide a
general enough event struct. (All you really need is the event type,
event specific data (a uintptr_t would probably be sufficient), and a
user cookie pointer)

Once you've done that, you can say "Hey, POSIX timers people, you should
make a way for people to queue timer completion events" and "Hey, AIO
people, you should make a way for people to queue AIO completion events"
and wait for them to do the work. (And if the AIO people complain that
they already have a way for people to queue AIO completions, well, then
you really need to justify why yours is better.)

In fact, I'm beginning to wonder if kevent_ctl() should exist at all --
it's only real use in this scenario would be for the odds-and-ends that
don't already have mechanisms for specifying how event completions
should be handled, in which case it becomes much more like
port_associate() and port_disassociate().

> If you think that kevents are going to be used through wrapper library -
> implement there any interface you like. If you do not, consider how many
> syscalls are required, and finally the same function will be called.
>
> 2. Wrong documantation and examples.
> For the last two weeks interface were changed at least three (!) times.
> Do you really think that I have some slaves in the cellar?
> When interface is ready I will write docs and change examples.
> But even with old applicatinos, it is _really_ trivial to understand
> what parameter is used and where, especially with excellent LWN
> articles.

The LWN articles weren't really that excellent; more of a pair of
cursory overviews really. And yes, I'm pretty sure that it is understood
that you don't have infinite time to work on this.

>
> And actually I do not see that this process comes to it's end -
>
> NO FSCKING ONE knows what we want!
>
> So I will say as author what _I_ want.
>
> Until there is strong objection on API, nothing will be changed.
>
> Something will be changed only when there are several people who acks
> the change.
>
> This can end up with declining of merge - do not care, I hack not for
> the entry in MAINTAINERS, but because I like the process,
> and I can use it with external patches easily.
>
> Nick, you want POSIX timers API? Ok, I can change it, if several core
> developers ack this. If they do not, I will not even discuss it.
> You can implement it as addon, no problem.
>
> Dixi.
>
> > --
> >
> > P.S.
> >
> > Dear DaveM,
> >
> > Go fuck yourself.
> >
> > Love,
> > Nicholas
>
> In a decent society you would have your nose broken...
> But in virtual one you just can not be considered as serious person.

I'm fairly certain DaveM is still the reigning champion of the
F-bomb-to-lines-of-kernel-code ratio, although I think Rusty may be
catching up.

Wait, no, somebody in MIPS land (I think maybe Christoph Hellwig) really
really hates the IOC3 very very much.

--
Nicholas Miell <[email protected]>

2006-08-23 08:19:37

by Nicholas Miell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Wed, 2006-08-23 at 00:35 -0700, David Miller wrote:
> From: Andrew Morton <[email protected]>
> Date: Wed, 23 Aug 2006 00:07:58 -0700
>
> > I wonder whether designing-in a millisecond granularity is the right thing
> > to do. If in a few years the kernel is running tickless with high-res clock
> > interrupt sources, that might look a bit lumpy.
> >
> > Switching it to a __u64 nanosecond counter would be basically free on
> > 64-bit machines, and not very expensive on 32-bit, no?
>
> If it ends up in a structure we'll need to use the "aligned_u64" type
> in order to avoid problems with 32-bit x86 binaries running on 64-bit
> kernels.

Perhaps

struct timespec64
{
uint64_t tv_sec __attribute__((aligned(8)));
uint32_t tv_nsec;
}

with a snide remark about gcc in the comments?

--
Nicholas Miell <[email protected]>

2006-08-23 08:22:19

by Jari Sundell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On 8/23/06, Evgeniy Polyakov <[email protected]> wrote:
> void * in structure exported to userspace is forbidden.

Only void * I'm seeing belongs to the user, (udata) perhaps you are
talking of something different?

> long in syscall requires wrapper in per-arch code (although that
> workaround _is_ there, it does not mean that broken interface should
> be used).
> poll uses millisecods - it is perfectly ok.

The kernel is there to hide those ugly implementation details from the
user, so I don't care that much about a workaround being required in
some cases. More important, IMHO is consistency with the POSIX system
calls.

I guess as long as you use usec, at least it won't be a pain to use.

Rakshasa

2006-08-23 08:45:07

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Wed, Aug 23, 2006 at 10:22:06AM +0200, Jari Sundell ([email protected]) wrote:
> On 8/23/06, Evgeniy Polyakov <[email protected]> wrote:
> >void * in structure exported to userspace is forbidden.
>
> Only void * I'm seeing belongs to the user, (udata) perhaps you are
> talking of something different?

Yes, exactly about it.

I put union {
u32 a[2];
void *b;
}
epcially to eliminate that problem.

And I'm not that sure aboit stuff like uptr_t or how they call pointers
in userspace and kernelspace.

> >long in syscall requires wrapper in per-arch code (although that
> >workaround _is_ there, it does not mean that broken interface should
> >be used).
> >poll uses millisecods - it is perfectly ok.
>
> The kernel is there to hide those ugly implementation details from the
> user, so I don't care that much about a workaround being required in
> some cases. More important, IMHO is consistency with the POSIX system
> calls.
>
> I guess as long as you use usec, at least it won't be a pain to use.

Andrew suggested to use nanoseconds there in u64 variable.
I think it is ok.

> Rakshasa

--
Evgeniy Polyakov

2006-08-23 08:51:37

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take12 1/3] kevent: Core files.

Hello Evgeniy

I have one comment/suggestion (minor detail, your work is very good)

I suggest to add one item in kevent_registered_callbacks[], so that
kevent_registered_callbacks[KEVENT_MAX] is valid and can act as a fallback.

In kevent_add_callbacks() you could replace the eventual NULL pointers by
kevent_break() in
kevent_registered_callbacks[pos].{callback, enqueue, dequeue}
like :

+int kevent_add_callbacks(const struct kevent_callbacks *cb, unsigned int pos)
+{
+ struct kevent_callbacks *p = &kevent_registered_callbacks[pos];
+ if (pos >= KEVENT_MAX)
+ return -EINVAL;
+ p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
+ p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
+ p->callback = (cb->callback) ? cb->callback : kevent_break;
+ printk(KERN_INFO "KEVENT: Added callbacks for type %u.\n", pos);
+ return 0;
+}

(I also added a const qualifier in first function argument, and unsigned int
pos so that the "if (pos >= KEVENT_MAX)" test catches 'negative' values)

Then you change kevent_break() to return -EINVAL instead of 0.

+int kevent_break(struct kevent *k)
+{
+???????unsigned long flags;
+
+???????spin_lock_irqsave(&k->ulock, flags);
+???????k->event.ret_flags |= KEVENT_RET_BROKEN;
+???????spin_unlock_irqrestore(&k->ulock, flags);
+???????return -EINVAL;
+}

Then avoid the tests in kevent_enqueue()

+int kevent_enqueue(struct kevent *k)
+{
+???????return k->callbacks.enqueue(k);
+}

And avoid the tests in kevent_dequeue()

+int kevent_dequeue(struct kevent *k)
+{
+???????return k->callbacks.dequeue(k);
+}

And change kevent_init() to

+int kevent_init(struct kevent *k)
+{
+???????spin_lock_init(&k->ulock);
+???????k->flags = 0;
+
+???????if (unlikely(k->event.type >= KEVENT_MAX))
+ k->event.type = KEVENT_MAX;
+
+
+???????k->callbacks = kevent_registered_callbacks[k->event.type];
+???????if (unlikely(k->callbacks.callback == kevent_break))
+???????????????return kevent_break(k);
+
+???????return 0;
+}



Eric Dumazet

2006-08-23 09:19:08

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 1/3] kevent: Core files.

On Wed, Aug 23, 2006 at 10:51:36AM +0200, Eric Dumazet ([email protected]) wrote:
> Hello Evgeniy

Hi Eric.

> I have one comment/suggestion (minor detail, your work is very good)
>
> I suggest to add one item in kevent_registered_callbacks[], so that
> kevent_registered_callbacks[KEVENT_MAX] is valid and can act as a fallback.

Sounds good, could you please send appliable patch with proper
signed-off line?

> Eric Dumazet

--
Evgeniy Polyakov

2006-08-23 09:23:51

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take12 1/3] kevent: Core files.

On Wednesday 23 August 2006 11:18, Evgeniy Polyakov wrote:
> On Wed, Aug 23, 2006 at 10:51:36AM +0200, Eric Dumazet ([email protected])
wrote:
> > Hello Evgeniy
>
> Hi Eric.
>
> > I have one comment/suggestion (minor detail, your work is very good)
> >
> > I suggest to add one item in kevent_registered_callbacks[], so that
> > kevent_registered_callbacks[KEVENT_MAX] is valid and can act as a
> > fallback.
>
> Sounds good, could you please send appliable patch with proper
> signed-off line?

Unfortunately not at this moment, I'm quite busy at work, my boss will kill
me :( .
If you find this good, please add it to your next patch submission or forget
it.

Thank you
Eric

2006-08-23 09:30:00

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 1/3] kevent: Core files.

On Wed, Aug 23, 2006 at 11:23:52AM +0200, Eric Dumazet ([email protected]) wrote:
> > Sounds good, could you please send appliable patch with proper
> > signed-off line?
>
> Unfortunately not at this moment, I'm quite busy at work, my boss will kill
> me :( .
> If you find this good, please add it to your next patch submission or forget
> it.

Ok, I will try to get it from pieces in e-mail.

> Thank you
> Eric

--
Evgeniy Polyakov

2006-08-23 09:49:25

by Jari Sundell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On 8/23/06, Evgeniy Polyakov <[email protected]> wrote:
> On Wed, Aug 23, 2006 at 10:22:06AM +0200, Jari Sundell ([email protected]) wrote:
> > On 8/23/06, Evgeniy Polyakov <[email protected]> wrote:
> > >void * in structure exported to userspace is forbidden.
> >
> > Only void * I'm seeing belongs to the user, (udata) perhaps you are
> > talking of something different?
>
> Yes, exactly about it.
>
> I put union {
> u32 a[2];
> void *b;
> }
> epcially to eliminate that problem.

It's just random data of a known maximum size appended to the struct,
I'm sure you can find a clean way to handle it. If you mangle the
first variable name in your union, you'll end up with something that
should be usable instead of udata.

> And I'm not that sure aboit stuff like uptr_t or how they call pointers
> in userspace and kernelspace.

Well, I can't find any use of pointers in your struct ukevent, nor in
any of the kqueue events in my man page. So if this is a deficit it
applies to both, I guess?

> ukevent is aligned to 8 bytes already (it's size selected to be 40 bytes),
> so it should not be a problem.
>
> > Eric

Even if it is so, wouldn't it be better to be explicit about it?

Rakshasa

2006-08-23 09:58:32

by Andi Kleen

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

Evgeniy Polyakov <[email protected]> writes:
>
> Let's then place there a structure with 64bit seconds and nanoseconds,
> similar to timspec, but without longs there.

You need 64bit (or at least more than 32bit) for the seconds,
otherwise you add a y2038 problem which would be sad in new code.
Remember you might be still alive then ;-)

Ok one could argue that on 32bit architectures 2038 is so deeply
embedded that it doesn't make much difference, but I still
think it would be better to not readd it to new interfaces there.

64bit longs on 32bit is fine, as long as you use aligned_u64,
never long long or u64 (which has varying alignment between i386 and x86-64)

-Andi

2006-08-23 10:05:23

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Wed, Aug 23, 2006 at 11:58:20AM +0200, Andi Kleen ([email protected]) wrote:
> Evgeniy Polyakov <[email protected]> writes:
> >
> > Let's then place there a structure with 64bit seconds and nanoseconds,
> > similar to timspec, but without longs there.
>
> You need 64bit (or at least more than 32bit) for the seconds,
> otherwise you add a y2038 problem which would be sad in new code.
> Remember you might be still alive then ;-)

I hope so :)

> Ok one could argue that on 32bit architectures 2038 is so deeply
> embedded that it doesn't make much difference, but I still
> think it would be better to not readd it to new interfaces there.
>
> 64bit longs on 32bit is fine, as long as you use aligned_u64,
> never long long or u64 (which has varying alignment between i386 and x86-64)

Btw, aligned_u64 is not exported to userspace.
I commited a change with __u64 nanoseconds without any strucutres.
Do we really need a structure?

> -Andi

--
Evgeniy Polyakov

2006-08-23 10:21:49

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Wed, Aug 23, 2006 at 11:49:22AM +0200, Jari Sundell ([email protected]) wrote:
> >> Only void * I'm seeing belongs to the user, (udata) perhaps you are
> >> talking of something different?
> >
> >Yes, exactly about it.
> >
> >I put union {
> > u32 a[2];
> > void *b;
> >}
> >epcially to eliminate that problem.
>
> It's just random data of a known maximum size appended to the struct,
> I'm sure you can find a clean way to handle it. If you mangle the
> first variable name in your union, you'll end up with something that
> should be usable instead of udata.

If there will be usual pointer, size of the whole structure will be
different in kernel and userspace.

> >And I'm not that sure aboit stuff like uptr_t or how they call pointers
> >in userspace and kernelspace.
>
> Well, I can't find any use of pointers in your struct ukevent, nor in
> any of the kqueue events in my man page. So if this is a deficit it
> applies to both, I guess?

No, it will change sizes of the structure in kernelspace and userspace,
so they just can not communicate.

> >ukevent is aligned to 8 bytes already (it's size selected to be 40 bytes),
> >so it should not be a problem.
> >
> >> Eric
>
> Even if it is so, wouldn't it be better to be explicit about it?

Ok, I will add a comment about it.

> Rakshasa

--
Evgeniy Polyakov

2006-08-23 10:34:28

by Jari Sundell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On 8/23/06, Evgeniy Polyakov <[email protected]> wrote:
>
> No, it will change sizes of the structure in kernelspace and userspace,
> so they just can not communicate.

struct kevent {
uintptr_t ident; /* identifier for this event */
short filter; /* filter for event */
u_short flags; /* action flags for kqueue */
u_int fflags; /* filter flag value */

union {
u32 _data_padding[2];
intptr_t data; /* filter data value */
};

union {
u32 _udata_padding[2];
void *udata; /* opaque user data identifier */
};
};

I'm not missing anything obvious here, I hope.

Rakshasa

2006-08-23 10:53:19

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Wed, Aug 23, 2006 at 12:34:25PM +0200, Jari Sundell ([email protected]) wrote:
> On 8/23/06, Evgeniy Polyakov <[email protected]> wrote:
> >
> >No, it will change sizes of the structure in kernelspace and userspace,
> >so they just can not communicate.
>
> struct kevent {
> uintptr_t ident; /* identifier for this event */
> short filter; /* filter for event */
> u_short flags; /* action flags for kqueue */
> u_int fflags; /* filter flag value */
>
> union {
> u32 _data_padding[2];
> intptr_t data; /* filter data value */
> };

As Eric pointed it must be aligned.

> union {
> u32 _udata_padding[2];
> void *udata; /* opaque user data identifier */
> };
> };
>
> I'm not missing anything obvious here, I hope.

We still do not know what uintptr_t is, and it looks like it is a pointer,
which is forbidden. Those numbers are not enough to make network AIO.
And actually is not compatible with kqueue already, so you will need to
write your own parser to convert your parameters into above structure.

> Rakshasa
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Evgeniy Polyakov

2006-08-23 12:55:52

by Jari Sundell

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On 8/23/06, Evgeniy Polyakov <[email protected]> wrote:
> We still do not know what uintptr_t is, and it looks like it is a pointer,
> which is forbidden. Those numbers are not enough to make network AIO.
> And actually is not compatible with kqueue already, so you will need to
> write your own parser to convert your parameters into above structure.

7.18.1.4 Integertypes capable of holding object pointers

"1 The following type designates a signed integer type with the
property that any valid
pointer to void can be converted to this type, then converted back to
pointer to void,
and the result will compare equal to the original pointer:"

Dunno if this means that x86-64 needs yet another typedef, or if using
long for intptr_t is incorrect. But assuming a different integer type
was used instead of intptr_t, that is known to be able to hold a
pointer, would there still be any problems?

I'm unable to see anything specific about AIO in your kevent patch
that these modifications wouldn't support.

Rakshasa

2006-08-23 13:12:47

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Wed, Aug 23, 2006 at 02:55:47PM +0200, Jari Sundell ([email protected]) wrote:
> On 8/23/06, Evgeniy Polyakov <[email protected]> wrote:
> >We still do not know what uintptr_t is, and it looks like it is a pointer,
> >which is forbidden. Those numbers are not enough to make network AIO.
> >And actually is not compatible with kqueue already, so you will need to
> >write your own parser to convert your parameters into above structure.
>
> 7.18.1.4 Integertypes capable of holding object pointers
>
> "1 The following type designates a signed integer type with the
> property that any valid
> pointer to void can be converted to this type, then converted back to
> pointer to void,
> and the result will compare equal to the original pointer:"
>
> Dunno if this means that x86-64 needs yet another typedef, or if using
> long for intptr_t is incorrect. But assuming a different integer type
> was used instead of intptr_t, that is known to be able to hold a
> pointer, would there still be any problems?

stdint.h

/* Types for `void *' pointers. */
#if __WORDSIZE == 64
# ifndef __intptr_t_defined
typedef long int intptr_t;
# define __intptr_t_defined
# endif
typedef unsigned long int uintptr_t;
#else
# ifndef __intptr_t_defined
typedef int intptr_t;
# define __intptr_t_defined
# endif
typedef unsigned int uintptr_t;
#endif

which means that with 32bit userspace it will be equal to 32bit only.

> I'm unable to see anything specific about AIO in your kevent patch
> that these modifications wouldn't support.

I was asked to postpone AIO stuff for now, you can find it in previous
patchsets sent about week or two ago.

> Rakshasa

--
Evgeniy Polyakov

2006-08-23 16:10:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Wed, 23 Aug 2006 11:50:56 +0400
Evgeniy Polyakov <[email protected]> wrote:

> On Wed, Aug 23, 2006 at 12:07:58AM -0700, Andrew Morton ([email protected]) wrote:
> > On Wed, 23 Aug 2006 10:56:59 +0400
> > Evgeniy Polyakov <[email protected]> wrote:
> >
> > > On Wed, Aug 23, 2006 at 02:43:50AM +0200, Jari Sundell ([email protected]) wrote:
> > > > Actually, I didn't miss that, it is an orthogonal issue. A timespec
> > > > timeout parameter for the syscall does not imply the use of timespec
> > > > in any timer event, etc. Nor is there any timespec timer in kqueue's
> > > > struct kevent, which is the only (interface related) thing that will
> > > > be exposed.
> > >
> > > void * in structure exported to userspace is forbidden.
> > > long in syscall requires wrapper in per-arch code (although that
> > > workaround _is_ there, it does not mean that broken interface should
> > > be used).
> > > poll uses millisecods - it is perfectly ok.
> >
> > I wonder whether designing-in a millisecond granularity is the right thing
> > to do. If in a few years the kernel is running tickless with high-res clock
> > interrupt sources, that might look a bit lumpy.
> >
> > Switching it to a __u64 nanosecond counter would be basically free on
> > 64-bit machines, and not very expensive on 32-bit, no?
>
> I can put nanoseconds as timer interval too (with aligned_u64 as David
> mentioned), and put it for timeout value too - 64 bit nanosecods ends up
> with 58 years, probably enough.
> Structures with u64 a really not so good idea.
>

OK. One could do u32 seconds/u32 nsecs, but a simple aligned_u64 will be
better for 64-bit machines, and OK for 32-bit.

2006-08-23 16:23:22

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take12 0/3] kevent: Generic event handling mechanism.

On Wed, Aug 23, 2006 at 09:09:20AM -0700, Andrew Morton ([email protected]) wrote:
> > I can put nanoseconds as timer interval too (with aligned_u64 as David
> > mentioned), and put it for timeout value too - 64 bit nanosecods ends up
> > with 58 years, probably enough.
> > Structures with u64 a really not so good idea.
> >
>
> OK. One could do u32 seconds/u32 nsecs, but a simple aligned_u64 will be
> better for 64-bit machines, and OK for 32-bit.

aligned_u64 is not exported to userspace, so in the last patchset I just
use __u64 as syscall parameter.

--
Evgeniy Polyakov

2006-08-23 18:25:33

by Stephen Hemminger

[permalink] [raw]
Subject: Re: The Proposed Linux kevent API

Could we try to real documentation rather than pissy little arguing.
I setup a page to start
http://linux-net.osdl.org/index.php/Kevent