2006-11-09 08:25:51

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take24 0/6] kevent: Generic event handling mechanism.


Generic event handling mechanism.

Kevent is a generic subsytem which allows to handle event notifications.
It supports both level and edge triggered events. It is similar to
poll/epoll in some cases, but it is more scalable, it is faster and
allows to work with essentially eny kind of events.

Events are provided into kernel through control syscall and can be read
back through ring buffer or using usual syscalls.
Kevent update (i.e. readiness switching) happens directly from internals
of the appropriate state machine of the underlying subsytem (like
network, filesystem, timer or any other).

Homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Documentation page:
http://linux-net.osdl.org/index.php/Kevent

Consider for inclusion.

Changes from 'take23' patchset:
* kevent PIPE notifications
* KEVENT_REQ_LAST_CHECK flag, which allows to perform last check in dequeuing time
* fixed poll/select notifications (were broken due to tree manipulations)
* made Documentation/kevent.txt look nice in 80-col terminal
* fix for copy_to_user() failure report for the first kevent (Andrew Morton)
* minor fucntion renames
Here is pipe result with kevent_pipe kernel kevent part with 2000 pipes
(Eric Dumazet's application):
epoll (edge-triggered): 248408 events/sec
kevent (edge-triggered): 269282 events/sec
Busy reading loop: 269519 events/sec

Changes from 'take22' patchset:
* new ring buffer implementation in process' memory
* wakeup-one-thread flag
* edge-triggered behaviour
With this release additional independent benchmark shows kevent speed compared to epoll:
Eric Dumazet created special benchmark which creates set of AF_INET sockets and two threads
start to simultaneously read and write data from/into them.
Here is results:
epoll (no EPOLLET): 57428 events/sec
kevent (no ET): 59794 events/sec
epoll (with EPOLLET): 71000 events/sec
kevent (with ET): 78265 events/sec
Maximum (busy loop reading events): 88482 events/sec

Changes from 'take21' patchset:
* minor cleanups (different return values, removed unneded variables, whitespaces and so on)
* fixed bug in kevent removal in case when kevent being removed
is the same as overflow_kevent (spotted by Eric Dumazet)

Changes from 'take20' patchset:
* new ring buffer implementation
* removed artificial limit on possible number of kevents
With this release and fixed userspace web server it was possible to
achive 3960+ req/s with client connection rate of 4000 con/s
over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which
is too close to wire speed if we get into account headers and the like.

Changes from 'take19' patchset:
* use __init instead of __devinit
* removed 'default N' from config for user statistic
* removed kevent_user_fini() since kevent can not be unloaded
* use KERN_INFO for statistic output

Changes from 'take18' patchset:
* use __init instead of __devinit
* removed 'default N' from config for user statistic
* removed kevent_user_fini() since kevent can not be unloaded
* use KERN_INFO for statistic output

Changes from 'take17' patchset:
* Use RB tree instead of hash table.
At least for a web sever, frequency of addition/deletion of new kevent
is comparable with number of search access, i.e. most of the time events
are added, accesed only couple of times and then removed, so it justifies
RB tree usage over AVL tree, since the latter does have much slower deletion
time (max O(log(N)) compared to 3 ops),
although faster search time (1.44*O(log(N)) vs. 2*O(log(N))).
So for kevents I use RB tree for now and later, when my AVL tree implementation
is ready, it will be possible to compare them.
* Changed readiness check for socket notifications.

With both above changes it is possible to achieve more than 3380 req/second compared to 2200,
sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same
hardware.
It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is
4096 events.

Changes from 'take16' patchset:
* misc cleanups (__read_mostly, const ...)
* created special macro which is used for mmap size (number of pages) calculation
* export kevent_socket_notify(), since it is used in network protocols which can be
built as modules (IPv6 for example)

Changes from 'take15' patchset:
* converted kevent_timer to high-resolution timers, this forces timer API update at
http://linux-net.osdl.org/index.php/Kevent
* use struct ukevent* instead of void * in syscalls (documentation has been updated)
* added warning in kevent_add_ukevent() if ring has broken index (for testing)

Changes from 'take14' patchset:
* added kevent_wait()
This syscall waits until either timeout expires or at least one event
becomes ready. It also commits that @num events from @start are processed
by userspace and thus can be be removed or rearmed (depending on it's flags).
It can be used for commit events read by userspace through mmap interface.
Example userspace code (evtest.c) can be found on project's homepage.
* added socket notifications (send/recv/accept)

Changes from 'take13' patchset:
* do not get lock aroung user data check in __kevent_search()
* fail early if there were no registered callbacks for given type of kevent
* trailing whitespace cleanup

Changes from 'take12' patchset:
* remove non-chardev interface for initialization
* use pointer to kevent_mring instead of unsigned longs
* use aligned 64bit type in raw user data (can be used by high-res timer if needed)
* simplified enqueue/dequeue callbacks and kevent initialization
* use nanoseconds for timeout
* put number of milliseconds into timer's return data
* move some definitions into user-visible header
* removed filenames from comments

Changes from 'take11' patchset:
* include missing headers into patchset
* some trivial code cleanups (use goto instead of if/else games and so on)
* some whitespace cleanups
* check for ready_callback() callback before main loop which should save us some ticks

Changes from 'take10' patchset:
* removed non-existent prototypes
* added helper function for kevent_registered_callbacks
* fixed 80 lines comments issues
* added shared between userspace and kernelspace header instead of embedd them in one
* core restructuring to remove forward declarations
* s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
* use vm_insert_page() instead of remap_pfn_range()

Changes from 'take9' patchset:
* fixed ->nopage method

Changes from 'take8' patchset:
* fixed mmap release bug
* use module_init() instead of late_initcall()
* use better structures for timer notifications

Changes from 'take7' patchset:
* new mmap interface (not tested, waiting for other changes to be acked)
- use nopage() method to dynamically substitue pages
- allocate new page for events only when new added kevent requres it
- do not use ugly index dereferencing, use structure instead
- reduced amount of data in the ring (id and flags),
maximum 12 pages on x86 per kevent fd

Changes from 'take6' patchset:
* a lot of comments!
* do not use list poisoning for detection of the fact, that entry is in the list
* return number of ready kevents even if copy*user() fails
* strict check for number of kevents in syscall
* use ARRAY_SIZE for array size calculation
* changed superblock magic number
* use SLAB_PANIC instead of direct panic() call
* changed -E* return values
* a lot of small cleanups and indent fixes

Changes from 'take5' patchset:
* removed compilation warnings about unused wariables when lockdep is not turned on
* do not use internal socket structures, use appropriate (exported) wrappers instead
* removed default 1 second timeout
* removed AIO stuff from patchset

Changes from 'take4' patchset:
* use miscdevice instead of chardevice
* comments fixes

Changes from 'take3' patchset:
* removed serializing mutex from kevent_user_wait()
* moved storage list processing to RCU
* removed lockdep screaming - all storage locks are initialized in the same function, so it was
learned
to differentiate between various cases
* remove kevent from storage if is marked as broken after callback
* fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion

Changes from 'take2' patchset:
* split kevent_finish_user() to locked and unlocked variants
* do not use KEVENT_STAT ifdefs, use inline functions instead
* use array of callbacks of each type instead of each kevent callback initialization
* changed name of ukevent guarding lock
* use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
* do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
* various indent cleanups
* added optimisation, which is aimed to help when a lot of kevents are being copied from
userspace
* mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
- rebased against 2.6.18-git tree
- removed ioctl controlling
- added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
unsigned int timeout, void __user *buf, unsigned flags)
- use old syscall kevent_ctl for creation/removing, modification and initial kevent
initialization
- use mutuxes instead of semaphores
- added file descriptor check and return error if provided descriptor does not match
kevent file operations
- various indent fixes
- removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov <[email protected]>



2006-11-09 08:24:32

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take24 5/6] kevent: Timer notifications.


Timer notifications.

Timer notifications can be used for fine grained per-process time
management, since interval timers are very inconvenient to use,
and they are limited.

This subsystem uses high-resolution timers.
id.raw[0] is used as number of seconds
id.raw[1] is used as number of nanoseconds

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..df93049
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,112 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/hrtimer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+struct kevent_timer
+{
+ struct hrtimer ktimer;
+ struct kevent_storage ktimer_storage;
+ struct kevent *ktimer_event;
+};
+
+static int kevent_timer_func(struct hrtimer *timer)
+{
+ struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer);
+ struct kevent *k = t->ktimer_event;
+
+ kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL);
+ hrtimer_forward(timer, timer->base->softirq_time,
+ ktime_set(k->event.id.raw[0], k->event.id.raw[1]));
+ return HRTIMER_RESTART;
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+ int err;
+ struct kevent_timer *t;
+
+ t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+ if (!t)
+ return -ENOMEM;
+
+ hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL);
+ t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]);
+ t->ktimer.function = kevent_timer_func;
+ t->ktimer_event = k;
+
+ err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+ if (err)
+ goto err_out_free;
+ lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+ err = kevent_storage_enqueue(&t->ktimer_storage, k);
+ if (err)
+ goto err_out_st_fini;
+
+ hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL);
+
+ return 0;
+
+err_out_st_fini:
+ kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+ kfree(t);
+
+ return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+ struct kevent_storage *st = k->st;
+ struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage);
+
+ hrtimer_cancel(&t->ktimer);
+ kevent_storage_dequeue(st, k);
+ kfree(t);
+
+ return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+ k->event.ret_data[0] = jiffies_to_msecs(jiffies);
+ return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+ struct kevent_callbacks tc = {
+ .callback = &kevent_timer_callback,
+ .enqueue = &kevent_timer_enqueue,
+ .dequeue = &kevent_timer_dequeue};
+
+ return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);
+

2006-11-09 08:25:05

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take24 1/6] kevent: Description.


Description.


diff --git a/Documentation/kevent.txt b/Documentation/kevent.txt
new file mode 100644
index 0000000..ca49e4b
--- /dev/null
+++ b/Documentation/kevent.txt
@@ -0,0 +1,186 @@
+Description.
+
+int kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent *arg);
+
+fd - is the file descriptor referring to the kevent queue to manipulate.
+It is created by opening "/dev/kevent" char device, which is created with
+dynamic minor number and major number assigned for misc devices.
+
+cmd - is the requested operation. It can be one of the following:
+ KEVENT_CTL_ADD - add event notification
+ KEVENT_CTL_REMOVE - remove event notification
+ KEVENT_CTL_MODIFY - modify existing notification
+
+num - number of struct ukevent in the array pointed to by arg
+arg - array of struct ukevent
+
+When called, kevent_ctl will carry out the operation specified in the
+cmd parameter.
+-------------------------------------------------------------------------------
+
+ int kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+ __u64 timeout, struct ukevent *buf, unsigned flags)
+
+ctl_fd - file descriptor referring to the kevent queue
+min_nr - minimum number of completed events that kevent_get_events will block
+ waiting for
+max_nr - number of struct ukevent in buf
+timeout - number of nanoseconds to wait before returning less than min_nr
+ events. If this is -1, then wait forever.
+buf - pointer to an array of struct ukevent.
+flags - unused
+
+kevent_get_events will wait timeout milliseconds for at least min_nr completed
+events, copying completed struct ukevents to buf and deleting any
+KEVENT_REQ_ONESHOT event requests. In nonblocking mode it returns as many
+events as possible, but not more than max_nr. In blocking mode it waits until
+timeout or if at least min_nr events are ready.
+-------------------------------------------------------------------------------
+
+ int kevent_wait(int ctl_fd, unsigned int num, __u64 timeout)
+
+ctl_fd - file descriptor referring to the kevent queue
+num - number of processed kevents
+timeout - this timeout specifies number of nanoseconds to wait until there is
+ free space in kevent queue
+
+This syscall waits until either timeout expires or at least one event becomes
+ready. It also copies that num events into special ring buffer and requeues
+them (or removes depending on flags).
+-------------------------------------------------------------------------------
+
+ int kevent_ring_init(int ctl_fd, struct kevent_ring *ring, unsigned int num)
+
+ctl_fd - file descriptor referring to the kevent queue
+num - size of the ring buffer in events
+
+ struct kevent_ring
+ {
+ unsigned int ring_kidx;
+ struct ukevent event[0];
+ }
+
+ring_kidx - is an index in the ring buffer where kernel will put new events
+ when kevent_wait() or kevent_get_events() is called
+
+Example userspace code (ring_buffer.c) can be found on project's homepage.
+
+Each kevent syscall can be so called cancellation point in glibc, i.e. when
+thread has been cancelled in kevent syscall, thread can be safely removed
+and no events will be lost, since each syscall (kevent_wait() or
+kevent_get_events()) will copy event into special ring buffer, accessible
+from other threads or even processes (if shared memory is used).
+
+When kevent is removed (not dequeued when it is ready, but just removed),
+even if it was ready, it is not copied into ring buffer, since if it is
+removed, no one cares about it (otherwise user would wait until it becomes
+ready and got it through usual way using kevent_get_events() or kevent_wait())
+and thus no need to copy it to the ring buffer.
+
+It is possible with userspace ring buffer, that events in the ring buffer
+can be replaced without knowledge for the thread currently reading them
+(when other thread calls kevent_get_events() or kevent_wait()), so appropriate
+locking between threads or processes, which can simultaneously access the same
+ring buffer, is required.
+-------------------------------------------------------------------------------
+
+The bulk of the interface is entirely done through the ukevent struct.
+It is used to add event requests, modify existing event requests,
+specify which event requests to remove, and return completed events.
+
+struct ukevent contains the following members:
+
+struct kevent_id id
+ Id of this request, e.g. socket number, file descriptor and so on
+__u32 type
+ Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on
+__u32 event
+ Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED
+__u32 req_flags
+ Per-event request flags,
+
+ KEVENT_REQ_ONESHOT
+ event will be removed when it is ready
+
+ KEVENT_REQ_WAKEUP_ONE
+ When several threads wait on the same kevent queue and requested the
+ same event, for example 'wake me up when new client has connected,
+ so I could call accept()', then all threads will be awakened when new
+ client has connected, but only one of them can process the data. This
+ problem is known as thundering nerd problem. Events which have this
+ flag set will not be marked as ready (and appropriate threads will
+ not be awakened) if at least one event has been already marked.
+
+ KEVENT_REQ_ET
+ Edge Triggered behaviour. It is an optimisation which allows to move
+ ready and dequeued (i.e. copied to userspace) event to move into set
+ of interest for given storage (socket, inode and so on) again. It is
+ very usefull for cases when the same event should be used many times
+ (like reading from pipe). It is similar to epoll()'s EPOLLET flag.
+
+ KEVENT_REQ_LAST_CHECK
+ if set allows to perform the last check on kevent (call appropriate
+ callback) when kevent is marked as ready and has been removed from
+ ready queue. If it will be confirmed that kevent is ready
+ (k->callbacks.callback(k) returns true) then kevent will be copied
+ to userspace, otherwise it will be requeued back to storage.
+ Second (checking) call is performed with this bit cleared, so callback
+ can detect when it was called from kevent_storage_ready() - bit is set,
+ or kevent_dequeue_ready() - bit is cleared. If kevent will be requeued,
+ bit will be set again.
+
+__u32 ret_flags
+ Per-event return flags
+
+ KEVENT_RET_BROKEN
+ Kevent is broken
+
+ KEVENT_RET_DONE
+ Kevent processing was finished successfully
+
+ KEVENT_RET_COPY_FAILED
+ Kevent was not copied into ring buffer due to some error conditions.
+
+__u32 ret_data
+ Event return data. Event originator fills it with anything it likes
+ (for example timer notifications put number of milliseconds when timer
+ has fired
+union { __u32 user[2]; void *ptr; }
+ User's data. It is not used, just copied to/from user. The whole structure
+ is aligned to 8 bytes already, so the last union is aligned properly.
+
+-------------------------------------------------------------------------------
+
+Usage
+
+For KEVENT_CTL_ADD, all fields relevant to the event type must be filled
+(id, type, possibly event, req_flags).
+After kevent_ctl(..., KEVENT_CTL_ADD, ...) returns each struct's ret_flags
+should be checked to see if the event is already broken or done.
+
+For KEVENT_CTL_MODIFY, the id, req_flags, and user and event fields must be
+set and an existing kevent request must have matching id and user fields. If
+match is found, req_flags and event are replaced with the newly supplied
+values and requeueing is started, so modified kevent can be checked and
+probably marked as ready immediately. If a match can't be found, the
+passed in ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is
+always set.
+
+For KEVENT_CTL_REMOVE, the id and user fields must be set and an existing
+kevent request must have matching id and user fields. If a match is found,
+the kevent request is removed. If a match can't be found, the passed in
+ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is always set.
+
+For kevent_get_events, the entire structure is returned.
+
+-------------------------------------------------------------------------------
+
+Usage cases
+
+kevent_timer
+struct ukevent should contain following fields:
+ type - KEVENT_TIMER
+ event - KEVENT_TIMER_FIRED
+ req_flags - KEVENT_REQ_ONESHOT if you want to fire that timer only once
+ id.raw[0] - number of seconds after commit when this timer shout expire
+ id.raw[0] - additional to number of seconds number of nanoseconds

2006-11-09 08:25:06

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take24 6/6] kevent: Pipe notifications.


Pipe notifications.


diff --git a/fs/pipe.c b/fs/pipe.c
index f3b6f71..aeaee9c 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -16,6 +16,7 @@ #include <linux/pipe_fs_i.h>
#include <linux/uio.h>
#include <linux/highmem.h>
#include <linux/pagemap.h>
+#include <linux/kevent.h>

#include <asm/uaccess.h>
#include <asm/ioctls.h>
@@ -312,6 +313,7 @@ redo:
break;
}
if (do_wakeup) {
+ kevent_pipe_notify(inode, KEVENT_SOCKET_SEND);
wake_up_interruptible_sync(&pipe->wait);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}
@@ -321,6 +323,7 @@ redo:

/* Signal writers asynchronously that there is more room. */
if (do_wakeup) {
+ kevent_pipe_notify(inode, KEVENT_SOCKET_SEND);
wake_up_interruptible(&pipe->wait);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}
@@ -490,6 +493,7 @@ redo2:
break;
}
if (do_wakeup) {
+ kevent_pipe_notify(inode, KEVENT_SOCKET_RECV);
wake_up_interruptible_sync(&pipe->wait);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
do_wakeup = 0;
@@ -501,6 +505,7 @@ redo2:
out:
mutex_unlock(&inode->i_mutex);
if (do_wakeup) {
+ kevent_pipe_notify(inode, KEVENT_SOCKET_RECV);
wake_up_interruptible(&pipe->wait);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
}
@@ -605,6 +610,7 @@ pipe_release(struct inode *inode, int de
free_pipe_info(inode);
} else {
wake_up_interruptible(&pipe->wait);
+ kevent_pipe_notify(inode, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
}
diff --git a/kernel/kevent/kevent_pipe.c b/kernel/kevent/kevent_pipe.c
new file mode 100644
index 0000000..32c6f19
--- /dev/null
+++ b/kernel/kevent/kevent_pipe.c
@@ -0,0 +1,112 @@
+/*
+ * kevent_pipe.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/kevent.h>
+#include <linux/pipe_fs_i.h>
+
+static int kevent_pipe_callback(struct kevent *k)
+{
+ struct inode *inode = k->st->origin;
+ struct pipe_inode_info *pipe = inode->i_pipe;
+ int nrbufs = pipe->nrbufs;
+
+ if (k->event.event & KEVENT_SOCKET_RECV && nrbufs > 0) {
+ if (!pipe->writers)
+ return -1;
+ return 1;
+ }
+
+ if (k->event.event & KEVENT_SOCKET_SEND && nrbufs < PIPE_BUFFERS) {
+ if (!pipe->readers)
+ return -1;
+ return 1;
+ }
+
+ return 0;
+}
+
+int kevent_pipe_enqueue(struct kevent *k)
+{
+ struct file *pipe;
+ int err = -EBADF;
+ struct inode *inode;
+
+ pipe = fget(k->event.id.raw[0]);
+ if (!pipe)
+ goto err_out_exit;
+
+ inode = igrab(pipe->f_dentry->d_inode);
+ if (!inode)
+ goto err_out_fput;
+
+ err = kevent_storage_enqueue(&inode->st, k);
+ if (err)
+ goto err_out_iput;
+
+ err = k->callbacks.callback(k);
+ if (err)
+ goto err_out_dequeue;
+
+ fput(pipe);
+
+ return err;
+
+err_out_dequeue:
+ kevent_storage_dequeue(k->st, k);
+err_out_iput:
+ iput(inode);
+err_out_fput:
+ fput(pipe);
+err_out_exit:
+ return err;
+}
+
+int kevent_pipe_dequeue(struct kevent *k)
+{
+ struct inode *inode = k->st->origin;
+
+ kevent_storage_dequeue(k->st, k);
+ iput(inode);
+
+ return 0;
+}
+
+void kevent_pipe_notify(struct inode *inode, u32 event)
+{
+ kevent_storage_ready(&inode->st, NULL, event);
+}
+
+static int __init kevent_init_pipe(void)
+{
+ struct kevent_callbacks sc = {
+ .callback = &kevent_pipe_callback,
+ .enqueue = &kevent_pipe_enqueue,
+ .dequeue = &kevent_pipe_dequeue};
+
+ return kevent_add_callbacks(&sc, KEVENT_PIPE);
+}
+module_init(kevent_init_pipe);

2006-11-09 08:24:35

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take24 3/6] kevent: poll/select() notifications.


poll/select() notifications.

This patch includes generic poll/select notifications.
kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake, a lot of allocations and so on).

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/fs/file_table.c b/fs/file_table.c
index bc35a40..0805547 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -20,6 +20,7 @@ #include <linux/capability.h>
#include <linux/cdev.h>
#include <linux/fsnotify.h>
#include <linux/sysctl.h>
+#include <linux/kevent.h>
#include <linux/percpu_counter.h>

#include <asm/atomic.h>
@@ -119,6 +120,7 @@ struct file *get_empty_filp(void)
f->f_uid = tsk->fsuid;
f->f_gid = tsk->fsgid;
eventpoll_init_file(f);
+ kevent_init_file(f);
/* f->f_version: 0 */
return f;

@@ -164,6 +166,7 @@ void fastcall __fput(struct file *file)
* in the file cleanup chain.
*/
eventpoll_release(file);
+ kevent_cleanup_file(file);
locks_remove_flock(file);

if (file->f_op && file->f_op->release)
diff --git a/fs/inode.c b/fs/inode.c
index ada7643..6745c00 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@ #include <linux/pagemap.h>
#include <linux/cdev.h>
#include <linux/bootmem.h>
#include <linux/inotify.h>
+#include <linux/kevent.h>
#include <linux/mount.h>

/*
@@ -164,12 +165,18 @@ #endif
}
inode->i_private = 0;
inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+ kevent_storage_init(inode, &inode->st);
+#endif
}
return inode;
}

void destroy_inode(struct inode *inode)
{
+#if defined CONFIG_KEVENT_SOCKET
+ kevent_storage_fini(&inode->st);
+#endif
BUG_ON(inode_has_buffers(inode));
security_inode_free(inode);
if (inode->i_sb->s_op->destroy_inode)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5baf3a1..c529723 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -276,6 +276,7 @@ #include <linux/prio_tree.h>
#include <linux/init.h>
#include <linux/sched.h>
#include <linux/mutex.h>
+#include <linux/kevent_storage.h>

#include <asm/atomic.h>
#include <asm/semaphore.h>
@@ -586,6 +587,10 @@ #ifdef CONFIG_INOTIFY
struct mutex inotify_mutex; /* protects the watches list */
#endif

+#ifdef CONFIG_KEVENT_SOCKET
+ struct kevent_storage st;
+#endif
+
unsigned long i_state;
unsigned long dirtied_when; /* jiffies of first dirtying */

@@ -739,6 +744,9 @@ #ifdef CONFIG_EPOLL
struct list_head f_ep_links;
spinlock_t f_ep_lock;
#endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+ struct kevent_storage st;
+#endif
struct address_space *f_mapping;
};
extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..7030d21
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,228 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+ struct poll_table_struct pt;
+ struct kevent *k;
+};
+
+struct kevent_poll_wait_container
+{
+ struct list_head container_entry;
+ wait_queue_head_t *whead;
+ wait_queue_t wait;
+ struct kevent *k;
+};
+
+struct kevent_poll_private
+{
+ struct list_head container_list;
+ spinlock_t container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+ unsigned mode, int sync, void *key)
+{
+ struct kevent_poll_wait_container *cont =
+ container_of(wait, struct kevent_poll_wait_container, wait);
+ struct kevent *k = cont->k;
+
+ kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL);
+ return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+ struct poll_table_struct *poll_table)
+{
+ struct kevent *k =
+ container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+ struct kevent_poll_private *priv = k->priv;
+ struct kevent_poll_wait_container *cont;
+ unsigned long flags;
+
+ cont = kmem_cache_alloc(kevent_poll_container_cache, GFP_KERNEL);
+ if (!cont) {
+ kevent_break(k);
+ return;
+ }
+
+ cont->k = k;
+ init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+ cont->whead = whead;
+
+ spin_lock_irqsave(&priv->container_lock, flags);
+ list_add_tail(&cont->container_entry, &priv->container_list);
+ spin_unlock_irqrestore(&priv->container_lock, flags);
+
+ add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+ struct file *file;
+ int err;
+ unsigned int revents;
+ unsigned long flags;
+ struct kevent_poll_ctl ctl;
+ struct kevent_poll_private *priv;
+
+ file = fget(k->event.id.raw[0]);
+ if (!file)
+ return -EBADF;
+
+ err = -EINVAL;
+ if (!file->f_op || !file->f_op->poll)
+ goto err_out_fput;
+
+ err = -ENOMEM;
+ priv = kmem_cache_alloc(kevent_poll_priv_cache, GFP_KERNEL);
+ if (!priv)
+ goto err_out_fput;
+
+ spin_lock_init(&priv->container_lock);
+ INIT_LIST_HEAD(&priv->container_list);
+
+ k->priv = priv;
+
+ ctl.k = k;
+ init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+
+ err = kevent_storage_enqueue(&file->st, k);
+ if (err)
+ goto err_out_free;
+
+ revents = file->f_op->poll(file, &ctl.pt);
+ if (revents & k->event.event) {
+ err = 1;
+ goto out_dequeue;
+ }
+
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.req_flags |= KEVENT_REQ_LAST_CHECK;
+ spin_unlock_irqrestore(&k->ulock, flags);
+
+ return 0;
+
+out_dequeue:
+ kevent_storage_dequeue(k->st, k);
+err_out_free:
+ kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+ fput(file);
+ return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+ struct file *file = k->st->origin;
+ struct kevent_poll_private *priv = k->priv;
+ struct kevent_poll_wait_container *w, *n;
+ unsigned long flags;
+
+ kevent_storage_dequeue(k->st, k);
+
+ spin_lock_irqsave(&priv->container_lock, flags);
+ list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+ list_del(&w->container_entry);
+ remove_wait_queue(w->whead, &w->wait);
+ kmem_cache_free(kevent_poll_container_cache, w);
+ }
+ spin_unlock_irqrestore(&priv->container_lock, flags);
+
+ kmem_cache_free(kevent_poll_priv_cache, priv);
+ k->priv = NULL;
+
+ fput(file);
+
+ return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+ if (k->event.req_flags & KEVENT_REQ_LAST_CHECK) {
+ return 1;
+ } else {
+ struct file *file = k->st->origin;
+ unsigned int revents = file->f_op->poll(file, NULL);
+
+ k->event.ret_data[0] = revents & k->event.event;
+
+ return (revents & k->event.event);
+ }
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+ struct kevent_callbacks pc = {
+ .callback = &kevent_poll_callback,
+ .enqueue = &kevent_poll_enqueue,
+ .dequeue = &kevent_poll_dequeue};
+
+ kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache",
+ sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+ if (!kevent_poll_container_cache) {
+ printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+ return -ENOMEM;
+ }
+
+ kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache",
+ sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+ if (!kevent_poll_priv_cache) {
+ printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+ kmem_cache_destroy(kevent_poll_container_cache);
+ kevent_poll_container_cache = NULL;
+ return -ENOMEM;
+ }
+
+ kevent_add_callbacks(&pc, KEVENT_POLL);
+
+ printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+ return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+ lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+ kmem_cache_destroy(kevent_poll_priv_cache);
+ kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);

2006-11-09 08:26:04

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take24 2/6] kevent: Core files.


Core files.

This patch includes core kevent files:
* userspace controlling
* kernelspace interfaces
* initialization
* notification state machines

Some bits of documentation can be found on project's homepage (and links from there):
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 7e639f7..fa8075b 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -318,3 +318,7 @@ ENTRY(sys_call_table)
.long sys_vmsplice
.long sys_move_pages
.long sys_getcpu
+ .long sys_kevent_get_events
+ .long sys_kevent_ctl /* 320 */
+ .long sys_kevent_wait
+ .long sys_kevent_ring_init
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index b4aa875..95fb252 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -714,8 +714,12 @@ #endif
.quad compat_sys_get_robust_list
.quad sys_splice
.quad sys_sync_file_range
- .quad sys_tee
+ .quad sys_tee /* 315 */
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
.quad sys_getcpu
+ .quad sys_kevent_get_events
+ .quad sys_kevent_ctl /* 320 */
+ .quad sys_kevent_wait
+ .quad sys_kevent_ring_init
ia32_syscall_end:
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index bd99870..2161ef2 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -324,10 +324,14 @@ #define __NR_tee 315
#define __NR_vmsplice 316
#define __NR_move_pages 317
#define __NR_getcpu 318
+#define __NR_kevent_get_events 319
+#define __NR_kevent_ctl 320
+#define __NR_kevent_wait 321
+#define __NR_kevent_ring_init 322

#ifdef __KERNEL__

-#define NR_syscalls 319
+#define NR_syscalls 323
#include <linux/err.h>

/*
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 6137146..3669c0f 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,18 @@ #define __NR_vmsplice 278
__SYSCALL(__NR_vmsplice, sys_vmsplice)
#define __NR_move_pages 279
__SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events 280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl 281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
+#define __NR_kevent_wait 282
+__SYSCALL(__NR_kevent_wait, sys_kevent_wait)
+#define __NR_kevent_ring_init 283
+__SYSCALL(__NR_kevent_ring_init, sys_kevent_ring_init)

#ifdef __KERNEL__

-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_ring_init
#include <linux/err.h>

#ifndef __NO_STUBS
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..f7cbf6b
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,223 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/fs.h>
+#include <linux/kevent_storage.h>
+#include <linux/ukevent.h>
+
+#define KEVENT_MIN_BUFFS_ALLOC 3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+ kevent_callback_t callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY 0x1
+#define KEVENT_STORAGE 0x2
+#define KEVENT_USER 0x4
+
+struct kevent
+{
+ /* Used for kevent freeing.*/
+ struct rcu_head rcu_head;
+ struct ukevent event;
+ /* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+ spinlock_t ulock;
+
+ /* Entry of user's tree. */
+ struct rb_node kevent_node;
+ /* Entry of origin's queue. */
+ struct list_head storage_entry;
+ /* Entry of user's ready. */
+ struct list_head ready_entry;
+
+ u32 flags;
+
+ /* User who requested this kevent. */
+ struct kevent_user *user;
+ /* Kevent container. */
+ struct kevent_storage *st;
+
+ struct kevent_callbacks callbacks;
+
+ /* Private data for different storages.
+ * poll()/select storage has a list of wait_queue_t containers
+ * for each ->poll() { poll_wait()' } here.
+ */
+ void *priv;
+};
+
+struct kevent_user
+{
+ struct rb_root kevent_root;
+ spinlock_t kevent_lock;
+ /* Number of queued kevents. */
+ unsigned int kevent_num;
+
+ /* List of ready kevents. */
+ struct list_head ready_list;
+ /* Number of ready kevents. */
+ unsigned int ready_num;
+ /* Protects all manipulations with ready queue. */
+ spinlock_t ready_lock;
+
+ /* Protects against simultaneous kevent_user control manipulations. */
+ struct mutex ctl_mutex;
+ /* Wait until some events are ready. */
+ wait_queue_head_t wait;
+
+ /* Reference counter, increased for each new kevent. */
+ atomic_t refcnt;
+
+ /* Mutex protecting userspace ring buffer. */
+ struct mutex ring_lock;
+ /* Kernel index and size of the userspace ring buffer. */
+ unsigned int kidx, ring_size;
+ /* Pointer to userspace ring buffer. */
+ struct kevent_ring __user *pring;
+
+#ifdef CONFIG_KEVENT_USER_STAT
+ unsigned long im_num;
+ unsigned long wait_num, ring_num;
+ unsigned long total;
+#endif
+};
+
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos);
+
+void kevent_storage_ready(struct kevent_storage *st,
+ kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+ u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+ printk(KERN_INFO "%s: u: %p, wait: %lu, ring: %lu, immediately: %lu, total: %lu.\n",
+ __func__, u, u->wait_num, u->ring_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+ u->im_num++;
+}
+static inline void kevent_stat_ring(struct kevent_user *u)
+{
+ u->ring_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+ u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+ u->total++;
+}
+#else
+#define kevent_stat_print(u) ({ (void) u;})
+#define kevent_stat_init(u) ({ (void) u;})
+#define kevent_stat_im(u) ({ (void) u;})
+#define kevent_stat_wait(u) ({ (void) u;})
+#define kevent_stat_ring(u) ({ (void) u;})
+#define kevent_stat_total(u) ({ (void) u;})
+#endif
+
+#ifdef CONFIG_LOCKDEP
+void kevent_socket_reinit(struct socket *sock);
+void kevent_sk_reinit(struct sock *sk);
+#else
+static inline void kevent_socket_reinit(struct socket *sock)
+{
+}
+static inline void kevent_sk_reinit(struct sock *sk)
+{
+}
+#endif
+#ifdef CONFIG_KEVENT_SOCKET
+void kevent_socket_notify(struct sock *sock, u32 event);
+int kevent_socket_dequeue(struct kevent *k);
+int kevent_socket_enqueue(struct kevent *k);
+#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)
+#else
+static inline void kevent_socket_notify(struct sock *sock, u32 event)
+{
+}
+#define sock_async(__sk) ({ (void)__sk; 0; })
+#endif
+
+#ifdef CONFIG_KEVENT_POLL
+static inline void kevent_init_file(struct file *file)
+{
+ kevent_storage_init(file, &file->st);
+}
+
+static inline void kevent_cleanup_file(struct file *file)
+{
+ kevent_storage_fini(&file->st);
+}
+#else
+static inline void kevent_init_file(struct file *file) {}
+static inline void kevent_cleanup_file(struct file *file) {}
+#endif
+
+#ifdef CONFIG_KEVENT_PIPE
+extern void kevent_pipe_notify(struct inode *inode, u32 events);
+#else
+static inline void kevent_pipe_notify(struct inode *inode, u32 events) {}
+#endif
+
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+ void *origin; /* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+ struct list_head list; /* List of queued kevents. */
+ spinlock_t lock; /* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2d1c3d5..471a685 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -54,6 +54,8 @@ struct compat_stat;
struct compat_timeval;
struct robust_list_head;
struct getcpu_cache;
+struct ukevent;
+struct kevent_ring;

#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -599,4 +601,9 @@ asmlinkage long sys_set_robust_list(stru
size_t len);
asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);

+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max,
+ __u64 timeout, struct ukevent __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, struct ukevent __user *buf);
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, __u64 timeout);
+asmlinkage long sys_kevent_ring_init(int ctl_fd, struct kevent_ring __user *ring, unsigned int num);
#endif
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
new file mode 100644
index 0000000..b14e14e
--- /dev/null
+++ b/include/linux/ukevent.h
@@ -0,0 +1,165 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __UKEVENT_H
+#define __UKEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+/* Process this event only once and then remove it. */
+#define KEVENT_REQ_ONESHOT 0x1
+/* Wake up only when event exclusively belongs to this thread,
+ * for example when several threads are waiting for new client
+ * connection so they could perform accept() it is a good idea
+ * to set this flag, so only one thread of all with this flag set
+ * will be awakened.
+ * If there are events without this flags, appropriate threads will
+ * be awakened too. */
+#define KEVENT_REQ_WAKEUP_ONE 0x2
+/* Edge Triggered behaviour. */
+#define KEVENT_REQ_ET 0x4
+/* Perform the last check on kevent (call appropriate callback) when
+ * kevent is marked as ready and has been removed from ready queue.
+ * If it will be confirmed that kevent is ready
+ * (k->callbacks.callback(k) returns true) then kevent will be copied
+ * to userspace, otherwise it will be requeued back to storage.
+ * Second (checking) call is performed with this bit _cleared_ so
+ * callback can detect when it was called from
+ * kevent_storage_ready() - bit is set, or
+ * kevent_dequeue_ready() - bit is cleared.
+ * If kevent will be requeued, bit will be set again. */
+#define KEVENT_REQ_LAST_CHECK 0x8
+
+/*
+ * Kevent return flags.
+ */
+/* Kevent is broken. */
+#define KEVENT_RET_BROKEN 0x1
+/* Kevent processing was finished successfully. */
+#define KEVENT_RET_DONE 0x2
+/* Kevent was not copied into ring buffer due to some error conditions. */
+#define KEVENT_RET_COPY_FAILED 0x4
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 0
+#define KEVENT_INODE 1
+#define KEVENT_TIMER 2
+#define KEVENT_POLL 3
+#define KEVENT_NAIO 4
+#define KEVENT_AIO 5
+#define KEVENT_PIPE 6
+#define KEVENT_MAX 7
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define KEVENT_TIMER_FIRED 0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#define KEVENT_SOCKET_RECV 0x1
+#define KEVENT_SOCKET_ACCEPT 0x2
+#define KEVENT_SOCKET_SEND 0x4
+
+/*
+ * Inode events.
+ */
+#define KEVENT_INODE_CREATE 0x1
+#define KEVENT_INODE_REMOVE 0x2
+
+/*
+ * Poll events.
+ */
+#define KEVENT_POLL_POLLIN 0x0001
+#define KEVENT_POLL_POLLPRI 0x0002
+#define KEVENT_POLL_POLLOUT 0x0004
+#define KEVENT_POLL_POLLERR 0x0008
+#define KEVENT_POLL_POLLHUP 0x0010
+#define KEVENT_POLL_POLLNVAL 0x0020
+
+#define KEVENT_POLL_POLLRDNORM 0x0040
+#define KEVENT_POLL_POLLRDBAND 0x0080
+#define KEVENT_POLL_POLLWRNORM 0x0100
+#define KEVENT_POLL_POLLWRBAND 0x0200
+#define KEVENT_POLL_POLLMSG 0x0400
+#define KEVENT_POLL_POLLREMOVE 0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define KEVENT_AIO_BIO 0x1
+
+#define KEVENT_MASK_ALL 0xffffffff
+/* Mask of all possible event values. */
+#define KEVENT_MASK_EMPTY 0x0
+/* Empty mask of ready events. */
+
+struct kevent_id
+{
+ union {
+ __u32 raw[2];
+ __u64 raw_u64 __attribute__((aligned(8)));
+ };
+};
+
+struct ukevent
+{
+ /* Id of this request, e.g. socket number, file descriptor and so on... */
+ struct kevent_id id;
+ /* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+ __u32 type;
+ /* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+ __u32 event;
+ /* Per-event request flags */
+ __u32 req_flags;
+ /* Per-event return flags */
+ __u32 ret_flags;
+ /* Event return data. Event originator fills it with anything it likes. */
+ __u32 ret_data[2];
+ /* User's data. It is not used, just copied to/from user.
+ * The whole structure is aligned to 8 bytes already, so the last union
+ * is aligned properly.
+ */
+ union {
+ __u32 user[2];
+ void *ptr;
+ };
+};
+
+struct kevent_ring
+{
+ unsigned int ring_kidx;
+ struct ukevent event[0];
+};
+
+#define KEVENT_CTL_ADD 0
+#define KEVENT_CTL_REMOVE 1
+#define KEVENT_CTL_MODIFY 2
+
+#endif /* __UKEVENT_H */
diff --git a/init/Kconfig b/init/Kconfig
index d2eb7a8..c7d8250 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -201,6 +201,8 @@ config AUDITSYSCALL
such as SELinux. To use audit's filesystem watch feature, please
ensure that INOTIFY is configured.

+source "kernel/kevent/Kconfig"
+
config IKCONFIG
bool "Kernel .config support"
---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
obj-$(CONFIG_RELAY) += relay.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..267fc53
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,45 @@
+config KEVENT
+ bool "Kernel event notification mechanism"
+ help
+ This option enables event queue mechanism.
+ It can be used as replacement for poll()/select(), AIO callback
+ invocations, advanced timer notifications and other kernel
+ object status changes.
+
+config KEVENT_USER_STAT
+ bool "Kevent user statistic"
+ depends on KEVENT
+ help
+ This option will turn kevent_user statistic collection on.
+ Statistic data includes total number of kevent, number of kevents
+ which are ready immediately at insertion time and number of kevents
+ which were removed through readiness completion.
+ It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_TIMER
+ bool "Kernel event notifications for timers"
+ depends on KEVENT
+ help
+ This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+ bool "Kernel event notifications for poll()/select()"
+ depends on KEVENT
+ help
+ This option allows to use kevent subsystem for poll()/select()
+ notifications.
+
+config KEVENT_SOCKET
+ bool "Kernel event notifications for sockets"
+ depends on NET && KEVENT
+ help
+ This option enables notifications through KEVENT subsystem of
+ sockets operations, like new packet receiving conditions,
+ ready for accept conditions and so on.
+
+config KEVENT_PIPE
+ bool "Kernel event notifications for pipes"
+ depends on KEVENT
+ help
+ This option enables notifications through KEVENT subsystem of
+ pipe read/write operations.
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..d4d6b68
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,5 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
+obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
+obj-$(CONFIG_KEVENT_PIPE) += kevent_pipe.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..24ee44a
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,232 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+ return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+ return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.ret_flags |= KEVENT_RET_BROKEN;
+ spin_unlock_irqrestore(&k->ulock, flags);
+ return -EINVAL;
+}
+
+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX] __read_mostly;
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos)
+{
+ struct kevent_callbacks *p;
+
+ if (pos >= KEVENT_MAX)
+ return -EINVAL;
+
+ p = &kevent_registered_callbacks[pos];
+
+ p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
+ p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
+ p->callback = (cb->callback) ? cb->callback : kevent_break;
+
+ printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
+ return 0;
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+ spin_lock_init(&k->ulock);
+ k->flags = 0;
+
+ if (unlikely(k->event.type >= KEVENT_MAX ||
+ !kevent_registered_callbacks[k->event.type].callback))
+ return kevent_break(k);
+
+ k->callbacks = kevent_registered_callbacks[k->event.type];
+ if (unlikely(k->callbacks.callback == kevent_break))
+ return kevent_break(k);
+
+ return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+ unsigned long flags;
+
+ k->st = st;
+ spin_lock_irqsave(&st->lock, flags);
+ list_add_tail_rcu(&k->storage_entry, &st->list);
+ k->flags |= KEVENT_STORAGE;
+ spin_unlock_irqrestore(&st->lock, flags);
+ return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue.
+ * It does not decrease origin's reference counter in any way
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&st->lock, flags);
+ if (k->flags & KEVENT_STORAGE) {
+ list_del_rcu(&k->storage_entry);
+ k->flags &= ~KEVENT_STORAGE;
+ }
+ spin_unlock_irqrestore(&st->lock, flags);
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static int __kevent_requeue(struct kevent *k, u32 event)
+{
+ int ret, rem;
+ unsigned long flags;
+
+ ret = k->callbacks.callback(k);
+
+ spin_lock_irqsave(&k->ulock, flags);
+ if (ret > 0)
+ k->event.ret_flags |= KEVENT_RET_DONE;
+ else if (ret < 0)
+ k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE);
+ else
+ ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+ rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+ spin_unlock_irqrestore(&k->ulock, flags);
+
+ if (ret) {
+ if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) {
+ list_del_rcu(&k->storage_entry);
+ k->flags &= ~KEVENT_STORAGE;
+ }
+
+ spin_lock_irqsave(&k->user->ready_lock, flags);
+ if (!(k->flags & KEVENT_READY)) {
+ list_add_tail(&k->ready_entry, &k->user->ready_list);
+ k->flags |= KEVENT_READY;
+ k->user->ready_num++;
+ }
+ spin_unlock_irqrestore(&k->user->ready_lock, flags);
+ wake_up(&k->user->wait);
+ }
+
+ return ret;
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->st->lock, flags);
+ __kevent_requeue(k, 0);
+ spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st,
+ kevent_callback_t ready_callback, u32 event)
+{
+ struct kevent *k;
+ int wake_num = 0;
+
+ rcu_read_lock();
+ if (ready_callback)
+ list_for_each_entry_rcu(k, &st->list, storage_entry)
+ (*ready_callback)(k);
+
+ list_for_each_entry_rcu(k, &st->list, storage_entry) {
+ if (event & k->event.event)
+ if (!(k->event.req_flags & KEVENT_REQ_WAKEUP_ONE) || wake_num == 0)
+ if (__kevent_requeue(k, event))
+ wake_num++;
+ }
+ rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+ spin_lock_init(&st->lock);
+ st->origin = origin;
+ INIT_LIST_HEAD(&st->list);
+ return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, sockt and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+ kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..00d942a
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,936 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static const char kevent_name[] = "kevent";
+static kmem_cache_t *kevent_cache __read_mostly;
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+ struct kevent_user *u = file->private_data;
+ unsigned int mask;
+
+ poll_wait(file, &u->wait, wait);
+ mask = 0;
+
+ if (u->ready_num)
+ mask |= POLLIN | POLLRDNORM;
+
+ return mask;
+}
+
+/*
+ * Copies kevent into userspace ring buffer if it was initialized.
+ * Returns
+ * 0 on success,
+ * -EAGAIN if there were no place for that kevent (impossible)
+ * -EFAULT if copy_to_user() failed.
+ *
+ * Must be called under kevent_user->ring_lock locked.
+ */
+static int kevent_copy_ring_buffer(struct kevent *k)
+{
+ struct kevent_ring __user *ring;
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+ int err;
+
+ ring = u->pring;
+ if (!ring)
+ return 0;
+
+ if (copy_to_user(&ring->event[u->kidx], &k->event, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ goto err_out_exit;
+ }
+
+ if (put_user(u->kidx, &ring->ring_kidx)) {
+ err = -EFAULT;
+ goto err_out_exit;
+ }
+
+ if (++u->kidx >= u->ring_size)
+ u->kidx = 0;
+
+ return 0;
+
+err_out_exit:
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.ret_flags |= KEVENT_RET_COPY_FAILED;
+ spin_unlock_irqrestore(&k->ulock, flags);
+ return err;
+}
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+ struct kevent_user *u;
+
+ u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+ if (!u)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&u->ready_list);
+ spin_lock_init(&u->ready_lock);
+ kevent_stat_init(u);
+ spin_lock_init(&u->kevent_lock);
+ u->kevent_root = RB_ROOT;
+
+ mutex_init(&u->ctl_mutex);
+ init_waitqueue_head(&u->wait);
+
+ atomic_set(&u->refcnt, 1);
+
+ mutex_init(&u->ring_lock);
+ u->kidx = u->ring_size = 0;
+ u->pring = NULL;
+
+ file->private_data = u;
+ return 0;
+}
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+ atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+ if (atomic_dec_and_test(&u->refcnt)) {
+ kevent_stat_print(u);
+ kfree(u);
+ }
+}
+
+static inline int kevent_compare_id(struct kevent_id *left, struct kevent_id *right)
+{
+ if (left->raw_u64 > right->raw_u64)
+ return -1;
+
+ if (right->raw_u64 > left->raw_u64)
+ return 1;
+
+ return 0;
+}
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+ struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+ kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Must be called under u->ready_lock.
+ * This function unlinks kevent from ready queue.
+ */
+static inline void kevent_unlink_ready(struct kevent *k)
+{
+ list_del(&k->ready_entry);
+ k->flags &= ~KEVENT_READY;
+ k->user->ready_num--;
+}
+
+static void kevent_remove_ready(struct kevent *k)
+{
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->ready_lock, flags);
+ if (k->flags & KEVENT_READY)
+ kevent_unlink_ready(k);
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+ if (deq)
+ kevent_dequeue(k);
+
+ kevent_remove_ready(k);
+
+ kevent_user_put(k->user);
+ call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+
+ rb_erase(&k->kevent_node, &u->kevent_root);
+ k->flags &= ~KEVENT_USER;
+ u->kevent_num--;
+ kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events,
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ rb_erase(&k->kevent_node, &u->kevent_root);
+ k->flags &= ~KEVENT_USER;
+ u->kevent_num--;
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+ kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kevent_dequeue_ready(struct kevent_user *u)
+{
+ unsigned long flags;
+ struct kevent *k = NULL;
+
+ mutex_lock(&u->ring_lock);
+ while (u->ready_num && !k) {
+ spin_lock_irqsave(&u->ready_lock, flags);
+ if (u->ready_num && !list_empty(&u->ready_list)) {
+ k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+ kevent_unlink_ready(k);
+ }
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+
+ if (k && (k->event.req_flags & KEVENT_REQ_LAST_CHECK)) {
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.req_flags &= ~KEVENT_REQ_LAST_CHECK;
+ spin_unlock_irqrestore(&k->ulock, flags);
+
+ if (!k->callbacks.callback(k)) {
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.req_flags |= KEVENT_REQ_LAST_CHECK;
+ k->event.ret_flags = 0;
+ k->event.ret_data[0] = k->event.ret_data[1] = 0;
+ spin_unlock_irqrestore(&k->ulock, flags);
+ k = NULL;
+ }
+ } else
+ break;
+ }
+
+ if (k)
+ kevent_copy_ring_buffer(k);
+ mutex_unlock(&u->ring_lock);
+
+ return k;
+}
+
+static void kevent_complete_ready(struct kevent *k)
+{
+ if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+ /*
+ * If it is one-shot kevent, it has been removed already from
+ * origin's queue, so we can easily free it here.
+ */
+ kevent_finish_user(k, 1);
+ else if (k->event.req_flags & KEVENT_REQ_ET) {
+ unsigned long flags;
+
+ /*
+ * Edge-triggered behaviour: mark event as clear new one.
+ */
+
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.ret_flags = 0;
+ k->event.ret_data[0] = k->event.ret_data[1] = 0;
+ spin_unlock_irqrestore(&k->ulock, flags);
+ }
+}
+
+/*
+ * Search a kevent inside kevent tree for given ukevent.
+ */
+static struct kevent *__kevent_search(struct kevent_id *id, struct kevent_user *u)
+{
+ struct kevent *k, *ret = NULL;
+ struct rb_node *n = u->kevent_root.rb_node;
+ int cmp;
+
+ while (n) {
+ k = rb_entry(n, struct kevent, kevent_node);
+ cmp = kevent_compare_id(&k->event.id, id);
+
+ if (cmp > 0)
+ n = n->rb_right;
+ else if (cmp < 0)
+ n = n->rb_left;
+ else {
+ ret = k;
+ break;
+ }
+ }
+
+ return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ int err = -ENODEV;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&uk->id, u);
+ if (k) {
+ spin_lock(&k->ulock);
+ k->event.event = uk->event;
+ k->event.req_flags = uk->req_flags;
+ k->event.ret_flags = 0;
+ spin_unlock(&k->ulock);
+ kevent_requeue(k);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+ int err = -ENODEV;
+ struct kevent *k;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&uk->id, u);
+ if (k) {
+ __kevent_finish_user(k, 1);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+ struct kevent_user *u = file->private_data;
+ struct kevent *k;
+ struct rb_node *n;
+
+ for (n = rb_first(&u->kevent_root); n; n = rb_next(n)) {
+ k = rb_entry(n, struct kevent, kevent_node);
+ kevent_finish_user(k, 1);
+ }
+
+ kevent_user_put(u);
+ file->private_data = NULL;
+
+ return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+ struct ukevent *ukev;
+
+ ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+ if (!ukev)
+ return NULL;
+
+ if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+ kfree(ukev);
+ return NULL;
+ }
+
+ return ukev;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = 0, i;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > u->kevent_num) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ if (kevent_modify(&ukev[i], u))
+ ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+ ukev[i].ret_flags |= KEVENT_RET_DONE;
+ }
+ if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+ err = -EFAULT;
+ kfree(ukev);
+ goto out;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ if (kevent_modify(&uk, u))
+ uk.ret_flags |= KEVENT_RET_BROKEN;
+ uk.ret_flags |= KEVENT_RET_DONE;
+
+ if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ arg += sizeof(struct ukevent);
+ }
+out:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = 0, i;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > u->kevent_num) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ if (kevent_remove(&ukev[i], u))
+ ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+ ukev[i].ret_flags |= KEVENT_RET_DONE;
+ }
+ if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+ err = -EFAULT;
+ kfree(ukev);
+ goto out;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ if (kevent_remove(&uk, u))
+ uk.ret_flags |= KEVENT_RET_BROKEN;
+
+ uk.ret_flags |= KEVENT_RET_DONE;
+
+ if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ arg += sizeof(struct ukevent);
+ }
+out:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static int kevent_user_enqueue(struct kevent_user *u, struct kevent *new)
+{
+ unsigned long flags;
+ struct rb_node **p = &u->kevent_root.rb_node, *parent = NULL;
+ struct kevent *k;
+ int err = 0, cmp;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ while (*p) {
+ parent = *p;
+ k = rb_entry(parent, struct kevent, kevent_node);
+
+ cmp = kevent_compare_id(&k->event.id, &new->event.id);
+ if (cmp > 0)
+ p = &parent->rb_right;
+ else if (cmp < 0)
+ p = &parent->rb_left;
+ else {
+ err = -EEXIST;
+ break;
+ }
+ }
+ if (likely(!err)) {
+ rb_link_node(&new->kevent_node, parent, p);
+ rb_insert_color(&new->kevent_node, &u->kevent_root);
+ new->flags |= KEVENT_USER;
+ u->kevent_num++;
+ kevent_user_get(u);
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ int err;
+
+ k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+ if (!k) {
+ err = -ENOMEM;
+ goto err_out_exit;
+ }
+
+ memcpy(&k->event, uk, sizeof(struct ukevent));
+ INIT_RCU_HEAD(&k->rcu_head);
+
+ k->event.ret_flags = 0;
+
+ err = kevent_init(k);
+ if (err) {
+ kmem_cache_free(kevent_cache, k);
+ goto err_out_exit;
+ }
+ k->user = u;
+ kevent_stat_total(u);
+ err = kevent_user_enqueue(u, k);
+ if (err) {
+ kmem_cache_free(kevent_cache, k);
+ goto err_out_exit;
+ }
+
+ err = kevent_enqueue(k);
+ if (err) {
+ memcpy(uk, &k->event, sizeof(struct ukevent));
+ kevent_finish_user(k, 0);
+ goto err_out_exit;
+ }
+
+ return 0;
+
+err_out_exit:
+ if (err < 0) {
+ uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
+ uk->ret_data[1] = err;
+ } else if (err > 0)
+ uk->ret_flags |= KEVENT_RET_DONE;
+ return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one
+ * and add them into appropriate kevent_storages,
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err, cerr = 0, rnum = 0, i;
+ void __user *orig = arg;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ err = -EINVAL;
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ err = kevent_user_add_ukevent(&ukev[i], u);
+ if (err) {
+ kevent_stat_im(u);
+ if (i != rnum)
+ memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+ rnum++;
+ }
+ }
+ if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+ cerr = -EFAULT;
+ kfree(ukev);
+ goto out_setup;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ cerr = -EFAULT;
+ break;
+ }
+ arg += sizeof(struct ukevent);
+
+ err = kevent_user_add_ukevent(&uk, u);
+ if (err) {
+ kevent_stat_im(u);
+ if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+ cerr = -EFAULT;
+ break;
+ }
+ orig += sizeof(struct ukevent);
+ rnum++;
+ }
+ }
+
+out_setup:
+ if (cerr < 0) {
+ err = cerr;
+ goto out_remove;
+ }
+
+ err = rnum;
+out_remove:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u,
+ unsigned int min_nr, unsigned int max_nr, __u64 timeout,
+ void __user *buf)
+{
+ struct kevent *k;
+ int num = 0;
+
+ if (!(file->f_flags & O_NONBLOCK)) {
+ wait_event_interruptible_timeout(u->wait,
+ u->ready_num >= min_nr,
+ clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+ }
+
+ while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) {
+ if (copy_to_user(buf + num*sizeof(struct ukevent),
+ &k->event, sizeof(struct ukevent))) {
+ if (num == 0)
+ num = -EFAULT;
+ break;
+ }
+ kevent_complete_ready(k);
+ ++num;
+ kevent_stat_wait(u);
+ }
+
+ return num;
+}
+
+static struct file_operations kevent_user_fops = {
+ .open = kevent_user_open,
+ .release = kevent_user_release,
+ .poll = kevent_user_poll,
+ .owner = THIS_MODULE,
+};
+
+static struct miscdevice kevent_miscdev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = kevent_name,
+ .fops = &kevent_user_fops,
+};
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+ int err;
+ struct kevent_user *u = file->private_data;
+
+ switch (cmd) {
+ case KEVENT_CTL_ADD:
+ err = kevent_user_ctl_add(u, num, arg);
+ break;
+ case KEVENT_CTL_REMOVE:
+ err = kevent_user_ctl_remove(u, num, arg);
+ break;
+ case KEVENT_CTL_MODIFY:
+ err = kevent_user_ctl_modify(u, num, arg);
+ break;
+ default:
+ err = -EINVAL;
+ break;
+ }
+
+ return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - timeout in nanoseconds to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - ununsed for now (will be used for mmap implementation).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+ __u64 timeout, struct ukevent __user *buf, unsigned flags)
+{
+ int err = -EINVAL;
+ struct file *file;
+ struct kevent_user *u;
+
+ file = fget(ctl_fd);
+ if (!file)
+ return -EBADF;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+ u = file->private_data;
+
+ err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+ fput(file);
+ return err;
+}
+
+asmlinkage long sys_kevent_ring_init(int ctl_fd, struct kevent_ring __user *ring, unsigned int num)
+{
+ int err = -EINVAL;
+ struct file *file;
+ struct kevent_user *u;
+
+ file = fget(ctl_fd);
+ if (!file)
+ return -EBADF;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+ u = file->private_data;
+
+ mutex_lock(&u->ring_lock);
+ if (u->pring) {
+ err = -EINVAL;
+ goto err_out_exit;
+ }
+ u->pring = ring;
+ u->ring_size = num;
+ mutex_unlock(&u->ring_lock);
+
+ fput(file);
+
+ return 0;
+
+err_out_exit:
+ mutex_unlock(&u->ring_lock);
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * This syscall is used to perform waiting until there is free space in kevent queue
+ * and removes/requeues requested number of events (commits them). Function returns
+ * number of actually committed events.
+ *
+ * @ctl_fd - kevent file descriptor.
+ * @num - number of kevents to process.
+ * @timeout - this timeout specifies number of nanoseconds to wait until there is
+ * free space in kevent queue.
+ *
+ * When we need to commit @num events, it means we should just remove first @num
+ * kevents from ready queue and copy them into the buffer.
+ * Kevents will be copied into ring buffer in order they were placed into ready queue.
+ */
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, __u64 timeout)
+{
+ int err = -EINVAL, committed = 0;
+ struct file *file;
+ struct kevent_user *u;
+ struct kevent *k;
+ struct kevent_ring __user *ring;
+ unsigned int i;
+
+ file = fget(ctl_fd);
+ if (!file)
+ return -EBADF;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+ u = file->private_data;
+
+ ring = u->pring;
+ if (!ring || num >= u->ring_size)
+ goto out_fput;
+
+ if (!(file->f_flags & O_NONBLOCK)) {
+ wait_event_interruptible_timeout(u->wait,
+ u->ready_num >= 1,
+ clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+ }
+
+ for (i=0; i<num; ++i) {
+ k = kevent_dequeue_ready(u);
+ if (!k)
+ break;
+ kevent_complete_ready(k);
+ kevent_stat_ring(u);
+ committed++;
+ }
+
+ fput(file);
+
+ return committed;
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent __user *arg)
+{
+ int err = -EINVAL;
+ struct file *file;
+
+ file = fget(fd);
+ if (!file)
+ return -EBADF;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+
+ err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * Kevent subsystem initialization - create kevent cache and register
+ * filesystem to get control file descriptors from.
+ */
+static int __init kevent_user_init(void)
+{
+ int err = 0;
+
+ kevent_cache = kmem_cache_create("kevent_cache",
+ sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+
+ err = misc_register(&kevent_miscdev);
+ if (err) {
+ printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
+ goto err_out_exit;
+ }
+
+ printk(KERN_INFO "KEVENT subsystem has been successfully registered.\n");
+
+ return 0;
+
+err_out_exit:
+ kmem_cache_destroy(kevent_cache);
+ return err;
+}
+
+module_init(kevent_user_init);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7a3b2e7..5200583 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,11 @@ cond_syscall(ppc_rtas);
cond_syscall(sys_spu_run);
cond_syscall(sys_spu_create);

+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_wait);
+cond_syscall(sys_kevent_ctl);
+cond_syscall(sys_kevent_ring_init);
+
/* mmu depending weak syscall entries */
cond_syscall(sys_mprotect);
cond_syscall(sys_msync);

2006-11-09 08:26:54

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take24 4/6] kevent: Socket notifications.


Socket notifications.

This patch includes socket send/recv/accept notifications.
Using trivial web server based on kevent and this features
instead of epoll it's performance increased more than noticebly.
More details about various benchmarks and server itself
(evserver_kevent.c) can be found on project's homepage.

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/fs/inode.c b/fs/inode.c
index ada7643..6745c00 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@ #include <linux/pagemap.h>
#include <linux/cdev.h>
#include <linux/bootmem.h>
#include <linux/inotify.h>
+#include <linux/kevent.h>
#include <linux/mount.h>

/*
@@ -164,12 +165,18 @@ #endif
}
inode->i_private = 0;
inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
+ kevent_storage_init(inode, &inode->st);
+#endif
}
return inode;
}

void destroy_inode(struct inode *inode)
{
+#if defined CONFIG_KEVENT_SOCKET
+ kevent_storage_fini(&inode->st);
+#endif
BUG_ON(inode_has_buffers(inode));
security_inode_free(inode);
if (inode->i_sb->s_op->destroy_inode)
diff --git a/include/net/sock.h b/include/net/sock.h
index edd4d73..d48ded8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -48,6 +48,7 @@ #include <linux/lockdep.h>
#include <linux/netdevice.h>
#include <linux/skbuff.h> /* struct sk_buff */
#include <linux/security.h>
+#include <linux/kevent.h>

#include <linux/filter.h>

@@ -450,6 +451,21 @@ static inline int sk_stream_memory_free(

extern void sk_stream_rfree(struct sk_buff *skb);

+struct socket_alloc {
+ struct socket socket;
+ struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+ return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+ return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
+}
+
static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
{
skb->sk = sk;
@@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct
sk->sk_backlog.tail = skb;
}
skb->next = NULL;
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
}

#define sk_wait_event(__sk, __timeo, __condition) \
@@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio
return si->kiocb;
}

-struct socket_alloc {
- struct socket socket;
- struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
- return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
- return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
-}
-
extern void __sk_stream_mem_reclaim(struct sock *sk);
extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..69f4ad2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so
tp->ucopy.memory = 0;
} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
wake_up_interruptible(sk->sk_sleep);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
if (!inet_csk_ack_scheduled(sk))
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
(3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 0000000..7f74110
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,135 @@
+/*
+ * kevent_socket.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/tcp.h>
+#include <linux/kevent.h>
+
+#include <net/sock.h>
+#include <net/request_sock.h>
+#include <net/inet_connection_sock.h>
+
+static int kevent_socket_callback(struct kevent *k)
+{
+ struct inode *inode = k->st->origin;
+ unsigned int events = SOCKET_I(inode)->ops->poll(SOCKET_I(inode)->file, SOCKET_I(inode), NULL);
+
+ if ((events & (POLLIN | POLLRDNORM)) && (k->event.event & (KEVENT_SOCKET_RECV | KEVENT_SOCKET_ACCEPT)))
+ return 1;
+ if ((events & (POLLOUT | POLLWRNORM)) && (k->event.event & KEVENT_SOCKET_SEND))
+ return 1;
+ return 0;
+}
+
+int kevent_socket_enqueue(struct kevent *k)
+{
+ struct inode *inode;
+ struct socket *sock;
+ int err = -EBADF;
+
+ sock = sockfd_lookup(k->event.id.raw[0], &err);
+ if (!sock)
+ goto err_out_exit;
+
+ inode = igrab(SOCK_INODE(sock));
+ if (!inode)
+ goto err_out_fput;
+
+ err = kevent_storage_enqueue(&inode->st, k);
+ if (err)
+ goto err_out_iput;
+
+ err = k->callbacks.callback(k);
+ if (err)
+ goto err_out_dequeue;
+
+ return err;
+
+err_out_dequeue:
+ kevent_storage_dequeue(k->st, k);
+err_out_iput:
+ iput(inode);
+err_out_fput:
+ sockfd_put(sock);
+err_out_exit:
+ return err;
+}
+
+int kevent_socket_dequeue(struct kevent *k)
+{
+ struct inode *inode = k->st->origin;
+ struct socket *sock;
+
+ kevent_storage_dequeue(k->st, k);
+
+ sock = SOCKET_I(inode);
+ iput(inode);
+ sockfd_put(sock);
+
+ return 0;
+}
+
+void kevent_socket_notify(struct sock *sk, u32 event)
+{
+ if (sk->sk_socket)
+ kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event);
+}
+
+/*
+ * It is required for network protocols compiled as modules, like IPv6.
+ */
+EXPORT_SYMBOL_GPL(kevent_socket_notify);
+
+#ifdef CONFIG_LOCKDEP
+static struct lock_class_key kevent_sock_key;
+
+void kevent_socket_reinit(struct socket *sock)
+{
+ struct inode *inode = SOCK_INODE(sock);
+
+ lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+}
+
+void kevent_sk_reinit(struct sock *sk)
+{
+ if (sk->sk_socket) {
+ struct inode *inode = SOCK_INODE(sk->sk_socket);
+
+ lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+ }
+}
+#endif
+static int __init kevent_init_socket(void)
+{
+ struct kevent_callbacks sc = {
+ .callback = &kevent_socket_callback,
+ .enqueue = &kevent_socket_enqueue,
+ .dequeue = &kevent_socket_dequeue};
+
+ return kevent_add_callbacks(&sc, KEVENT_SOCKET);
+}
+module_init(kevent_init_socket);
diff --git a/net/core/sock.c b/net/core/sock.c
index b77e155..7d5fa3e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1402,6 +1402,7 @@ static void sock_def_wakeup(struct sock
if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
wake_up_interruptible_all(sk->sk_sleep);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}

static void sock_def_error_report(struct sock *sk)
@@ -1411,6 +1412,7 @@ static void sock_def_error_report(struct
wake_up_interruptible(sk->sk_sleep);
sk_wake_async(sk,0,POLL_ERR);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}

static void sock_def_readable(struct sock *sk, int len)
@@ -1420,6 +1422,7 @@ static void sock_def_readable(struct soc
wake_up_interruptible(sk->sk_sleep);
sk_wake_async(sk,1,POLL_IN);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}

static void sock_def_write_space(struct sock *sk)
@@ -1439,6 +1442,7 @@ static void sock_def_write_space(struct
}

read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
}

static void sock_def_destruct(struct sock *sk)
@@ -1489,6 +1493,8 @@ #endif
sk->sk_state = TCP_CLOSE;
sk->sk_socket = sock;

+ kevent_sk_reinit(sk);
+
sock_set_flag(sk, SOCK_ZAPPED);

if(sock)
@@ -1555,8 +1561,10 @@ void fastcall release_sock(struct sock *
if (sk->sk_backlog.tail)
__release_sock(sk);
sk->sk_lock.owner = NULL;
- if (waitqueue_active(&sk->sk_lock.wq))
+ if (waitqueue_active(&sk->sk_lock.wq)) {
wake_up(&sk->sk_lock.wq);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
+ }
spin_unlock_bh(&sk->sk_lock.slock);
}
EXPORT_SYMBOL(release_sock);
diff --git a/net/core/stream.c b/net/core/stream.c
index d1d7dec..2878c2a 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock *
wake_up_interruptible(sk->sk_sleep);
if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
sock_wake_async(sock, 2, POLL_OUT);
+ kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
}
}

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3f884ce..e7dd989 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3119,6 +3119,7 @@ static void tcp_ofo_queue(struct sock *s

__skb_unlink(skb, &tp->out_of_order_queue);
__skb_queue_tail(&sk->sk_receive_queue, skb);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
if(skb->h.th->fin)
tcp_fin(skb, sk, skb->h.th);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index c83938b..b0dd70d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -61,6 +61,7 @@ #include <linux/cache.h>
#include <linux/jhash.h>
#include <linux/init.h>
#include <linux/times.h>
+#include <linux/kevent.h>

#include <net/icmp.h>
#include <net/inet_hashtables.h>
@@ -870,6 +871,7 @@ #endif
reqsk_free(req);
} else {
inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+ kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT);
}
return 0;

diff --git a/net/socket.c b/net/socket.c
index 1bc4167..5582b4a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -85,6 +85,7 @@ #include <linux/compat.h>
#include <linux/kmod.h>
#include <linux/audit.h>
#include <linux/wireless.h>
+#include <linux/kevent.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -490,6 +491,8 @@ static struct socket *sock_alloc(void)
inode->i_uid = current->fsuid;
inode->i_gid = current->fsgid;

+ kevent_socket_reinit(sock);
+
get_cpu_var(sockets_in_use)++;
put_cpu_var(sockets_in_use);
return sock;

2006-11-09 09:08:41

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take24 3/6] kevent: poll/select() notifications.

On Thursday 09 November 2006 09:23, Evgeniy Polyakov wrote:
> poll/select() notifications.
>
> This patch includes generic poll/select notifications.
> kevent_poll works simialr to epoll and has the same issues (callback
> is invoked not from internal state machine of the caller, but through
> process awake, a lot of allocations and so on).
>
> Signed-off-by: Evgeniy Polyakov <[email protected]>
>
> diff --git a/fs/file_table.c b/fs/file_table.c
> index bc35a40..0805547 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -20,6 +20,7 @@ #include <linux/capability.h>
> #include <linux/cdev.h>
> #include <linux/fsnotify.h>
> #include <linux/sysctl.h>
> +#include <linux/kevent.h>
> #include <linux/percpu_counter.h>
>
> #include <asm/atomic.h>
> @@ -119,6 +120,7 @@ struct file *get_empty_filp(void)
> f->f_uid = tsk->fsuid;
> f->f_gid = tsk->fsgid;
> eventpoll_init_file(f);
> + kevent_init_file(f);
> /* f->f_version: 0 */
> return f;
>
> @@ -164,6 +166,7 @@ void fastcall __fput(struct file *file)
> * in the file cleanup chain.
> */
> eventpoll_release(file);
> + kevent_cleanup_file(file);
> locks_remove_flock(file);
>
> if (file->f_op && file->f_op->release)
> diff --git a/fs/inode.c b/fs/inode.c
> index ada7643..6745c00 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -21,6 +21,7 @@ #include <linux/pagemap.h>
> #include <linux/cdev.h>
> #include <linux/bootmem.h>
> #include <linux/inotify.h>
> +#include <linux/kevent.h>
> #include <linux/mount.h>
>
> /*
> @@ -164,12 +165,18 @@ #endif
> }
> inode->i_private = 0;
> inode->i_mapping = mapping;

Here you test both KEVENT_SOCKET and KEVENT_PIPE

> +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
> + kevent_storage_init(inode, &inode->st);
> +#endif
> }
> return inode;
> }
>
> void destroy_inode(struct inode *inode)
> {

but here you test only KEVENT_SOCKET

> +#if defined CONFIG_KEVENT_SOCKET
> + kevent_storage_fini(&inode->st);
> +#endif
> BUG_ON(inode_has_buffers(inode));
> security_inode_free(inode);
> if (inode->i_sb->s_op->destroy_inode)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 5baf3a1..c529723 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -276,6 +276,7 @@ #include <linux/prio_tree.h>
> #include <linux/init.h>
> #include <linux/sched.h>
> #include <linux/mutex.h>
> +#include <linux/kevent_storage.h>
>
> #include <asm/atomic.h>
> #include <asm/semaphore.h>
> @@ -586,6 +587,10 @@ #ifdef CONFIG_INOTIFY
> struct mutex inotify_mutex; /* protects the watches list */
> #endif
>

Here you include a kevent_storage only if KEVENT_SOCKET

> +#ifdef CONFIG_KEVENT_SOCKET
> + struct kevent_storage st;
> +#endif
> +
> unsigned long i_state;
> unsigned long dirtied_when; /* jiffies of first dirtying */
>
> @@ -739,6 +744,9 @@ #ifdef CONFIG_EPOLL
> struct list_head f_ep_links;
> spinlock_t f_ep_lock;
> #endif /* #ifdef CONFIG_EPOLL */
> +#ifdef CONFIG_KEVENT_POLL
> + struct kevent_storage st;
> +#endif
> struct address_space *f_mapping;
> };
> extern spinlock_t files_lock;

2006-11-09 09:32:45

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 3/6] kevent: poll/select() notifications.

On Thu, Nov 09, 2006 at 10:08:44AM +0100, Eric Dumazet ([email protected]) wrote:
> Here you test both KEVENT_SOCKET and KEVENT_PIPE
>
> > +#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE
> > + kevent_storage_init(inode, &inode->st);
> > +#endif
> > }
> > return inode;
> > }
> >
> > void destroy_inode(struct inode *inode)
> > {
>
> but here you test only KEVENT_SOCKET
>
> > +#if defined CONFIG_KEVENT_SOCKET
> > + kevent_storage_fini(&inode->st);
> > +#endif

Indeed, it must be
#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE

> > BUG_ON(inode_has_buffers(inode));
> > security_inode_free(inode);
> > if (inode->i_sb->s_op->destroy_inode)
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 5baf3a1..c529723 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -276,6 +276,7 @@ #include <linux/prio_tree.h>
> > #include <linux/init.h>
> > #include <linux/sched.h>
> > #include <linux/mutex.h>
> > +#include <linux/kevent_storage.h>
> >
> > #include <asm/atomic.h>
> > #include <asm/semaphore.h>
> > @@ -586,6 +587,10 @@ #ifdef CONFIG_INOTIFY
> > struct mutex inotify_mutex; /* protects the watches list */
> > #endif
> >
>
> Here you include a kevent_storage only if KEVENT_SOCKET
>
> > +#ifdef CONFIG_KEVENT_SOCKET
> > + struct kevent_storage st;
> > +#endif
> > +

It must be
#if defined CONFIG_KEVENT_SOCKET || defined CONFIG_KEVENT_PIPE

--
Evgeniy Polyakov

2006-11-09 18:52:06

by Davide Libenzi

[permalink] [raw]
Subject: Re: [take24 3/6] kevent: poll/select() notifications.

On Thu, 9 Nov 2006, Evgeniy Polyakov wrote:

> +static int kevent_poll_callback(struct kevent *k)
> +{
> + if (k->event.req_flags & KEVENT_REQ_LAST_CHECK) {
> + return 1;
> + } else {
> + struct file *file = k->st->origin;
> + unsigned int revents = file->f_op->poll(file, NULL);
> +
> + k->event.ret_data[0] = revents & k->event.event;
> +
> + return (revents & k->event.event);
> + }
> +}

You need to be careful that file->f_op->poll is not called inside the
spin_lock_irqsave/spin_lock_irqrestore pair, since (even this came up
during epoll developemtn days) file->f_op->poll might do a simple
spin_lock_irq/spin_unlock_irq. This unfortunate constrain forced epoll to
have a suboptimal double O(R) loop to handle LT events.



- Davide


2006-11-09 19:14:19

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 3/6] kevent: poll/select() notifications.

On Thu, Nov 09, 2006 at 10:51:56AM -0800, Davide Libenzi ([email protected]) wrote:
> On Thu, 9 Nov 2006, Evgeniy Polyakov wrote:
>
> > +static int kevent_poll_callback(struct kevent *k)
> > +{
> > + if (k->event.req_flags & KEVENT_REQ_LAST_CHECK) {
> > + return 1;
> > + } else {
> > + struct file *file = k->st->origin;
> > + unsigned int revents = file->f_op->poll(file, NULL);
> > +
> > + k->event.ret_data[0] = revents & k->event.event;
> > +
> > + return (revents & k->event.event);
> > + }
> > +}
>
> You need to be careful that file->f_op->poll is not called inside the
> spin_lock_irqsave/spin_lock_irqrestore pair, since (even this came up
> during epoll developemtn days) file->f_op->poll might do a simple
> spin_lock_irq/spin_unlock_irq. This unfortunate constrain forced epoll to
> have a suboptimal double O(R) loop to handle LT events.

It is tricky - users call wake_up() from any context, which in turn ends
up calling kevent_storage_ready(), which calls kevent_poll_callback() with
KEVENT_REQ_LAST_CHECK bit set, which becomes almost empty call in fast
path. Since callback returns 1, kevent will be queued into ready queue,
which is processed on behalf of syscalls - in that case kevent will
check the flag and since KEVENT_REQ_LAST_CHECK is set, will call
callback again to check if kevent is correctly marked, but already
without that flag (it happens in syscall context, i.e. process context
without any locks held), so callback calls ->poll(), which can sleep,
but it is safe. If ->poll() returns 'ready' value, kevent is transfers
data into userspace, otherwise it is 'requeued' (just removed from
ready queue).

> - Davide
>

--
Evgeniy Polyakov

2006-11-09 19:42:54

by Davide Libenzi

[permalink] [raw]
Subject: Re: [take24 3/6] kevent: poll/select() notifications.

On Thu, 9 Nov 2006, Evgeniy Polyakov wrote:

> On Thu, Nov 09, 2006 at 10:51:56AM -0800, Davide Libenzi ([email protected]) wrote:
> > On Thu, 9 Nov 2006, Evgeniy Polyakov wrote:
> >
> > > +static int kevent_poll_callback(struct kevent *k)
> > > +{
> > > + if (k->event.req_flags & KEVENT_REQ_LAST_CHECK) {
> > > + return 1;
> > > + } else {
> > > + struct file *file = k->st->origin;
> > > + unsigned int revents = file->f_op->poll(file, NULL);
> > > +
> > > + k->event.ret_data[0] = revents & k->event.event;
> > > +
> > > + return (revents & k->event.event);
> > > + }
> > > +}
> >
> > You need to be careful that file->f_op->poll is not called inside the
> > spin_lock_irqsave/spin_lock_irqrestore pair, since (even this came up
> > during epoll developemtn days) file->f_op->poll might do a simple
> > spin_lock_irq/spin_unlock_irq. This unfortunate constrain forced epoll to
> > have a suboptimal double O(R) loop to handle LT events.
>
> It is tricky - users call wake_up() from any context, which in turn ends
> up calling kevent_storage_ready(), which calls kevent_poll_callback() with
> KEVENT_REQ_LAST_CHECK bit set, which becomes almost empty call in fast
> path. Since callback returns 1, kevent will be queued into ready queue,
> which is processed on behalf of syscalls - in that case kevent will
> check the flag and since KEVENT_REQ_LAST_CHECK is set, will call
> callback again to check if kevent is correctly marked, but already
> without that flag (it happens in syscall context, i.e. process context
> without any locks held), so callback calls ->poll(), which can sleep,
> but it is safe. If ->poll() returns 'ready' value, kevent is transfers
> data into userspace, otherwise it is 'requeued' (just removed from
> ready queue).

Oh, mine was only a general warn. I hadn't looked at the generic code
before. But now that I poke on it, I see:

void kevent_requeue(struct kevent *k)
{
unsigned long flags;

spin_lock_irqsave(&k->st->lock, flags);
__kevent_requeue(k, 0);
spin_unlock_irqrestore(&k->st->lock, flags);
}

and then:

static int __kevent_requeue(struct kevent *k, u32 event)
{
int ret, rem;
unsigned long flags;

ret = k->callbacks.callback(k);

Isn't the k->callbacks.callback() possibly end up calling f_op->poll?



- Davide


2006-11-09 20:10:46

by Davide Libenzi

[permalink] [raw]
Subject: Re: [take24 3/6] kevent: poll/select() notifications.

On Thu, 9 Nov 2006, Davide Libenzi wrote:

> On Thu, 9 Nov 2006, Evgeniy Polyakov wrote:
>
> > On Thu, Nov 09, 2006 at 10:51:56AM -0800, Davide Libenzi ([email protected]) wrote:
> > > On Thu, 9 Nov 2006, Evgeniy Polyakov wrote:
> > >
> > > > +static int kevent_poll_callback(struct kevent *k)
> > > > +{
> > > > + if (k->event.req_flags & KEVENT_REQ_LAST_CHECK) {
> > > > + return 1;
> > > > + } else {
> > > > + struct file *file = k->st->origin;
> > > > + unsigned int revents = file->f_op->poll(file, NULL);
> > > > +
> > > > + k->event.ret_data[0] = revents & k->event.event;
> > > > +
> > > > + return (revents & k->event.event);
> > > > + }
> > > > +}
> > >
> > > You need to be careful that file->f_op->poll is not called inside the
> > > spin_lock_irqsave/spin_lock_irqrestore pair, since (even this came up
> > > during epoll developemtn days) file->f_op->poll might do a simple
> > > spin_lock_irq/spin_unlock_irq. This unfortunate constrain forced epoll to
> > > have a suboptimal double O(R) loop to handle LT events.
> >
> > It is tricky - users call wake_up() from any context, which in turn ends
> > up calling kevent_storage_ready(), which calls kevent_poll_callback() with
> > KEVENT_REQ_LAST_CHECK bit set, which becomes almost empty call in fast
> > path. Since callback returns 1, kevent will be queued into ready queue,
> > which is processed on behalf of syscalls - in that case kevent will
> > check the flag and since KEVENT_REQ_LAST_CHECK is set, will call
> > callback again to check if kevent is correctly marked, but already
> > without that flag (it happens in syscall context, i.e. process context
> > without any locks held), so callback calls ->poll(), which can sleep,
> > but it is safe. If ->poll() returns 'ready' value, kevent is transfers
> > data into userspace, otherwise it is 'requeued' (just removed from
> > ready queue).
>
> Oh, mine was only a general warn. I hadn't looked at the generic code
> before. But now that I poke on it, I see:
>
> void kevent_requeue(struct kevent *k)
> {
> unsigned long flags;
>
> spin_lock_irqsave(&k->st->lock, flags);
> __kevent_requeue(k, 0);
> spin_unlock_irqrestore(&k->st->lock, flags);
> }
>
> and then:
>
> static int __kevent_requeue(struct kevent *k, u32 event)
> {
> int ret, rem;
> unsigned long flags;
>
> ret = k->callbacks.callback(k);
>
> Isn't the k->callbacks.callback() possibly end up calling f_op->poll?

Ack, there the check for KEVENT_REQ_LAST_CHECK inside the callback.
The problem with f_op->poll was not that it can sleep (not excluded
though) but that some f_op->poll can do a simple spin_lock_irq/spin_unlock_irq.
But for a quick peek your new code seems fine with that.



- Davide


2006-11-11 17:42:00

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take24 7/6] kevent: signal notifications.

Signals which were requested to be delivered through kevent
subsystem must be registered through usual signal() and others
syscalls, this option allows alternative delivery.

With KEVENT_SIGNAL_NOMASK flag being set in kevent for set of
signals, they will not be delivered in a usual way.
Kevents for appropriate signals are not copied when process forks,
new process must add new kevents after fork(). Mask of signals
is copied as before.

Test application which registers two signal callbacks for usr1 and usr2
signals and it's deivery through kevent (the former with both callback and
kevent notifications, the latter only through kevent) is called signal.c
and can be found in archive on project homepage
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/include/linux/kevent.h b/include/linux/kevent.h
index f7cbf6b..e588ae6 100644
--- a/include/linux/kevent.h
+++ b/include/linux/kevent.h
@@ -28,6 +28,7 @@ #include <linux/wait.h>
#include <linux/net.h>
#include <linux/rcupdate.h>
#include <linux/fs.h>
+#include <linux/sched.h>
#include <linux/kevent_storage.h>
#include <linux/ukevent.h>

@@ -220,4 +221,10 @@ #else
static inline void kevent_pipe_notify(struct inode *inode, u32 events) {}
#endif

+#ifdef CONFIG_KEVENT_SIGNAL
+extern int kevent_signal_notify(struct task_struct *tsk, int sig);
+#else
+static inline int kevent_signal_notify(struct task_struct *tsk, int sig) {return 0;}
+#endif
+
#endif /* __KEVENT_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fc4a987..ef38a3c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -80,6 +80,7 @@ #include <linux/param.h>
#include <linux/resource.h>
#include <linux/timer.h>
#include <linux/hrtimer.h>
+#include <linux/kevent_storage.h>

#include <asm/processor.h>

@@ -1013,6 +1014,10 @@ #endif
#ifdef CONFIG_TASK_DELAY_ACCT
struct task_delay_info *delays;
#endif
+#ifdef CONFIG_KEVENT_SIGNAL
+ struct kevent_storage st;
+ u32 kevent_signals;
+#endif
};

static inline pid_t process_group(struct task_struct *tsk)
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
index b14e14e..a6038eb 100644
--- a/include/linux/ukevent.h
+++ b/include/linux/ukevent.h
@@ -68,7 +68,8 @@ #define KEVENT_POLL 3
#define KEVENT_NAIO 4
#define KEVENT_AIO 5
#define KEVENT_PIPE 6
-#define KEVENT_MAX 7
+#define KEVENT_SIGNAL 7
+#define KEVENT_MAX 8

/*
* Per-type event sets.
@@ -81,7 +82,7 @@ #define KEVENT_MAX 7
#define KEVENT_TIMER_FIRED 0x1

/*
- * Socket/network asynchronous IO events.
+ * Socket/network asynchronous IO and PIPE events.
*/
#define KEVENT_SOCKET_RECV 0x1
#define KEVENT_SOCKET_ACCEPT 0x2
@@ -115,10 +116,20 @@ #define KEVENT_POLL_POLLREMOVE 0x1000
*/
#define KEVENT_AIO_BIO 0x1

-#define KEVENT_MASK_ALL 0xffffffff
+/*
+ * Signal events.
+ */
+#define KEVENT_SIGNAL_DELIVERY 0x1
+
+/* If set in raw64, then given signals will not be delivered
+ * in a usual way through sigmask update and signal callback
+ * invokation. */
+#define KEVENT_SIGNAL_NOMASK 0x8000000000000000ULL
+
/* Mask of all possible event values. */
-#define KEVENT_MASK_EMPTY 0x0
+#define KEVENT_MASK_ALL 0xffffffff
/* Empty mask of ready events. */
+#define KEVENT_MASK_EMPTY 0x0

struct kevent_id
{
diff --git a/kernel/fork.c b/kernel/fork.c
index 1c999f3..e5b5b14 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -46,6 +46,7 @@ #include <linux/cn_proc.h>
#include <linux/delayacct.h>
#include <linux/taskstats_kern.h>
#include <linux/random.h>
+#include <linux/kevent.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -115,6 +116,9 @@ void __put_task_struct(struct task_struc
WARN_ON(atomic_read(&tsk->usage));
WARN_ON(tsk == current);

+#ifdef CONFIG_KEVENT_SIGNAL
+ kevent_storage_fini(&tsk->st);
+#endif
security_task_free(tsk);
free_uid(tsk->user);
put_group_info(tsk->group_info);
@@ -1121,6 +1125,10 @@ #endif
if (retval)
goto bad_fork_cleanup_namespace;

+#ifdef CONFIG_KEVENT_SIGNAL
+ kevent_storage_init(p, &p->st);
+#endif
+
p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
/*
* Clear TID on mm_release()?
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
index 267fc53..4b137ee 100644
--- a/kernel/kevent/Kconfig
+++ b/kernel/kevent/Kconfig
@@ -43,3 +43,18 @@ config KEVENT_PIPE
help
This option enables notifications through KEVENT subsystem of
pipe read/write operations.
+
+config KEVENT_SIGNAL
+ bool "Kernel event notifications for signals"
+ depends on KEVENT
+ help
+ This option enables signal delivery through KEVENT subsystem.
+ Signals which were requested to be delivered through kevent
+ subsystem must be registered through usual signal() and others
+ syscalls, this option allows alternative delivery.
+ With KEVENT_SIGNAL_NOMASK flag being set in kevent for set of
+ signals, they will not be delivered in a usual way.
+ Kevents for appropriate signals are not copied when process forks,
+ new process must add new kevents after fork(). Mask of signals
+ is copied as before.
+
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
index d4d6b68..f98e0c8 100644
--- a/kernel/kevent/Makefile
+++ b/kernel/kevent/Makefile
@@ -3,3 +3,4 @@ obj-$(CONFIG_KEVENT_TIMER) += kevent_tim
obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
obj-$(CONFIG_KEVENT_PIPE) += kevent_pipe.o
+obj-$(CONFIG_KEVENT_SIGNAL) += kevent_signal.o
diff --git a/kernel/kevent/kevent_signal.c b/kernel/kevent/kevent_signal.c
new file mode 100644
index 0000000..15f9d1f
--- /dev/null
+++ b/kernel/kevent/kevent_signal.c
@@ -0,0 +1,87 @@
+/*
+ * kevent_signal.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/kevent.h>
+
+static int kevent_signal_callback(struct kevent *k)
+{
+ struct task_struct *tsk = k->st->origin;
+ int sig = k->event.id.raw[0];
+ int ret = 0;
+
+ if (sig == tsk->kevent_signals)
+ ret = 1;
+
+ if (ret && (k->event.id.raw_u64 & KEVENT_SIGNAL_NOMASK))
+ tsk->kevent_signals |= 0x80000000;
+
+ return ret;
+}
+
+int kevent_signal_enqueue(struct kevent *k)
+{
+ int err;
+
+ err = kevent_storage_enqueue(&current->st, k);
+ if (err)
+ goto err_out_exit;
+
+ err = k->callbacks.callback(k);
+ if (err)
+ goto err_out_dequeue;
+
+ return err;
+
+err_out_dequeue:
+ kevent_storage_dequeue(k->st, k);
+err_out_exit:
+ return err;
+}
+
+int kevent_signal_dequeue(struct kevent *k)
+{
+ kevent_storage_dequeue(k->st, k);
+ return 0;
+}
+
+int kevent_signal_notify(struct task_struct *tsk, int sig)
+{
+ tsk->kevent_signals = sig;
+ kevent_storage_ready(&tsk->st, NULL, KEVENT_SIGNAL_DELIVERY);
+ return (tsk->kevent_signals & 0x80000000);
+}
+
+static int __init kevent_init_signal(void)
+{
+ struct kevent_callbacks sc = {
+ .callback = &kevent_signal_callback,
+ .enqueue = &kevent_signal_enqueue,
+ .dequeue = &kevent_signal_dequeue};
+
+ return kevent_add_callbacks(&sc, KEVENT_SIGNAL);
+}
+module_init(kevent_init_signal);
diff --git a/kernel/signal.c b/kernel/signal.c
index fb5da6d..d3d3594 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -23,6 +23,7 @@ #include <linux/syscalls.h>
#include <linux/ptrace.h>
#include <linux/signal.h>
#include <linux/capability.h>
+#include <linux/kevent.h>
#include <asm/param.h>
#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -703,6 +704,9 @@ static int send_signal(int sig, struct s
{
struct sigqueue * q = NULL;
int ret = 0;
+
+ if (kevent_signal_notify(t, sig))
+ return 1;

/*
* fast-pathed signals for kernel-internal things like SIGSTOP
@@ -782,6 +786,17 @@ specific_send_sig_info(int sig, struct s
ret = send_signal(sig, info, t, &t->pending);
if (!ret && !sigismember(&t->blocked, sig))
signal_wake_up(t, sig == SIGKILL);
+#ifdef CONFIG_KEVENT_SIGNAL
+ /*
+ * Kevent allows to deliver signals through kevent queue,
+ * it is possible to setup kevent to not deliver
+ * signal through the usual way, in that case send_signal()
+ * returns 1 and signal is delivered only through kevent queue.
+ * We simulate successfull delivery notification through this hack:
+ */
+ if (ret == 1)
+ ret = 0;
+#endif
out:
return ret;
}
@@ -971,6 +986,17 @@ __group_send_sig_info(int sig, struct si
* to avoid several races.
*/
ret = send_signal(sig, info, p, &p->signal->shared_pending);
+#ifdef CONFIG_KEVENT_SIGNAL
+ /*
+ * Kevent allows to deliver signals through kevent queue,
+ * it is possible to setup kevent to not deliver
+ * signal through the usual way, in that case send_signal()
+ * returns 1 and signal is delivered only through kevent queue.
+ * We simulate successfull delivery notification through this hack:
+ */
+ if (ret == 1)
+ ret = 0;
+#endif
if (unlikely(ret))
return ret;


--
Evgeniy Polyakov

2006-11-11 22:31:58

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
> Generic event handling mechanism.
> [...]

Sorry for the delay again. Kernel work is simply not my highest priority.

I've collected my comments on some parts of the patch. I haven't gone
through every part of the patch yet. Sorry for the length.

===================

- basic ring buffer problem: the kevent_copy_ring_buffer function stores
the event in the ring buffer without disregard of the current content.

+ if dequeued entries larger than number of ring buffer entries
events immediately get overwritten without passing anything to
userlevel

+ as with the old approach, the ring buffer is basically unusable with
multiple threads/processes. A thread calling kevent_wait might
cause entries another thread is still working on to be overwritten.

Possible solution:

a) it would be possible to have a "used" flag in each ring buffer entry.
That's too expensive, I guess.

b) kevent_wait needs another parameter which specifies the which is the
last (i.e., least recently added) entry in the ring buffer.
Everything between this entry and the current head (in ->kidx) is
occupied. If multiple threads arrive in kevent_wait the highest idx
(with wrap around possibly lowest) is used.

kevent_wait will not try to move more entries into the ring buffer
if ->kidx and the higest index passed in to any kevent_wait call
is equal (i.e., the ring buffer is full).

There is one issue, though, and that is that a system call is needed
to signal to the kernel that more entries in the ring buffer are
processed and that they can be refilled. This goes against the
kernel filling the ring buffer automatically (see below)


Threads should be able to (not necessarily forced to) use the
interfaces like this:

- by default all threads are "parked" in the kevent_wait syscall.


- If an event occurs one thread might be woken (depending on the 'num'
parameter)

- the woken thread(s) work on all the events in the ring buffer and
then call kevent_wait() again.

This requires that the threads can independently call kevent_wait()
and that they can independently retrieve events from the ring buffer
without fear the entry gets overwritten before it is retrieved.
Atomically retrieving entries from the ring buffer can be implemented
at userlevel. Either the ring buffer is writable and a field in each
ring buffer entry can be used as a 'handled' flag. Obviously this can
be done with atomic compare-and-exchange. If the ring buffer is not
writable then, as part of the userlevel wrapper around the event
handling interfaces, another array is created which contains the use
flags for each ring buffer entry. This is less elegant and probably
slower.

===================

- implementing the kevent_wait syscall the proposed way means we are
missing out on one possible optimization. The ring buffer is
currently only filled on kevent_wait calls. I expect that in really
high traffic situations requests are coming in at a higher rate than
the can be processed. At least for periods of time. If such
situations it would be nice to not have to call into the kernel at
all. If the kernel would deliver into the ring buffer on its own
this would be possible.

If the argument against this is that kevent_get_event should be
possible the answer is...

===================

- the kevent_get_event syscall is not needed at all. All reporting
should be done using a ring buffer. There really is not reason to
keep two interfaces around which serve the same purpose. Making
the argument the kevent_get_event is so much easier to use is not
valid. The exposed interface to access the ring buffer will be easy,
too. In the OLS paper I more or wait hinted at the interfaces. I
think they should be like this (names are irrelevant):

ec_t ec_create(unsigned flags);
int ec_destroy(ec_t ec);
int ec_poll_event(ec_t ec, event_data_t *d);
int ec_wait_event(ec_t ec, event_data_t *d);
int ec_timedwait_event(ec_t ec, event_data_t *d, struct timespec *to);

The latter three interfaces are the interesting ones. We have to get
the data out of the ring buffer as quickly as possible. So the
interfaces require passing in a reference to an object which can hold
the data. The 'poll' variant won't delay, the other two will.

We need separate create and destroy functions since there will always
be a userlevel component of the data structures. The create variant
can allocate the ring buffer and the other memory needed ('handled'
flags, tail pointers, ...) and destroy free all resources.

These interfaces are fast and easy to use. At least as easy as the
kevent_get_event syscall. And all transparently implemented on top of
the ring buffer. So, please let's drop the unneeded syscall.

===================

- another optimization I am thinking about is optimizing the thread
wakeup and ring buffer use for cache line use. I.e., if we know
an event was queued on a specific CPU then the wakeup function
should take this into account. I.e., if any of the threads
waiting was/will be scheduled on the same CPU it should be
preferred.

With the current simple form of a ring buffer this isn't sufficient,
though. Reading all entries in the ring buffer until finding the
one written by the CPU in question is not helpful. We'd need a
mechanism to point the thread to the entry in question. One
possibility to do this is to return the ring buffer entry as the
return value of the kevent_wait() syscall. This works fine if the
thread only works for one event (which I guess will be 99.999% of
all uses). An extension could be to extend the ukevent structure to
contain an index of the next entry written the same CPU.

Another problem this entails is false sharing of the ring buffer
entries. This would probably require to pad the ukevent structure
to 64 bytes. It's not that much more, 40 bytes so far, it's
also more future-safe. The alternative is to allocate have per-CPU
regions in the ring buffer. With hotplug CPUs this is just plain
silly.

I think this optimization has the potential to help quite a bit,
especially for large machines.

===================

- we absolutely need an interface to signal the kernel that a thread,
just woken from kevent_wait, cannot handle the events. I.e., the
events are in the ring buffer but all the other threads are in the
kernel in their kevent_wait calls. The new syscall would wake up
one or more threads to handle the events.

This syscall is for instance necessary if the thread calling
kevent_wait is canceled. It might also be needed when a thread
requested more than one event and realizes processing an entry
takes a long time and that another thread might work on the other
items in the meantime.


Al Viro pointed out another possible solution which also could solve
the "handled" flag problem and concurrency in use of the ring buffer.

The idea is to require the kevent_wait() syscall to signal which entry
in the ring buffer is handled or not handled. This means:

+ the kernel knows at any time which entries in the buffer are free
and which are not

+ concurrent filling of the ring buffer is no problem anymore since
entries are not discarded until told

+ by not waiting for event (num parameter == 0) the syscall can be
used to discard entries to free up the ring buffer before continuing
to work on more entries. And, as per the requirement above, it can
be used to tell the kernel that certain entries are *NOT* handled
and need to be sent to another thread. This would be useful in the
thread cancellation case.

This seems like a nice approach.

===================

- why no syscall to create kevent queue? With dynamic /dev this might
be a problem and it's really not much additional code. What about
programs which want to use these interfaces before /dev is set up?

===================

- still: the syscall should use a struct timespec* timeout parameter
and not nanosecs. There are at least three timeout modes which
are wanted:

+ relative, unconditionally wait that long

+ relative, aborted in case of large enough settimeofday() or NTP
adjustment

+ absolute timeout. Probably even with selecting which clock ot use.
This mode requires a timespec value parameter


We have all this code already in the futex syscall. It just needs to
be generalized or copied and adjusted.

===================

- still: no signal mask parameter in the kevent_wait (and get_event)
syscall. Regardless of what one thinks about signals, they are used
and integrating the kevent interface into existing code requires
this functionality. And it's not only about receiving signals.
The signal mask parameter can also be used to _prevent_ signals from
being delivered in that time.

===================

- the KEVENT_REQ_WAKEUP_ONE functionality is good and needed. But I
would reverse the default. I cannot see many places where you want
all threads to be woken. Introduce KEVENT_REQ_WAKEUP_ALL instead.

===================

- there is really no reason to invent yet another timer implementation.
We have the POSIX timers which are feature rich and nicely
implemented. All that is needed is to implement SIGEV_KEVENT as a
notification mechanism. The timer is registered as part of the
timer_create() syscalls.

===================


I haven't yet looked at the other event sources. I think the above is
enough for now.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-11-13 11:05:08

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Sat, Nov 11, 2006 at 02:28:53PM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >Generic event handling mechanism.
> >[...]
>
> Sorry for the delay again. Kernel work is simply not my highest priority.
>
> I've collected my comments on some parts of the patch. I haven't gone
> through every part of the patch yet. Sorry for the length.

No problem.

> ===================
>
> - basic ring buffer problem: the kevent_copy_ring_buffer function stores
> the event in the ring buffer without disregard of the current content.
>
> + if dequeued entries larger than number of ring buffer entries
> events immediately get overwritten without passing anything to
> userlevel
>
> + as with the old approach, the ring buffer is basically unusable with
> multiple threads/processes. A thread calling kevent_wait might
> cause entries another thread is still working on to be overwritten.
>
> Possible solution:
>
> a) it would be possible to have a "used" flag in each ring buffer entry.
> That's too expensive, I guess.
>
> b) kevent_wait needs another parameter which specifies the which is the
> last (i.e., least recently added) entry in the ring buffer.
> Everything between this entry and the current head (in ->kidx) is
> occupied. If multiple threads arrive in kevent_wait the highest idx
> (with wrap around possibly lowest) is used.
>
> kevent_wait will not try to move more entries into the ring buffer
> if ->kidx and the higest index passed in to any kevent_wait call
> is equal (i.e., the ring buffer is full).
>
> There is one issue, though, and that is that a system call is needed
> to signal to the kernel that more entries in the ring buffer are
> processed and that they can be refilled. This goes against the
> kernel filling the ring buffer automatically (see below)

If thread calls kevent_wait() it means it has processed previous entries,
one can call kevent_wait() with $num parameter as zero, which
means that thread does not want any new events, so nothing will be
copied.

> Threads should be able to (not necessarily forced to) use the
> interfaces like this:
>
> - by default all threads are "parked" in the kevent_wait syscall.
>
>
> - If an event occurs one thread might be woken (depending on the 'num'
> parameter)
>
> - the woken thread(s) work on all the events in the ring buffer and
> then call kevent_wait() again.
>
> This requires that the threads can independently call kevent_wait()
> and that they can independently retrieve events from the ring buffer
> without fear the entry gets overwritten before it is retrieved.
> Atomically retrieving entries from the ring buffer can be implemented
> at userlevel. Either the ring buffer is writable and a field in each
> ring buffer entry can be used as a 'handled' flag. Obviously this can
> be done with atomic compare-and-exchange. If the ring buffer is not
> writable then, as part of the userlevel wrapper around the event
> handling interfaces, another array is created which contains the use
> flags for each ring buffer entry. This is less elegant and probably
> slower.

Writable ring buffer does not sound too good to me - what if one thread
will overwrite the whole ring buffer so kernel's indexes can be screwed?

Ring buffer processed not in FIFO order is wrong idea - ring buffer can
be potentially very big and searching there for the entry, which was
been marked as 'free' by userspace is not a solution at all - userspace
in that case must provide ukevent so fast tree search would be used,
(and although it is already possible) it requires userspace to make
additional syscalls which is not what we want.

So kevent ring buffer is designed in the following way: all entries can
be processed _only_ in fifo order, i.e. they can be read in any order
threads want, but when one thread calls kevent_wait(num), $num requested
from the begining can be overwritten - kernel does not know how many
users reads those $num events from the begining, and even if they have
some flag that 'do not touch me, someone reads me', how and when those
entries will be reused? Kernel does not store bitmask or any other type
of objects to show that holes in the ring buffer are free - it works in
FIFO order since it the fastest mode.

As a solution I can create folowing scheme:
there are two syscalls (or one with a switch) which get events and
commits them.

kevent_wait() becomes a syscall which waits until number of events or
one of them becomes ready and just copies them into ring buffer and
returns. kevent_wait() will fail with special error code when ring
buffer is full.

kevent_commit() frees requested number of events _from the beginning_,
i.e. from special index, visible from userspace. Userspace can create
special counters for events (and even put them into read-only ring
buffer overwriting some fields of kevent, especially if we will increase
it's size) and only call kevent_commit() when all events have zero usage
counter.

I disagree that having possibility to have holes in the ring buffer is a
good idea at all - it requires much more complex protocol, which will
fill and reuse that holes, and the main disavantge - it requires to
transfer much more information from userspace to kernelspace to free the
ring entry in the hole - in that case it is already possible just to
call kevent_ctl(KEVENT_REMOVE) and do not wash the brain with new
approach at all.

> ===================
>
> - implementing the kevent_wait syscall the proposed way means we are
> missing out on one possible optimization. The ring buffer is
> currently only filled on kevent_wait calls. I expect that in really
> high traffic situations requests are coming in at a higher rate than
> the can be processed. At least for periods of time. If such
> situations it would be nice to not have to call into the kernel at
> all. If the kernel would deliver into the ring buffer on its own
> this would be possible.

Well, it can be done on behalf of workqueue or dedicated thread which
will bring up appropriate mm context, although it means that userspace
can not handle the load it requested, which is a bad sign...

> If the argument against this is that kevent_get_event should be
> possible the answer is...
>
> ===================
>
> - the kevent_get_event syscall is not needed at all. All reporting
> should be done using a ring buffer. There really is not reason to
> keep two interfaces around which serve the same purpose. Making
> the argument the kevent_get_event is so much easier to use is not
> valid. The exposed interface to access the ring buffer will be easy,
> too. In the OLS paper I more or wait hinted at the interfaces. I
> think they should be like this (names are irrelevant):

Well, kevent_get_events() _is_ much easier to use. And actually having
only that interface it is possible to implement ring buffer with any
kind or protocol for its controlling - userspace can have a wrapper
which will call kevent_get_events() with pointer which shows to the
place in the shared ring buffer where to place new events, that wrapper
can handle essentially any kind of flags/parameters which are suitable
for that ring buffer implementation.
But since we started to implement ring buffer as a additional feature of
kevent, let's find the way all people will be happy with before removing
something which was proven to work correctly.

> ec_t ec_create(unsigned flags);
> int ec_destroy(ec_t ec);
> int ec_poll_event(ec_t ec, event_data_t *d);
> int ec_wait_event(ec_t ec, event_data_t *d);
> int ec_timedwait_event(ec_t ec, event_data_t *d, struct timespec *to);
>
> The latter three interfaces are the interesting ones. We have to get
> the data out of the ring buffer as quickly as possible. So the
> interfaces require passing in a reference to an object which can hold
> the data. The 'poll' variant won't delay, the other two will.

The last three are exactly kevent_get_events() with different set of
parameters - it is possible to get events without sleeping, it is
possible to wait until at least something is ready and it is possible to
sleep for timeout.

> We need separate create and destroy functions since there will always
> be a userlevel component of the data structures. The create variant
> can allocate the ring buffer and the other memory needed ('handled'
> flags, tail pointers, ...) and destroy free all resources.
>
> These interfaces are fast and easy to use. At least as easy as the
> kevent_get_event syscall. And all transparently implemented on top of
> the ring buffer. So, please let's drop the unneeded syscall.

They all already imeplemented. Just all above, and it was done several
months ago already. No need to reinvent what is already there.
Even if we will decide to remove kevent_get_events() in favour of ring
buffer-only implementation, winting-for-event syscall will be
essentially kevent_get_events() without pointer to the place where to
put events.
And I will not repeat, that it is (and was from the beginning for about
10 months already) to implement ring buffer using kevent_get_events().

I agree that having special syscall to initialize kevent is a good idea,
and initial kevent implementation had it, but it was removed due to API
cleanup work by Cristoph Hellwing.
So I again see the same problem as several months ago when there are
many people who have opposite views on API, and I as author do not know
who is right...

Can we all agree that initialization syscall is a good idea?

> ===================
>
> - another optimization I am thinking about is optimizing the thread
> wakeup and ring buffer use for cache line use. I.e., if we know
> an event was queued on a specific CPU then the wakeup function
> should take this into account. I.e., if any of the threads
> waiting was/will be scheduled on the same CPU it should be
> preferred.

Do you have _any_ kind of benchmarks with epoll() which would show that
it is feasible? ukevent is one cache line (well, 2 cache lines on old
CPUs), which can be setup way too far away from the time when it is
ready, and CPU which origianlly set that up can be busy, so we will lose
performance waiting until CPU becomes free instead of calling other
thread on different CPU.

So I'm asking is there at least some data except theoretical thoughts?

> With the current simple form of a ring buffer this isn't sufficient,
> though. Reading all entries in the ring buffer until finding the
> one written by the CPU in question is not helpful. We'd need a
> mechanism to point the thread to the entry in question. One
> possibility to do this is to return the ring buffer entry as the
> return value of the kevent_wait() syscall. This works fine if the
> thread only works for one event (which I guess will be 99.999% of
> all uses). An extension could be to extend the ukevent structure to
> contain an index of the next entry written the same CPU.
>
> Another problem this entails is false sharing of the ring buffer
> entries. This would probably require to pad the ukevent structure
> to 64 bytes. It's not that much more, 40 bytes so far, it's
> also more future-safe. The alternative is to allocate have per-CPU
> regions in the ring buffer. With hotplug CPUs this is just plain
> silly.
>
> I think this optimization has the potential to help quite a bit,
> especially for large machines.

I think again that complete removal of ring buffer and its
implementation in userspace wrapper and kevent_get_events() is a good
idea. But probably I'm alone thinking in that direction, so let's think
about ring buffer in kernelspace.

It is possible to specify CPU id in kevent (not in ukevent, i.e. not
in shared by userspace structure, but in it's kernel representation),
and then check if currently active CPU is the same or not, but what if
it is not the same CPU? Entry order is important, since application can
take advantage of synchronization, so idea to skip some entries is bad.

> ===================
>
> - we absolutely need an interface to signal the kernel that a thread,
> just woken from kevent_wait, cannot handle the events. I.e., the
> events are in the ring buffer but all the other threads are in the
> kernel in their kevent_wait calls. The new syscall would wake up
> one or more threads to handle the events.
>
> This syscall is for instance necessary if the thread calling
> kevent_wait is canceled. It might also be needed when a thread
> requested more than one event and realizes processing an entry
> takes a long time and that another thread might work on the other
> items in the meantime.

Hmm, send a signal to other thread when glibc cancells given one...
This problem points me to the idea of userspace thread implementation I
have in mind, but it is another story.

It is management task - kernel should not even know about someone has
died and can not process events it requested.
Userspace can open a control pipe (and setup a kevent handler for it)
and glibc will write there a byte thus awakening some other thread.
It can be done in userspace and should be done in userspace.

If you insist I will create userspace kevent handling - userspace will
be able to request kevents and mark them as ready.

> Al Viro pointed out another possible solution which also could solve
> the "handled" flag problem and concurrency in use of the ring buffer.
>
> The idea is to require the kevent_wait() syscall to signal which entry
> in the ring buffer is handled or not handled. This means:
>
> + the kernel knows at any time which entries in the buffer are free
> and which are not
>
> + concurrent filling of the ring buffer is no problem anymore since
> entries are not discarded until told
>
> + by not waiting for event (num parameter == 0) the syscall can be
> used to discard entries to free up the ring buffer before continuing
> to work on more entries. And, as per the requirement above, it can
> be used to tell the kernel that certain entries are *NOT* handled
> and need to be sent to another thread. This would be useful in the
> thread cancellation case.
>
> This seems like a nice approach.

But unfortunately theory and practice are different in a real world.
Kernel has millions of entries in _linear_ ring buffer, how do you think
they should be handled without complex protocol between userspace and
kernelspace? In that protocol userspace is required to transfer some
information to kernelspace so it could find the entry (i.e. per entry
field ! ), and then it should have a tree or other mechanism to store
free and used chunks of entries...

You probably did not see my network tree allocator patches I posted in
lkml@, netdev@ and linux-mm@ lists - it is quite big chunk of code which
handles exactly that, but you do not want to implement it in glibc I
think...

So, do not overdesign.

And as a side note, btw - _all_ above can be implemented in userspace.

> ===================
>
> - why no syscall to create kevent queue? With dynamic /dev this might
> be a problem and it's really not much additional code. What about
> programs which want to use these interfaces before /dev is set up?

It was there - Cristoph Hellwig removed it in his API cleanup patch, so
far it was not needed at all (and is not needed for now).
That application can create /dev file by itself if it wants... Just a
though.

> ===================
>
> - still: the syscall should use a struct timespec* timeout parameter
> and not nanosecs. There are at least three timeout modes which
> are wanted:
>
> + relative, unconditionally wait that long
>
> + relative, aborted in case of large enough settimeofday() or NTP
> adjustment
>
> + absolute timeout. Probably even with selecting which clock ot use.
> This mode requires a timespec value parameter
>
>
> We have all this code already in the futex syscall. It just needs to
> be generalized or copied and adjusted.

Will we discuss it for death?

Kevent does not need to have absolute timeout.

Because timeout specified there is always related to the start of
syscall, since it is a timeout which specifies maximum time frame
syscall can live.

All such timeouts _ARE_ relative and should be relative since it is
correct.

> ===================
>
> - still: no signal mask parameter in the kevent_wait (and get_event)
> syscall. Regardless of what one thinks about signals, they are used
> and integrating the kevent interface into existing code requires
> this functionality. And it's not only about receiving signals.
> The signal mask parameter can also be used to _prevent_ signals from
> being delivered in that time.

I created kevent_signal notifications - it allows user to setup any set
of interested signals before call to kevent_get_events() and friends.

No need to solve a problem with operation way when there is tactical and
strategical ones - kevent signal is that way which allows not to use
workarounds for interfaces which do not support handling of different
types of events except file descriptors.

> ===================
>
> - the KEVENT_REQ_WAKEUP_ONE functionality is good and needed. But I
> would reverse the default. I cannot see many places where you want
> all threads to be woken. Introduce KEVENT_REQ_WAKEUP_ALL instead.

I.e. to wake up only first thread always and in addon those threads
which have specified flag set? Ok, will put into todo foer the next
release.

> ===================
>
> - there is really no reason to invent yet another timer implementation.
> We have the POSIX timers which are feature rich and nicely
> implemented. All that is needed is to implement SIGEV_KEVENT as a
> notification mechanism. The timer is registered as part of the
> timer_create() syscalls.

Feel free to add any interface you like - it is as simple as call for
kevent_user_add_ukevent() in userspace.

> ===================
>
>
> I haven't yet looked at the other event sources. I think the above is
> enough for now.

It looks like you generate ideas (or move them into different
implementation layer) faster than I implement them :)
And I almost silently stay behind with the fact that it is possbile to
implement _all_ above ring buffer things in userspace with
kevent_get_events() and this functionality is there for almost a year :)

Let's solve problem in order of theirs appearance - what do you think
about above interface for ring buffer?

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View,
> CA ❖

--
Evgeniy Polyakov

2006-11-13 11:21:06

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Mon, Nov 13, 2006 at 01:54:58PM +0300, Evgeniy Polyakov ([email protected]) wrote:
> > ===================
> >
> > - there is really no reason to invent yet another timer implementation.
> > We have the POSIX timers which are feature rich and nicely
> > implemented. All that is needed is to implement SIGEV_KEVENT as a
> > notification mechanism. The timer is registered as part of the
> > timer_create() syscalls.
>
> Feel free to add any interface you like - it is as simple as call for
> kevent_user_add_ukevent() in userspace.

... in kernelspace I mean.

--
Evgeniy Polyakov

2006-11-20 00:03:48

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
>> Possible solution:
>>
>> a) it would be possible to have a "used" flag in each ring buffer entry.
>> That's too expensive, I guess.
>>
>> b) kevent_wait needs another parameter which specifies the which is the
>> last (i.e., least recently added) entry in the ring buffer.
>> Everything between this entry and the current head (in ->kidx) is
>> occupied. If multiple threads arrive in kevent_wait the highest idx
>> (with wrap around possibly lowest) is used.
>>
>> kevent_wait will not try to move more entries into the ring buffer
>> if ->kidx and the higest index passed in to any kevent_wait call
>> is equal (i.e., the ring buffer is full).
>>
>> There is one issue, though, and that is that a system call is needed
>> to signal to the kernel that more entries in the ring buffer are
>> processed and that they can be refilled. This goes against the
>> kernel filling the ring buffer automatically (see below)
>
> If thread calls kevent_wait() it means it has processed previous entries,
> one can call kevent_wait() with $num parameter as zero, which
> means that thread does not want any new events, so nothing will be
> copied.

This doesn't solve the problem. You could only request new events when
all previously reported events are processed. Plus: how do you report
events if the you don't allow get_event pass them on?


> Writable ring buffer does not sound too good to me - what if one thread
> will overwrite the whole ring buffer so kernel's indexes can be screwed?

Agreed, there are problems. This is why I suggested the ring buffer can
be a structured. Parts of it might be read-only, other parts
read/write. I don't necessarily think the 'used' flag is the right way.
And front/tail pointer solution seems to be better.


> Ring buffer processed not in FIFO order is wrong idea

Not necessarily, see my comments about CPU affinity in the previous mail.


> - ring buffer can
> be potentially very big and searching there for the entry, which was
> been marked as 'free' by userspace is not a solution at all - userspace
> in that case must provide ukevent so fast tree search would be used,
> (and although it is already possible) it requires userspace to make
> additional syscalls which is not what we want.

It is not necessary. I've proposed to only have a fron and tail
pointer. The tail pointer is maintained by the application and passed
to the kernel explicitly or via shared memory. The kernel maintains the
front pointer. No tree needed.


> As a solution I can create folowing scheme:
> there are two syscalls (or one with a switch) which get events and
> commits them.
>
> kevent_wait() becomes a syscall which waits until number of events or
> one of them becomes ready and just copies them into ring buffer and
> returns. kevent_wait() will fail with special error code when ring
> buffer is full.
>
> kevent_commit() frees requested number of events _from the beginning_,
> i.e. from special index, visible from userspace. Userspace can create
> special counters for events (and even put them into read-only ring
> buffer overwriting some fields of kevent, especially if we will increase
> it's size) and only call kevent_commit() when all events have zero usage
> counter.

Right, that's basically the front/tail pointer implementation. That
would work. You just have to make sure that the kevent_wait() call
takes the current front pointer/index as a parameter. This way if the
buffer gets filled between the thread checking the ring buffer (and
finding it empty) and the syscall being handled the thread is not suspended.


> I disagree that having possibility to have holes in the ring buffer is a
> good idea at all - it requires much more complex protocol, which will
> fill and reuse that holes, and the main disavantge - it requires to
> transfer much more information from userspace to kernelspace to free the
> ring entry in the hole - in that case it is already possible just to
> call kevent_ctl(KEVENT_REMOVE) and do not wash the brain with new
> approach at all.

Well, it would require more data transport of we'd use writable shared
memory. But I agree, it's far too complicated and might not scale with
growing ring buffer sizes.


>> - implementing the kevent_wait syscall the proposed way means we are
>> missing out on one possible optimization. The ring buffer is
>> currently only filled on kevent_wait calls. I expect that in really
>> high traffic situations requests are coming in at a higher rate than
>> the can be processed. At least for periods of time. If such
>> situations it would be nice to not have to call into the kernel at
>> all. If the kernel would deliver into the ring buffer on its own
>> this would be possible.
>
> Well, it can be done on behalf of workqueue or dedicated thread which
> will bring up appropriate mm context,

I think it should be done. It's potentially a huge advantage.


> although it means that userspace
> can not handle the load it requested, which is a bad sign...

I don't understand. What is not supposed to work? There is nothing
which cannot work with automatic posting since the get_event() call does
nothing but copying the event data over and wake a thread.


>> - the kevent_get_event syscall is not needed at all. All reporting
>> should be done using a ring buffer. There really is not reason to
>> keep two interfaces around which serve the same purpose. Making
>> the argument the kevent_get_event is so much easier to use is not
>> valid. The exposed interface to access the ring buffer will be easy,
>> too. In the OLS paper I more or wait hinted at the interfaces. I
>> think they should be like this (names are irrelevant):
>
> Well, kevent_get_events() _is_ much easier to use. And actually having
> only that interface it is possible to implement ring buffer with any
> kind or protocol for its controlling - userspace can have a wrapper
> which will call kevent_get_events() with pointer which shows to the
> place in the shared ring buffer where to place new events, that wrapper
> can handle essentially any kind of flags/parameters which are suitable
> for that ring buffer implementation.

That's far too slow. The whole point behind the ring buffer is speed.
And emulation would defeat the purpose.


> But since we started to implement ring buffer as a additional feature of
> kevent, let's find the way all people will be happy with before removing
> something which was proven to work correctly.

The get_event interface is basically the userlevel interface the runtime
(glibc probably) would provide. Programmers don't see the complexity.

I'm concerned about the get_event interface holding the kernel
implementation back. For instance, automatic filling the ring buffer.
This would not be possible if the program is free to mix
kevent_get_event and kevent_wait calls freely. If you do away with the
get_event syscall the automatic ring buffer filling is possible and a
logical extension.


>
> The last three are exactly kevent_get_events() with different set of
> parameters - it is possible to get events without sleeping, it is
> possible to wait until at least something is ready and it is possible to
> sleep for timeout.

Exactly. But these interfaces should be implemented at userlevel, not
at the syscall level. It's not necessary. The kernel interface should
be kept as small as possible and the get_event syscall is pure duplication.


> They all already imeplemented. Just all above, and it was done several
> months ago already. No need to reinvent what is already there.
> Even if we will decide to remove kevent_get_events() in favour of ring
> buffer-only implementation, winting-for-event syscall will be
> essentially kevent_get_events() without pointer to the place where to
> put events.

Right, but this limitation of the interface is important. It means the
interface of the kernel is smaller: fewer possibilities for problems and
fewer constraints if in future something should be changed (and smaller
kernel).


> I agree that having special syscall to initialize kevent is a good idea,
> and initial kevent implementation had it, but it was removed due to API
> cleanup work by Cristoph Hellwing.

Well, he is wrong. If, for instance, init or any of the programs which
start first wants to use the syscall it couldn't because /dev isn't
mounted. The program might use libraries and therefore not have any
influence on whether the kevent stuff is used or not.

Yes, the /dev interface is useful for some/many other kernel interfaces.
But this is a core interface. For the same reason epoll_create is a
syscall.


> Do you have _any_ kind of benchmarks with epoll() which would show that
> it is feasible? ukevent is one cache line (well, 2 cache lines on old
> CPUs), which can be setup way too far away from the time when it is
> ready, and CPU which origianlly set that up can be busy, so we will lose
> performance waiting until CPU becomes free instead of calling other
> thread on different CPU.

If the period between the generation of the event (e.g., incoming
network traffic or sent data) and the delivery of the event by waking a
thread is too long, it makes not too much sense. But if the L2 cache
hasn't hasn't been flushed it might be a big advantage.

I think it's reasonable to only have the last queued entry for a CPU
handled special. And note, this is only ever a hint. If an event entry
was created by the kernel in one CPU but none of the threads which wait
to be waken is on that CPU, nothing has to be done.

No, I don't have a benchmark. But it is likely quite easily possible to
create a synthetic benachmark. Maybe with pipes.


> It is possible to specify CPU id in kevent (not in ukevent, i.e. not
> in shared by userspace structure, but in it's kernel representation),
> and then check if currently active CPU is the same or not, but what if
> it is not the same CPU?

Nothing special. It's up to the userlevel wrapper code. The CPU number
would only be a hint.


> Entry order is important, since application can
> take advantage of synchronization, so idea to skip some entries is bad.

That's something the application should be make a call about. It's not
always (or even mostly) the case that the ordering of the notification
is important. Furthermore, this would also require the kernel to
enforce an ordering. This is expensive on SMP machines. A locally
generated event (i.e., source and the thread reporting the event) can be
delivered faster than an event created on another CPU.


> It is management task - kernel should not even know about someone has
> died and can not process events it requested.

But the kernel has to be involed.


> Userspace can open a control pipe (and setup a kevent handler for it)
> and glibc will write there a byte thus awakening some other thread.
> It can be done in userspace and should be done in userspace.

That's invasive. The problem is that no userlevel interface should have
to implicitly keep file descriptors open. This would mean the
application would be influenced since suddenly a file descriptor is not
available anymore. Yes, applications shouldn't care but they
unfortunately sometimes do.


> Will we discuss it for death?
>
> Kevent does not need to have absolute timeout.

Of course it does. Just because you don't see a need for it for your
applications right now it doesn't mean it's not a valid use.


> Because timeout specified there is always related to the start of
> syscall, since it is a timeout which specifies maximum time frame
> syscall can live.

That's your current implementation. There is absolutely no reason
whatsoever why this couldn't be changed.
> I created kevent_signal notifications - it allows user to setup any set
> of interested signals before call to kevent_get_events() and friends.
>
> No need to solve a problem with operation way when there is tactical and
> strategical ones

Of course there is a need and I explained it before. Getting signal
notifications is in no way the same as changing the signal mask
temporarily. You cannot correctly emulate the case where you want to
block a signal while in the call as reenable it afterwards. Receiving
the signal as an event and then artificially raising it is not the same.
Especially timing-wise, the signal kevent might not be seen long after
the syscall returns because other entries are worked on first.

The opposite case is equally impossible to emulate: unblocking a signal
just for the duration of the syscall. These are all possible and used
cases.


>> - the KEVENT_REQ_WAKEUP_ONE functionality is good and needed. But I
>> would reverse the default. I cannot see many places where you want
>> all threads to be woken. Introduce KEVENT_REQ_WAKEUP_ALL instead.
>
> I.e. to wake up only first thread always and in addon those threads
> which have specified flag set? Ok, will put into todo foer the next
> release.

It's a flag for an event. So the threads won't have the flag set. If
an event is delivered with the flag set, wake all threads. Otherwise
just one.


>> - there is really no reason to invent yet another timer implementation.
>> We have the POSIX timers which are feature rich and nicely
>> implemented. All that is needed is to implement SIGEV_KEVENT as a
>> notification mechanism. The timer is registered as part of the
>> timer_create() syscalls.
>
> Feel free to add any interface you like - it is as simple as call for
> kevent_user_add_ukevent() in userspace.

No, that's not what I mean. There is no need for the special
timer-related part of your patch. Instead the existing POSIX timer
syscalls should be modified to handle SIGEV_KEVENT notification. Again,
keep the interface as small as possible. Plus, the POSIX timer
interface is very flexible. You don't want to duplicate all that
functionality.


> And I almost silently stay behind with the fact that it is possbile to
> implement _all_ above ring buffer things in userspace with
> kevent_get_events() and this functionality is there for almost a year :)

Again, this defeats the purpose completely. The ring buffer is the
faster interface, especially when coupled with asynchronous filling of
ring buffer (i.e., without a syscal).


> Let's solve problem in order of theirs appearance - what do you think
> about above interface for ring buffer?

Looks better, yes.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-11-20 08:26:55

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Sun, Nov 19, 2006 at 04:02:03PM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >>Possible solution:
> >>
> >>a) it would be possible to have a "used" flag in each ring buffer entry.
> >> That's too expensive, I guess.
> >>
> >>b) kevent_wait needs another parameter which specifies the which is the
> >> last (i.e., least recently added) entry in the ring buffer.
> >> Everything between this entry and the current head (in ->kidx) is
> >> occupied. If multiple threads arrive in kevent_wait the highest idx
> >> (with wrap around possibly lowest) is used.
> >>
> >> kevent_wait will not try to move more entries into the ring buffer
> >> if ->kidx and the higest index passed in to any kevent_wait call
> >> is equal (i.e., the ring buffer is full).
> >>
> >> There is one issue, though, and that is that a system call is needed
> >> to signal to the kernel that more entries in the ring buffer are
> >> processed and that they can be refilled. This goes against the
> >> kernel filling the ring buffer automatically (see below)
> >
> >If thread calls kevent_wait() it means it has processed previous entries,
> >one can call kevent_wait() with $num parameter as zero, which
> >means that thread does not want any new events, so nothing will be
> >copied.
>
> This doesn't solve the problem. You could only request new events when
> all previously reported events are processed. Plus: how do you report
> events if the you don't allow get_event pass them on?

Userspace should itself maintain order and possibility to get event in
this implementation, kernel just returns events which were requested.

> >Writable ring buffer does not sound too good to me - what if one thread
> >will overwrite the whole ring buffer so kernel's indexes can be screwed?
>
> Agreed, there are problems. This is why I suggested the ring buffer can
> be a structured. Parts of it might be read-only, other parts
> read/write. I don't necessarily think the 'used' flag is the right way.
> And front/tail pointer solution seems to be better.
>
>
> >Ring buffer processed not in FIFO order is wrong idea
>
> Not necessarily, see my comments about CPU affinity in the previous mail.
>
>
> >- ring buffer can
> >be potentially very big and searching there for the entry, which was
> >been marked as 'free' by userspace is not a solution at all - userspace
> >in that case must provide ukevent so fast tree search would be used,
> >(and although it is already possible) it requires userspace to make
> >additional syscalls which is not what we want.
>
> It is not necessary. I've proposed to only have a fron and tail
> pointer. The tail pointer is maintained by the application and passed
> to the kernel explicitly or via shared memory. The kernel maintains the
> front pointer. No tree needed.

There was such implementation (in previous patchset) - sine no one
commented, I changed that.

> >As a solution I can create folowing scheme:
> >there are two syscalls (or one with a switch) which get events and
> >commits them.
> >
> >kevent_wait() becomes a syscall which waits until number of events or
> >one of them becomes ready and just copies them into ring buffer and
> >returns. kevent_wait() will fail with special error code when ring
> >buffer is full.
> >
> >kevent_commit() frees requested number of events _from the beginning_,
> >i.e. from special index, visible from userspace. Userspace can create
> >special counters for events (and even put them into read-only ring
> >buffer overwriting some fields of kevent, especially if we will increase
> >it's size) and only call kevent_commit() when all events have zero usage
> >counter.
>
> Right, that's basically the front/tail pointer implementation. That
> would work. You just have to make sure that the kevent_wait() call
> takes the current front pointer/index as a parameter. This way if the
> buffer gets filled between the thread checking the ring buffer (and
> finding it empty) and the syscall being handled the thread is not suspended.

It is exactly how previous ring buffer (in mapped area though) was
implemented.

I think I need to quickly setup my slightly used (bought on ebay) but
still working mind reader, I will try to tune it to work with your brain
waves so next time I would not spent weeks changing something which
could be reused, while others keep silent :)

> >I disagree that having possibility to have holes in the ring buffer is a
> >good idea at all - it requires much more complex protocol, which will
> >fill and reuse that holes, and the main disavantge - it requires to
> >transfer much more information from userspace to kernelspace to free the
> >ring entry in the hole - in that case it is already possible just to
> >call kevent_ctl(KEVENT_REMOVE) and do not wash the brain with new
> >approach at all.
>
> Well, it would require more data transport of we'd use writable shared
> memory. But I agree, it's far too complicated and might not scale with
> growing ring buffer sizes.
>
>
> >>- implementing the kevent_wait syscall the proposed way means we are
> >> missing out on one possible optimization. The ring buffer is
> >> currently only filled on kevent_wait calls. I expect that in really
> >> high traffic situations requests are coming in at a higher rate than
> >> the can be processed. At least for periods of time. If such
> >> situations it would be nice to not have to call into the kernel at
> >> all. If the kernel would deliver into the ring buffer on its own
> >> this would be possible.
> >
> >Well, it can be done on behalf of workqueue or dedicated thread which
> >will bring up appropriate mm context,
>
> I think it should be done. It's potentially a huge advantage.
>
>
> >although it means that userspace
> >can not handle the load it requested, which is a bad sign...
>
> I don't understand. What is not supposed to work? There is nothing
> which cannot work with automatic posting since the get_event() call does
> nothing but copying the event data over and wake a thread.

If userspace is too slow to get events, dedicated thread or workqueue
will be busy unneded things, although they can allow to remove peaks in
the load.

> >>- the kevent_get_event syscall is not needed at all. All reporting
> >> should be done using a ring buffer. There really is not reason to
> >> keep two interfaces around which serve the same purpose. Making
> >> the argument the kevent_get_event is so much easier to use is not
> >> valid. The exposed interface to access the ring buffer will be easy,
> >> too. In the OLS paper I more or wait hinted at the interfaces. I
> >> think they should be like this (names are irrelevant):
> >
> >Well, kevent_get_events() _is_ much easier to use. And actually having
> >only that interface it is possible to implement ring buffer with any
> >kind or protocol for its controlling - userspace can have a wrapper
> >which will call kevent_get_events() with pointer which shows to the
> >place in the shared ring buffer where to place new events, that wrapper
> >can handle essentially any kind of flags/parameters which are suitable
> >for that ring buffer implementation.
>
> That's far too slow. The whole point behind the ring buffer is speed.
> And emulation would defeat the purpose.

It was an example, I do not say ring-buffer maintained in kernelspace is
bad idea. Actually it is possible to create several threads which will
only read events into the buffer, which will be processed by some pool of
'working' threads. There are a lot of possibilities to work with only
one syscall and create scalable system.

> >But since we started to implement ring buffer as a additional feature of
> >kevent, let's find the way all people will be happy with before removing
> >something which was proven to work correctly.
>
> The get_event interface is basically the userlevel interface the runtime
> (glibc probably) would provide. Programmers don't see the complexity.
>
> I'm concerned about the get_event interface holding the kernel
> implementation back. For instance, automatic filling the ring buffer.
> This would not be possible if the program is free to mix
> kevent_get_event and kevent_wait calls freely. If you do away with the
> get_event syscall the automatic ring buffer filling is possible and a
> logical extension.

Yes, that is why only one should be used.
If there are several threads, then ring buffer implementation should be
used otherwise just kevent_get_events().
In theory yes, access library like glibc can provide kevent_get_events()
which will read event from ring buffer, but there is no such call right
now, so kernel's kevent_get_events() looks reasonable.

> >The last three are exactly kevent_get_events() with different set of
> >parameters - it is possible to get events without sleeping, it is
> >possible to wait until at least something is ready and it is possible to
> >sleep for timeout.
>
> Exactly. But these interfaces should be implemented at userlevel, not
> at the syscall level. It's not necessary. The kernel interface should
> be kept as small as possible and the get_event syscall is pure duplication.

I would say that ring-buffer mainpulating syscalls are duplicatino, but
it is just matter of a view :)

> >They all already imeplemented. Just all above, and it was done several
> >months ago already. No need to reinvent what is already there.
> >Even if we will decide to remove kevent_get_events() in favour of ring
> >buffer-only implementation, winting-for-event syscall will be
> >essentially kevent_get_events() without pointer to the place where to
> >put events.
>
> Right, but this limitation of the interface is important. It means the
> interface of the kernel is smaller: fewer possibilities for problems and
> fewer constraints if in future something should be changed (and smaller
> kernel).

Ok, lets see for ring buffer implementation right now, and then we will
decide if we want to remove or to stay with kevent_get_events() syscall.

> >I agree that having special syscall to initialize kevent is a good idea,
> >and initial kevent implementation had it, but it was removed due to API
> >cleanup work by Cristoph Hellwing.
>
> Well, he is wrong. If, for instance, init or any of the programs which
> start first wants to use the syscall it couldn't because /dev isn't
> mounted. The program might use libraries and therefore not have any
> influence on whether the kevent stuff is used or not.
>
> Yes, the /dev interface is useful for some/many other kernel interfaces.
> But this is a core interface. For the same reason epoll_create is a
> syscall.

Ok, I will create initialization syscall.

> >Do you have _any_ kind of benchmarks with epoll() which would show that
> >it is feasible? ukevent is one cache line (well, 2 cache lines on old
> >CPUs), which can be setup way too far away from the time when it is
> >ready, and CPU which origianlly set that up can be busy, so we will lose
> >performance waiting until CPU becomes free instead of calling other
> >thread on different CPU.
>
> If the period between the generation of the event (e.g., incoming
> network traffic or sent data) and the delivery of the event by waking a
> thread is too long, it makes not too much sense. But if the L2 cache
> hasn't hasn't been flushed it might be a big advantage.
>
> I think it's reasonable to only have the last queued entry for a CPU
> handled special. And note, this is only ever a hint. If an event entry
> was created by the kernel in one CPU but none of the threads which wait
> to be waken is on that CPU, nothing has to be done.
>
> No, I don't have a benchmark. But it is likely quite easily possible to
> create a synthetic benachmark. Maybe with pipes.
>
>
> >It is possible to specify CPU id in kevent (not in ukevent, i.e. not
> >in shared by userspace structure, but in it's kernel representation),
> >and then check if currently active CPU is the same or not, but what if
> >it is not the same CPU?
>
> Nothing special. It's up to the userlevel wrapper code. The CPU number
> would only be a hint.
>
>
> >Entry order is important, since application can
> >take advantage of synchronization, so idea to skip some entries is bad.
>
> That's something the application should be make a call about. It's not
> always (or even mostly) the case that the ordering of the notification
> is important. Furthermore, this would also require the kernel to
> enforce an ordering. This is expensive on SMP machines. A locally
> generated event (i.e., source and the thread reporting the event) can be
> delivered faster than an event created on another CPU.

How come? If signal was delivered earlier than data arrived, userspace
should get signal before data - that is the rule. Ordering is maintained
not for event insertion, but for marking them ready - it is atomic, so
who first starts to mark even ready, that event will be read first from
the ready queue.

> >It is management task - kernel should not even know about someone has
> >died and can not process events it requested.
>
> But the kernel has to be involed.
>
>
> >Userspace can open a control pipe (and setup a kevent handler for it)
> >and glibc will write there a byte thus awakening some other thread.
> >It can be done in userspace and should be done in userspace.
>
> That's invasive. The problem is that no userlevel interface should have
> to implicitly keep file descriptors open. This would mean the
> application would be influenced since suddenly a file descriptor is not
> available anymore. Yes, applications shouldn't care but they
> unfortunately sometimes do.

Then I propose userspace notifications - each new thread can register
'wake me up when userspace event 1 is ready' and 'event 1' will be
marked as ready by glibc when it removes the thread.

> >Will we discuss it for death?
> >
> >Kevent does not need to have absolute timeout.
>
> Of course it does. Just because you don't see a need for it for your
> applications right now it doesn't mean it's not a valid use.

Please explain why glibc AIO uses relatinve timeouts then :)

> >Because timeout specified there is always related to the start of
> >syscall, since it is a timeout which specifies maximum time frame
> >syscall can live.
>
> That's your current implementation. There is absolutely no reason
> whatsoever why this couldn't be changed.

It has nothing with implementation - it is logic. Something starts and
it has its maximum lifetime, but not something starts and should be
stopped Jan 1, 2008. In the latter case one can setup a timer, but it
does not allow to specify maximum lifetime. If glibc posix sleeping
functions converts relatinve AIO timeouts into absolute it does not mean
all should do it. It is just not needed.

> >I created kevent_signal notifications - it allows user to setup any set
> >of interested signals before call to kevent_get_events() and friends.
> >
> >No need to solve a problem with operation way when there is tactical and
> >strategical ones
>
> Of course there is a need and I explained it before. Getting signal
> notifications is in no way the same as changing the signal mask
> temporarily. You cannot correctly emulate the case where you want to
> block a signal while in the call as reenable it afterwards. Receiving
> the signal as an event and then artificially raising it is not the same.
> Especially timing-wise, the signal kevent might not be seen long after
> the syscall returns because other entries are worked on first.
>
> The opposite case is equally impossible to emulate: unblocking a signal
> just for the duration of the syscall. These are all possible and used
> cases.

Add and remove appropriate kevent - it is as simple as call for one
function.

> >>- the KEVENT_REQ_WAKEUP_ONE functionality is good and needed. But I
> >> would reverse the default. I cannot see many places where you want
> >> all threads to be woken. Introduce KEVENT_REQ_WAKEUP_ALL instead.
> >
> >I.e. to wake up only first thread always and in addon those threads
> >which have specified flag set? Ok, will put into todo foer the next
> >release.
>
> It's a flag for an event. So the threads won't have the flag set. If
> an event is delivered with the flag set, wake all threads. Otherwise
> just one.

Ok.

> >>- there is really no reason to invent yet another timer implementation.
> >> We have the POSIX timers which are feature rich and nicely
> >> implemented. All that is needed is to implement SIGEV_KEVENT as a
> >> notification mechanism. The timer is registered as part of the
> >> timer_create() syscalls.
> >
> >Feel free to add any interface you like - it is as simple as call for
> >kevent_user_add_ukevent() in userspace.
>
> No, that's not what I mean. There is no need for the special
> timer-related part of your patch. Instead the existing POSIX timer
> syscalls should be modified to handle SIGEV_KEVENT notification. Again,
> keep the interface as small as possible. Plus, the POSIX timer
> interface is very flexible. You don't want to duplicate all that
> functionality.

Interface is already there with kevent_ctl(KEVENT_ADD), I just created
additional entry, which describes timers enqueue/dequeue callbacks - I
have not invented new interfaces, just reused existing generic kevent
facilities. It is possible to add timer events from any other place.

> >And I almost silently stay behind with the fact that it is possbile to
> >implement _all_ above ring buffer things in userspace with
> >kevent_get_events() and this functionality is there for almost a year :)
>
> Again, this defeats the purpose completely. The ring buffer is the
> faster interface, especially when coupled with asynchronous filling of
> ring buffer (i.e., without a syscal).

It is still possible to have very scalable system with it, for example
with one thread dedicated for syscall reading (with big number of events
transferred in one shot syscall overhead becomes negligible) and pool of
working threads. It is not about 'let's remove kernelspace ring buffer
management', but about possibilities and flexibility of the existing
model.

> >Let's solve problem in order of theirs appearance - what do you think
> >about above interface for ring buffer?
>
> Looks better, yes.

Ok, I will implement this new (old) ring buffer in present it in the
next release. I will also schedule there userspace notifications,
'wake-up-one-thread' flag changes and other small updates.

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View,
> CA ❖

--
Evgeniy Polyakov

2006-11-20 08:44:29

by Andrew Morton

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Mon, 20 Nov 2006 11:25:01 +0300
Evgeniy Polyakov <[email protected]> wrote:

> On Sun, Nov 19, 2006 at 04:02:03PM -0800, Ulrich Drepper ([email protected]) wrote:
> > Evgeniy Polyakov wrote:
> > >>Possible solution:
> > >>
> > >>a) it would be possible to have a "used" flag in each ring buffer entry.
> > >> That's too expensive, I guess.
> > >>
> > >>b) kevent_wait needs another parameter which specifies the which is the
> > >> last (i.e., least recently added) entry in the ring buffer.
> > >> Everything between this entry and the current head (in ->kidx) is
> > >> occupied. If multiple threads arrive in kevent_wait the highest idx
> > >> (with wrap around possibly lowest) is used.
> > >>
> > >> kevent_wait will not try to move more entries into the ring buffer
> > >> if ->kidx and the higest index passed in to any kevent_wait call
> > >> is equal (i.e., the ring buffer is full).
> > >>
> > >> There is one issue, though, and that is that a system call is needed
> > >> to signal to the kernel that more entries in the ring buffer are
> > >> processed and that they can be refilled. This goes against the
> > >> kernel filling the ring buffer automatically (see below)
> > >
> > >If thread calls kevent_wait() it means it has processed previous entries,
> > >one can call kevent_wait() with $num parameter as zero, which
> > >means that thread does not want any new events, so nothing will be
> > >copied.
> >
> > This doesn't solve the problem. You could only request new events when
> > all previously reported events are processed. Plus: how do you report
> > events if the you don't allow get_event pass them on?
>
> Userspace should itself maintain order and possibility to get event in
> this implementation, kernel just returns events which were requested.

That would mean that in a multithreaded application (or multi-processes
sharing the same MAP_SHARED ringbuffer), all threads/processes will be
slowed down to wait for the slowest one.

> > >They all already imeplemented. Just all above, and it was done several
> > >months ago already. No need to reinvent what is already there.
> > >Even if we will decide to remove kevent_get_events() in favour of ring
> > >buffer-only implementation, winting-for-event syscall will be
> > >essentially kevent_get_events() without pointer to the place where to
> > >put events.
> >
> > Right, but this limitation of the interface is important. It means the
> > interface of the kernel is smaller: fewer possibilities for problems and
> > fewer constraints if in future something should be changed (and smaller
> > kernel).
>
> Ok, lets see for ring buffer implementation right now, and then we will
> decide if we want to remove or to stay with kevent_get_events() syscall.

I agree that kevent_get_events() is duplicative and we shouldn't need it.
Better to concentrate all our development effort on the single and most
flexible means of delivery.

2006-11-20 08:53:29

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Mon, Nov 20, 2006 at 12:43:01AM -0800, Andrew Morton ([email protected]) wrote:
> > > >If thread calls kevent_wait() it means it has processed previous entries,
> > > >one can call kevent_wait() with $num parameter as zero, which
> > > >means that thread does not want any new events, so nothing will be
> > > >copied.
> > >
> > > This doesn't solve the problem. You could only request new events when
> > > all previously reported events are processed. Plus: how do you report
> > > events if the you don't allow get_event pass them on?
> >
> > Userspace should itself maintain order and possibility to get event in
> > this implementation, kernel just returns events which were requested.
>
> That would mean that in a multithreaded application (or multi-processes
> sharing the same MAP_SHARED ringbuffer), all threads/processes will be
> slowed down to wait for the slowest one.

Not at all - all other threads can call kevent_get_events() with theirs
own place in the ring buffer, so while one of them is processing an
entry, others can fill next entries.

> > > >They all already imeplemented. Just all above, and it was done several
> > > >months ago already. No need to reinvent what is already there.
> > > >Even if we will decide to remove kevent_get_events() in favour of ring
> > > >buffer-only implementation, winting-for-event syscall will be
> > > >essentially kevent_get_events() without pointer to the place where to
> > > >put events.
> > >
> > > Right, but this limitation of the interface is important. It means the
> > > interface of the kernel is smaller: fewer possibilities for problems and
> > > fewer constraints if in future something should be changed (and smaller
> > > kernel).
> >
> > Ok, lets see for ring buffer implementation right now, and then we will
> > decide if we want to remove or to stay with kevent_get_events() syscall.
>
> I agree that kevent_get_events() is duplicative and we shouldn't need it.
> Better to concentrate all our development effort on the single and most
> flexible means of delivery.

Let's wait for ring buffer imeplementation first :)

--
Evgeniy Polyakov

2006-11-20 09:17:53

by Andrew Morton

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Mon, 20 Nov 2006 11:51:59 +0300
Evgeniy Polyakov <[email protected]> wrote:

> On Mon, Nov 20, 2006 at 12:43:01AM -0800, Andrew Morton ([email protected]) wrote:
> > > > >If thread calls kevent_wait() it means it has processed previous entries,
> > > > >one can call kevent_wait() with $num parameter as zero, which
> > > > >means that thread does not want any new events, so nothing will be
> > > > >copied.
> > > >
> > > > This doesn't solve the problem. You could only request new events when
> > > > all previously reported events are processed. Plus: how do you report
> > > > events if the you don't allow get_event pass them on?
> > >
> > > Userspace should itself maintain order and possibility to get event in
> > > this implementation, kernel just returns events which were requested.
> >
> > That would mean that in a multithreaded application (or multi-processes
> > sharing the same MAP_SHARED ringbuffer), all threads/processes will be
> > slowed down to wait for the slowest one.
>
> Not at all - all other threads can call kevent_get_events() with theirs
> own place in the ring buffer, so while one of them is processing an
> entry, others can fill next entries.

eh? That's not a ringbuffer, and it sounds awfully complex.

I don't know if this (new?) proposal resolves the
events-gets-lost-due-to-thread-cancellation problem? Would need to see
considerably more detail.

2006-11-20 09:21:48

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Mon, Nov 20, 2006 at 01:15:16AM -0800, Andrew Morton ([email protected]) wrote:
> On Mon, 20 Nov 2006 11:51:59 +0300
> Evgeniy Polyakov <[email protected]> wrote:
>
> > On Mon, Nov 20, 2006 at 12:43:01AM -0800, Andrew Morton ([email protected]) wrote:
> > > > > >If thread calls kevent_wait() it means it has processed previous entries,
> > > > > >one can call kevent_wait() with $num parameter as zero, which
> > > > > >means that thread does not want any new events, so nothing will be
> > > > > >copied.
> > > > >
> > > > > This doesn't solve the problem. You could only request new events when
> > > > > all previously reported events are processed. Plus: how do you report
> > > > > events if the you don't allow get_event pass them on?
> > > >
> > > > Userspace should itself maintain order and possibility to get event in
> > > > this implementation, kernel just returns events which were requested.
> > >
> > > That would mean that in a multithreaded application (or multi-processes
> > > sharing the same MAP_SHARED ringbuffer), all threads/processes will be
> > > slowed down to wait for the slowest one.
> >
> > Not at all - all other threads can call kevent_get_events() with theirs
> > own place in the ring buffer, so while one of them is processing an
> > entry, others can fill next entries.
>
> eh? That's not a ringbuffer, and it sounds awfully complex.
>
> I don't know if this (new?) proposal resolves the
> events-gets-lost-due-to-thread-cancellation problem? Would need to see
> considerably more detail.

It does - event is copied into shared buffer, but place (or index in the
ring buffer) is selected by userspace (wrapper, glibc, anything).
It is simple and (from my point of view) elegant, but it will not be used -
I surrender and implement kenelspace ring buffer management right now, I
just said that it is possible to implement any kind of ring buffer in
userspace with old kevent_get_events() syscall only.

--
Evgeniy Polyakov

2006-11-20 20:31:18

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
> It is exactly how previous ring buffer (in mapped area though) was
> implemented.

Not any of those I saw. The one I looked at always started again at
index 0 to fill the ring buffer. I'll wait for the next implementation.


>> That's something the application should be make a call about. It's not
>> always (or even mostly) the case that the ordering of the notification
>> is important. Furthermore, this would also require the kernel to
>> enforce an ordering. This is expensive on SMP machines. A locally
>> generated event (i.e., source and the thread reporting the event) can be
>> delivered faster than an event created on another CPU.
>
> How come? If signal was delivered earlier than data arrived, userspace
> should get signal before data - that is the rule. Ordering is maintained
> not for event insertion, but for marking them ready - it is atomic, so
> who first starts to mark even ready, that event will be read first from
> the ready queue.

This is as far as the kernel is concerned. Queue them in the order they
arrive.

I'm talking about the userlevel side. *If* (and it needs to be verified
that this has an advantage) a CPU creates an event for, e.g., a read
event and then a number of threads could be notified about the event.
When the kernel has to wake up a thread it'll look whether any thread is
scheduled on the same CPU which generated the event. Then the thread,
upon waking up, can be told about the entry in the ring buffer which can
be accessed first best (due to caching). This entry needs not be the
first available in the ring buffer but that's a problem the userlevel
code has to worry about.


> Then I propose userspace notifications - each new thread can register
> 'wake me up when userspace event 1 is ready' and 'event 1' will be
> marked as ready by glibc when it removes the thread.

You don't want to have a channel like this. The userlevel code doesn't
know which threads are waiting in the kernel on the event queue. And it
seems to be much more complicated then simply have an kevent call which
tells the kernel "wake up N or 1 more threads since I cannot handle it".
Basically a futex_wake()-like call.


>> Of course it does. Just because you don't see a need for it for your
>> applications right now it doesn't mean it's not a valid use.
>
> Please explain why glibc AIO uses relatinve timeouts then :)

You are still completely focused on AIO. We are talking here about a
new generic event handling. It is not tied to AIO. We will add all
kinds of events, e.g., hopefully futex support and many others. And
even for AIO it's relevant.

As I said, relative timeouts are unable to cope with settimeofday calls
or ntp adjustments. AIO is certainly usable in situations where
timeouts are related to wall clock time.


> It has nothing with implementation - it is logic. Something starts and
> it has its maximum lifetime, but not something starts and should be
> stopped Jan 1, 2008.

It is an implementation detail. Look at the PI futex support. It has
timeouts which can be cut short (or increased) due to wall clock changes.


>> The opposite case is equally impossible to emulate: unblocking a signal
>> just for the duration of the syscall. These are all possible and used
>> cases.
>
> Add and remove appropriate kevent - it is as simple as call for one
> function.

No, it's not. The kevent stuff handles only the kevent handler (i.e.,
the replacement for calling the signal handler). It cannot set signal
masks. I am talking about signal masks here. And don't suggest "I can
add another kevent feature where I can register signal masks". This
would be ridiculous since it's not an event source. Just add the
parameter and every base is covered and, at least equally important, we
have symmetry between the event handling interfaces.


>> No, that's not what I mean. There is no need for the special
>> timer-related part of your patch. Instead the existing POSIX timer
>> syscalls should be modified to handle SIGEV_KEVENT notification. Again,
>> keep the interface as small as possible. Plus, the POSIX timer
>> interface is very flexible. You don't want to duplicate all that
>> functionality.
>
> Interface is already there with kevent_ctl(KEVENT_ADD), I just created
> additional entry, which describes timers enqueue/dequeue callbacks

New multiplexers cases are additional syscalls. This is unnecessary
code. Increased kernel interface and such. We have the POSIX timer
interfaces which are feature-rich and standardized *and* can be triviall
extended (at least from the userlevel interface POV) to use event
queues. If you don't want to do this, fine, I'll try to get it made.
But drop the timer part of your patches.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-11-20 21:46:27

by Jeff Garzik

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

Ulrich Drepper wrote:
> Evgeniy Polyakov wrote:
>> It is exactly how previous ring buffer (in mapped area though) was
>> implemented.
>
> Not any of those I saw. The one I looked at always started again at
> index 0 to fill the ring buffer. I'll wait for the next implementation.

I like the two-pointer ring buffer approach, one pointer for the
consumer and one for the producer.


> You don't want to have a channel like this. The userlevel code doesn't
> know which threads are waiting in the kernel on the event queue. And it

Agreed.


> You are still completely focused on AIO. We are talking here about a
> new generic event handling. It is not tied to AIO. We will add all

Agreed.


> As I said, relative timeouts are unable to cope with settimeofday calls
> or ntp adjustments. AIO is certainly usable in situations where
> timeouts are related to wall clock time.

I think we have lived with relative timeouts for so long, it would be
unusual to change now. select(2), poll(2), epoll_wait(2) all take
relative timeouts.

Jeff


2006-11-20 21:55:27

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

Jeff Garzik wrote:
> I think we have lived with relative timeouts for so long, it would be
> unusual to change now. select(2), poll(2), epoll_wait(2) all take
> relative timeouts.

I'm not talking about always using absolute timeouts.

I'm saying the timeout parameter should be a struct timespec* and then
the flags word could have a flag meaning "this is an absolute timeout".
I.e., enable both uses,, even make relative timeouts the default.
This is what the modern POSIX interfaces do, too, see clock_nanosleep.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-11-21 09:10:30

by Ingo Oeser

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

Hi,

Ulrich Drepper schrieb:
> Jeff Garzik wrote:
> > I think we have lived with relative timeouts for so long, it would be
> > unusual to change now. select(2), poll(2), epoll_wait(2) all take
> > relative timeouts.
>
> I'm not talking about always using absolute timeouts.
>
> I'm saying the timeout parameter should be a struct timespec* and then
> the flags word could have a flag meaning "this is an absolute timeout".
> I.e., enable both uses,, even make relative timeouts the default.
> This is what the modern POSIX interfaces do, too, see clock_nanosleep.

I agree here. And while you are at it: Have it say "not before" vs. "not after".

<rant>
And if you call "absolute timeout" an "alarm" or "deadline" everyone will agree,
that this is useful.

Timeout means "I ran OUT of TIME to do it" and this is by definition relative
to a starting point. A "deadline" is an absolute point in (wall) time where sth.
has to be ready and an "alarm" is an absolute point in (wall) time where sth.
is triggered (e.g. a bell rings on your "ALARM clock").

I don't know which person established that non-sense nomenclature about relative
and absolute timouts.
</rant>

Regards

Ingo Oeser

2006-11-21 10:08:07

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Mon, Nov 20, 2006 at 12:29:31PM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >It is exactly how previous ring buffer (in mapped area though) was
> >implemented.
>
> Not any of those I saw. The one I looked at always started again at
> index 0 to fill the ring buffer. I'll wait for the next implementation.

That what I'm talking about - there are at least 4 (!) different ring
buffer implementations, most of them were not even looked at.
But new version is ready, I will complete testing stage and will relese
'take25' soon today.

For those who like 'real-world benchmark and so on' I created a patch
for the latest stable lighttpd version and test it with kevent.

> >>That's something the application should be make a call about. It's not
> >>always (or even mostly) the case that the ordering of the notification
> >>is important. Furthermore, this would also require the kernel to
> >>enforce an ordering. This is expensive on SMP machines. A locally
> >>generated event (i.e., source and the thread reporting the event) can be
> >>delivered faster than an event created on another CPU.
> >
> >How come? If signal was delivered earlier than data arrived, userspace
> >should get signal before data - that is the rule. Ordering is maintained
> >not for event insertion, but for marking them ready - it is atomic, so
> >who first starts to mark even ready, that event will be read first from
> >the ready queue.
>
> This is as far as the kernel is concerned. Queue them in the order they
> arrive.
>
> I'm talking about the userlevel side. *If* (and it needs to be verified
> that this has an advantage) a CPU creates an event for, e.g., a read
> event and then a number of threads could be notified about the event.
> When the kernel has to wake up a thread it'll look whether any thread is
> scheduled on the same CPU which generated the event. Then the thread,
> upon waking up, can be told about the entry in the ring buffer which can
> be accessed first best (due to caching). This entry needs not be the
> first available in the ring buffer but that's a problem the userlevel
> code has to worry about.

Ok, I've understood.

> >Then I propose userspace notifications - each new thread can register
> >'wake me up when userspace event 1 is ready' and 'event 1' will be
> >marked as ready by glibc when it removes the thread.
>
> You don't want to have a channel like this. The userlevel code doesn't
> know which threads are waiting in the kernel on the event queue. And it
> seems to be much more complicated then simply have an kevent call which
> tells the kernel "wake up N or 1 more threads since I cannot handle it".
> Basically a futex_wake()-like call.

Kernel does not know about any threads which waits for events, it only
has queue of events, it can only wake those who was parked in
kevent_get_events() or kevent_wait(), but syscall will return only when
condition it waits on is true, i.e. when there is new event in the ready
queue and/or ring buffer has empty slots, but kernel will wake them up
in any case if those conditions are true.

How should it know which syscall should be interrupted when special syscall
is called?

> >>Of course it does. Just because you don't see a need for it for your
> >>applications right now it doesn't mean it's not a valid use.
> >
> >Please explain why glibc AIO uses relatinve timeouts then :)
>
> You are still completely focused on AIO. We are talking here about a
> new generic event handling. It is not tied to AIO. We will add all
> kinds of events, e.g., hopefully futex support and many others. And
> even for AIO it's relevant.
>
> As I said, relative timeouts are unable to cope with settimeofday calls
> or ntp adjustments. AIO is certainly usable in situations where
> timeouts are related to wall clock time.

No AIO, but syscall.
Only syscall time matters.
Syscall starts, it sould be sometime stopped. When it should be stopped?
It should be stopped after some time after it was started!

I still do not understand how will you use absolute timeout values
there. Please exaplain.

> >It has nothing with implementation - it is logic. Something starts and
> >it has its maximum lifetime, but not something starts and should be
> >stopped Jan 1, 2008.
>
> It is an implementation detail. Look at the PI futex support. It has
> timeouts which can be cut short (or increased) due to wall clock changes.

futex_wait() uses relative timeouts:
static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time)

Kernel use relative timeouts.

Only special syscalls, which work with absolute time, have absolute
timeouts (like settimeofday).

> >>The opposite case is equally impossible to emulate: unblocking a signal
> >>just for the duration of the syscall. These are all possible and used
> >>cases.
> >
> >Add and remove appropriate kevent - it is as simple as call for one
> >function.
>
> No, it's not. The kevent stuff handles only the kevent handler (i.e.,
> the replacement for calling the signal handler). It cannot set signal
> masks. I am talking about signal masks here. And don't suggest "I can
> add another kevent feature where I can register signal masks". This
> would be ridiculous since it's not an event source. Just add the
> parameter and every base is covered and, at least equally important, we
> have symmetry between the event handling interfaces.

We have not have such symmetry.
Other event handling interfaces can not work with events, which do not
have file descriptor behind them. Kevent can and works.
Signals are just usual events.

You request to get events - and you get them.
You request to not get events during syscall - you remove events.

Btw, please point me to the discussion about real life usefullness of
that parameter for epoll. I read thread where sys_pepoll() was
intruduced, but except some theoretical handwaving about possible
usefullness there are no real signs of that requirement.

What is the ground research or extended explaination about
blocking/unblocking some signals during syscall execution?

> >>No, that's not what I mean. There is no need for the special
> >>timer-related part of your patch. Instead the existing POSIX timer
> >>syscalls should be modified to handle SIGEV_KEVENT notification. Again,
> >>keep the interface as small as possible. Plus, the POSIX timer
> >>interface is very flexible. You don't want to duplicate all that
> >>functionality.
> >
> >Interface is already there with kevent_ctl(KEVENT_ADD), I just created
> >additional entry, which describes timers enqueue/dequeue callbacks
>
> New multiplexers cases are additional syscalls. This is unnecessary
> code. Increased kernel interface and such. We have the POSIX timer
> interfaces which are feature-rich and standardized *and* can be triviall
> extended (at least from the userlevel interface POV) to use event
> queues. If you don't want to do this, fine, I'll try to get it made.
> But drop the timer part of your patches.

There are _no_ additional syscalls.
I just introduced new case for event type.
You _need_ it to be done, since any kernel kevent user must have
enqueue/dequeue/callback callbacks. It is just an implementation of that
callbacks.
I made the work, one can create any interfaces (additional syscalls or
anything else) on top of that.

Due to the fact that kevent was designed as generic event handling
mechanism it is possible to work will all types of events using the same
interface, which was created 10 month ago: kevent add, remove and so
on... There is nothing special for timers there - it is separate file
which does _not_ have any interfaces accessible outside kevent core (i.e.
syscalls or exported symbols).

Btw, how POSIX API should be extended to allow to queue events - queue
is required (which is created when user calls kevent_init() or
previoisly opened /dev/kevent), how should it be accessed, since it is
just a file descriptor in process task_struct.

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View,
> CA ❖

--
Evgeniy Polyakov

2006-11-21 17:02:57

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
>> You don't want to have a channel like this. The userlevel code doesn't
>> know which threads are waiting in the kernel on the event queue. And it
>> seems to be much more complicated then simply have an kevent call which
>> tells the kernel "wake up N or 1 more threads since I cannot handle it".
>> Basically a futex_wake()-like call.
>
> Kernel does not know about any threads which waits for events, it only
> has queue of events, it can only wake those who was parked in
> kevent_get_events() or kevent_wait(), but syscall will return only when
> condition it waits on is true, i.e. when there is new event in the ready
> queue and/or ring buffer has empty slots, but kernel will wake them up
> in any case if those conditions are true.
>
> How should it know which syscall should be interrupted when special syscall
> is called?

It's not about interrupting any threads.

The issue is that the wakeup of a thread from the kevent_wait call
constitutes an "event notification". If, as it should be, only one
thread is woken than this information mustn't get lost. If the woken
thread cannot work on the events it got notified for, then it must tell
the kernel about it so that, *if* there are other threads waiting in
kevent_wait, one of those other threads can be woken.

What is needed is a simple "wake another thread waiting on this event
queue" syscall. Yes, in theory we could open an additional pipe with
each event queue and use it for waking threads, but this is influencing
the ABI through the use of a file descriptor. It's much better to have
an explicit way to do this.


> No AIO, but syscall.
> Only syscall time matters.
> Syscall starts, it sould be sometime stopped. When it should be stopped?
> It should be stopped after some time after it was started!
>
> I still do not understand how will you use absolute timeout values
> there. Please exaplain.

What is there to explain? If you are waiting for events which must
coincide with real-world events you'll naturally will want to formulate
something like "wait for X until 10:15h". You cannot formulate this
correctly with relative timeouts since the realtime clock might be adjusted.


> futex_wait() uses relative timeouts:
> static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time)
>
> Kernel use relative timeouts.

Look again. This time at the implementation. For FUTEX_LOCK_PI the
timeout is an absolute timeout.

> We have not have such symmetry.
> Other event handling interfaces can not work with events, which do not
> have file descriptor behind them. Kevent can and works.
> Signals are just usual events.
>
> You request to get events - and you get them.
> You request to not get events during syscall - you remove events.

None of this matches what I'm talking about. If you want to block a
signal for the duration of the kevent_wait call this is nothing you can
do by registering an event.

Registering events has nothing to do with signal masks. They are not
modified. It is the program's responsibility to set the mask up
correctly. Just like sigwaitinfo() etc expect all signals which are
waited on to be blocked.

The signal mask handling is orthogonal to all this and must be explicit.
In some cases explicit pthread_sigmask/sigprocmask calls. But this is
not atomic if a signal must be masked/unmasked for the *_wait call.
This is why we have variants like pselect/ppoll/epoll_pwait which
explicitly and *atomically* change the signal mask for the duration of
the call.


> Btw, please point me to the discussion about real life usefullness of
> that parameter for epoll. I read thread where sys_pepoll() was
> intruduced, but except some theoretical handwaving about possible
> usefullness there are no real signs of that requirement.

Don't search for epoll_pwait, it's not widely used yet. Search for
pselect, which is standardized. You'll find plenty of uses of that
interface. The number is certainly depressed in the moment since until
recently there was no correct implementation on Linux. And the
interface is mostly used in real-time contexts where signals are more
commonly used.


> What is the ground research or extended explaination about
> blocking/unblocking some signals during syscall execution?

Why is this even a question? Have you done programming with signals?
You hatred of signals makes me think this isn't the case.

You might want to unblock a signal on a *_wait call if it can be used to
interrupt the wait but you don't want this to happen during when the
thread is working on a request.

You might want to block a signal, for instance, around a sigwaitinfo
call or, in this case, a kevent_wait call where the signal might be
delivered to the queue.

There are countless possibilities. Signals are very flexible.


> There are _no_ additional syscalls.
> I just introduced new case for event type.

Which is a new syscall. All demultiplexer cases are no syscalls.
Which, BTW, implies that unrecognized types should actually cause a
ENOSYS return value (this affects kevent_break). We've been over this
many times. If EINVAL is return this case cannot be distinguished from
invalid parameters. This is crucial for future extensions where
userland (esp glibc) needs to be able to determine whether a new feature
is supported on the system.


> You _need_ it to be done, since any kernel kevent user must have
> enqueue/dequeue/callback callbacks. It is just an implementation of that
> callbacks.

I don't question that. But there is no need to add the callback. It
extends the kernel ABI/API. And for what? A vastly inferior timer
implementation compared to the POSIX timers. And this while all that
needs to be done is to extend the POSIX timer code slightly to handle
SIGEV_KEVENT in addition to the other notification methods currently
used. If you do it right then the code can be shared with the file AIO
code which currently is circulated as well and which uses parts of the
POSIX timer infrastructure.


> Btw, how POSIX API should be extended to allow to queue events - queue
> is required (which is created when user calls kevent_init() or
> previoisly opened /dev/kevent), how should it be accessed, since it is
> just a file descriptor in process task_struct.

I've explained this multiple times. The struct sigevent structure needs
to be extended to get a new part in the union. Something like

struct {
int kevent_fd;
void *data;
} _sigev_kevent;

Then define SIGEV_KEVENT as a value distinct from the other SIGEV_
values. In the code which handles setup of timers (the timer_create
syscall), recognize SIGEV_KEVENT and handle it appropriately. I.e.,
call into the code to register the event source, just like you'd do with
the current interface. Then add the code to post an event to the event
queue where currently signals would be sent et voilà.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-11-21 17:50:32

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Tue, Nov 21, 2006 at 08:58:49AM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >>You don't want to have a channel like this. The userlevel code doesn't
> >>know which threads are waiting in the kernel on the event queue. And it
> >>seems to be much more complicated then simply have an kevent call which
> >>tells the kernel "wake up N or 1 more threads since I cannot handle it".
> >> Basically a futex_wake()-like call.
> >
> >Kernel does not know about any threads which waits for events, it only
> >has queue of events, it can only wake those who was parked in
> >kevent_get_events() or kevent_wait(), but syscall will return only when
> >condition it waits on is true, i.e. when there is new event in the ready
> >queue and/or ring buffer has empty slots, but kernel will wake them up
> >in any case if those conditions are true.
> >
> >How should it know which syscall should be interrupted when special syscall
> >is called?
>
> It's not about interrupting any threads.
>
> The issue is that the wakeup of a thread from the kevent_wait call
> constitutes an "event notification". If, as it should be, only one
> thread is woken than this information mustn't get lost. If the woken
> thread cannot work on the events it got notified for, then it must tell
> the kernel about it so that, *if* there are other threads waiting in
> kevent_wait, one of those other threads can be woken.
>
> What is needed is a simple "wake another thread waiting on this event
> queue" syscall. Yes, in theory we could open an additional pipe with
> each event queue and use it for waking threads, but this is influencing
> the ABI through the use of a file descriptor. It's much better to have
> an explicit way to do this.

Threads are parked in syscalls - which one should be interrupted?
And what if there were no threads waiting in syscalls?

> >No AIO, but syscall.
> >Only syscall time matters.
> >Syscall starts, it sould be sometime stopped. When it should be stopped?
> >It should be stopped after some time after it was started!
> >
> >I still do not understand how will you use absolute timeout values
> >there. Please exaplain.
>
> What is there to explain? If you are waiting for events which must
> coincide with real-world events you'll naturally will want to formulate
> something like "wait for X until 10:15h". You cannot formulate this
> correctly with relative timeouts since the realtime clock might be adjusted.

It has completely nothing with syscall.
You register a timer to wait until 10:15 that is all.

You do not ask to sleep in read() until some time, because read() has
nothing in common with that time and event.

But actually it becomes stupid discussion, don't you think?
What do you think about putting there timespec and a small warning in
dmesg about absolute timeout? When someone will report it I will
publically say that you were right and it is correct to have possibility
to have absolute timeouts for syscalls? :)

> >futex_wait() uses relative timeouts:
> > static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time)
> >
> >Kernel use relative timeouts.
>
> Look again. This time at the implementation. For FUTEX_LOCK_PI the
> timeout is an absolute timeout.

How come? It just uses timespec.

> >We have not have such symmetry.
> >Other event handling interfaces can not work with events, which do not
> >have file descriptor behind them. Kevent can and works.
> >Signals are just usual events.
> >
> >You request to get events - and you get them.
> >You request to not get events during syscall - you remove events.
>
> None of this matches what I'm talking about. If you want to block a
> signal for the duration of the kevent_wait call this is nothing you can
> do by registering an event.
>
> Registering events has nothing to do with signal masks. They are not
> modified. It is the program's responsibility to set the mask up
> correctly. Just like sigwaitinfo() etc expect all signals which are
> waited on to be blocked.
>
> The signal mask handling is orthogonal to all this and must be explicit.
> In some cases explicit pthread_sigmask/sigprocmask calls. But this is
> not atomic if a signal must be masked/unmasked for the *_wait call.
> This is why we have variants like pselect/ppoll/epoll_pwait which
> explicitly and *atomically* change the signal mask for the duration of
> the call.

You probably missed kevent signal patch - signal will not be delivered
(in special cases) since it will not be copied into signal mask. System
just will not know that it happend. Completely. Like putting it into
blocked mask.

> >Btw, please point me to the discussion about real life usefullness of
> >that parameter for epoll. I read thread where sys_pepoll() was
> >intruduced, but except some theoretical handwaving about possible
> >usefullness there are no real signs of that requirement.
>
> Don't search for epoll_pwait, it's not widely used yet. Search for
> pselect, which is standardized. You'll find plenty of uses of that
> interface. The number is certainly depressed in the moment since until
> recently there was no correct implementation on Linux. And the
> interface is mostly used in real-time contexts where signals are more
> commonly used.

I found this:

... document a pselect() call intended to
remove the race condition that is present when one wants
to wait on either a signal or some file descriptor.
(See also Stevens, Unix Network Programming, Volume 1, 2nd Ed.,
1998, p. 168 and the pselect.2 man page released today.)
Glibc 2.0 has a bad version (wrong number of parameters)
and glibc 2.1 a better version, but the whole purpose
of pselect is to avoid the race, and glibc cannot do that,
one needs kernel support.


But it is completely irrelevant with kevent signals - there is no race
for that case when signal is delivered through file descriptor.

> >What is the ground research or extended explaination about
> >blocking/unblocking some signals during syscall execution?
>
> Why is this even a question? Have you done programming with signals?
> You hatred of signals makes me think this isn't the case.

It is much better to not know how thing works, then to not be possible
to understand how new things can work.

> You might want to unblock a signal on a *_wait call if it can be used to
> interrupt the wait but you don't want this to happen during when the
> thread is working on a request.

Add kevent signal and do not process that event.

> You might want to block a signal, for instance, around a sigwaitinfo
> call or, in this case, a kevent_wait call where the signal might be
> delivered to the queue.

Having special type of kevent signal is the same as putting signal into
blocked mask, but signal event will be marked as ready - to indicate
that condition was there.
There will not be any race in that case.

> There are countless possibilities. Signals are very flexible.

That is why we want to get them through synchronous queue? :)

> >There are _no_ additional syscalls.
> >I just introduced new case for event type.
>
> Which is a new syscall. All demultiplexer cases are no syscalls.

I think I am a bit blind, probably parts of Leonids are still getting
into my brain, but there is one syscall called kevent_ctl() which adds
different events, including timer, signal, socket and others.

> Which, BTW, implies that unrecognized types should actually cause a
> ENOSYS return value (this affects kevent_break). We've been over this
> many times. If EINVAL is return this case cannot be distinguished from
> invalid parameters. This is crucial for future extensions where
> userland (esp glibc) needs to be able to determine whether a new feature
> is supported on the system.

I can replace with -ENOSYS if you like.

> >You _need_ it to be done, since any kernel kevent user must have
> >enqueue/dequeue/callback callbacks. It is just an implementation of that
> >callbacks.
>
> I don't question that. But there is no need to add the callback. It

No one asked and pain me to create kevent, but it is done.
Probably no the way some people wanted, but it always happend,
it is really not that bad.

Kevent subsystem operates with structures which can be added into
completely different objects in the system - inodes, files - anything.
And to say to that object about new events there are special callbacks -
enqueue and dequeue. Callback which has extremely unusual name 'callback'
is invoked when object, where event is linked, has something to report -
new data, fired alarm or anything else, so it calls kevent's ->callback
and if return value is positive, kevent is marked as ready.
It allows to have event with different sets of interests for the same
type of the main object - for example socket can have read and write
callbacks.

So you must have them.
As you probably saw, kevent_timer_callback() just returns 1.

> extends the kernel ABI/API. And for what? A vastly inferior timer
> implementation compared to the POSIX timers. And this while all that
> needs to be done is to extend the POSIX timer code slightly to handle
> SIGEV_KEVENT in addition to the other notification methods currently
> used. If you do it right then the code can be shared with the file AIO
> code which currently is circulated as well and which uses parts of the
> POSIX timer infrastructure.

Ulrich, tell me the truth, will you kill me if I say that I have an entry
in TODO to implement different AIO design (details for interested readers
can be found in my blog), and then present it to community? :))

> >Btw, how POSIX API should be extended to allow to queue events - queue
> >is required (which is created when user calls kevent_init() or
> >previoisly opened /dev/kevent), how should it be accessed, since it is
> >just a file descriptor in process task_struct.
>
> I've explained this multiple times. The struct sigevent structure needs
> to be extended to get a new part in the union. Something like
>
> struct {
> int kevent_fd;
> void *data;
> } _sigev_kevent;
>
> Then define SIGEV_KEVENT as a value distinct from the other SIGEV_
> values. In the code which handles setup of timers (the timer_create
> syscall), recognize SIGEV_KEVENT and handle it appropriately. I.e.,
> call into the code to register the event source, just like you'd do with
> the current interface. Then add the code to post an event to the event
> queue where currently signals would be sent et voilà.

Ok, I see.
It is doable and simple.
I will try to implement it tomorrow.

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View,
> CA ❖

--
Evgeniy Polyakov

2006-11-21 18:52:03

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Tue, Nov 21, 2006 at 08:43:34PM +0300, Evgeniy Polyakov ([email protected]) wrote:
> > I've explained this multiple times. The struct sigevent structure needs
> > to be extended to get a new part in the union. Something like
> >
> > struct {
> > int kevent_fd;
> > void *data;
> > } _sigev_kevent;
> >
> > Then define SIGEV_KEVENT as a value distinct from the other SIGEV_
> > values. In the code which handles setup of timers (the timer_create
> > syscall), recognize SIGEV_KEVENT and handle it appropriately. I.e.,
> > call into the code to register the event source, just like you'd do with
> > the current interface. Then add the code to post an event to the event
> > queue where currently signals would be sent et voilà.
>
> Ok, I see.
> It is doable and simple.
> I will try to implement it tomorrow.

I've checked the code.
Since it will be a union, it is impossible to use _sigev_thread and it
becomes just SIGEV_SIGNAL case with different delivery mechanism.
Is it what you want?

--
Evgeniy Polyakov

2006-11-21 20:02:02

by Jeff Garzik

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

nitpick: in ring_buffer.c (example app), I would use posix_memalign(3)
rather than malloc(3)

Jeff




2006-11-21 20:19:18

by Jeff Garzik

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

Another: pass a 'flags' argument to kevent_init(2). I guarantee you
will need it eventually. It IMO would help with later binary
compatibility, if nothing else. You wouldn't need a new syscall to
introduce struct kevent_ring_v2.

Jeff



2006-11-22 07:34:19

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
> Threads are parked in syscalls - which one should be interrupted?

It doesn't matter, use the same policy you use when waking a thread in
case of an event. This is not about waking a specific thread, it's
about not dropping the event notification.


> And what if there were no threads waiting in syscalls?

This is fine, do nothing. It means that the other threads are about to
read the ring buffer and will pick up the event.


The case which must be avoided is that of all threads being in the
kernel, one threads gets woken, and then is canceled. Without notifying
the kernel about the cancellation and in the absence of further events
notifications the process is deadlocked.

A second case which should be avoided is that there is a thread waiting
when a thread gets canceled and there are one or more addition threads
around, but not in the kernel. But those other threads might not get to
the ring buffer anytime soon, so handling the event is unnecessarily
delayed.


> It has completely nothing with syscall.
> You register a timer to wait until 10:15 that is all.

That's a nonsense argument. In this case you would not add any timeout
parameter at all. Of course nobody would want that since it's simply
too slow. Stop thinking about the absolute timeout as an exceptional
case, it might very well not be for some problems.

Beside, I've already mentioned another case where a struct timespec*
parameter is needed. There are even two different relative timeouts:
using the monotonis clock or using the realtime clock. The latter is
affected by gettimeofday and ntp.


>>> Kernel use relative timeouts.
>> Look again. This time at the implementation. For FUTEX_LOCK_PI the
>> timeout is an absolute timeout.
>
> How come? It just uses timespec.

Correct, it's using the value passed in.


>> The signal mask handling is orthogonal to all this and must be explicit.
>> In some cases explicit pthread_sigmask/sigprocmask calls. But this is
>> not atomic if a signal must be masked/unmasked for the *_wait call.
>> This is why we have variants like pselect/ppoll/epoll_pwait which
>> explicitly and *atomically* change the signal mask for the duration of
>> the call.
>
> You probably missed kevent signal patch - signal will not be delivered
> (in special cases) since it will not be copied into signal mask. System
> just will not know that it happend. Completely. Like putting it into
> blocked mask.


I don't really understand what you want to say here.

I looked over the patch and I don't think I miss anything. You just
deliver the signal as an event. No signal mask handling at all. This
is exactly the problem.


> But it is completely irrelevant with kevent signals - there is no race
> for that case when signal is delivered through file descriptor.

Of course there is a race. You might not want the signal delivered.
This is what the signal mask is for. Of the other way around, as I've
said before.


> It is much better to not know how thing works, then to not be possible
> to understand how new things can work.

Well, this explains why you don't understand signal masks at all.


> Add kevent signal and do not process that event.

That's not only a horrible hack, it does not work. If I want to ignore
a signal for the duration of the call, while you have it occasionally
blocked for the rest of the program you would have to register the
kevent for the signal, unblock the signal, the kevent_wait call, reset
the mask, remove the kevent for the signal.. Otherwise it would not be
delivered to be ignored. And then you have a race, the same race
pselect is designed to prevent. In fact, you have two races.

There are other scenarios like this. Fact is, signal mask handling is
necessary and it cannot be folded into the event handling, it's orthogonal.


> Having special type of kevent signal is the same as putting signal into
> blocked mask, but signal event will be marked as ready - to indicate
> that condition was there.
> There will not be any race in that case.

Nonsense on all counts.


> I think I am a bit blind, probably parts of Leonids are still getting
> into my brain, but there is one syscall called kevent_ctl() which adds
> different events, including timer, signal, socket and others.

You are searching for callbacks and if none is found you return EINVAL.
This is exactly the same as if you'd create separate syscalls.
Perhaps even worse, I really don't like demultiplexers, separate
syscalls are much cleaner.

Avoiding these callbacks would help reducing the kernel interface,
especially for this useless since inferior timer implementation.


> I can replace with -ENOSYS if you like.

It's necessary since we must be able to distinguish the errors.


> No one asked and pain me to create kevent, but it is done.
> Probably no the way some people wanted, but it always happend,
> it is really not that bad.

Nobody says that the work isn't appreciated. But if you don't want it
to be critiqued, don't publish it. If you don't want to mask any more
changes, fine, say so. I'll find somebody else to do it or will do it
myself.

I claim that I know a thing or two about interfaces of the runtime
programs expect to use. And I know POSIX and the way the interfaces are
designed and how they interact.


> Ulrich, tell me the truth, will you kill me if I say that I have an entry
> in TODO to implement different AIO design (details for interested readers
> can be found in my blog), and then present it to community? :))

I don't care about the kernel implementation as long as the interface is
compatible with what I need for the POSIX AIO implementation. The
currently proposed code is going in that direction. Any implementation
which like Ben's old one does not allow POSIX AIO to be implemented I
will of oppose.


>> Then define SIGEV_KEVENT as a value distinct from the other SIGEV_
>> values. In the code which handles setup of timers (the timer_create
>> syscall), recognize SIGEV_KEVENT and handle it appropriately. I.e.,
>> call into the code to register the event source, just like you'd do with
>> the current interface. Then add the code to post an event to the event
>> queue where currently signals would be sent et voilà.
>
> Ok, I see.
> It is doable and simple.
> I will try to implement it tomorrow.

Thanks, that's progress. And yes, I imagine it's not hard which is why
the currently proposed timer interface is so unnecessary.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-11-22 07:38:54

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
> I've checked the code.
> Since it will be a union, it is impossible to use _sigev_thread and it
> becomes just SIGEV_SIGNAL case with different delivery mechanism.
> Is it what you want?

struct sigevent is defined like this:

typedef struct sigevent {
sigval_t sigev_value;
int sigev_signo;
int sigev_notify;
union {
int _pad[SIGEV_PAD_SIZE];
int _tid;

struct {
void (*_function)(sigval_t);
void *_attribute; /* really pthread_attr_t */
} _sigev_thread;
} _sigev_un;
} sigevent_t;


For the SIGEV_KEVENT case:

sigev_notify is set to SIGEV_KEVENT (obviously)

sigev_value can be used for the void* data passed along with the
signal, just like in the case of a signal delivery

Now you need a way to specify the kevent descriptor. Just add

int _kevent;

inside the union and if you want

#define sigev_kevent_descr _sigev_un._kevent

That should be all.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-11-22 10:40:21

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Tue, Nov 21, 2006 at 03:19:05PM -0500, Jeff Garzik ([email protected]) wrote:
> Another: pass a 'flags' argument to kevent_init(2). I guarantee you
> will need it eventually. It IMO would help with later binary
> compatibility, if nothing else. You wouldn't need a new syscall to
> introduce struct kevent_ring_v2.

Yep, I will add there 'flags' field.

> Jeff

--
Evgeniy Polyakov

2006-11-22 10:39:34

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Tue, Nov 21, 2006 at 11:33:39PM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >Threads are parked in syscalls - which one should be interrupted?
>
> It doesn't matter, use the same policy you use when waking a thread in
> case of an event. This is not about waking a specific thread, it's
> about not dropping the event notification.

Event notification is not dropped - thread was awakened, kernel task is
completed. Kernel does not know and should not have such knowledge about
the fact that selected thread was not good enough. If you want to wakeup
another thread - create another event, that is why I proposed userspace
notifications, which I actually do not like.

> >And what if there were no threads waiting in syscalls?
>
> This is fine, do nothing. It means that the other threads are about to
> read the ring buffer and will pick up the event.
>
>
> The case which must be avoided is that of all threads being in the
> kernel, one threads gets woken, and then is canceled. Without notifying
> the kernel about the cancellation and in the absence of further events
> notifications the process is deadlocked.
>
> A second case which should be avoided is that there is a thread waiting
> when a thread gets canceled and there are one or more addition threads
> around, but not in the kernel. But those other threads might not get to
> the ring buffer anytime soon, so handling the event is unnecessarily
> delayed.

If those threads are not in the kernel, kernel can not wake hem up.
But if there is an event 'wake me up when thread has died' or something
like that, when new threads will try to sleep in syscall, they will be
immediately awakened, since that event will be ready.

> >It has completely nothing with syscall.
> >You register a timer to wait until 10:15 that is all.
>
> That's a nonsense argument. In this case you would not add any timeout
> parameter at all. Of course nobody would want that since it's simply
> too slow. Stop thinking about the absolute timeout as an exceptional
> case, it might very well not be for some problems.

I repeate - timeout is needed to tell kernel the maximum possible
timeframe syscall can live. When you will tell me why you want syscall
to be interrupted when some absolute time is on the clock instead of
having special event for that, then ok.

I think I know why you want absolute time there - because glibc converts
most of the timeouts to absolute time since posix waiting
pthread_cond_timedwait() works only with it.

> Beside, I've already mentioned another case where a struct timespec*
> parameter is needed. There are even two different relative timeouts:
> using the monotonis clock or using the realtime clock. The latter is
> affected by gettimeofday and ntp.

Kevent convert it to jiffies since it uses wait_event() and friends,
jiffies do not carry information about clocks to be used.

> >>>Kernel use relative timeouts.
> >>Look again. This time at the implementation. For FUTEX_LOCK_PI the
> >>timeout is an absolute timeout.
> >
> >How come? It just uses timespec.
>
> Correct, it's using the value passed in.
>
>
> >>The signal mask handling is orthogonal to all this and must be explicit.
> >> In some cases explicit pthread_sigmask/sigprocmask calls. But this is
> >>not atomic if a signal must be masked/unmasked for the *_wait call.
> >>This is why we have variants like pselect/ppoll/epoll_pwait which
> >>explicitly and *atomically* change the signal mask for the duration of
> >>the call.
> >
> >You probably missed kevent signal patch - signal will not be delivered
> >(in special cases) since it will not be copied into signal mask. System
> >just will not know that it happend. Completely. Like putting it into
> >blocked mask.
>
>
> I don't really understand what you want to say here.
>
> I looked over the patch and I don't think I miss anything. You just
> deliver the signal as an event. No signal mask handling at all. This
> is exactly the problem.

Have you seen specific_send_sig_info():

/* Short-circuit ignored signals. */
if (sig_ignored(p, sig)) {
ret = 1;
goto out;
}

almost the same happens when signal is delivered using kevent (special
case) - pending mask is not updated.

> >But it is completely irrelevant with kevent signals - there is no race
> >for that case when signal is delivered through file descriptor.
>
> Of course there is a race. You might not want the signal delivered.
> This is what the signal mask is for. Of the other way around, as I've
> said before.

Then ignore that event - there is no race between signal delivery and
other descriptors reading, and it _is_ when signal is delivered no
through the same queue but asynchronously with mask update.

> >It is much better to not know how thing works, then to not be possible
> >to understand how new things can work.
>
> Well, this explains why you don't understand signal masks at all.

Nice :)
I at least try to do something to solve this problem, instead of blindly
saying the same again and again without even trying to hear and
understand what others say.

> >Add kevent signal and do not process that event.
>
> That's not only a horrible hack, it does not work. If I want to ignore
> a signal for the duration of the call, while you have it occasionally
> blocked for the rest of the program you would have to register the
> kevent for the signal, unblock the signal, the kevent_wait call, reset
> the mask, remove the kevent for the signal.. Otherwise it would not be
> delivered to be ignored. And then you have a race, the same race
> pselect is designed to prevent. In fact, you have two races.
>
> There are other scenarios like this. Fact is, signal mask handling is
> necessary and it cannot be folded into the event handling, it's orthogonal.

You have too narrow look.
Look broader - pselect() has signal mask to prevent race between async
signal delivery and file descriptor readiness. With kevent both that
events are delivered through the same queue, so there is no race, so
kevent syscalls do not need that workaround for 20 years-old design,
which can not handle different than fd events.

> >Having special type of kevent signal is the same as putting signal into
> >blocked mask, but signal event will be marked as ready - to indicate
> >that condition was there.
> >There will not be any race in that case.
>
> Nonsense on all counts.
>
>
> >I think I am a bit blind, probably parts of Leonids are still getting
> >into my brain, but there is one syscall called kevent_ctl() which adds
> >different events, including timer, signal, socket and others.
>
> You are searching for callbacks and if none is found you return EINVAL.
> This is exactly the same as if you'd create separate syscalls.
> Perhaps even worse, I really don't like demultiplexers, separate
> syscalls are much cleaner.
>
> Avoiding these callbacks would help reducing the kernel interface,
> especially for this useless since inferior timer implementation.

You completely do not want to understand how kevent works and why they
are needed, if you would try to think that there are different than
yours opinions, then probably we could have some progress.

Those callbacks are neededto support different types of objects, which
can produce events, with the same interface.

> >I can replace with -ENOSYS if you like.
>
> It's necessary since we must be able to distinguish the errors.

And what if user requests bogus event type - is it invalid condition or
normal, but not handled (thus enosys)?

> >No one asked and pain me to create kevent, but it is done.
> >Probably no the way some people wanted, but it always happend,
> >it is really not that bad.
>
> Nobody says that the work isn't appreciated. But if you don't want it
> to be critiqued, don't publish it. If you don't want to mask any more
> changes, fine, say so. I'll find somebody else to do it or will do it
> myself.

I greatly appreciate critics, really. But when it comes to 'this
sucks because it sucks, no matter if it is completely different way,
it still sucks because others sucked there too' I can not say it is
critics, it becomes nonsence.

> I claim that I know a thing or two about interfaces of the runtime
> programs expect to use. And I know POSIX and the way the interfaces are
> designed and how they interact.

Well, then I claim that I do not know 'thing or two about interfaces of
the runtime programs expect to use', but instead I write those programms
and I know my needs. And POSIX interfaces are the last one I prefer to
use.

We are on the different positions - theoretical thoughs about world
hapinness, and practical usage. I do not say that only one of that
approaches must exist, they both can live together, but it requires that
people from both sides not only tried to say, that other part is stupid
and do not know something or anything, but instead tried to listen and get
into account that.

> >Ulrich, tell me the truth, will you kill me if I say that I have an entry
> >in TODO to implement different AIO design (details for interested readers
> >can be found in my blog), and then present it to community? :))
>
> I don't care about the kernel implementation as long as the interface is
> compatible with what I need for the POSIX AIO implementation. The
> currently proposed code is going in that direction. Any implementation
> which like Ben's old one does not allow POSIX AIO to be implemented I
> will of oppose.

What if it will not be called POSIX AIO, but instead some kind of 'true
AIO' or 'real AIO' or maybe 'alternative AIO'? :)
It is quite sure that POSIX AIO interfaces will unlikely to be applied
there...

> >>Then define SIGEV_KEVENT as a value distinct from the other SIGEV_
> >>values. In the code which handles setup of timers (the timer_create
> >>syscall), recognize SIGEV_KEVENT and handle it appropriately. I.e.,
> >>call into the code to register the event source, just like you'd do with
> >>the current interface. Then add the code to post an event to the event
> >>queue where currently signals would be sent et voilà.
> >
> >Ok, I see.
> >It is doable and simple.
> >I will try to implement it tomorrow.
>
> Thanks, that's progress. And yes, I imagine it's not hard which is why
> the currently proposed timer interface is so unnecessary.

It is the first techical but not political problem we cought in this
endless discussion, I separated it in different subthread already.
Let's try to think more about it there.

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View,
> CA ❖

--
Evgeniy Polyakov

2006-11-22 10:42:10

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Tue, Nov 21, 2006 at 03:01:45PM -0500, Jeff Garzik ([email protected]) wrote:
> nitpick: in ring_buffer.c (example app), I would use posix_memalign(3)
> rather than malloc(3)

Yes, it can be done.

> Jeff

--
Evgeniy Polyakov

2006-11-22 10:46:41

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Tue, Nov 21, 2006 at 11:38:25PM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >I've checked the code.
> >Since it will be a union, it is impossible to use _sigev_thread and it
> >becomes just SIGEV_SIGNAL case with different delivery mechanism.
> >Is it what you want?
>
> struct sigevent is defined like this:
>
> typedef struct sigevent {
> sigval_t sigev_value;
> int sigev_signo;
> int sigev_notify;
> union {
> int _pad[SIGEV_PAD_SIZE];
> int _tid;
>
> struct {
> void (*_function)(sigval_t);
> void *_attribute; /* really pthread_attr_t */
> } _sigev_thread;
> } _sigev_un;
> } sigevent_t;
>
>
> For the SIGEV_KEVENT case:
>
> sigev_notify is set to SIGEV_KEVENT (obviously)
>
> sigev_value can be used for the void* data passed along with the
> signal, just like in the case of a signal delivery
>
> Now you need a way to specify the kevent descriptor. Just add
>
> int _kevent;
>
> inside the union and if you want
>
> #define sigev_kevent_descr _sigev_un._kevent
>
> That should be all.

That what I implemented.
But in this case it will be impossible to have SIGEV_THREAD and SIGEV_KEVENT
at the same time, it will be just the same as SIGEV_SIGNAL but with
different delivery mechanism. Is is what you expect for that?

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View,
> CA ❖

--
Evgeniy Polyakov

2006-11-22 11:39:00

by Michael Tokarev

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

Ulrich Drepper wrote:
> Jeff Garzik wrote:
>> I think we have lived with relative timeouts for so long, it would be
>> unusual to change now. select(2), poll(2), epoll_wait(2) all take
>> relative timeouts.
>
> I'm not talking about always using absolute timeouts.
>
> I'm saying the timeout parameter should be a struct timespec* and then
> the flags word could have a flag meaning "this is an absolute timeout".
> I.e., enable both uses,, even make relative timeouts the default. This
> is what the modern POSIX interfaces do, too, see clock_nanosleep.


Can't the argument be something like u64 instead of struct timespec,
regardless of this discussion (relative vs absolute)?

Compare:

void mysleep(int msec) {
struct timeval tv;
tv.tv_sec = msec/1000;
tv.tv_usec = msec%1000;
select(0,0,0,0,&tv);
}

with

void mysleep(int msec) {
poll(0, 0, msec*SOME_TIME_SCALE_VALUE);
}

That to say: struct time{spec,val,whatever} is more difficult to use than
plain numbers.

But yes... existing struct timespec has an advantage of being already existed.
Oh well.

/mjt

2006-11-22 11:49:07

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Wed, Nov 22, 2006 at 02:38:50PM +0300, Michael Tokarev ([email protected]) wrote:
> Ulrich Drepper wrote:
> > Jeff Garzik wrote:
> >> I think we have lived with relative timeouts for so long, it would be
> >> unusual to change now. select(2), poll(2), epoll_wait(2) all take
> >> relative timeouts.
> >
> > I'm not talking about always using absolute timeouts.
> >
> > I'm saying the timeout parameter should be a struct timespec* and then
> > the flags word could have a flag meaning "this is an absolute timeout".
> > I.e., enable both uses,, even make relative timeouts the default. This
> > is what the modern POSIX interfaces do, too, see clock_nanosleep.
>
>
> Can't the argument be something like u64 instead of struct timespec,
> regardless of this discussion (relative vs absolute)?

It is right now :)

> /mjt

--
Evgeniy Polyakov

2006-11-22 12:11:17

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Tue, Nov 21, 2006 at 11:33:39PM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >Threads are parked in syscalls - which one should be interrupted?
>
> It doesn't matter, use the same policy you use when waking a thread in
> case of an event. This is not about waking a specific thread, it's
> about not dropping the event notification.
>
>
> >And what if there were no threads waiting in syscalls?
>
> This is fine, do nothing. It means that the other threads are about to
> read the ring buffer and will pick up the event.
>
>
> The case which must be avoided is that of all threads being in the
> kernel, one threads gets woken, and then is canceled. Without notifying
> the kernel about the cancellation and in the absence of further events
> notifications the process is deadlocked.
>
> A second case which should be avoided is that there is a thread waiting
> when a thread gets canceled and there are one or more addition threads
> around, but not in the kernel. But those other threads might not get to
> the ring buffer anytime soon, so handling the event is unnecessarily
> delayed.

Ok, to solve the problem in the way which should be good for both I
decided to implement additional syscall which will allow to mark any
event as ready and thus wake up appropriate threads. If userspace will
request zero events to be marked as ready, syscall will just
interrupt/wakeup one of the listeners parked in syscall.

Piece?

--
Evgeniy Polyakov

2006-11-22 12:16:48

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Wed, Nov 22, 2006 at 03:09:34PM +0300, Evgeniy Polyakov ([email protected]) wrote:
> Ok, to solve the problem in the way which should be good for both I
> decided to implement additional syscall which will allow to mark any
> event as ready and thus wake up appropriate threads. If userspace will
> request zero events to be marked as ready, syscall will just
> interrupt/wakeup one of the listeners parked in syscall.

Btw, what about putting aditional multiplexer into add/remove/modify
switch? There will be logical 'ready' addon?

--
Evgeniy Polyakov

2006-11-22 12:34:13

by Jeff Garzik

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

Michael Tokarev wrote:
> Can't the argument be something like u64 instead of struct timespec,
> regardless of this discussion (relative vs absolute)?


Newer syscalls (ppoll, pselect) take struct timespec, which is a
reasonable, modern form of the timeout argument...

Jeff


2006-11-22 13:48:33

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Wed, Nov 22, 2006 at 03:15:16PM +0300, Evgeniy Polyakov ([email protected]) wrote:
> On Wed, Nov 22, 2006 at 03:09:34PM +0300, Evgeniy Polyakov ([email protected]) wrote:
> > Ok, to solve the problem in the way which should be good for both I
> > decided to implement additional syscall which will allow to mark any
> > event as ready and thus wake up appropriate threads. If userspace will
> > request zero events to be marked as ready, syscall will just
> > interrupt/wakeup one of the listeners parked in syscall.
>
> Btw, what about putting aditional multiplexer into add/remove/modify
> switch? There will be logical 'ready' addon?

Something like this.

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/include/linux/kevent.h b/include/linux/kevent.h
index c909c62..7afb3d6 100644
--- a/include/linux/kevent.h
+++ b/include/linux/kevent.h
@@ -99,6 +99,8 @@ struct kevent_user
struct mutex ctl_mutex;
/* Wait until some events are ready. */
wait_queue_head_t wait;
+ /* Exit from syscall if someone wants us to do it */
+ int need_exit;

/* Reference counter, increased for each new kevent. */
atomic_t refcnt;
@@ -132,6 +134,8 @@ void kevent_storage_fini(struct kevent_s
int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);

+void kevent_ready(struct kevent *k, int ret);
+
int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);

#ifdef CONFIG_KEVENT_POLL
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
index 0680fdf..6bc0c79 100644
--- a/include/linux/ukevent.h
+++ b/include/linux/ukevent.h
@@ -174,5 +174,6 @@ struct kevent_ring
#define KEVENT_CTL_ADD 0
#define KEVENT_CTL_REMOVE 1
#define KEVENT_CTL_MODIFY 2
+#define KEVENT_CTL_READY 3

#endif /* __UKEVENT_H */
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
index 4d2d878..d1770a1 100644
--- a/kernel/kevent/kevent.c
+++ b/kernel/kevent/kevent.c
@@ -91,10 +91,10 @@ int kevent_init(struct kevent *k)
spin_lock_init(&k->ulock);
k->flags = 0;

- if (unlikely(k->event.type >= KEVENT_MAX)
+ if (unlikely(k->event.type >= KEVENT_MAX))
return kevent_break(k);

- if (!kevent_registered_callbacks[k->event.type].callback)) {
+ if (!kevent_registered_callbacks[k->event.type].callback) {
kevent_break(k);
return -ENOSYS;
}
@@ -142,16 +142,10 @@ void kevent_storage_dequeue(struct keven
spin_unlock_irqrestore(&st->lock, flags);
}

-/*
- * Call kevent ready callback and queue it into ready queue if needed.
- * If kevent is marked as one-shot, then remove it from storage queue.
- */
-static int __kevent_requeue(struct kevent *k, u32 event)
+void kevent_ready(struct kevent *k, int ret)
{
- int ret, rem;
unsigned long flags;
-
- ret = k->callbacks.callback(k);
+ int rem;

spin_lock_irqsave(&k->ulock, flags);
if (ret > 0)
@@ -178,6 +172,19 @@ static int __kevent_requeue(struct keven
spin_unlock_irqrestore(&k->user->ready_lock, flags);
wake_up(&k->user->wait);
}
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static int __kevent_requeue(struct kevent *k, u32 event)
+{
+ int ret;
+
+ ret = k->callbacks.callback(k);
+
+ kevent_ready(k, ret);

return ret;
}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
index 2cd8c99..3d1ea6b 100644
--- a/kernel/kevent/kevent_user.c
+++ b/kernel/kevent/kevent_user.c
@@ -47,8 +47,9 @@ static unsigned int kevent_user_poll(str
poll_wait(file, &u->wait, wait);
mask = 0;

- if (u->ready_num)
+ if (u->ready_num || u->need_exit)
mask |= POLLIN | POLLRDNORM;
+ u->need_exit = 0;

return mask;
}
@@ -136,6 +137,7 @@ static struct kevent_user *kevent_user_a

mutex_init(&u->ctl_mutex);
init_waitqueue_head(&u->wait);
+ u->need_exit = 0;

atomic_set(&u->refcnt, 1);

@@ -487,6 +489,97 @@ static struct ukevent *kevent_get_user(u
return ukev;
}

+static int kevent_mark_ready(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ int err = -ENODEV;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&uk->id, u);
+ if (k) {
+ spin_lock(&k->st->lock);
+ kevent_ready(k, 1);
+ spin_unlock(&k->st->lock);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Mark appropriate kevents as ready.
+ * If number of events is zero just wake up one listener.
+ */
+static int kevent_user_ctl_ready(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = -EINVAL, cerr = 0, rnum = 0, i;
+ void __user *orig = arg;
+ struct ukevent uk;
+
+ if (num > u->kevent_num)
+ return err;
+
+ if (!num) {
+ u->need_exit = 1;
+ wake_up(&u->wait);
+ return 0;
+ }
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ err = kevent_mark_ready(&ukev[i], u);
+ if (err) {
+ if (i != rnum)
+ memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+ rnum++;
+ }
+ }
+ if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+ cerr = -EFAULT;
+ kfree(ukev);
+ goto out_setup;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ cerr = -EFAULT;
+ break;
+ }
+ arg += sizeof(struct ukevent);
+
+ err = kevent_mark_ready(&uk, u);
+ if (err) {
+ if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+ cerr = -EFAULT;
+ break;
+ }
+ orig += sizeof(struct ukevent);
+ rnum++;
+ }
+ }
+
+out_setup:
+ if (cerr < 0) {
+ err = cerr;
+ goto out_remove;
+ }
+
+ err = num - rnum;
+out_remove:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
/*
* Read from userspace all ukevents and modify appropriate kevents.
* If provided number of ukevents is more that threshold, it is faster
@@ -779,9 +872,10 @@ static int kevent_user_wait(struct file

if (!(file->f_flags & O_NONBLOCK)) {
wait_event_interruptible_timeout(u->wait,
- u->ready_num >= min_nr,
+ (u->ready_num >= min_nr) || u->need_exit,
clock_t_to_jiffies(nsec_to_clock_t(timeout)));
}
+ u->need_exit = 0;

while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) {
if (copy_to_user(buf + num*sizeof(struct ukevent),
@@ -819,6 +913,9 @@ static int kevent_ctl_process(struct fil
case KEVENT_CTL_MODIFY:
err = kevent_user_ctl_modify(u, num, arg);
break;
+ case KEVENT_CTL_READY:
+ err = kevent_user_ctl_ready(u, num, arg);
+ break;
default:
err = -EINVAL;
break;
@@ -994,9 +1091,10 @@ asmlinkage long sys_kevent_wait(int ctl_

if (!(file->f_flags & O_NONBLOCK)) {
wait_event_interruptible_timeout(u->wait,
- ((u->ready_num >= 1) && (kevent_ring_space(u))),
+ ((u->ready_num >= 1) && kevent_ring_space(u)) || u->need_exit,
clock_t_to_jiffies(nsec_to_clock_t(timeout)));
}
+ u->need_exit = 0;

for (i=0; i<num; ++i) {
k = kevent_dequeue_ready_ring(u);

--
Evgeniy Polyakov

2006-11-22 21:05:40

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
> But in this case it will be impossible to have SIGEV_THREAD and SIGEV_KEVENT
> at the same time, it will be just the same as SIGEV_SIGNAL but with
> different delivery mechanism. Is is what you expect for that?

Yes, that's expected. The event if for the queue, not directed to a
specific thread.

If in future we want to think about preferably waking a specific thread
we can then think about it. But I doubt that'll be beneficial. The
thread specific part in the signal handling is only used to implement
the SIGEV_THREAD notification.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-11-22 22:26:22

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
> Event notification is not dropped - [...]

Since you said you added the new syscall I'll leave this alone.


> I repeate - timeout is needed to tell kernel the maximum possible
> timeframe syscall can live. When you will tell me why you want syscall
> to be interrupted when some absolute time is on the clock instead of
> having special event for that, then ok.

This goes together with...


> I think I know why you want absolute time there - because glibc converts
> most of the timeouts to absolute time since posix waiting
> pthread_cond_timedwait() works only with it.

I did not make the decision to use absolute timeouts/deadlines. This is
what is needed in many situations. It's the more general way to specify
delays. These are real-world requirements which were taken into account
when designing the interfaces.

For most cases I would agree that when doing AIO you need relative
timeouts. But the event handling is not about AIO alone. It's all
kinds of events and some/many are wall clock related. And it is
definitely necessary in some situations to be able to interrupt if the
clock jumps ahead. If a program deals with devices in the real world
this be crucial. The new event handling must be generic enough to
accommodate all these uses and using struct timespec* plus eventually
flags does not add any measurable overhead so there is no reason to not
do it right.


> Kevent convert it to jiffies since it uses wait_event() and friends,
> jiffies do not carry information about clocks to be used.

Then this points to a place in the implementation which needs changing.
The interface cannot be restricted just because the implementation
currently allow this to be implemented.


> /* Short-circuit ignored signals. */
> if (sig_ignored(p, sig)) {
> ret = 1;
> goto out;
> }
>
> almost the same happens when signal is delivered using kevent (special
> case) - pending mask is not updated.

Yes, and how do you set the signal mask atomically wrt to registering
and unregistering signals with kevent and the syscall itself? You
cannot. But this is exactly which is resolved by adding the signal mask
parameter.

Programs which don't need the functionality simply pass a NULL pointer
and the cost is once again not measurable. But don't restrict the
functionality just because you don't see a use for this in your small world.

Yes, we could (later again) add new syscalls. But this is plain stupid.
I would love to never have the epoll_wait or select syscall and just
have epoll_pwait and pselect since the functionality is a superset. We
have a larger kernel ABI. Here we can stop making the same mistake again.

For the userlevel side we might even have separate intterfaces, one with
one without signal mask parameter. But that's userlevel, both functions
would use the same syscall.


>> There are other scenarios like this. Fact is, signal mask handling is
>> necessary and it cannot be folded into the event handling, it's orthogonal.
>
> You have too narrow look.
> Look broader - pselect() has signal mask to prevent race between async
> signal delivery and file descriptor readiness. With kevent both that
> events are delivered through the same queue, so there is no race, so
> kevent syscalls do not need that workaround for 20 years-old design,
> which can not handle different than fd events.

Your failure to understand to signal model leads to wrong conclusions.
There are races, several of them, and you cannot do anything without
signal mask parameters. I've explained this before.


>> Avoiding these callbacks would help reducing the kernel interface,
>> especially for this useless since inferior timer implementation.
>
> You completely do not want to understand how kevent works and why they
> are needed, if you would try to think that there are different than
> yours opinions, then probably we could have some progress.

I think I know very well how they work meanwhile.


> Those callbacks are neededto support different types of objects, which
> can produce events, with the same interface.

Yes, but it is not necessary to expose all the different types in the
userlevel APIs. That's the issue. Reduce the exposure of kernel
functionality to userlevel APIs.

If you integrate the timer handling into the POSIX timer syscalls the
callbacks in your timer patch might not need be there. At least the
enqueue callback, if I remember correctly. All enqueue operations are
initiated by timer_create calls which can call the function directly.
Removing the callback from the list used by add_ctl will reduce the
exposed interface.


>>> I can replace with -ENOSYS if you like.
>> It's necessary since we must be able to distinguish the errors.
>
> And what if user requests bogus event type - is it invalid condition or
> normal, but not handled (thus enosys)?

It's ENOSYS. Just like for system calls. You cannot distinguish
completely invalid values from values which are correct only on later
kernels. But: the first use is a bug while the later is not a bug and
needed to write robust and well performing apps. The former's problems
therefore are unimportant.


> Well, then I claim that I do not know 'thing or two about interfaces of
> the runtime programs expect to use', but instead I write those programms
> and I know my needs. And POSIX interfaces are the last one I prefer to
> use.

Well, there it is. You look out for yourself while I make sure that all
the bases I can think of are covered.

Again, if you don't want to work on the generalization, fine. That's
your right. Nobody will think bad of you for doing this. But don't
expect that a) I'll not try to change it and b) I'll not object to the
changes being accepted as they are.


> What if it will not be called POSIX AIO, but instead some kind of 'true
> AIO' or 'real AIO' or maybe 'alternative AIO'? :)
> It is quite sure that POSIX AIO interfaces will unlikely to be applied
> there...

Programmers don't like specialized OS-specific interfaces. AIO users
who put up with libaio are rare. The same will happen with any other
approach. The Samba use is symptomatic: they need portability even if
this costs a minute percentage of performance compared to a highly
specialized implementation.

There might be some aspects of POSIX AIO which could be implemented
better on Linux. But the important part in the name is the 'P'.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-11-22 22:27:43

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
> On Wed, Nov 22, 2006 at 03:09:34PM +0300, Evgeniy Polyakov ([email protected]) wrote:
>> Ok, to solve the problem in the way which should be good for both I
>> decided to implement additional syscall which will allow to mark any
>> event as ready and thus wake up appropriate threads. If userspace will
>> request zero events to be marked as ready, syscall will just
>> interrupt/wakeup one of the listeners parked in syscall.

I'll wait for the new code drop to comment.


> Btw, what about putting aditional multiplexer into add/remove/modify
> switch? There will be logical 'ready' addon?

Is it needed? Usually this is done with a *_wait call with a timeout of
zero. That code path might have to be optimized but it should already
be there.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-11-23 11:48:17

by Evgeniy Polyakov

[permalink] [raw]
Subject: Kevent POSIX timers support.

On Wed, Nov 22, 2006 at 01:44:16PM +0300, Evgeniy Polyakov ([email protected]) wrote:
> That what I implemented.
> But in this case it will be impossible to have SIGEV_THREAD and SIGEV_KEVENT
> at the same time, it will be just the same as SIGEV_SIGNAL but with
> different delivery mechanism. Is is what you expect for that?

Something like this morning hack (compile tested only).
If my thoughts are correct, I will create some simple application and
test if it works.

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index a7dd38f..4b9deb4 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -4,6 +4,7 @@
#include <linux/spinlock.h>
#include <linux/list.h>
#include <linux/sched.h>
+#include <linux/kevent_storage.h>

union cpu_time_count {
cputime_t cpu;
@@ -49,6 +50,9 @@ struct k_itimer {
sigval_t it_sigev_value; /* value word of sigevent struct */
struct task_struct *it_process; /* process to send signal to */
struct sigqueue *sigq; /* signal queue entry. */
+#ifdef CONFIG_KEVENT_TIMER
+ struct kevent_storage st;
+#endif
union {
struct {
struct hrtimer timer;
diff --git a/kernel/posix-timers.c b/kernel/posix-timers.c
index e5ebcc1..148a9f9 100644
--- a/kernel/posix-timers.c
+++ b/kernel/posix-timers.c
@@ -48,6 +48,8 @@
#include <linux/wait.h>
#include <linux/workqueue.h>
#include <linux/module.h>
+#include <linux/kevent.h>
+#include <linux/file.h>

/*
* Management arrays for POSIX timers. Timers are kept in slab memory
@@ -224,6 +226,95 @@ static int posix_ktime_get_ts(clockid_t
return 0;
}

+#ifdef CONFIG_KEVENT_TIMER
+static int posix_kevent_enqueue(struct kevent *k)
+{
+ struct k_itimer *tmr = k->event.ptr;
+ return kevent_storage_enqueue(&tmr->st, k);
+}
+static int posix_kevent_dequeue(struct kevent *k)
+{
+ struct k_itimer *tmr = k->event.ptr;
+ kevent_storage_dequeue(&tmr->st, k);
+ return 0;
+}
+static int posix_kevent_callback(struct kevent *k)
+{
+ return 1;
+}
+static int posix_kevent_init(void)
+{
+ struct kevent_callbacks tc = {
+ .callback = &posix_kevent_callback,
+ .enqueue = &posix_kevent_enqueue,
+ .dequeue = &posix_kevent_dequeue};
+
+ return kevent_add_callbacks(&tc, KEVENT_POSIX_TIMER);
+}
+
+extern struct file_operations kevent_user_fops;
+
+static int posix_kevent_init_timer(struct k_itimer *tmr, int fd)
+{
+ struct ukevent uk;
+ struct file *file;
+ struct kevent_user *u;
+ int err;
+
+ file = fget(fd);
+ if (!file) {
+ err = -EBADF;
+ goto err_out;
+ }
+
+ if (file->f_op != &kevent_user_fops) {
+ err = -EINVAL;
+ goto err_out_fput;
+ }
+
+ u = file->private_data;
+
+ memset(&uk, 0, sizeof(struct ukevent));
+
+ uk.type = KEVENT_POSIX_TIMER;
+ uk.id.raw_u64 = (unsigned long)(tmr); /* Just cast to something unique */
+ uk.ptr = tmr;
+
+ tmr->it_sigev_value.sival_ptr = file;
+
+ err = kevent_user_add_ukevent(&uk, u);
+ if (err)
+ goto err_out_fput;
+
+ fput(file);
+
+ return 0;
+
+err_out_fput:
+ fput(file);
+err_out:
+ return err;
+}
+
+static void posix_kevent_fini_timer(struct k_itimer *tmr)
+{
+ kevent_storage_fini(&tmr->st);
+}
+#else
+static int posix_kevent_init_timer(struct k_itimer *tmr, int fd)
+{
+ return -ENOSYS;
+}
+static int posix_kevent_init(void)
+{
+ return 0;
+}
+static void posix_kevent_fini_timer(struct k_itimer *tmr)
+{
+}
+#endif
+
+
/*
* Initialize everything, well, just everything in Posix clocks/timers ;)
*/
@@ -241,6 +332,11 @@ static __init int init_posix_timers(void
register_posix_clock(CLOCK_REALTIME, &clock_realtime);
register_posix_clock(CLOCK_MONOTONIC, &clock_monotonic);

+ if (posix_kevent_init()) {
+ printk(KERN_ERR "Failed to initialize kevent posix timers.\n");
+ BUG();
+ }
+
posix_timers_cache = kmem_cache_create("posix_timers_cache",
sizeof (struct k_itimer), 0, 0, NULL, NULL);
idr_init(&posix_timers_id);
@@ -343,23 +439,27 @@ static int posix_timer_fn(struct hrtimer

timr = container_of(timer, struct k_itimer, it.real.timer);
spin_lock_irqsave(&timr->it_lock, flags);
+
+ if (timr->it_sigev_notify & SIGEV_KEVENT) {
+ kevent_storage_ready(&timr->st, NULL, KEVENT_MASK_ALL);
+ } else {
+ if (timr->it.real.interval.tv64 != 0)
+ si_private = ++timr->it_requeue_pending;

- if (timr->it.real.interval.tv64 != 0)
- si_private = ++timr->it_requeue_pending;
-
- if (posix_timer_event(timr, si_private)) {
- /*
- * signal was not sent because of sig_ignor
- * we will not get a call back to restart it AND
- * it should be restarted.
- */
- if (timr->it.real.interval.tv64 != 0) {
- timr->it_overrun +=
- hrtimer_forward(timer,
- timer->base->softirq_time,
- timr->it.real.interval);
- ret = HRTIMER_RESTART;
- ++timr->it_requeue_pending;
+ if (posix_timer_event(timr, si_private)) {
+ /*
+ * signal was not sent because of sig_ignor
+ * we will not get a call back to restart it AND
+ * it should be restarted.
+ */
+ if (timr->it.real.interval.tv64 != 0) {
+ timr->it_overrun +=
+ hrtimer_forward(timer,
+ timer->base->softirq_time,
+ timr->it.real.interval);
+ ret = HRTIMER_RESTART;
+ ++timr->it_requeue_pending;
+ }
}
}

@@ -407,6 +507,9 @@ static struct k_itimer * alloc_posix_tim
kmem_cache_free(posix_timers_cache, tmr);
tmr = NULL;
}
+#ifdef CONFIG_KEVENT_TIMER
+ kevent_storage_init(tmr, &tmr->st);
+#endif
return tmr;
}

@@ -424,6 +527,7 @@ static void release_posix_timer(struct k
if (unlikely(tmr->it_process) &&
tmr->it_sigev_notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID))
put_task_struct(tmr->it_process);
+ posix_kevent_fini_timer(tmr);
kmem_cache_free(posix_timers_cache, tmr);
}

@@ -496,40 +600,52 @@ sys_timer_create(const clockid_t which_c
new_timer->it_sigev_signo = event.sigev_signo;
new_timer->it_sigev_value = event.sigev_value;

- read_lock(&tasklist_lock);
- if ((process = good_sigevent(&event))) {
- /*
- * We may be setting up this process for another
- * thread. It may be exiting. To catch this
- * case the we check the PF_EXITING flag. If
- * the flag is not set, the siglock will catch
- * him before it is too late (in exit_itimers).
- *
- * The exec case is a bit more invloved but easy
- * to code. If the process is in our thread
- * group (and it must be or we would not allow
- * it here) and is doing an exec, it will cause
- * us to be killed. In this case it will wait
- * for us to die which means we can finish this
- * linkage with our last gasp. I.e. no code :)
- */
+ if (event.sigev_notify & SIGEV_KEVENT) {
+ error = posix_kevent_init_timer(new_timer, event._sigev_un.kevent_fd);
+ if (error)
+ goto out;
+
+ process = current->group_leader;
spin_lock_irqsave(&process->sighand->siglock, flags);
- if (!(process->flags & PF_EXITING)) {
- new_timer->it_process = process;
- list_add(&new_timer->list,
- &process->signal->posix_timers);
- spin_unlock_irqrestore(&process->sighand->siglock, flags);
- if (new_timer->it_sigev_notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID))
- get_task_struct(process);
- } else {
- spin_unlock_irqrestore(&process->sighand->siglock, flags);
- process = NULL;
+ new_timer->it_process = process;
+ list_add(&new_timer->list, &process->signal->posix_timers);
+ spin_unlock_irqrestore(&process->sighand->siglock, flags);
+ } else {
+ read_lock(&tasklist_lock);
+ if ((process = good_sigevent(&event))) {
+ /*
+ * We may be setting up this process for another
+ * thread. It may be exiting. To catch this
+ * case the we check the PF_EXITING flag. If
+ * the flag is not set, the siglock will catch
+ * him before it is too late (in exit_itimers).
+ *
+ * The exec case is a bit more invloved but easy
+ * to code. If the process is in our thread
+ * group (and it must be or we would not allow
+ * it here) and is doing an exec, it will cause
+ * us to be killed. In this case it will wait
+ * for us to die which means we can finish this
+ * linkage with our last gasp. I.e. no code :)
+ */
+ spin_lock_irqsave(&process->sighand->siglock, flags);
+ if (!(process->flags & PF_EXITING)) {
+ new_timer->it_process = process;
+ list_add(&new_timer->list,
+ &process->signal->posix_timers);
+ spin_unlock_irqrestore(&process->sighand->siglock, flags);
+ if (new_timer->it_sigev_notify == (SIGEV_SIGNAL|SIGEV_THREAD_ID))
+ get_task_struct(process);
+ } else {
+ spin_unlock_irqrestore(&process->sighand->siglock, flags);
+ process = NULL;
+ }
+ }
+ read_unlock(&tasklist_lock);
+ if (!process) {
+ error = -EINVAL;
+ goto out;
}
- }
- read_unlock(&tasklist_lock);
- if (!process) {
- error = -EINVAL;
- goto out;
}
} else {
new_timer->it_sigev_notify = SIGEV_SIGNAL;


--
Evgeniy Polyakov

2006-11-23 12:20:40

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Wed, Nov 22, 2006 at 02:22:15PM -0800, Ulrich Drepper ([email protected]) wrote:
> >I repeate - timeout is needed to tell kernel the maximum possible
> >timeframe syscall can live. When you will tell me why you want syscall
> >to be interrupted when some absolute time is on the clock instead of
> >having special event for that, then ok.
>
> This goes together with...
>
>
> >I think I know why you want absolute time there - because glibc converts
> >most of the timeouts to absolute time since posix waiting
> >pthread_cond_timedwait() works only with it.
>
> I did not make the decision to use absolute timeouts/deadlines. This is
> what is needed in many situations. It's the more general way to specify
> delays. These are real-world requirements which were taken into account
> when designing the interfaces.
>
> For most cases I would agree that when doing AIO you need relative
> timeouts. But the event handling is not about AIO alone. It's all
> kinds of events and some/many are wall clock related. And it is
> definitely necessary in some situations to be able to interrupt if the
> clock jumps ahead. If a program deals with devices in the real world
> this be crucial. The new event handling must be generic enough to
> accommodate all these uses and using struct timespec* plus eventually
> flags does not add any measurable overhead so there is no reason to not
> do it right.

Timeouts are not about AIO or any other event types (there are a lot of
them already as you can see), it is only about syscall itself.
Please point me to _any_ syscall out there which uses absolute time
(except settimeofday() and similar syscalls).

> >Kevent convert it to jiffies since it uses wait_event() and friends,
> >jiffies do not carry information about clocks to be used.
>
> Then this points to a place in the implementation which needs changing.
> The interface cannot be restricted just because the implementation
> currently allow this to be implemented.

Btw, do you propose to change all users of wait_event()?

Interface is not restricted, it is just different from what you want it
to be, and you did not show why it requires changes.

Btw, there are _no_ interfaces similar to 'wait event with absolute
times' in kernel.

> > /* Short-circuit ignored signals. */
> > if (sig_ignored(p, sig)) {
> > ret = 1;
> > goto out;
> > }
> >
> >almost the same happens when signal is delivered using kevent (special
> >case) - pending mask is not updated.
>
> Yes, and how do you set the signal mask atomically wrt to registering
> and unregistering signals with kevent and the syscall itself? You
> cannot. But this is exactly which is resolved by adding the signal mask
> parameter.

kevent signal registering is atomic with respect to other kevent
syscalls: control syscalls are protected by mutex and waiting syscalls
work with queue, which is protected by appropriate lock.

> Programs which don't need the functionality simply pass a NULL pointer
> and the cost is once again not measurable. But don't restrict the
> functionality just because you don't see a use for this in your small world.
>
> Yes, we could (later again) add new syscalls. But this is plain stupid.
> I would love to never have the epoll_wait or select syscall and just
> have epoll_pwait and pselect since the functionality is a superset. We
> have a larger kernel ABI. Here we can stop making the same mistake again.
>
> For the userlevel side we might even have separate intterfaces, one with
> one without signal mask parameter. But that's userlevel, both functions
> would use the same syscall.

Let me formulate signal problem here, please point me if it is correct
or not.

User registers some async signal notifications and calls poll() waiting
for some file descriptors to became ready. When it is interrupted there
is no knowledge about what really happend first - signal was delivered
or file descriptor was ready.

Is it correct?

In case it is, let me explain why this situation can not happen with
kevent: since signals are not delivered in the old way, but instead they
are queued into the same queue where file descriptors are, and queueing
is atomic, and pending signal mask is not updated, user will only read
one event after another, which automatically (since delivery is atomic)
means that what first was read, that was first happend.

So, why in the latter situation we need to specify signal mask, which
will block some signals from _async_ delivery, since there is _no_ async
delivery?

> >>There are other scenarios like this. Fact is, signal mask handling is
> >>necessary and it cannot be folded into the event handling, it's
> >>orthogonal.
> >
> >You have too narrow look.
> >Look broader - pselect() has signal mask to prevent race between async
> >signal delivery and file descriptor readiness. With kevent both that
> >events are delivered through the same queue, so there is no race, so
> >kevent syscalls do not need that workaround for 20 years-old design,
> >which can not handle different than fd events.
>
> Your failure to understand to signal model leads to wrong conclusions.
> There are races, several of them, and you cannot do anything without
> signal mask parameters. I've explained this before.

Please refer to my above explaination, point me in that example what we
are talking about. It seems we do not understand each other.

> >>Avoiding these callbacks would help reducing the kernel interface,
> >>especially for this useless since inferior timer implementation.
> >
> >You completely do not want to understand how kevent works and why they
> >are needed, if you would try to think that there are different than
> >yours opinions, then probably we could have some progress.
>
> I think I know very well how they work meanwhile.

If that would be true, I would be very happy. Definitely.

> >Those callbacks are neededto support different types of objects, which
> >can produce events, with the same interface.
>
> Yes, but it is not necessary to expose all the different types in the
> userlevel APIs. That's the issue. Reduce the exposure of kernel
> functionality to userlevel APIs.
>
> If you integrate the timer handling into the POSIX timer syscalls the
> callbacks in your timer patch might not need be there. At least the
> enqueue callback, if I remember correctly. All enqueue operations are
> initiated by timer_create calls which can call the function directly.
> Removing the callback from the list used by add_ctl will reduce the
> exposed interface.

I posted a patch to implement kevent support for posix timers, it is
quite simple in existing model. No need to remove anything, that allows
to have flexibility and create different usage models others than what
is required by fairly small part of the users.

> >>>I can replace with -ENOSYS if you like.
> >>It's necessary since we must be able to distinguish the errors.
> >
> >And what if user requests bogus event type - is it invalid condition or
> >normal, but not handled (thus enosys)?
>
> It's ENOSYS. Just like for system calls. You cannot distinguish
> completely invalid values from values which are correct only on later
> kernels. But: the first use is a bug while the later is not a bug and
> needed to write robust and well performing apps. The former's problems
> therefore are unimportant.

I implemented it to return -enosys for the case, when event type is
smaller than maximum allowed and no subsystem is registered, and -einval
for the case, when requested type is higher.

> >Well, then I claim that I do not know 'thing or two about interfaces of
> >the runtime programs expect to use', but instead I write those programms
> >and I know my needs. And POSIX interfaces are the last one I prefer to
> >use.
>
> Well, there it is. You look out for yourself while I make sure that all
> the bases I can think of are covered.
>
> Again, if you don't want to work on the generalization, fine. That's
> your right. Nobody will think bad of you for doing this. But don't
> expect that a) I'll not try to change it and b) I'll not object to the
> changes being accepted as they are.

It is not about generalization, but about those who do practical work
and those who prefer to spread theoretical thoughts, which result in
several month of unused empty discussions.

> >What if it will not be called POSIX AIO, but instead some kind of 'true
> >AIO' or 'real AIO' or maybe 'alternative AIO'? :)
> >It is quite sure that POSIX AIO interfaces will unlikely to be applied
> >there...
>
> Programmers don't like specialized OS-specific interfaces. AIO users
> who put up with libaio are rare. The same will happen with any other
> approach. The Samba use is symptomatic: they need portability even if
> this costs a minute percentage of performance compared to a highly
> specialized implementation.

Do not say for everyone - it is not a some kind of feodalism with the
only opinion allowed - respect those who do not like/do not want what
you propose them to use.

> There might be some aspects of POSIX AIO which could be implemented
> better on Linux. But the important part in the name is the 'P'.

I will create completely different model, POSIX completely is not
designed for that - that model allows to specify set of tasks performed
on object completely asycnhronous to user before object is returned -
for example specify destination socket and filename, so async sendfile
will asynchronously open file, transfer it to remote destination and
probably even close (or return file descriptor). The same can be applied
to aio read/write.

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View,
> CA ❖

--
Evgeniy Polyakov

2006-11-23 12:25:25

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Wed, Nov 22, 2006 at 02:24:00PM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >On Wed, Nov 22, 2006 at 03:09:34PM +0300, Evgeniy Polyakov
> >([email protected]) wrote:
> >>Ok, to solve the problem in the way which should be good for both I
> >>decided to implement additional syscall which will allow to mark any
> >>event as ready and thus wake up appropriate threads. If userspace will
> >>request zero events to be marked as ready, syscall will just
> >>interrupt/wakeup one of the listeners parked in syscall.
>
> I'll wait for the new code drop to comment.

I posted it.

> >Btw, what about putting aditional multiplexer into add/remove/modify
> >switch? There will be logical 'ready' addon?
>
> Is it needed? Usually this is done with a *_wait call with a timeout of
> zero. That code path might have to be optimized but it should already
> be there.

It does not allow to mark events as ready.
And current interfaces wake up when either timeout is zero (in this case
thread itself does not sleep and can process events), or when there is
_new_ work - since there is no _new_ work, when thread awakened to
process it was killed, kernel does not think that something is wrong.

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View,
> CA ❖

--
Evgeniy Polyakov

2006-11-23 12:26:28

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Wed, Nov 22, 2006 at 01:02:00PM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >But in this case it will be impossible to have SIGEV_THREAD and
> >SIGEV_KEVENT
> >at the same time, it will be just the same as SIGEV_SIGNAL but with
> >different delivery mechanism. Is is what you expect for that?
>
> Yes, that's expected. The event if for the queue, not directed to a
> specific thread.
>
> If in future we want to think about preferably waking a specific thread
> we can then think about it. But I doubt that'll be beneficial. The
> thread specific part in the signal handling is only used to implement
> the SIGEV_THREAD notification.

Ok, so please review patch I sent, if it is ok from design point of
view, I will run some tests here.

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View,
> CA ❖

--
Evgeniy Polyakov

2006-11-23 20:27:11

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Kevent POSIX timers support.

Evgeniy Polyakov wrote:
> +static int posix_kevent_init(void)
> +{
> + struct kevent_callbacks tc = {
> + .callback = &posix_kevent_callback,
> + .enqueue = &posix_kevent_enqueue,
> + .dequeue = &posix_kevent_dequeue};

How do we prevent that somebody tries to register a POSIX timer event
source with kevent_ctl(KEVENT_CTL_ADD)? This should only be possible
from sys_timer_create and nowhere else.

Can you add a parameter to kevent_enqueue indicating this is a call from
inside the kernel and then ignore certain enqueue callbacks?


> @@ -343,23 +439,27 @@ static int posix_timer_fn(struct hrtimer
>
> timr = container_of(timer, struct k_itimer, it.real.timer);
> spin_lock_irqsave(&timr->it_lock, flags);
> +
> + if (timr->it_sigev_notify & SIGEV_KEVENT) {
> + kevent_storage_ready(&timr->st, NULL, KEVENT_MASK_ALL);
> + } else {

We need to pass the data in the sigev_value meember of the struct
sigevent structure passed to timer_create to the caller. I don't see it
being done here nor when the timer is created. Do I miss something?
The sigev_value value should be stored in the user/ptr member of struct
ukevent.


> + if (event.sigev_notify & SIGEV_KEVENT) {

Don't use a bit. It makes no sense to combine SIGEV_SIGNAL with
SIGEV_KEVENT etc. Only SIGEV_THREAD_ID is a special case.

Just define SIGEV_KEVENT to 3 and replace the tests like the one cited
above with

if (timr->it_sigev_notify == SIGEV_KEVENT)

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-11-23 20:36:09

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
>>> Btw, what about putting aditional multiplexer into add/remove/modify
>>> switch? There will be logical 'ready' addon?
>> Is it needed? Usually this is done with a *_wait call with a timeout of
>> zero. That code path might have to be optimized but it should already
>> be there.
>
> It does not allow to mark events as ready.
> And current interfaces wake up when either timeout is zero (in this case
> thread itself does not sleep and can process events), or when there is
> _new_ work - since there is no _new_ work, when thread awakened to
> process it was killed, kernel does not think that something is wrong.

Rather than mark an existing entry as ready, how about a call to inject
a new ready event?

This would be useful to implement functionality at userlevel and still
use an event queue to announce the availability. Without this type of
functionality we'd need to use indirect notification via signal or pipe
or something like that.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-11-23 22:24:01

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
> On Wed, Nov 22, 2006 at 02:22:15PM -0800, Ulrich Drepper ([email protected]) wrote:
> Timeouts are not about AIO or any other event types (there are a lot of
> them already as you can see), it is only about syscall itself.
> Please point me to _any_ syscall out there which uses absolute time
> (except settimeofday() and similar syscalls).

futex(FUTEX_LOCK_PI).


> Btw, do you propose to change all users of wait_event()?

Which users?


> Interface is not restricted, it is just different from what you want it
> to be, and you did not show why it requires changes.

No, it is restricted because I cannot express something like an absolute
timeout/deadline. If the parameter would be a struct timespec* then at
any time we can implement either relative timeouts w/ and w/out
observance of settimeofday/ntp and absolute timeouts. This is what
makes the interface generic and unrestricted while your current version
cannot be used for the latter.


> kevent signal registering is atomic with respect to other kevent
> syscalls: control syscalls are protected by mutex and waiting syscalls
> work with queue, which is protected by appropriate lock.

It is about atomicity wrt to the signal mask manipulation which would
have to precede the kevent_wait call and the call itself (and
registering a signal for kevent delivery). This is not atomic.


> Let me formulate signal problem here, please point me if it is correct
> or not.

There are a myriad of different scenarios, it makes no sense to pick
one. The interface must be generic to cover them all, I don't know how
often I have to repeat this.


> User registers some async signal notifications and calls poll() waiting
> for some file descriptors to became ready. When it is interrupted there
> is no knowledge about what really happend first - signal was delivered
> or file descriptor was ready.

The order is unimportant. You change the signal mask, for instance, if
the time when a thread is waiting in poll() is the only time when a
signal can be handled. Or vice versa, it's the time when signals are
not wanted. And these are per-thread decisions.

Signal handlers and kevent registrations for signals are process-wide
decisions. And furthermore: with kevent delivered signals there is no
signal mask anymore (at least you seem to not check it). Even if this
would be done it doesn't change the fact that you cannot use signals the
way many programs want to.

Fact is that without a signal queue you cannot implement the above
cases. You cannot block/unblock a signal for a specific thread. You
also cannot work together with signals which cannot be delivered through
kevent. This is the case for existing code in a program which happens
to use also kevent and it is the case if there is more than one possible
recipient. With kevent signals can be attached to one kevent queue only
but the recipients (different threads or only different parts of a
program) need not use the same kevent queue.

I've said from the start that you cannot possibly expect that programs
are not using signal delivery in the current form. And the complete
loss of blocking signals for individual threads makes the kevent-based
signal delivery incomplete (in a non-fixable form) anyway.


> In case it is, let me explain why this situation can not happen with
> kevent: since signals are not delivered in the old way, but instead they
> are queued into the same queue where file descriptors are, and queueing
> is atomic, and pending signal mask is not updated, user will only read
> one event after another, which automatically (since delivery is atomic)
> means that what first was read, that was first happend.

This really has nothing to do with the problem.



> I posted a patch to implement kevent support for posix timers, it is
> quite simple in existing model. No need to remove anything,

Surely you don't suggest keeping your original timer patch?


> I implemented it to return -enosys for the case, when event type is
> smaller than maximum allowed and no subsystem is registered, and -einval
> for the case, when requested type is higher.

What is the "maximum allowed"? ENOSYS must be returned for all values
which could potentially in future be used as a valid type value. If you
limit the values which are treated this way you are setting a fixed
upper limit for the type values which _ever_ can be used.


> It is not about generalization, but about those who do practical work
> and those who prefer to spread theoretical thoughts, which result in
> several month of unused empty discussions.

I've told you, then don't work on these parts. I'll get the changes I
think are needed implemented by somebody else or I'll do it myself. If
you say that only those you implement something have a say in the way
this is done then this is fine with me. But you have to realize that
you're not the one who will make all the final decisions.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-11-24 09:53:10

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Kevent POSIX timers support.

On Thu, Nov 23, 2006 at 12:26:15PM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >+static int posix_kevent_init(void)
> >+{
> >+ struct kevent_callbacks tc = {
> >+ .callback = &posix_kevent_callback,
> >+ .enqueue = &posix_kevent_enqueue,
> >+ .dequeue = &posix_kevent_dequeue};
>
> How do we prevent that somebody tries to register a POSIX timer event
> source with kevent_ctl(KEVENT_CTL_ADD)? This should only be possible
> from sys_timer_create and nowhere else.
>
> Can you add a parameter to kevent_enqueue indicating this is a call from
> inside the kernel and then ignore certain enqueue callbacks?

I think we need some set of flags for callbacks - where they can be
called, maybe even from which context and so on. So userspace will not
be allowed to create such timers through kevent API.
Will do it for release.

> >@@ -343,23 +439,27 @@ static int posix_timer_fn(struct hrtimer
> >
> > timr = container_of(timer, struct k_itimer, it.real.timer);
> > spin_lock_irqsave(&timr->it_lock, flags);
> >+
> >+ if (timr->it_sigev_notify & SIGEV_KEVENT) {
> >+ kevent_storage_ready(&timr->st, NULL, KEVENT_MASK_ALL);
> >+ } else {
>
> We need to pass the data in the sigev_value meember of the struct
> sigevent structure passed to timer_create to the caller. I don't see it
> being done here nor when the timer is created. Do I miss something?
> The sigev_value value should be stored in the user/ptr member of struct
> ukevent.

sigev_value was stored in k_itimer structure, I just do not know where
to put it in the ukevent provided to userspace - it can be placed in
pointer value if you like.

> >+ if (event.sigev_notify & SIGEV_KEVENT) {
>
> Don't use a bit. It makes no sense to combine SIGEV_SIGNAL with
> SIGEV_KEVENT etc. Only SIGEV_THREAD_ID is a special case.
>
> Just define SIGEV_KEVENT to 3 and replace the tests like the one cited
> above with
>
> if (timr->it_sigev_notify == SIGEV_KEVENT)

Ok.

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View,
> CA ❖

--
Evgeniy Polyakov

2006-11-24 10:58:50

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Thu, Nov 23, 2006 at 02:23:12PM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >On Wed, Nov 22, 2006 at 02:22:15PM -0800, Ulrich Drepper
> >([email protected]) wrote:
> >Timeouts are not about AIO or any other event types (there are a lot of
> >them already as you can see), it is only about syscall itself.
> >Please point me to _any_ syscall out there which uses absolute time
> >(except settimeofday() and similar syscalls).
>
> futex(FUTEX_LOCK_PI).

It just sets hrtimer with abs time and sleeps - it can achieve the same
goals using similar to wait_event() mechanism.

> >Btw, do you propose to change all users of wait_event()?
>
> Which users?

Any users which use wait_event() or schedule_timeout(). Futex for
example - it perfectly ok lives with relative timeouts provided to
schedule_timeout() - the same (roughly saying of course) is done in kevent.

> >Interface is not restricted, it is just different from what you want it
> >to be, and you did not show why it requires changes.
>
> No, it is restricted because I cannot express something like an absolute
> timeout/deadline. If the parameter would be a struct timespec* then at
> any time we can implement either relative timeouts w/ and w/out
> observance of settimeofday/ntp and absolute timeouts. This is what
> makes the interface generic and unrestricted while your current version
> cannot be used for the latter.

I think I said already several times that absolute timeouts are not
related to syscall execution process. But you seems to not hear me and
insist.

Ok, I will change waiting syscalls to have 'flags' parameter and 'struct
timespec' as timeout parameter. Special bit in flags will result in
additional timer setup which will fire after absolute timeout and will
wake up those who wait...

> >kevent signal registering is atomic with respect to other kevent
> >syscalls: control syscalls are protected by mutex and waiting syscalls
> >work with queue, which is protected by appropriate lock.
>
> It is about atomicity wrt to the signal mask manipulation which would
> have to precede the kevent_wait call and the call itself (and
> registering a signal for kevent delivery). This is not atomic.

If signal mask is updated from userspace it should be done through
kevent - add/remove different kevent signals. Signal mask of pending
signals is not updated for special kevent signals.

> >Let me formulate signal problem here, please point me if it is correct
> >or not.
>
> There are a myriad of different scenarios, it makes no sense to pick
> one. The interface must be generic to cover them all, I don't know how
> often I have to repeat this.

The whole signal mask was added by POSXI exactly for that single
practical race in the event dispatching mechanism, which can not handle
other types of events like signals.

> >User registers some async signal notifications and calls poll() waiting
> >for some file descriptors to became ready. When it is interrupted there
> >is no knowledge about what really happend first - signal was delivered
> >or file descriptor was ready.
>
> The order is unimportant. You change the signal mask, for instance, if
> the time when a thread is waiting in poll() is the only time when a
> signal can be handled. Or vice versa, it's the time when signals are
> not wanted. And these are per-thread decisions.
>
> Signal handlers and kevent registrations for signals are process-wide
> decisions. And furthermore: with kevent delivered signals there is no
> signal mask anymore (at least you seem to not check it). Even if this
> would be done it doesn't change the fact that you cannot use signals the
> way many programs want to.

There is major contradiction here - you say that programmers will use
old-style signal delivery and want me to add signal mask to prevent that
delivery, so signals would be in blocked mask, when I say that current kevent
signal delivery does not update pending signal mask, which is the same as
putting signals into blocked mask, you say that it is not what is
required.

> Fact is that without a signal queue you cannot implement the above
> cases. You cannot block/unblock a signal for a specific thread. You
> also cannot work together with signals which cannot be delivered through
> kevent. This is the case for existing code in a program which happens
> to use also kevent and it is the case if there is more than one possible
> recipient. With kevent signals can be attached to one kevent queue only
> but the recipients (different threads or only different parts of a
> program) need not use the same kevent queue.

Signal queue is replaced with kevent queue, and it is in sync with all
other kevents.
Programmers which want to use kevents will use kevents (if miracle will
happend and we agree that kevent is good for inclusion), and programmers
will know how kevent signal delivery works.

> I've said from the start that you cannot possibly expect that programs
> are not using signal delivery in the current form. And the complete
> loss of blocking signals for individual threads makes the kevent-based
> signal delivery incomplete (in a non-fixable form) anyway.

Having sigmask parameter is the same as creating kevent signal delivery.

And, btw, programmers can change signal mask before calling syscall,
since in the syscall there is a gap between start and sigprocmask()
call.

> >In case it is, let me explain why this situation can not happen with
> >kevent: since signals are not delivered in the old way, but instead they
> >are queued into the same queue where file descriptors are, and queueing
> >is atomic, and pending signal mask is not updated, user will only read
> >one event after another, which automatically (since delivery is atomic)
> >means that what first was read, that was first happend.
>
> This really has nothing to do with the problem.

It is the only practical example of the need for that signal mask.
And it can be perfectly handled by kevent.

> >I posted a patch to implement kevent support for posix timers, it is
> >quite simple in existing model. No need to remove anything,
>
> Surely you don't suggest keeping your original timer patch?

Of course not - kevent timers are more scalable than posix timers (the
latter uses idr, which is slower than balanced binary tree, since it
looks like it uses similar to radix tree algo), POSIX interface is
much-much-much more unconvenient to use than simple add/wait.

> >I implemented it to return -enosys for the case, when event type is
> >smaller than maximum allowed and no subsystem is registered, and -einval
> >for the case, when requested type is higher.
>
> What is the "maximum allowed"? ENOSYS must be returned for all values
> which could potentially in future be used as a valid type value. If you
> limit the values which are treated this way you are setting a fixed
> upper limit for the type values which _ever_ can be used.

Upper limit is for current version - when new type is added limit is
increased - just like maximum number of syscalls.
Ok, I will use -ENOSYS for all cases.

> >It is not about generalization, but about those who do practical work
> >and those who prefer to spread theoretical thoughts, which result in
> >several month of unused empty discussions.
>
> I've told you, then don't work on these parts. I'll get the changes I
> think are needed implemented by somebody else or I'll do it myself. If
> you say that only those you implement something have a say in the way
> this is done then this is fine with me. But you have to realize that
> you're not the one who will make all the final decisions.

Because our void discussion seems to never end, which puts kevent into
hung state - I definitely prefer final words made by kernel maintainers
about inclusion or declining of the kevents, but they keep silence since
they look for not only my decision as author, but also different
opinions of the potential users.

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View,
> CA ❖

--
Evgeniy Polyakov

2006-11-24 11:00:46

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Thu, Nov 23, 2006 at 12:34:50PM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >>>Btw, what about putting aditional multiplexer into add/remove/modify
> >>>switch? There will be logical 'ready' addon?
> >>Is it needed? Usually this is done with a *_wait call with a timeout of
> >>zero. That code path might have to be optimized but it should already
> >>be there.
> >
> >It does not allow to mark events as ready.
> >And current interfaces wake up when either timeout is zero (in this case
> >thread itself does not sleep and can process events), or when there is
> >_new_ work - since there is no _new_ work, when thread awakened to
> >process it was killed, kernel does not think that something is wrong.
>
> Rather than mark an existing entry as ready, how about a call to inject
> a new ready event?
>
> This would be useful to implement functionality at userlevel and still
> use an event queue to announce the availability. Without this type of
> functionality we'd need to use indirect notification via signal or pipe
> or something like that.

With provided patch it is possible to wakeup 'for-free' - just call
kevent_ctl(ready) with zero number of ready events, so thread will be
awakened if it was in poll(kevent_fd), kevent_wait() or
kevent_get_events().

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View,
> CA ❖

--
Evgeniy Polyakov

2006-11-27 18:21:54

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Kevent POSIX timers support.

Evgeniy Polyakov wrote:
>> We need to pass the data in the sigev_value meember of the struct
>> sigevent structure passed to timer_create to the caller. I don't see it
>> being done here nor when the timer is created. Do I miss something?
>> The sigev_value value should be stored in the user/ptr member of struct
>> ukevent.
>
> sigev_value was stored in k_itimer structure, I just do not know where
> to put it in the ukevent provided to userspace - it can be placed in
> pointer value if you like.

sigev_value is a union and the largest element is a pointer. So,
transporting the pointer value is sufficient and it should be passed up
to the user in the ptr member of struct ukevent.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-11-27 18:24:09

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
>
> With provided patch it is possible to wakeup 'for-free' - just call
> kevent_ctl(ready) with zero number of ready events, so thread will be
> awakened if it was in poll(kevent_fd), kevent_wait() or
> kevent_get_events().

Yes, I realize that. But I wrote something else:

>> Rather than mark an existing entry as ready, how about a call to
>> inject a new ready event?
>>
>> This would be useful to implement functionality at userlevel and
>> still use an event queue to announce the availability. Without this
>> type of functionality we'd need to use indirect notification via
>> signal or pipe or something like that.

This is still something which is wanted.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-11-27 18:24:43

by David Miller

[permalink] [raw]
Subject: Re: Kevent POSIX timers support.

From: Ulrich Drepper <[email protected]>
Date: Mon, 27 Nov 2006 10:20:50 -0800

> Evgeniy Polyakov wrote:
> >> We need to pass the data in the sigev_value meember of the struct
> >> sigevent structure passed to timer_create to the caller. I don't see it
> >> being done here nor when the timer is created. Do I miss something?
> >> The sigev_value value should be stored in the user/ptr member of struct
> >> ukevent.
> >
> > sigev_value was stored in k_itimer structure, I just do not know where
> > to put it in the ukevent provided to userspace - it can be placed in
> > pointer value if you like.
>
> sigev_value is a union and the largest element is a pointer. So,
> transporting the pointer value is sufficient and it should be passed up
> to the user in the ptr member of struct ukevent.

Now we'll have to have a compat layer for 32-bit/64-bit environments
thanks to POSIX timers, which is rediculious.

This is exactly the kind of thing I was hoping we could avoid when
designing these data structures. No pointers, no non-fixed sized
types, only types which are identically sized and aligned between
32-bit and 64-bit environments.

It's OK to have these problems for things designed a long time ago
before 32-bit/64-bit compat issues existed, but for new stuff no
way.

2006-11-27 18:36:47

by Ulrich Drepper

[permalink] [raw]
Subject: Re: Kevent POSIX timers support.

David Miller wrote:
> Now we'll have to have a compat layer for 32-bit/64-bit environments
> thanks to POSIX timers, which is rediculious.

We already have compat_sys_timer_create. It should be sufficient just
to add the conversion (if anything new is needed) there. The pointer
value can be passed to userland in one or two int fields, I don't really
care. When reporting the event to the user code we cannot just point
into the ring buffer anyway. So while copying the data we can rewrite
it if necessary. I see no need to complicate the code more than it
already is.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-11-27 18:49:54

by David Miller

[permalink] [raw]
Subject: Re: Kevent POSIX timers support.

From: Ulrich Drepper <[email protected]>
Date: Mon, 27 Nov 2006 10:36:06 -0800

> David Miller wrote:
> > Now we'll have to have a compat layer for 32-bit/64-bit environments
> > thanks to POSIX timers, which is rediculious.
>
> We already have compat_sys_timer_create. It should be sufficient just
> to add the conversion (if anything new is needed) there. The pointer
> value can be passed to userland in one or two int fields, I don't really
> care. When reporting the event to the user code we cannot just point
> into the ring buffer anyway. So while copying the data we can rewrite
> it if necessary. I see no need to complicate the code more than it
> already is.

Ok, as long as that thing doesn't end up in the ring buffer entry
data structure, that's where the real troubles would be.

2006-11-27 19:13:19

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
> It just sets hrtimer with abs time and sleeps - it can achieve the same
> goals using similar to wait_event() mechanism.

I don't follow. Of course it is somehow possible to wait until an
absolute deadline. But it's not part of the parameter list and hence
easily and _quickly_ usable.


>>> Btw, do you propose to change all users of wait_event()?
>> Which users?
>
> Any users which use wait_event() or schedule_timeout(). Futex for
> example - it perfectly ok lives with relative timeouts provided to
> schedule_timeout() - the same (roughly saying of course) is done in kevent.

No, it does not live perfectly OK with relative timeouts. The userlevel
implementation is actually wrong because of this in subtle ways. Some
futex interfaces take absolute timeouts and they have to be interrupted
if the realtime clock is set forward.

Also, the calls are complicated and slow because the userlevel wrapper
has to call clock_gettime/gettimeofday before each futex syscall. If
the kernel would accept absolute timeouts as well we would save a
syscall and have actually a correct implementation.


> I think I said already several times that absolute timeouts are not
> related to syscall execution process. But you seems to not hear me and
> insist.

Because you're wrong. For your use cases it might not be but it's not
true in general. And your interface is preventing it from being
implemented forever.


> Ok, I will change waiting syscalls to have 'flags' parameter and 'struct
> timespec' as timeout parameter. Special bit in flags will result in
> additional timer setup which will fire after absolute timeout and will
> wake up those who wait...

Thanks a lot.


>>> kevent signal registering is atomic with respect to other kevent
>>> syscalls: control syscalls are protected by mutex and waiting syscalls
>>> work with queue, which is protected by appropriate lock.
>> It is about atomicity wrt to the signal mask manipulation which would
>> have to precede the kevent_wait call and the call itself (and
>> registering a signal for kevent delivery). This is not atomic.
>
> If signal mask is updated from userspace it should be done through
> kevent - add/remove different kevent signals.

Indeed, this is what I've been saying and why ppoll/pselect/epoll_pwait
take the sigset_t parameter.

Adding the signal mask to the queued events (e.g., the signal events)
does not work. First of all it's slow, you'd have to find and combine
all mask at least every time a signal event is added/removed. Then how
do you combine them, OR or AND? Not all threads might want/need the
same signal mask.

These are just some of the usability problems. The only clean and
usable solution is really to OPTIONALLY pass in the signal mask. Nobody
forces anybody to use this feature. Pass a NULL pointer and nothing
happens, this is how the other syscalls also work.


> The whole signal mask was added by POSXI exactly for that single
> practical race in the event dispatching mechanism, which can not handle
> other types of events like signals.

No. How should this argument make sense ? Signals cannot be used in
the current event handling and are therefore used for something
completely different. And they will have to be used like this for many
applications (.e., thread cancellation, setuid/setgid implementation, etc).

That fact that the new event handling can handle signals is orthogonal
(and good). But it does not supersede the old signal use, it's
something new. The old uses are still valid.

BTW: there is a little design decision which has to be made: if a signal
is registered with kevent and this signal is sent to a specific thread
instead of the process (tkill and tgkill), what should happen? I'm
currently leaning toward failing the tkill/tgkill syscall if delivery of
the signal requires posting to an event queue.


> There is major contradiction here - you say that programmers will use
> old-style signal delivery and want me to add signal mask to prevent that
> delivery, so signals would be in blocked mask,

That's one thing you can do. You also can unblock signals.


> when I say that current kevent
> signal delivery does not update pending signal mask, which is the same as
> putting signals into blocked mask, you say that it is not what is
> required.

First, what is "pending signal mask"? There is one signal mask per
thread. And "pending" refers to thread delivery (either per-process or
per-thread) which is not the signal mask (well, for non-RT signals it
can be a bitmap but this still is no mask).

Second, I'm not talking about signal delivery. Yes, sigaction allows to
specify how the signal mask is to be changed when a signal is delivered.
But this is not what I'm talk about. I'm talking about the signal
mask used for the duration of the kevent_wait syscall, regardless of
whether signals are waited for or delivered.



> Signal queue is replaced with kevent queue, and it is in sync with all
> other kevents.

But the signal mask is something completely different and completely
independent from the signal queue. There is nothing in the kevent
interface to replace that functionality. Nor should this be possible
with the events; only a sigset_t parameter to kevent_wait makes sense.


> Having sigmask parameter is the same as creating kevent signal delivery.

No, no, no. Not at all.

>> Surely you don't suggest keeping your original timer patch?
>
> Of course not - kevent timers are more scalable than posix timers (the
> latter uses idr, which is slower than balanced binary tree, since it
> looks like it uses similar to radix tree algo), POSIX interface is
> much-much-much more unconvenient to use than simple add/wait.

I assume you misread the question. You agree to drop the patch and then
go on listing things why you think it's better to keep them. I don't
think these arguments are in any way sufficient. The interface is
already too big and this is 100% duplicate functionality. If there are
performance problems with the POSIX timer implementation (and I have yet
to see indications) it should be fixed instead of worked around.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2006-11-28 09:17:50

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Kevent POSIX timers support.

On Mon, Nov 27, 2006 at 10:49:55AM -0800, David Miller ([email protected]) wrote:
> From: Ulrich Drepper <[email protected]>
> Date: Mon, 27 Nov 2006 10:36:06 -0800
>
> > David Miller wrote:
> > > Now we'll have to have a compat layer for 32-bit/64-bit environments
> > > thanks to POSIX timers, which is rediculious.
> >
> > We already have compat_sys_timer_create. It should be sufficient just
> > to add the conversion (if anything new is needed) there. The pointer
> > value can be passed to userland in one or two int fields, I don't really
> > care. When reporting the event to the user code we cannot just point
> > into the ring buffer anyway. So while copying the data we can rewrite
> > it if necessary. I see no need to complicate the code more than it
> > already is.
>
> Ok, as long as that thing doesn't end up in the ring buffer entry
> data structure, that's where the real troubles would be.

Although ukevent has pointer embedded, it is unioned with u64, so there
should be no problems until 128 bit arch appeared, which likely will not
happen soon. There is also unused in kevent posix timers patch
'u32 ret_val[2]' field, which can store segval's value too.

But the fact that ukevent does not and will not in any way have variable
size is absolutely.

--
Evgeniy Polyakov

2006-11-28 09:20:57

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Kevent POSIX timers support.

On Mon, Nov 27, 2006 at 10:20:50AM -0800, Ulrich Drepper ([email protected]) wrote:
> sigev_value is a union and the largest element is a pointer. So,
> transporting the pointer value is sufficient and it should be passed up
> to the user in the ptr member of struct ukevent.

That is where I've put it in current version.

--
Evgeniy Polyakov

2006-11-28 10:14:50

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Mon, Nov 27, 2006 at 10:23:39AM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >
> >With provided patch it is possible to wakeup 'for-free' - just call
> >kevent_ctl(ready) with zero number of ready events, so thread will be
> >awakened if it was in poll(kevent_fd), kevent_wait() or
> >kevent_get_events().
>
> Yes, I realize that. But I wrote something else:
>
> >> Rather than mark an existing entry as ready, how about a call to
> >> inject a new ready event?
> >>
> >> This would be useful to implement functionality at userlevel and
> >> still use an event queue to announce the availability. Without this
> >> type of functionality we'd need to use indirect notification via
> >> signal or pipe or something like that.
>
> This is still something which is wanted.

Why do we want to inject _ready_ event, when it is possible to mark
event as ready and wakeup thread parked in syscall?

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View,
> CA ❖

--
Evgeniy Polyakov

2006-11-28 11:04:15

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Mon, Nov 27, 2006 at 11:12:21AM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> >It just sets hrtimer with abs time and sleeps - it can achieve the same
> >goals using similar to wait_event() mechanism.
>
> I don't follow. Of course it is somehow possible to wait until an
> absolute deadline. But it's not part of the parameter list and hence
> easily and _quickly_ usable.

I just described how it is implemented in futex. I will create the same
approach - hrtimer which will wakeup wait_event() with infinite timeout.

> >>>Btw, do you propose to change all users of wait_event()?
> >>Which users?
> >
> >Any users which use wait_event() or schedule_timeout(). Futex for
> >example - it perfectly ok lives with relative timeouts provided to
> >schedule_timeout() - the same (roughly saying of course) is done in kevent.
>
> No, it does not live perfectly OK with relative timeouts. The userlevel
> implementation is actually wrong because of this in subtle ways. Some
> futex interfaces take absolute timeouts and they have to be interrupted
> if the realtime clock is set forward.
>
> Also, the calls are complicated and slow because the userlevel wrapper
> has to call clock_gettime/gettimeofday before each futex syscall. If
> the kernel would accept absolute timeouts as well we would save a
> syscall and have actually a correct implementation.

It is only done for LOCK_PI case, which was specially created to have
absolute timeout, i.e. futex does not need it, but there is an option.

I will extend waiting syscalls to have timespec and absolute timeout,
I'm just want to stop this (I hope you agree) stupid endless arguing
about completely unimportant thing.

> >I think I said already several times that absolute timeouts are not
> >related to syscall execution process. But you seems to not hear me and
> >insist.
>
> Because you're wrong. For your use cases it might not be but it's not
> true in general. And your interface is preventing it from being
> implemented forever.

Because I'm right and it will not be used :)
Well, it does not matter anymore, right?

> >Ok, I will change waiting syscalls to have 'flags' parameter and 'struct
> >timespec' as timeout parameter. Special bit in flags will result in
> >additional timer setup which will fire after absolute timeout and will
> >wake up those who wait...
>
> Thanks a lot.

No problem - I always like to spend couple of month arguing about taste
and 'right-from-my-point-of-view' theories - doesn't it the best way to
waste the time?

...

> >Having sigmask parameter is the same as creating kevent signal delivery.
>
> No, no, no. Not at all.

I've dropped a lot, but let me describe signal mask problem in few
words: signal mask provided in sys_pselect() and friends is a mask of
signals, which will be put into blocked mask in the task structure in
kernel. When new signal is going to be delivered, signal number is being
checked if it is in blocked mask, and if so, signal is not put into
pending mask of signals, which ends up in not being delivered to
userspace. Kevent (with special flag) does exactly the same - but it
does not update blocked mask, but instead adds another check if signal
is in kevent set of requests, in that case signal is delivered to
userspace through kevent queue.

It is _exactly_ the same behaviour from userspace point of view
concerning race of delivery signal versus file descriptor readyness.
Exactly.

Here is code snippet:
specific_send_sig_info()
{
...
/* Short-circuit ignored signals. */
if (sig_ignored(t, sig))
goto out;
...
ret = send_signal(sig, info, t, &t->pending);
if (!ret && !sigismember(&t->blocked, sig))
signal_wake_up(t, sig == SIGKILL);
#ifdef CONFIG_KEVENT_SIGNAL
/*
* Kevent allows to deliver signals through kevent queue,
* it is possible to setup kevent to not deliver
* signal through the usual way, in that case send_signal()
* returns 1 and signal is delivered only through kevent queue.
* We simulate successfull delivery notification through this hack:
*/
if (ret == 1)
ret = 0;
#endif
out:
return ret;
}

> >>Surely you don't suggest keeping your original timer patch?
> >
> >Of course not - kevent timers are more scalable than posix timers (the
> >latter uses idr, which is slower than balanced binary tree, since it
> >looks like it uses similar to radix tree algo), POSIX interface is
> >much-much-much more unconvenient to use than simple add/wait.
>
> I assume you misread the question. You agree to drop the patch and then
> go on listing things why you think it's better to keep them. I don't
> think these arguments are in any way sufficient. The interface is
> already too big and this is 100% duplicate functionality. If there are
> performance problems with the POSIX timer implementation (and I have yet
> to see indications) it should be fixed instead of worked around.

I do _not_ agree to drop kevent timer patch (not posix timer), since
from my point of view it is much more convenient interface, it is more
scalable, it is generic enough to be used with other kevent methods.

But anyway, we can spend awfull lot of time arguing about taste, which
is definitely _NOT_ what we want. So, there are two worlds - posix
timers and usual timers, accessible from userspace, first one through
create_timer() and friends, second one with kevent interface.

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View,
> CA ❖

--
Evgeniy Polyakov

2006-11-28 19:12:56

by David Miller

[permalink] [raw]
Subject: Re: Kevent POSIX timers support.

From: Evgeniy Polyakov <[email protected]>
Date: Tue, 28 Nov 2006 12:16:02 +0300

> Although ukevent has pointer embedded, it is unioned with u64, so there
> should be no problems until 128 bit arch appeared, which likely will not
> happen soon. There is also unused in kevent posix timers patch
> 'u32 ret_val[2]' field, which can store segval's value too.
>
> But the fact that ukevent does not and will not in any way have variable
> size is absolutely.

I believe that in order to be %100 safe you will need to use the
special aligned_u64 type, as that takes care of a crucial difference
between x86 and x86_64 API, namely that u64 needs 8-byte alignment on
x86_64 but not on x86.

You probably know this already :-)

2006-11-28 19:23:22

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Kevent POSIX timers support.

On Tue, Nov 28, 2006 at 11:13:00AM -0800, David Miller ([email protected]) wrote:
> From: Evgeniy Polyakov <[email protected]>
> Date: Tue, 28 Nov 2006 12:16:02 +0300
>
> > Although ukevent has pointer embedded, it is unioned with u64, so there
> > should be no problems until 128 bit arch appeared, which likely will not
> > happen soon. There is also unused in kevent posix timers patch
> > 'u32 ret_val[2]' field, which can store segval's value too.
> >
> > But the fact that ukevent does not and will not in any way have variable
> > size is absolutely.
>
> I believe that in order to be %100 safe you will need to use the
> special aligned_u64 type, as that takes care of a crucial difference
> between x86 and x86_64 API, namely that u64 needs 8-byte alignment on
> x86_64 but not on x86.
>
> You probably know this already :-)

Yep :)
So I put it at the end, where structure is already correctly aligned, so
there is no need for special alignment.
And, btw, last time I checked, aligned_u64 was not exported to
userspace.

--
Evgeniy Polyakov

2006-12-12 01:36:46

by David Miller

[permalink] [raw]
Subject: Re: Kevent POSIX timers support.

From: Evgeniy Polyakov <[email protected]>
Date: Tue, 28 Nov 2006 22:22:36 +0300

> And, btw, last time I checked, aligned_u64 was not exported to
> userspace.

It is in linux/types.h and not protected by __KERNEL__ ifdefs.
Perhaps you mean something else?

2006-12-12 05:32:31

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Kevent POSIX timers support.

On Mon, Dec 11, 2006 at 05:36:44PM -0800, David Miller ([email protected]) wrote:
> From: Evgeniy Polyakov <[email protected]>
> Date: Tue, 28 Nov 2006 22:22:36 +0300
>
> > And, btw, last time I checked, aligned_u64 was not exported to
> > userspace.
>
> It is in linux/types.h and not protected by __KERNEL__ ifdefs.
> Perhaps you mean something else?

It looks like I checked wrong #ifdef __KERNEL__/#endif pair.
It is there indeed.

--
Evgeniy Polyakov

2006-12-13 13:21:50

by Tushar Adeshara

[permalink] [raw]
Subject: Re: Kevent POSIX timers support.

On 11/23/06, Evgeniy Polyakov <[email protected]> wrote:
> On Wed, Nov 22, 2006 at 01:44:16PM +0300, Evgeniy Polyakov ([email protected]) wrote:
> +static int posix_kevent_init_timer(struct k_itimer *tmr, int fd)
> +{
> + struct ukevent uk;
> + struct file *file;
> + struct kevent_user *u;
> + int err;
> +
> + file = fget(fd);
> + if (!file) {
> + err = -EBADF;
> + goto err_out;
> + }
> +
> + if (file->f_op != &kevent_user_fops) {
> + err = -EINVAL;
> + goto err_out_fput;
> + }
> +
> + u = file->private_data;
> +
> + memset(&uk, 0, sizeof(struct ukevent));
> +
> + uk.type = KEVENT_POSIX_TIMER;
> + uk.id.raw_u64 = (unsigned long)(tmr); /* Just cast to something unique */
> + uk.ptr = tmr;
> +
> + tmr->it_sigev_value.sival_ptr = file;
> +
> + err = kevent_user_add_ukevent(&uk, u);

I think these four lines are not required. Irrespective of return
value of kevent_user_add_ukevent(), we are going to release file, and
return err.

> + if (err)
> + goto err_out_fput;
> +
> + fput(file);
> +
> + return 0;


> +
> +err_out_fput:
> + fput(file);
> +err_out:
> + return err;
> +}
> +

--
Regards,
Tushar
--------------------
It's not a problem, it's an opportunity for improvement. Lets improve.

2006-12-13 13:38:15

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: Kevent POSIX timers support.

Hello.

Please _NEVER_ drop Cc: list, since not everyone can be subscribed to
linux-kernel@, fortunately I'm not for example.

On Wed, Dec 13, 2006 at 06:51:47PM +0530, Tushar Adeshara ([email protected]) wrote:
> I think these four lines are not required. Irrespective of return
> value of kevent_user_add_ukevent(), we are going to release file, and
> return err.

> >+ if (err)
> >+ goto err_out_fput;
> >+
> >+ fput(file);
> >+
> >+ return 0;
>
>
> >+
> >+err_out_fput:
> >+ fput(file);
> >+err_out:
> >+ return err;
> >+}
> >+

I put them to always know where error path is and where it is not, so it
could be easier to put error statistic or debug.
It can be removed, but imho it reduces readability.

--
Evgeniy Polyakov

2006-12-27 20:48:05

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

Evgeniy Polyakov wrote:
> Why do we want to inject _ready_ event, when it is possible to mark
> event as ready and wakeup thread parked in syscall?

Going back to this old one:

How do you want to mark an event ready if you don't want to introduce
yet another layer of data structures? The event notification happens
through entries in the ring buffer. Userlevel code should never add
anything to the ring buffer directly, this would mean huge
synchronization problems. Yes, one could add additional data structures
accompanying the ring buffer which can specify userlevel-generated
events. But this is a) clumsy and b) a pain to use when the same ring
buffer is used in multiple threads (you'd have to have another shared
memory segment).

It's much cleaner if the userlevel code can get the kernel to inject a
userlevel-generated event. This is the equivalent of userlevel code
generating a signal with kill().

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


Attachments:
signature.asc (251.00 B)
OpenPGP digital signature

2006-12-28 09:52:32

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take24 0/6] kevent: Generic event handling mechanism.

On Wed, Dec 27, 2006 at 12:45:50PM -0800, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> > Why do we want to inject _ready_ event, when it is possible to mark
> > event as ready and wakeup thread parked in syscall?
>
> Going back to this old one:
>
> How do you want to mark an event ready if you don't want to introduce
> yet another layer of data structures? The event notification happens
> through entries in the ring buffer. Userlevel code should never add
> anything to the ring buffer directly, this would mean huge
> synchronization problems. Yes, one could add additional data structures
> accompanying the ring buffer which can specify userlevel-generated
> events. But this is a) clumsy and b) a pain to use when the same ring
> buffer is used in multiple threads (you'd have to have another shared
> memory segment).
>
> It's much cleaner if the userlevel code can get the kernel to inject a
> userlevel-generated event. This is the equivalent of userlevel code
> generating a signal with kill().

Existing possibility to mark event as ready works following way:
event is queued into storage queue (socket, inode or some other queue),
when readiness condition becomes true, event is queued into ready queue
(although it is still in the storage queueu). It happens completely
asynchronosu to _any_ kind of userspace processing.
When userspace calls apropriate syscall, event is being copied into ring
buffer.

Thus userspace readiness will just mark event as ready, i.e. it queues
event into ready queue, so later usersapce will callsyscall to actually
get the event.

When one thread is parked in the syscall and there are _no_ events
which should be marked as ready (for example only sockets are there, and
it is not a good idea to wakeup the whole socket processing state machine),
then there is no possibility to receive such event (although it is
possible to interrupt and break syscall).

So, according to injecting ready events, it can be done - just an
addition of special flag which will force kevent core to move event into
ready queue immediately. In this case userspace can event prepare a
needed event (like signal event) and deliver it to process, so it will
think (only from kevent point of view) that real signal has been arrived.

I will also add special type of events - userspace events - which will
not have empty callbacks, which will be intended to use for user-defined
way (i.e. for inter thread communications).

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
>



--
Evgeniy Polyakov