2006-11-07 16:51:32

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take23 0/5] kevent: Generic event handling mechanism.


Generic event handling mechanism.

Kevent is a generic subsytem which allows to handle event notifications.
It supports both level and edge triggered events. It is similar to
poll/epoll in some cases, but it is more scalable, it is faster and
allows to work with essentially eny kind of events.

Events are provided into kernel through control syscall and can be read
back through mmaped ring or syscall.
Kevent update (i.e. readiness switching) happens directly from internals
of the appropriate state machine of the underlying subsytem (like
network, filesystem, timer or any other).

Homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Documentation page:
http://linux-net.osdl.org/index.php/Kevent

Consider for inclusion.

Changes from 'take22' patchset:
* new ring buffer implementation in process' memory
* wakeup-one-thread flag
* edge-triggered behaviour
With this release additional independent benchmark shows kevent speed compared to epoll:
Eric Dumazet created special benchmark which creates set of AF_INET sockets and two threads
start to simultaneously read and write data from/into them.
Here is results:
epoll (no EPOLLET): 57428 events/sec
kevent (no ET): 59794 events/sec
epoll (with EPOLLET): 71000 events/sec
kevent (with ET): 78265 events/sec
Maximum (busy loop reading events): 88482 events/sec

Changes from 'take21' patchset:
* minor cleanups (different return values, removed unneded variables, whitespaces and so on)
* fixed bug in kevent removal in case when kevent being removed
is the same as overflow_kevent (spotted by Eric Dumazet)

Changes from 'take20' patchset:
* new ring buffer implementation
* removed artificial limit on possible number of kevents
With this release and fixed userspace web server it was possible to
achive 3960+ req/s with client connection rate of 4000 con/s
over 100 Mbit lan, data IO over network was about 10582.7 KB/s, which
is too close to wire speed if we get into account headers and the like.

Changes from 'take19' patchset:
* use __init instead of __devinit
* removed 'default N' from config for user statistic
* removed kevent_user_fini() since kevent can not be unloaded
* use KERN_INFO for statistic output

Changes from 'take18' patchset:
* use __init instead of __devinit
* removed 'default N' from config for user statistic
* removed kevent_user_fini() since kevent can not be unloaded
* use KERN_INFO for statistic output

Changes from 'take17' patchset:
* Use RB tree instead of hash table.
At least for a web sever, frequency of addition/deletion of new kevent
is comparable with number of search access, i.e. most of the time events
are added, accesed only couple of times and then removed, so it justifies
RB tree usage over AVL tree, since the latter does have much slower deletion
time (max O(log(N)) compared to 3 ops),
although faster search time (1.44*O(log(N)) vs. 2*O(log(N))).
So for kevents I use RB tree for now and later, when my AVL tree implementation
is ready, it will be possible to compare them.
* Changed readiness check for socket notifications.

With both above changes it is possible to achieve more than 3380 req/second compared to 2200,
sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same
hardware.
It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is
4096 events.

Changes from 'take16' patchset:
* misc cleanups (__read_mostly, const ...)
* created special macro which is used for mmap size (number of pages) calculation
* export kevent_socket_notify(), since it is used in network protocols which can be
built as modules (IPv6 for example)

Changes from 'take15' patchset:
* converted kevent_timer to high-resolution timers, this forces timer API update at
http://linux-net.osdl.org/index.php/Kevent
* use struct ukevent* instead of void * in syscalls (documentation has been updated)
* added warning in kevent_add_ukevent() if ring has broken index (for testing)

Changes from 'take14' patchset:
* added kevent_wait()
This syscall waits until either timeout expires or at least one event
becomes ready. It also commits that @num events from @start are processed
by userspace and thus can be be removed or rearmed (depending on it's flags).
It can be used for commit events read by userspace through mmap interface.
Example userspace code (evtest.c) can be found on project's homepage.
* added socket notifications (send/recv/accept)

Changes from 'take13' patchset:
* do not get lock aroung user data check in __kevent_search()
* fail early if there were no registered callbacks for given type of kevent
* trailing whitespace cleanup

Changes from 'take12' patchset:
* remove non-chardev interface for initialization
* use pointer to kevent_mring instead of unsigned longs
* use aligned 64bit type in raw user data (can be used by high-res timer if needed)
* simplified enqueue/dequeue callbacks and kevent initialization
* use nanoseconds for timeout
* put number of milliseconds into timer's return data
* move some definitions into user-visible header
* removed filenames from comments

Changes from 'take11' patchset:
* include missing headers into patchset
* some trivial code cleanups (use goto instead of if/else games and so on)
* some whitespace cleanups
* check for ready_callback() callback before main loop which should save us some ticks

Changes from 'take10' patchset:
* removed non-existent prototypes
* added helper function for kevent_registered_callbacks
* fixed 80 lines comments issues
* added shared between userspace and kernelspace header instead of embedd them in one
* core restructuring to remove forward declarations
* s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p
* use vm_insert_page() instead of remap_pfn_range()

Changes from 'take9' patchset:
* fixed ->nopage method

Changes from 'take8' patchset:
* fixed mmap release bug
* use module_init() instead of late_initcall()
* use better structures for timer notifications

Changes from 'take7' patchset:
* new mmap interface (not tested, waiting for other changes to be acked)
- use nopage() method to dynamically substitue pages
- allocate new page for events only when new added kevent requres it
- do not use ugly index dereferencing, use structure instead
- reduced amount of data in the ring (id and flags),
maximum 12 pages on x86 per kevent fd

Changes from 'take6' patchset:
* a lot of comments!
* do not use list poisoning for detection of the fact, that entry is in the list
* return number of ready kevents even if copy*user() fails
* strict check for number of kevents in syscall
* use ARRAY_SIZE for array size calculation
* changed superblock magic number
* use SLAB_PANIC instead of direct panic() call
* changed -E* return values
* a lot of small cleanups and indent fixes

Changes from 'take5' patchset:
* removed compilation warnings about unused wariables when lockdep is not turned on
* do not use internal socket structures, use appropriate (exported) wrappers instead
* removed default 1 second timeout
* removed AIO stuff from patchset

Changes from 'take4' patchset:
* use miscdevice instead of chardevice
* comments fixes

Changes from 'take3' patchset:
* removed serializing mutex from kevent_user_wait()
* moved storage list processing to RCU
* removed lockdep screaming - all storage locks are initialized in the same function, so it was
learned
to differentiate between various cases
* remove kevent from storage if is marked as broken after callback
* fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion

Changes from 'take2' patchset:
* split kevent_finish_user() to locked and unlocked variants
* do not use KEVENT_STAT ifdefs, use inline functions instead
* use array of callbacks of each type instead of each kevent callback initialization
* changed name of ukevent guarding lock
* use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
* do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
* various indent cleanups
* added optimisation, which is aimed to help when a lot of kevents are being copied from
userspace
* mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
- rebased against 2.6.18-git tree
- removed ioctl controlling
- added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
unsigned int timeout, void __user *buf, unsigned flags)
- use old syscall kevent_ctl for creation/removing, modification and initial kevent
initialization
- use mutuxes instead of semaphores
- added file descriptor check and return error if provided descriptor does not match
kevent file operations
- various indent fixes
- removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov <[email protected]>



2006-11-07 16:51:57

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take23 1/5] kevent: Description.


Description.

int kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent *arg);

fd - is the file descriptor referring to the kevent queue to manipulate.
It is created by opening "/dev/kevent" char device, which is created with dynamic
minor number and major number assigned for misc devices.

cmd - is the requested operation. It can be one of the following:
KEVENT_CTL_ADD - add event notification
KEVENT_CTL_REMOVE - remove event notification
KEVENT_CTL_MODIFY - modify existing notification

num - number of struct ukevent in the array pointed to by arg
arg - array of struct ukevent

When called, kevent_ctl will carry out the operation specified in the cmd parameter.
-------------------------------------------------------------------------------------

int kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr, __u64 timeout, struct ukevent *buf, unsigned flags)

ctl_fd - file descriptor referring to the kevent queue
min_nr - minimum number of completed events that kevent_get_events will block waiting for
max_nr - number of struct ukevent in buf
timeout - number of nanoseconds to wait before returning less than min_nr events.
If this is -1, then wait forever.
buf - pointer to an array of struct ukevent.
flags - unused

kevent_get_events will wait timeout milliseconds for at least min_nr completed events,
copying completed struct ukevents to buf and deleting any KEVENT_REQ_ONESHOT event requests.
In nonblocking mode it returns as many events as possible, but not more than max_nr.
In blocking mode it waits until timeout or if at least min_nr events are ready.
-------------------------------------------------------------------------------------

int kevent_wait(int ctl_fd, unsigned int num, __u64 timeout)

ctl_fd - file descriptor referring to the kevent queue
num - number of processed kevents
timeout - this timeout specifies number of nanoseconds to wait until there is free space in kevent queue

This syscall waits until either timeout expires or at least one event becomes ready.
It also copies that num events into special ring buffer and requeues them (or removes depending on flags).
-------------------------------------------------------------------------------------

int kevent_ring_init(int ctl_fd, struct kevent_ring *ring, unsigned int num)

ctl_fd - file descriptor referring to the kevent queue
num - size of the ring buffer in events

struct kevent_ring
{
unsigned int ring_kidx;
struct ukevent event[0];
}

ring_kidx - is an index in the ring buffer where kernel will put new events when
kevent_wait() or kevent_get_events() is called

Example userspace code (ring_buffer.c) can be found on project's homepage.

Each kevent syscall can be so called cancellation point in glibc, i.e. when thread has
been cancelled in kevent syscall, thread can be safely removed and no events will be lost,
since each syscall (kevent_wait() or kevent_get_events()) will copy event into special ring buffer,
accessible from other threads or even processes (if shared memory is used).

When kevent is removed (not dequeued when it is ready, but just removed), even if it was ready,
it is not copied into ring buffer, since if it is removed, no one cares about it (otherwise user
would wait until it becomes ready and got it through usual way using kevent_get_events() or kevent_wait())
and thus no need to copy it to the ring buffer.

It is possible with userspace ring buffer, that events in the ring buffer can be replaced without knowledge
for the thread currently reading them (when other thread calls kevent_get_events() or kevent_wait()),
so appropriate locking between threads or processes, which can simultaneously access the same ring buffer,
is required.
-------------------------------------------------------------------------------------

The bulk of the interface is entirely done through the ukevent struct.
It is used to add event requests, modify existing event requests,
specify which event requests to remove, and return completed events.

struct ukevent contains the following members:

struct kevent_id id
Id of this request, e.g. socket number, file descriptor and so on
__u32 type
Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on
__u32 event
Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED
__u32 req_flags
Per-event request flags,

KEVENT_REQ_ONESHOT
event will be removed when it is ready

KEVENT_REQ_WAKEUP_ONE
When several threads wait on the same kevent queue and requested the same event,
for example 'wake me up when new client has connected, so I could call accept()',
then all threads will be awakened when new client has connected, but only one of
them can process the data. This problem is known as thundering nerd problem.
Events which have this flag set will not be marked as ready (and appropriate threads
will not be awakened) if at least one event has been already marked.

KEVENT_REQ_ET
Edge Triggered behaviour. It is an optimisation which allows to move ready and dequeued
(i.e. copied to userspace) event to move into set of interest for given storage (socket,
inode and so on) again. It is very usefull for cases when the same event should be used
many times (like reading from pipe). It is similar to epoll()'s EPOLLET flag.

__u32 ret_flags
Per-event return flags

KEVENT_RET_BROKEN
Kevent is broken

KEVENT_RET_DONE
Kevent processing was finished successfully

KEVENT_RET_COPY_FAILED
Kevent was not copied into ring buffer due to some error conditions.

__u32 ret_data
Event return data. Event originator fills it with anything it likes (for example
timer notifications put number of milliseconds when timer has fired
union { __u32 user[2]; void *ptr; }
User's data. It is not used, just copied to/from user. The whole structure is aligned
to 8 bytes already, so the last union is aligned properly.

---------------------------------------------------------------------------------

Usage

For KEVENT_CTL_ADD, all fields relevant to the event type must be filled
(id, type, possibly event, req_flags). After kevent_ctl(..., KEVENT_CTL_ADD, ...)
returns each struct's ret_flags should be checked to see if the event is already broken or done.

For KEVENT_CTL_MODIFY, the id, req_flags, and user and event fields must be set and an
existing kevent request must have matching id and user fields. If a match is found,
req_flags and event are replaced with the newly supplied values and requeueing is started,
so modified kevent can be checked and probably marked as ready immediately. If a match can't
be found, the passed in ukevent's ret_flags has KEVENT_RET_BROKEN set. KEVENT_RET_DONE is always set.

For KEVENT_CTL_REMOVE, the id and user fields must be set and an existing kevent request must
have matching id and user fields. If a match is found, the kevent request is removed.
If a match can't be found, the passed in ukevent's ret_flags has KEVENT_RET_BROKEN set.
KEVENT_RET_DONE is always set.

For kevent_get_events, the entire structure is returned.

---------------------------------------------------------------------------------

Usage cases

kevent_timer
struct ukevent should contain following fields:
type - KEVENT_TIMER
event - KEVENT_TIMER_FIRED
req_flags - KEVENT_REQ_ONESHOT if you want to fire that timer only once
id.raw[0] - number of seconds after commit when this timer shout expire
id.raw[0] - additional to number of seconds number of nanoseconds


2006-11-07 16:51:28

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take23 5/5] kevent: Timer notifications.


Timer notifications.

Timer notifications can be used for fine grained per-process time
management, since interval timers are very inconvenient to use,
and they are limited.

This subsystem uses high-resolution timers.
id.raw[0] is used as number of seconds
id.raw[1] is used as number of nanoseconds

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..df93049
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,112 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/hrtimer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+struct kevent_timer
+{
+ struct hrtimer ktimer;
+ struct kevent_storage ktimer_storage;
+ struct kevent *ktimer_event;
+};
+
+static int kevent_timer_func(struct hrtimer *timer)
+{
+ struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer);
+ struct kevent *k = t->ktimer_event;
+
+ kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL);
+ hrtimer_forward(timer, timer->base->softirq_time,
+ ktime_set(k->event.id.raw[0], k->event.id.raw[1]));
+ return HRTIMER_RESTART;
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+ int err;
+ struct kevent_timer *t;
+
+ t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL);
+ if (!t)
+ return -ENOMEM;
+
+ hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL);
+ t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]);
+ t->ktimer.function = kevent_timer_func;
+ t->ktimer_event = k;
+
+ err = kevent_storage_init(&t->ktimer, &t->ktimer_storage);
+ if (err)
+ goto err_out_free;
+ lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key);
+
+ err = kevent_storage_enqueue(&t->ktimer_storage, k);
+ if (err)
+ goto err_out_st_fini;
+
+ hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL);
+
+ return 0;
+
+err_out_st_fini:
+ kevent_storage_fini(&t->ktimer_storage);
+err_out_free:
+ kfree(t);
+
+ return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+ struct kevent_storage *st = k->st;
+ struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage);
+
+ hrtimer_cancel(&t->ktimer);
+ kevent_storage_dequeue(st, k);
+ kfree(t);
+
+ return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+ k->event.ret_data[0] = jiffies_to_msecs(jiffies);
+ return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+ struct kevent_callbacks tc = {
+ .callback = &kevent_timer_callback,
+ .enqueue = &kevent_timer_enqueue,
+ .dequeue = &kevent_timer_dequeue};
+
+ return kevent_add_callbacks(&tc, KEVENT_TIMER);
+}
+module_init(kevent_init_timer);
+

2006-11-07 16:52:37

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take23 2/5] kevent: Core files.


Core files.

This patch includes core kevent files:
* userspace controlling
* kernelspace interfaces
* initialization
* notification state machines

Some bits of documentation can be found on project's homepage (and links from there):
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 7e639f7..fa8075b 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -318,3 +318,7 @@ ENTRY(sys_call_table)
.long sys_vmsplice
.long sys_move_pages
.long sys_getcpu
+ .long sys_kevent_get_events
+ .long sys_kevent_ctl /* 320 */
+ .long sys_kevent_wait
+ .long sys_kevent_ring_init
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index b4aa875..95fb252 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -714,8 +714,12 @@ #endif
.quad compat_sys_get_robust_list
.quad sys_splice
.quad sys_sync_file_range
- .quad sys_tee
+ .quad sys_tee /* 315 */
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
.quad sys_getcpu
+ .quad sys_kevent_get_events
+ .quad sys_kevent_ctl /* 320 */
+ .quad sys_kevent_wait
+ .quad sys_kevent_ring_init
ia32_syscall_end:
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index bd99870..2161ef2 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -324,10 +324,14 @@ #define __NR_tee 315
#define __NR_vmsplice 316
#define __NR_move_pages 317
#define __NR_getcpu 318
+#define __NR_kevent_get_events 319
+#define __NR_kevent_ctl 320
+#define __NR_kevent_wait 321
+#define __NR_kevent_ring_init 322

#ifdef __KERNEL__

-#define NR_syscalls 319
+#define NR_syscalls 323
#include <linux/err.h>

/*
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 6137146..3669c0f 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,18 @@ #define __NR_vmsplice 278
__SYSCALL(__NR_vmsplice, sys_vmsplice)
#define __NR_move_pages 279
__SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events 280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl 281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)
+#define __NR_kevent_wait 282
+__SYSCALL(__NR_kevent_wait, sys_kevent_wait)
+#define __NR_kevent_ring_init 283
+__SYSCALL(__NR_kevent_ring_init, sys_kevent_ring_init)

#ifdef __KERNEL__

-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_ring_init
#include <linux/err.h>

#ifndef __NO_STUBS
diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..781ffa8
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,201 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/rbtree.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/kevent_storage.h>
+#include <linux/ukevent.h>
+
+#define KEVENT_MIN_BUFFS_ALLOC 3
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+ kevent_callback_t callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY 0x1
+#define KEVENT_STORAGE 0x2
+#define KEVENT_USER 0x4
+
+struct kevent
+{
+ /* Used for kevent freeing.*/
+ struct rcu_head rcu_head;
+ struct ukevent event;
+ /* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+ spinlock_t ulock;
+
+ /* Entry of user's tree. */
+ struct rb_node kevent_node;
+ /* Entry of origin's queue. */
+ struct list_head storage_entry;
+ /* Entry of user's ready. */
+ struct list_head ready_entry;
+
+ u32 flags;
+
+ /* User who requested this kevent. */
+ struct kevent_user *user;
+ /* Kevent container. */
+ struct kevent_storage *st;
+
+ struct kevent_callbacks callbacks;
+
+ /* Private data for different storages.
+ * poll()/select storage has a list of wait_queue_t containers
+ * for each ->poll() { poll_wait()' } here.
+ */
+ void *priv;
+};
+
+struct kevent_user
+{
+ struct rb_root kevent_root;
+ spinlock_t kevent_lock;
+ /* Number of queued kevents. */
+ unsigned int kevent_num;
+
+ /* List of ready kevents. */
+ struct list_head ready_list;
+ /* Number of ready kevents. */
+ unsigned int ready_num;
+ /* Protects all manipulations with ready queue. */
+ spinlock_t ready_lock;
+
+ /* Protects against simultaneous kevent_user control manipulations. */
+ struct mutex ctl_mutex;
+ /* Wait until some events are ready. */
+ wait_queue_head_t wait;
+
+ /* Reference counter, increased for each new kevent. */
+ atomic_t refcnt;
+
+ /* Mutex protecting userspace ring buffer. */
+ struct mutex ring_lock;
+ /* Kernel index and size of the userspace ring buffer. */
+ unsigned int kidx, ring_size;
+ /* Pointer to userspace ring buffer. */
+ struct kevent_ring __user *pring;
+
+#ifdef CONFIG_KEVENT_USER_STAT
+ unsigned long im_num;
+ unsigned long wait_num, ring_num;
+ unsigned long total;
+#endif
+};
+
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos);
+
+void kevent_storage_ready(struct kevent_storage *st,
+ kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+ u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+ printk(KERN_INFO "%s: u: %p, wait: %lu, ring: %lu, immediately: %lu, total: %lu.\n",
+ __func__, u, u->wait_num, u->ring_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+ u->im_num++;
+}
+static inline void kevent_stat_ring(struct kevent_user *u)
+{
+ u->ring_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+ u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+ u->total++;
+}
+#else
+#define kevent_stat_print(u) ({ (void) u;})
+#define kevent_stat_init(u) ({ (void) u;})
+#define kevent_stat_im(u) ({ (void) u;})
+#define kevent_stat_wait(u) ({ (void) u;})
+#define kevent_stat_ring(u) ({ (void) u;})
+#define kevent_stat_total(u) ({ (void) u;})
+#endif
+
+#ifdef CONFIG_LOCKDEP
+void kevent_socket_reinit(struct socket *sock);
+void kevent_sk_reinit(struct sock *sk);
+#else
+static inline void kevent_socket_reinit(struct socket *sock)
+{
+}
+static inline void kevent_sk_reinit(struct sock *sk)
+{
+}
+#endif
+#ifdef CONFIG_KEVENT_SOCKET
+void kevent_socket_notify(struct sock *sock, u32 event);
+int kevent_socket_dequeue(struct kevent *k);
+int kevent_socket_enqueue(struct kevent *k);
+#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)
+#else
+static inline void kevent_socket_notify(struct sock *sock, u32 event)
+{
+}
+#define sock_async(__sk) ({ (void)__sk; 0; })
+#endif
+
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+ void *origin; /* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+ struct list_head list; /* List of queued kevents. */
+ spinlock_t lock; /* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2d1c3d5..471a685 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -54,6 +54,8 @@ struct compat_stat;
struct compat_timeval;
struct robust_list_head;
struct getcpu_cache;
+struct ukevent;
+struct kevent_ring;

#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -599,4 +601,9 @@ asmlinkage long sys_set_robust_list(stru
size_t len);
asmlinkage long sys_getcpu(unsigned __user *cpu, unsigned __user *node, struct getcpu_cache __user *cache);

+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max,
+ __u64 timeout, struct ukevent __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, struct ukevent __user *buf);
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, __u64 timeout);
+asmlinkage long sys_kevent_ring_init(int ctl_fd, struct kevent_ring __user *ring, unsigned int num);
#endif
diff --git a/include/linux/ukevent.h b/include/linux/ukevent.h
new file mode 100644
index 0000000..ee881c9
--- /dev/null
+++ b/include/linux/ukevent.h
@@ -0,0 +1,153 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __UKEVENT_H
+#define __UKEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+/* Process this event only once and then remove it. */
+#define KEVENT_REQ_ONESHOT 0x1
+/* Wake up only when event exclusively belongs to this thread,
+ * for example when several threads are waiting for new client
+ * connection so they could perform accept() it is a good idea
+ * to set this flag, so only one thread of all with this flag set
+ * will be awakened.
+ * If there are events without this flags, appropriate threads will
+ * be awakened too. */
+#define KEVENT_REQ_WAKEUP_ONE 0x2
+/* Edge Triggered behaviour. */
+#define KEVENT_REQ_ET 0x4
+
+/*
+ * Kevent return flags.
+ */
+/* Kevent is broken. */
+#define KEVENT_RET_BROKEN 0x1
+/* Kevent processing was finished successfully. */
+#define KEVENT_RET_DONE 0x2
+/* Kevent was not copied into ring buffer due to some error conditions. */
+#define KEVENT_RET_COPY_FAILED 0x4
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 0
+#define KEVENT_INODE 1
+#define KEVENT_TIMER 2
+#define KEVENT_POLL 3
+#define KEVENT_NAIO 4
+#define KEVENT_AIO 5
+#define KEVENT_MAX 6
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define KEVENT_TIMER_FIRED 0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#define KEVENT_SOCKET_RECV 0x1
+#define KEVENT_SOCKET_ACCEPT 0x2
+#define KEVENT_SOCKET_SEND 0x4
+
+/*
+ * Inode events.
+ */
+#define KEVENT_INODE_CREATE 0x1
+#define KEVENT_INODE_REMOVE 0x2
+
+/*
+ * Poll events.
+ */
+#define KEVENT_POLL_POLLIN 0x0001
+#define KEVENT_POLL_POLLPRI 0x0002
+#define KEVENT_POLL_POLLOUT 0x0004
+#define KEVENT_POLL_POLLERR 0x0008
+#define KEVENT_POLL_POLLHUP 0x0010
+#define KEVENT_POLL_POLLNVAL 0x0020
+
+#define KEVENT_POLL_POLLRDNORM 0x0040
+#define KEVENT_POLL_POLLRDBAND 0x0080
+#define KEVENT_POLL_POLLWRNORM 0x0100
+#define KEVENT_POLL_POLLWRBAND 0x0200
+#define KEVENT_POLL_POLLMSG 0x0400
+#define KEVENT_POLL_POLLREMOVE 0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define KEVENT_AIO_BIO 0x1
+
+#define KEVENT_MASK_ALL 0xffffffff
+/* Mask of all possible event values. */
+#define KEVENT_MASK_EMPTY 0x0
+/* Empty mask of ready events. */
+
+struct kevent_id
+{
+ union {
+ __u32 raw[2];
+ __u64 raw_u64 __attribute__((aligned(8)));
+ };
+};
+
+struct ukevent
+{
+ /* Id of this request, e.g. socket number, file descriptor and so on... */
+ struct kevent_id id;
+ /* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+ __u32 type;
+ /* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+ __u32 event;
+ /* Per-event request flags */
+ __u32 req_flags;
+ /* Per-event return flags */
+ __u32 ret_flags;
+ /* Event return data. Event originator fills it with anything it likes. */
+ __u32 ret_data[2];
+ /* User's data. It is not used, just copied to/from user.
+ * The whole structure is aligned to 8 bytes already, so the last union
+ * is aligned properly.
+ */
+ union {
+ __u32 user[2];
+ void *ptr;
+ };
+};
+
+struct kevent_ring
+{
+ unsigned int ring_kidx;
+ struct ukevent event[0];
+};
+
+#define KEVENT_CTL_ADD 0
+#define KEVENT_CTL_REMOVE 1
+#define KEVENT_CTL_MODIFY 2
+
+#endif /* __UKEVENT_H */
diff --git a/init/Kconfig b/init/Kconfig
index d2eb7a8..c7d8250 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -201,6 +201,8 @@ config AUDITSYSCALL
such as SELinux. To use audit's filesystem watch feature, please
ensure that INOTIFY is configured.

+source "kernel/kevent/Kconfig"
+
config IKCONFIG
bool "Kernel .config support"
---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
obj-$(CONFIG_RELAY) += relay.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..5ba8086
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,39 @@
+config KEVENT
+ bool "Kernel event notification mechanism"
+ help
+ This option enables event queue mechanism.
+ It can be used as replacement for poll()/select(), AIO callback
+ invocations, advanced timer notifications and other kernel
+ object status changes.
+
+config KEVENT_USER_STAT
+ bool "Kevent user statistic"
+ depends on KEVENT
+ help
+ This option will turn kevent_user statistic collection on.
+ Statistic data includes total number of kevent, number of kevents
+ which are ready immediately at insertion time and number of kevents
+ which were removed through readiness completion.
+ It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_TIMER
+ bool "Kernel event notifications for timers"
+ depends on KEVENT
+ help
+ This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+ bool "Kernel event notifications for poll()/select()"
+ depends on KEVENT
+ help
+ This option allows to use kevent subsystem for poll()/select()
+ notifications.
+
+config KEVENT_SOCKET
+ bool "Kernel event notifications for sockets"
+ depends on NET && KEVENT
+ help
+ This option enables notifications through KEVENT subsystem of
+ sockets operations, like new packet receiving conditions,
+ ready for accept conditions and so on.
+
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..9130cad
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,4 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
+obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..24ee44a
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,232 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+ return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+ return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.ret_flags |= KEVENT_RET_BROKEN;
+ spin_unlock_irqrestore(&k->ulock, flags);
+ return -EINVAL;
+}
+
+static struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX] __read_mostly;
+
+int kevent_add_callbacks(const struct kevent_callbacks *cb, int pos)
+{
+ struct kevent_callbacks *p;
+
+ if (pos >= KEVENT_MAX)
+ return -EINVAL;
+
+ p = &kevent_registered_callbacks[pos];
+
+ p->enqueue = (cb->enqueue) ? cb->enqueue : kevent_break;
+ p->dequeue = (cb->dequeue) ? cb->dequeue : kevent_break;
+ p->callback = (cb->callback) ? cb->callback : kevent_break;
+
+ printk(KERN_INFO "KEVENT: Added callbacks for type %d.\n", pos);
+ return 0;
+}
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+ spin_lock_init(&k->ulock);
+ k->flags = 0;
+
+ if (unlikely(k->event.type >= KEVENT_MAX ||
+ !kevent_registered_callbacks[k->event.type].callback))
+ return kevent_break(k);
+
+ k->callbacks = kevent_registered_callbacks[k->event.type];
+ if (unlikely(k->callbacks.callback == kevent_break))
+ return kevent_break(k);
+
+ return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+ unsigned long flags;
+
+ k->st = st;
+ spin_lock_irqsave(&st->lock, flags);
+ list_add_tail_rcu(&k->storage_entry, &st->list);
+ k->flags |= KEVENT_STORAGE;
+ spin_unlock_irqrestore(&st->lock, flags);
+ return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue.
+ * It does not decrease origin's reference counter in any way
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&st->lock, flags);
+ if (k->flags & KEVENT_STORAGE) {
+ list_del_rcu(&k->storage_entry);
+ k->flags &= ~KEVENT_STORAGE;
+ }
+ spin_unlock_irqrestore(&st->lock, flags);
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static int __kevent_requeue(struct kevent *k, u32 event)
+{
+ int ret, rem;
+ unsigned long flags;
+
+ ret = k->callbacks.callback(k);
+
+ spin_lock_irqsave(&k->ulock, flags);
+ if (ret > 0)
+ k->event.ret_flags |= KEVENT_RET_DONE;
+ else if (ret < 0)
+ k->event.ret_flags |= (KEVENT_RET_BROKEN | KEVENT_RET_DONE);
+ else
+ ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+ rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+ spin_unlock_irqrestore(&k->ulock, flags);
+
+ if (ret) {
+ if ((rem || ret < 0) && (k->flags & KEVENT_STORAGE)) {
+ list_del_rcu(&k->storage_entry);
+ k->flags &= ~KEVENT_STORAGE;
+ }
+
+ spin_lock_irqsave(&k->user->ready_lock, flags);
+ if (!(k->flags & KEVENT_READY)) {
+ list_add_tail(&k->ready_entry, &k->user->ready_list);
+ k->flags |= KEVENT_READY;
+ k->user->ready_num++;
+ }
+ spin_unlock_irqrestore(&k->user->ready_lock, flags);
+ wake_up(&k->user->wait);
+ }
+
+ return ret;
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->st->lock, flags);
+ __kevent_requeue(k, 0);
+ spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st,
+ kevent_callback_t ready_callback, u32 event)
+{
+ struct kevent *k;
+ int wake_num = 0;
+
+ rcu_read_lock();
+ if (ready_callback)
+ list_for_each_entry_rcu(k, &st->list, storage_entry)
+ (*ready_callback)(k);
+
+ list_for_each_entry_rcu(k, &st->list, storage_entry) {
+ if (event & k->event.event)
+ if (!(k->event.req_flags & KEVENT_REQ_WAKEUP_ONE) || wake_num == 0)
+ if (__kevent_requeue(k, event))
+ wake_num++;
+ }
+ rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+ spin_lock_init(&st->lock);
+ st->origin = origin;
+ INIT_LIST_HEAD(&st->list);
+ return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, sockt and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+ kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..5ebfa6d
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,913 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static const char kevent_name[] = "kevent";
+static kmem_cache_t *kevent_cache __read_mostly;
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+ struct kevent_user *u = file->private_data;
+ unsigned int mask;
+
+ poll_wait(file, &u->wait, wait);
+ mask = 0;
+
+ if (u->ready_num)
+ mask |= POLLIN | POLLRDNORM;
+
+ return mask;
+}
+
+/*
+ * Copies kevent into userspace ring buffer if it was initialized.
+ * Returns
+ * 0 on success,
+ * -EAGAIN if there were no place for that kevent (impossible)
+ * -EFAULT if copy_to_user() failed.
+ *
+ * Must be called under kevent_user->ring_lock locked.
+ */
+static int kevent_copy_ring_buffer(struct kevent *k)
+{
+ struct kevent_ring __user *ring;
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+ int err;
+
+ ring = u->pring;
+ if (!ring)
+ return 0;
+
+ if (copy_to_user(&ring->event[u->kidx], &k->event, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ goto err_out_exit;
+ }
+
+ if (put_user(u->kidx, &ring->ring_kidx)) {
+ err = -EFAULT;
+ goto err_out_exit;
+ }
+
+ if (++u->kidx >= u->ring_size)
+ u->kidx = 0;
+
+ return 0;
+
+err_out_exit:
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.ret_flags |= KEVENT_RET_COPY_FAILED;
+ spin_unlock_irqrestore(&k->ulock, flags);
+ return err;
+}
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+ struct kevent_user *u;
+
+ u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+ if (!u)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&u->ready_list);
+ spin_lock_init(&u->ready_lock);
+ kevent_stat_init(u);
+ spin_lock_init(&u->kevent_lock);
+ u->kevent_root = RB_ROOT;
+
+ mutex_init(&u->ctl_mutex);
+ init_waitqueue_head(&u->wait);
+
+ atomic_set(&u->refcnt, 1);
+
+ mutex_init(&u->ring_lock);
+ u->kidx = u->ring_size = 0;
+ u->pring = NULL;
+
+ file->private_data = u;
+ return 0;
+}
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+ atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+ if (atomic_dec_and_test(&u->refcnt)) {
+ kevent_stat_print(u);
+ kfree(u);
+ }
+}
+
+static inline int kevent_compare_id(struct kevent_id *left, struct kevent_id *right)
+{
+ if (left->raw_u64 > right->raw_u64)
+ return -1;
+
+ if (right->raw_u64 > left->raw_u64)
+ return 1;
+
+ return 0;
+}
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+ struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+ kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Must be called under u->ready_lock.
+ * This function unlinks kevent from ready queue.
+ */
+static inline void kevent_unlink_ready(struct kevent *k)
+{
+ list_del(&k->ready_entry);
+ k->flags &= ~KEVENT_READY;
+ k->user->ready_num--;
+}
+
+static void kevent_remove_ready(struct kevent *k)
+{
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->ready_lock, flags);
+ if (k->flags & KEVENT_READY)
+ kevent_unlink_ready(k);
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+ if (deq)
+ kevent_dequeue(k);
+
+ kevent_remove_ready(k);
+
+ kevent_user_put(k->user);
+ call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+
+ rb_erase(&k->kevent_node, &u->kevent_root);
+ k->flags &= ~KEVENT_USER;
+ u->kevent_num--;
+ kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events,
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ rb_erase(&k->kevent_node, &u->kevent_root);
+ k->flags &= ~KEVENT_USER;
+ u->kevent_num--;
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+ kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
+{
+ unsigned long flags;
+ struct kevent *k = NULL;
+
+ mutex_lock(&u->ring_lock);
+ spin_lock_irqsave(&u->ready_lock, flags);
+ if (u->ready_num && !list_empty(&u->ready_list)) {
+ k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+ kevent_unlink_ready(k);
+ }
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+
+ if (k)
+ kevent_copy_ring_buffer(k);
+ mutex_unlock(&u->ring_lock);
+
+ return k;
+}
+
+static void kevent_complete_ready(struct kevent *k)
+{
+ if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+ /*
+ * If it is one-shot kevent, it has been removed already from
+ * origin's queue, so we can easily free it here.
+ */
+ kevent_finish_user(k, 1);
+ else if (k->event.req_flags & KEVENT_REQ_ET) {
+ unsigned long flags;
+
+ /*
+ * Edge-triggered behaviour: mark event as clear new one.
+ */
+
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.ret_flags = 0;
+ k->event.ret_data[0] = k->event.ret_data[1] = 0;
+ spin_unlock_irqrestore(&k->ulock, flags);
+ }
+}
+
+/*
+ * Search a kevent inside kevent tree for given ukevent.
+ */
+static struct kevent *__kevent_search(struct kevent_id *id, struct kevent_user *u)
+{
+ struct kevent *k, *ret = NULL;
+ struct rb_node *n = u->kevent_root.rb_node;
+ int cmp;
+
+ while (n) {
+ k = rb_entry(n, struct kevent, kevent_node);
+ cmp = kevent_compare_id(&k->event.id, id);
+
+ if (cmp > 0)
+ n = n->rb_right;
+ else if (cmp < 0)
+ n = n->rb_left;
+ else {
+ ret = k;
+ break;
+ }
+ }
+
+ return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ int err = -ENODEV;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&uk->id, u);
+ if (k) {
+ spin_lock(&k->ulock);
+ k->event.event = uk->event;
+ k->event.req_flags = uk->req_flags;
+ k->event.ret_flags = 0;
+ spin_unlock(&k->ulock);
+ kevent_requeue(k);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+ int err = -ENODEV;
+ struct kevent *k;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&uk->id, u);
+ if (k) {
+ __kevent_finish_user(k, 1);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+ struct kevent_user *u = file->private_data;
+ struct kevent *k;
+ struct rb_node *n;
+
+ for (n = rb_first(&u->kevent_root); n; n = rb_next(n)) {
+ k = rb_entry(n, struct kevent, kevent_node);
+ kevent_finish_user(k, 1);
+ }
+
+ kevent_user_put(u);
+ file->private_data = NULL;
+
+ return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+ struct ukevent *ukev;
+
+ ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+ if (!ukev)
+ return NULL;
+
+ if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+ kfree(ukev);
+ return NULL;
+ }
+
+ return ukev;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = 0, i;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > u->kevent_num) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ if (kevent_modify(&ukev[i], u))
+ ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+ ukev[i].ret_flags |= KEVENT_RET_DONE;
+ }
+ if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+ err = -EFAULT;
+ kfree(ukev);
+ goto out;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ if (kevent_modify(&uk, u))
+ uk.ret_flags |= KEVENT_RET_BROKEN;
+ uk.ret_flags |= KEVENT_RET_DONE;
+
+ if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ arg += sizeof(struct ukevent);
+ }
+out:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = 0, i;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > u->kevent_num) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ if (kevent_remove(&ukev[i], u))
+ ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+ ukev[i].ret_flags |= KEVENT_RET_DONE;
+ }
+ if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+ err = -EFAULT;
+ kfree(ukev);
+ goto out;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ if (kevent_remove(&uk, u))
+ uk.ret_flags |= KEVENT_RET_BROKEN;
+
+ uk.ret_flags |= KEVENT_RET_DONE;
+
+ if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ arg += sizeof(struct ukevent);
+ }
+out:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static int kevent_user_enqueue(struct kevent_user *u, struct kevent *new)
+{
+ unsigned long flags;
+ struct rb_node **p = &u->kevent_root.rb_node, *parent = NULL;
+ struct kevent *k;
+ int err = 0, cmp;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ while (*p) {
+ parent = *p;
+ k = rb_entry(parent, struct kevent, kevent_node);
+
+ cmp = kevent_compare_id(&k->event.id, &new->event.id);
+ if (cmp > 0)
+ p = &parent->rb_right;
+ else if (cmp < 0)
+ p = &parent->rb_left;
+ else {
+ err = -EEXIST;
+ break;
+ }
+ }
+ if (likely(!err)) {
+ rb_link_node(&new->kevent_node, parent, p);
+ rb_insert_color(&new->kevent_node, &u->kevent_root);
+ new->flags |= KEVENT_USER;
+ u->kevent_num++;
+ kevent_user_get(u);
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ int err;
+
+ k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+ if (!k) {
+ err = -ENOMEM;
+ goto err_out_exit;
+ }
+
+ memcpy(&k->event, uk, sizeof(struct ukevent));
+ INIT_RCU_HEAD(&k->rcu_head);
+
+ k->event.ret_flags = 0;
+
+ err = kevent_init(k);
+ if (err) {
+ kmem_cache_free(kevent_cache, k);
+ goto err_out_exit;
+ }
+ k->user = u;
+ kevent_stat_total(u);
+ err = kevent_user_enqueue(u, k);
+ if (err) {
+ kmem_cache_free(kevent_cache, k);
+ goto err_out_exit;
+ }
+
+ err = kevent_enqueue(k);
+ if (err) {
+ memcpy(uk, &k->event, sizeof(struct ukevent));
+ kevent_finish_user(k, 0);
+ goto err_out_exit;
+ }
+
+ return 0;
+
+err_out_exit:
+ if (err < 0) {
+ uk->ret_flags |= KEVENT_RET_BROKEN | KEVENT_RET_DONE;
+ uk->ret_data[1] = err;
+ } else if (err > 0)
+ uk->ret_flags |= KEVENT_RET_DONE;
+ return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one
+ * and add them into appropriate kevent_storages,
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err, cerr = 0, rnum = 0, i;
+ void __user *orig = arg;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ err = -EINVAL;
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i = 0; i < num; ++i) {
+ err = kevent_user_add_ukevent(&ukev[i], u);
+ if (err) {
+ kevent_stat_im(u);
+ if (i != rnum)
+ memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+ rnum++;
+ }
+ }
+ if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+ cerr = -EFAULT;
+ kfree(ukev);
+ goto out_setup;
+ }
+ }
+
+ for (i = 0; i < num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ cerr = -EFAULT;
+ break;
+ }
+ arg += sizeof(struct ukevent);
+
+ err = kevent_user_add_ukevent(&uk, u);
+ if (err) {
+ kevent_stat_im(u);
+ if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+ cerr = -EFAULT;
+ break;
+ }
+ orig += sizeof(struct ukevent);
+ rnum++;
+ }
+ }
+
+out_setup:
+ if (cerr < 0) {
+ err = cerr;
+ goto out_remove;
+ }
+
+ err = rnum;
+out_remove:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u,
+ unsigned int min_nr, unsigned int max_nr, __u64 timeout,
+ void __user *buf)
+{
+ struct kevent *k;
+ int num = 0;
+
+ if (!(file->f_flags & O_NONBLOCK)) {
+ wait_event_interruptible_timeout(u->wait,
+ u->ready_num >= min_nr,
+ clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+ }
+
+ while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
+ if (copy_to_user(buf + num*sizeof(struct ukevent),
+ &k->event, sizeof(struct ukevent)))
+ break;
+ kevent_complete_ready(k);
+ ++num;
+ kevent_stat_wait(u);
+ }
+
+ return num;
+}
+
+static struct file_operations kevent_user_fops = {
+ .open = kevent_user_open,
+ .release = kevent_user_release,
+ .poll = kevent_user_poll,
+ .owner = THIS_MODULE,
+};
+
+static struct miscdevice kevent_miscdev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = kevent_name,
+ .fops = &kevent_user_fops,
+};
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+ int err;
+ struct kevent_user *u = file->private_data;
+
+ switch (cmd) {
+ case KEVENT_CTL_ADD:
+ err = kevent_user_ctl_add(u, num, arg);
+ break;
+ case KEVENT_CTL_REMOVE:
+ err = kevent_user_ctl_remove(u, num, arg);
+ break;
+ case KEVENT_CTL_MODIFY:
+ err = kevent_user_ctl_modify(u, num, arg);
+ break;
+ default:
+ err = -EINVAL;
+ break;
+ }
+
+ return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - timeout in nanoseconds to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - ununsed for now (will be used for mmap implementation).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+ __u64 timeout, struct ukevent __user *buf, unsigned flags)
+{
+ int err = -EINVAL;
+ struct file *file;
+ struct kevent_user *u;
+
+ file = fget(ctl_fd);
+ if (!file)
+ return -EBADF;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+ u = file->private_data;
+
+ err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+ fput(file);
+ return err;
+}
+
+asmlinkage long sys_kevent_ring_init(int ctl_fd, struct kevent_ring __user *ring, unsigned int num)
+{
+ int err = -EINVAL;
+ struct file *file;
+ struct kevent_user *u;
+
+ file = fget(ctl_fd);
+ if (!file)
+ return -EBADF;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+ u = file->private_data;
+
+ mutex_lock(&u->ring_lock);
+ if (u->pring) {
+ err = -EINVAL;
+ goto err_out_exit;
+ }
+ u->pring = ring;
+ u->ring_size = num;
+ mutex_unlock(&u->ring_lock);
+
+ fput(file);
+
+ return 0;
+
+err_out_exit:
+ mutex_unlock(&u->ring_lock);
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * This syscall is used to perform waiting until there is free space in kevent queue
+ * and removes/requeues requested number of events (commits them). Function returns
+ * number of actually committed events.
+ *
+ * @ctl_fd - kevent file descriptor.
+ * @num - number of kevents to process.
+ * @timeout - this timeout specifies number of nanoseconds to wait until there is
+ * free space in kevent queue.
+ *
+ * When we need to commit @num events, it means we should just remove first @num
+ * kevents from ready queue and copy them into the buffer.
+ * Kevents will be copied into ring buffer in order they were placed into ready queue.
+ */
+asmlinkage long sys_kevent_wait(int ctl_fd, unsigned int num, __u64 timeout)
+{
+ int err = -EINVAL, committed = 0;
+ struct file *file;
+ struct kevent_user *u;
+ struct kevent *k;
+ struct kevent_ring __user *ring;
+ unsigned int i;
+
+ file = fget(ctl_fd);
+ if (!file)
+ return -EBADF;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+ u = file->private_data;
+
+ ring = u->pring;
+ if (!ring || num >= u->ring_size)
+ goto out_fput;
+
+ if (!(file->f_flags & O_NONBLOCK)) {
+ wait_event_interruptible_timeout(u->wait,
+ u->ready_num >= 1,
+ clock_t_to_jiffies(nsec_to_clock_t(timeout)));
+ }
+
+ for (i=0; i<num; ++i) {
+ k = kqueue_dequeue_ready(u);
+ if (!k)
+ break;
+ kevent_complete_ready(k);
+ kevent_stat_ring(u);
+ committed++;
+ }
+
+ fput(file);
+
+ return committed;
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, struct ukevent __user *arg)
+{
+ int err = -EINVAL;
+ struct file *file;
+
+ file = fget(fd);
+ if (!file)
+ return -EBADF;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+
+ err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * Kevent subsystem initialization - create kevent cache and register
+ * filesystem to get control file descriptors from.
+ */
+static int __init kevent_user_init(void)
+{
+ int err = 0;
+
+ kevent_cache = kmem_cache_create("kevent_cache",
+ sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+
+ err = misc_register(&kevent_miscdev);
+ if (err) {
+ printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
+ goto err_out_exit;
+ }
+
+ printk(KERN_INFO "KEVENT subsystem has been successfully registered.\n");
+
+ return 0;
+
+err_out_exit:
+ kmem_cache_destroy(kevent_cache);
+ return err;
+}
+
+module_init(kevent_user_init);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7a3b2e7..5200583 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,11 @@ cond_syscall(ppc_rtas);
cond_syscall(sys_spu_run);
cond_syscall(sys_spu_create);

+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_wait);
+cond_syscall(sys_kevent_ctl);
+cond_syscall(sys_kevent_ring_init);
+
/* mmu depending weak syscall entries */
cond_syscall(sys_mprotect);
cond_syscall(sys_msync);

2006-11-07 16:52:48

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take23 3/5] kevent: poll/select() notifications.


poll/select() notifications.

This patch includes generic poll/select notifications.
kevent_poll works simialr to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake, a lot of allocations and so on).

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5baf3a1..f81299f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -276,6 +276,7 @@ #include <linux/prio_tree.h>
#include <linux/init.h>
#include <linux/sched.h>
#include <linux/mutex.h>
+#include <linux/kevent.h>

#include <asm/atomic.h>
#include <asm/semaphore.h>
@@ -586,6 +587,10 @@ #ifdef CONFIG_INOTIFY
struct mutex inotify_mutex; /* protects the watches list */
#endif

+#ifdef CONFIG_KEVENT_SOCKET
+ struct kevent_storage st;
+#endif
+
unsigned long i_state;
unsigned long dirtied_when; /* jiffies of first dirtying */

@@ -739,6 +744,9 @@ #ifdef CONFIG_EPOLL
struct list_head f_ep_links;
spinlock_t f_ep_lock;
#endif /* #ifdef CONFIG_EPOLL */
+#ifdef CONFIG_KEVENT_POLL
+ struct kevent_storage st;
+#endif
struct address_space *f_mapping;
};
extern spinlock_t files_lock;
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..94facbb
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,222 @@
+/*
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+ struct poll_table_struct pt;
+ struct kevent *k;
+};
+
+struct kevent_poll_wait_container
+{
+ struct list_head container_entry;
+ wait_queue_head_t *whead;
+ wait_queue_t wait;
+ struct kevent *k;
+};
+
+struct kevent_poll_private
+{
+ struct list_head container_list;
+ spinlock_t container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+ unsigned mode, int sync, void *key)
+{
+ struct kevent_poll_wait_container *cont =
+ container_of(wait, struct kevent_poll_wait_container, wait);
+ struct kevent *k = cont->k;
+ struct file *file = k->st->origin;
+ u32 revents;
+
+ revents = file->f_op->poll(file, NULL);
+
+ kevent_storage_ready(k->st, NULL, revents);
+
+ return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+ struct poll_table_struct *poll_table)
+{
+ struct kevent *k =
+ container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+ struct kevent_poll_private *priv = k->priv;
+ struct kevent_poll_wait_container *cont;
+ unsigned long flags;
+
+ cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+ if (!cont) {
+ kevent_break(k);
+ return;
+ }
+
+ cont->k = k;
+ init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+ cont->whead = whead;
+
+ spin_lock_irqsave(&priv->container_lock, flags);
+ list_add_tail(&cont->container_entry, &priv->container_list);
+ spin_unlock_irqrestore(&priv->container_lock, flags);
+
+ add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+ struct file *file;
+ int err, ready = 0;
+ unsigned int revents;
+ struct kevent_poll_ctl ctl;
+ struct kevent_poll_private *priv;
+
+ file = fget(k->event.id.raw[0]);
+ if (!file)
+ return -EBADF;
+
+ err = -EINVAL;
+ if (!file->f_op || !file->f_op->poll)
+ goto err_out_fput;
+
+ err = -ENOMEM;
+ priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+ if (!priv)
+ goto err_out_fput;
+
+ spin_lock_init(&priv->container_lock);
+ INIT_LIST_HEAD(&priv->container_list);
+
+ k->priv = priv;
+
+ ctl.k = k;
+ init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+
+ err = kevent_storage_enqueue(&file->st, k);
+ if (err)
+ goto err_out_free;
+
+ revents = file->f_op->poll(file, &ctl.pt);
+ if (revents & k->event.event) {
+ ready = 1;
+ kevent_poll_dequeue(k);
+ }
+
+ return ready;
+
+err_out_free:
+ kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+ fput(file);
+ return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+ struct file *file = k->st->origin;
+ struct kevent_poll_private *priv = k->priv;
+ struct kevent_poll_wait_container *w, *n;
+ unsigned long flags;
+
+ kevent_storage_dequeue(k->st, k);
+
+ spin_lock_irqsave(&priv->container_lock, flags);
+ list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+ list_del(&w->container_entry);
+ remove_wait_queue(w->whead, &w->wait);
+ kmem_cache_free(kevent_poll_container_cache, w);
+ }
+ spin_unlock_irqrestore(&priv->container_lock, flags);
+
+ kmem_cache_free(kevent_poll_priv_cache, priv);
+ k->priv = NULL;
+
+ fput(file);
+
+ return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+ struct file *file = k->st->origin;
+ unsigned int revents = file->f_op->poll(file, NULL);
+
+ k->event.ret_data[0] = revents & k->event.event;
+
+ return (revents & k->event.event);
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+ struct kevent_callbacks pc = {
+ .callback = &kevent_poll_callback,
+ .enqueue = &kevent_poll_enqueue,
+ .dequeue = &kevent_poll_dequeue};
+
+ kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache",
+ sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+ if (!kevent_poll_container_cache) {
+ printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+ return -ENOMEM;
+ }
+
+ kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache",
+ sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+ if (!kevent_poll_priv_cache) {
+ printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+ kmem_cache_destroy(kevent_poll_container_cache);
+ kevent_poll_container_cache = NULL;
+ return -ENOMEM;
+ }
+
+ kevent_add_callbacks(&pc, KEVENT_POLL);
+
+ printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+ return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+ lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+ kmem_cache_destroy(kevent_poll_priv_cache);
+ kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);

2006-11-07 16:53:09

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take23 4/5] kevent: Socket notifications.


Socket notifications.

This patch includes socket send/recv/accept notifications.
Using trivial web server based on kevent and this features
instead of epoll it's performance increased more than noticebly.
More details about various benchmarks and server itself
(evserver_kevent.c) can be found on project's homepage.

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/fs/inode.c b/fs/inode.c
index ada7643..ff1b129 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -21,6 +21,7 @@ #include <linux/pagemap.h>
#include <linux/cdev.h>
#include <linux/bootmem.h>
#include <linux/inotify.h>
+#include <linux/kevent.h>
#include <linux/mount.h>

/*
@@ -164,12 +165,18 @@ #endif
}
inode->i_private = 0;
inode->i_mapping = mapping;
+#if defined CONFIG_KEVENT_SOCKET
+ kevent_storage_init(inode, &inode->st);
+#endif
}
return inode;
}

void destroy_inode(struct inode *inode)
{
+#if defined CONFIG_KEVENT_SOCKET
+ kevent_storage_fini(&inode->st);
+#endif
BUG_ON(inode_has_buffers(inode));
security_inode_free(inode);
if (inode->i_sb->s_op->destroy_inode)
diff --git a/include/net/sock.h b/include/net/sock.h
index edd4d73..d48ded8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -48,6 +48,7 @@ #include <linux/lockdep.h>
#include <linux/netdevice.h>
#include <linux/skbuff.h> /* struct sk_buff */
#include <linux/security.h>
+#include <linux/kevent.h>

#include <linux/filter.h>

@@ -450,6 +451,21 @@ static inline int sk_stream_memory_free(

extern void sk_stream_rfree(struct sk_buff *skb);

+struct socket_alloc {
+ struct socket socket;
+ struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+ return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+ return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
+}
+
static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
{
skb->sk = sk;
@@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct
sk->sk_backlog.tail = skb;
}
skb->next = NULL;
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
}

#define sk_wait_event(__sk, __timeo, __condition) \
@@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio
return si->kiocb;
}

-struct socket_alloc {
- struct socket socket;
- struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
- return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
- return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
-}
-
extern void __sk_stream_mem_reclaim(struct sock *sk);
extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7a093d0..69f4ad2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so
tp->ucopy.memory = 0;
} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
wake_up_interruptible(sk->sk_sleep);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
if (!inet_csk_ack_scheduled(sk))
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
(3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 0000000..7f74110
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,135 @@
+/*
+ * kevent_socket.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/tcp.h>
+#include <linux/kevent.h>
+
+#include <net/sock.h>
+#include <net/request_sock.h>
+#include <net/inet_connection_sock.h>
+
+static int kevent_socket_callback(struct kevent *k)
+{
+ struct inode *inode = k->st->origin;
+ unsigned int events = SOCKET_I(inode)->ops->poll(SOCKET_I(inode)->file, SOCKET_I(inode), NULL);
+
+ if ((events & (POLLIN | POLLRDNORM)) && (k->event.event & (KEVENT_SOCKET_RECV | KEVENT_SOCKET_ACCEPT)))
+ return 1;
+ if ((events & (POLLOUT | POLLWRNORM)) && (k->event.event & KEVENT_SOCKET_SEND))
+ return 1;
+ return 0;
+}
+
+int kevent_socket_enqueue(struct kevent *k)
+{
+ struct inode *inode;
+ struct socket *sock;
+ int err = -EBADF;
+
+ sock = sockfd_lookup(k->event.id.raw[0], &err);
+ if (!sock)
+ goto err_out_exit;
+
+ inode = igrab(SOCK_INODE(sock));
+ if (!inode)
+ goto err_out_fput;
+
+ err = kevent_storage_enqueue(&inode->st, k);
+ if (err)
+ goto err_out_iput;
+
+ err = k->callbacks.callback(k);
+ if (err)
+ goto err_out_dequeue;
+
+ return err;
+
+err_out_dequeue:
+ kevent_storage_dequeue(k->st, k);
+err_out_iput:
+ iput(inode);
+err_out_fput:
+ sockfd_put(sock);
+err_out_exit:
+ return err;
+}
+
+int kevent_socket_dequeue(struct kevent *k)
+{
+ struct inode *inode = k->st->origin;
+ struct socket *sock;
+
+ kevent_storage_dequeue(k->st, k);
+
+ sock = SOCKET_I(inode);
+ iput(inode);
+ sockfd_put(sock);
+
+ return 0;
+}
+
+void kevent_socket_notify(struct sock *sk, u32 event)
+{
+ if (sk->sk_socket)
+ kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event);
+}
+
+/*
+ * It is required for network protocols compiled as modules, like IPv6.
+ */
+EXPORT_SYMBOL_GPL(kevent_socket_notify);
+
+#ifdef CONFIG_LOCKDEP
+static struct lock_class_key kevent_sock_key;
+
+void kevent_socket_reinit(struct socket *sock)
+{
+ struct inode *inode = SOCK_INODE(sock);
+
+ lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+}
+
+void kevent_sk_reinit(struct sock *sk)
+{
+ if (sk->sk_socket) {
+ struct inode *inode = SOCK_INODE(sk->sk_socket);
+
+ lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+ }
+}
+#endif
+static int __init kevent_init_socket(void)
+{
+ struct kevent_callbacks sc = {
+ .callback = &kevent_socket_callback,
+ .enqueue = &kevent_socket_enqueue,
+ .dequeue = &kevent_socket_dequeue};
+
+ return kevent_add_callbacks(&sc, KEVENT_SOCKET);
+}
+module_init(kevent_init_socket);
diff --git a/net/core/sock.c b/net/core/sock.c
index b77e155..7d5fa3e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1402,6 +1402,7 @@ static void sock_def_wakeup(struct sock
if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
wake_up_interruptible_all(sk->sk_sleep);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}

static void sock_def_error_report(struct sock *sk)
@@ -1411,6 +1412,7 @@ static void sock_def_error_report(struct
wake_up_interruptible(sk->sk_sleep);
sk_wake_async(sk,0,POLL_ERR);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}

static void sock_def_readable(struct sock *sk, int len)
@@ -1420,6 +1422,7 @@ static void sock_def_readable(struct soc
wake_up_interruptible(sk->sk_sleep);
sk_wake_async(sk,1,POLL_IN);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}

static void sock_def_write_space(struct sock *sk)
@@ -1439,6 +1442,7 @@ static void sock_def_write_space(struct
}

read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
}

static void sock_def_destruct(struct sock *sk)
@@ -1489,6 +1493,8 @@ #endif
sk->sk_state = TCP_CLOSE;
sk->sk_socket = sock;

+ kevent_sk_reinit(sk);
+
sock_set_flag(sk, SOCK_ZAPPED);

if(sock)
@@ -1555,8 +1561,10 @@ void fastcall release_sock(struct sock *
if (sk->sk_backlog.tail)
__release_sock(sk);
sk->sk_lock.owner = NULL;
- if (waitqueue_active(&sk->sk_lock.wq))
+ if (waitqueue_active(&sk->sk_lock.wq)) {
wake_up(&sk->sk_lock.wq);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
+ }
spin_unlock_bh(&sk->sk_lock.slock);
}
EXPORT_SYMBOL(release_sock);
diff --git a/net/core/stream.c b/net/core/stream.c
index d1d7dec..2878c2a 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock *
wake_up_interruptible(sk->sk_sleep);
if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
sock_wake_async(sock, 2, POLL_OUT);
+ kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
}
}

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3f884ce..e7dd989 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3119,6 +3119,7 @@ static void tcp_ofo_queue(struct sock *s

__skb_unlink(skb, &tp->out_of_order_queue);
__skb_queue_tail(&sk->sk_receive_queue, skb);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
if(skb->h.th->fin)
tcp_fin(skb, sk, skb->h.th);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index c83938b..b0dd70d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -61,6 +61,7 @@ #include <linux/cache.h>
#include <linux/jhash.h>
#include <linux/init.h>
#include <linux/times.h>
+#include <linux/kevent.h>

#include <net/icmp.h>
#include <net/inet_hashtables.h>
@@ -870,6 +871,7 @@ #endif
reqsk_free(req);
} else {
inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+ kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT);
}
return 0;

diff --git a/net/socket.c b/net/socket.c
index 1bc4167..5582b4a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -85,6 +85,7 @@ #include <linux/compat.h>
#include <linux/kmod.h>
#include <linux/audit.h>
#include <linux/wireless.h>
+#include <linux/kevent.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -490,6 +491,8 @@ static struct socket *sock_alloc(void)
inode->i_uid = current->fsuid;
inode->i_gid = current->fsgid;

+ kevent_socket_reinit(sock);
+
get_cpu_var(sockets_in_use)++;
put_cpu_var(sockets_in_use);
return sock;

2006-11-07 22:18:59

by Andrew Morton

[permalink] [raw]
Subject: Re: [take23 0/5] kevent: Generic event handling mechanism.

On Tue, 7 Nov 2006 19:50:48 +0300
Evgeniy Polyakov <[email protected]> wrote:

> Generic event handling mechanism.

I updated the version in -mm to v23. So people can play with it and review
it. It looks like a bit of work will be needed to get it to compile.


It seems that most of the fixes which were added to the previous version
were merged or are now irrelevant, however you lost this change:


From: Andrew Morton <[email protected]>

If kevent_user_wait() gets -EFAULT on the attempt to copy the first event, it
will return 0, which is indistinguishable from "no events pending".

It can and should return EFAULT in this case.

Cc: Evgeniy Polyakov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

kernel/kevent/kevent_user.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletion(-)

diff -puN kernel/kevent/kevent_user.c~kevent_user_wait-retval-fix kernel/kevent/kevent_user.c
--- a/kernel/kevent/kevent_user.c~kevent_user_wait-retval-fix
+++ a/kernel/kevent/kevent_user.c
@@ -690,8 +690,11 @@ static int kevent_user_wait(struct file

while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
if (copy_to_user(buf + num*sizeof(struct ukevent),
- &k->event, sizeof(struct ukevent)))
+ &k->event, sizeof(struct ukevent))) {
+ if (num == 0)
+ num = -EFAULT;
break;
+ }
kevent_complete_ready(k);
++num;
kevent_stat_wait(u);
_

2006-11-07 22:19:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [take23 2/5] kevent: Core files.

On Tue, 7 Nov 2006 19:50:48 +0300
Evgeniy Polyakov <[email protected]> wrote:

> This patch includes core kevent files:
> * userspace controlling
> * kernelspace interfaces
> * initialization
> * notification state machines

I fixed up all the rejects, but your syscall numbers changed. Please
always raise patches against the latest kernel.

2006-11-07 22:18:40

by Andrew Morton

[permalink] [raw]
Subject: Re: [take23 1/5] kevent: Description.

On Tue, 7 Nov 2006 19:50:48 +0300
Evgeniy Polyakov <[email protected]> wrote:

> Description.

I converted this into Documentation/kevent.txt. It looks like crap in an 80-col
xterm btw.

2006-11-07 22:53:41

by Davide Libenzi

[permalink] [raw]
Subject: Re: [take23 3/5] kevent: poll/select() notifications.

On Tue, 7 Nov 2006, Evgeniy Polyakov wrote:

> +static int kevent_poll_wait_callback(wait_queue_t *wait,
> + unsigned mode, int sync, void *key)
> +{
> + struct kevent_poll_wait_container *cont =
> + container_of(wait, struct kevent_poll_wait_container, wait);
> + struct kevent *k = cont->k;
> + struct file *file = k->st->origin;
> + u32 revents;
> +
> + revents = file->f_op->poll(file, NULL);
> +
> + kevent_storage_ready(k->st, NULL, revents);
> +
> + return 0;
> +}

Are you sure you can safely call file->f_op->poll() from inside a callback
based wakeup? The low level driver may be calling the wakeup with one of
its locks held, and during the file->f_op->poll may be trying to acquire
the same lock. I remember there was a discussion about this, and assuming
the above not true, made epoll code more complex (and slower, since an
extra O(R) loop was needed to fetch events).



- Davide


2006-11-08 08:23:07

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take23 0/5] kevent: Generic event handling mechanism.

On Tue, Nov 07, 2006 at 02:17:18PM -0800, Andrew Morton ([email protected]) wrote:
> From: Andrew Morton <[email protected]>
>
> If kevent_user_wait() gets -EFAULT on the attempt to copy the first event, it
> will return 0, which is indistinguishable from "no events pending".
>
> It can and should return EFAULT in this case.

Correct, I missed that.
Thanks Andrew, I will put into my tree, -mm seems to have it already.

--
Evgeniy Polyakov

2006-11-08 08:24:44

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take23 1/5] kevent: Description.

On Tue, Nov 07, 2006 at 02:16:40PM -0800, Andrew Morton ([email protected]) wrote:
> On Tue, 7 Nov 2006 19:50:48 +0300
> Evgeniy Polyakov <[email protected]> wrote:
>
> > Description.
>
> I converted this into Documentation/kevent.txt. It looks like crap in an 80-col
> xterm btw.

Thanks.
It was copied as is from documentation page, so it does looks like crap
in non-browser window. I'm quite sure there will be some questions about
kevent, so I will update that file and fix indent.

--
Evgeniy Polyakov

2006-11-08 08:25:44

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take23 2/5] kevent: Core files.

On Tue, Nov 07, 2006 at 02:16:57PM -0800, Andrew Morton ([email protected]) wrote:
> On Tue, 7 Nov 2006 19:50:48 +0300
> Evgeniy Polyakov <[email protected]> wrote:
>
> > This patch includes core kevent files:
> > * userspace controlling
> > * kernelspace interfaces
> > * initialization
> > * notification state machines
>
> I fixed up all the rejects, but your syscall numbers changed. Please
> always raise patches against the latest kernel.

Will do. NUmbers actually are the same, but added new syscall which was
against old tree.
Thanks Andrew.

--
Evgeniy Polyakov

2006-11-08 08:47:20

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take23 3/5] kevent: poll/select() notifications.

On Tue, Nov 07, 2006 at 02:53:33PM -0800, Davide Libenzi ([email protected]) wrote:
> On Tue, 7 Nov 2006, Evgeniy Polyakov wrote:
>
> > +static int kevent_poll_wait_callback(wait_queue_t *wait,
> > + unsigned mode, int sync, void *key)
> > +{
> > + struct kevent_poll_wait_container *cont =
> > + container_of(wait, struct kevent_poll_wait_container, wait);
> > + struct kevent *k = cont->k;
> > + struct file *file = k->st->origin;
> > + u32 revents;
> > +
> > + revents = file->f_op->poll(file, NULL);
> > +
> > + kevent_storage_ready(k->st, NULL, revents);
> > +
> > + return 0;
> > +}
>
> Are you sure you can safely call file->f_op->poll() from inside a callback
> based wakeup? The low level driver may be calling the wakeup with one of
> its locks held, and during the file->f_op->poll may be trying to acquire
> the same lock. I remember there was a discussion about this, and assuming
> the above not true, made epoll code more complex (and slower, since an
> extra O(R) loop was needed to fetch events).

Indeed, I have not paid too much attention to poll/select notifications in
kevent actually. As far as I recall it should be called on behalf of process
doing kevent_get_event(). I will check and fix if that is not correct.
Thanks Davide.

> - Davide
>

--
Evgeniy Polyakov

2006-11-08 14:51:14

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take23 0/5] kevent: Generic event handling mechanism.

--- linux/fs/eventpoll.c 2006-11-08 15:37:36.000000000 +0100
+++ linux/fs/eventpoll.c 2006-11-08 15:38:31.000000000 +0100
@@ -1447,7 +1447,7 @@
&events[eventcnt].events) ||
__put_user(epi->event.data,
&events[eventcnt].data))
- return -EFAULT;
+ return eventcnt ? eventcnt : -EFAULT;
if (epi->event.events & EPOLLONESHOT)
epi->event.events &= EP_PRIVATE_BITS;
eventcnt++;


Attachments:
(No filename) (1.03 kB)
eventpoll.patch (423.00 B)
Download all attachments

2006-11-08 17:06:13

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take23 3/5] kevent: poll/select() notifications.

On Wed, Nov 08, 2006 at 11:45:54AM +0300, Evgeniy Polyakov ([email protected]) wrote:
> > Are you sure you can safely call file->f_op->poll() from inside a callback
> > based wakeup? The low level driver may be calling the wakeup with one of
> > its locks held, and during the file->f_op->poll may be trying to acquire
> > the same lock. I remember there was a discussion about this, and assuming
> > the above not true, made epoll code more complex (and slower, since an
> > extra O(R) loop was needed to fetch events).
>
> Indeed, I have not paid too much attention to poll/select notifications in
> kevent actually. As far as I recall it should be called on behalf of process
> doing kevent_get_event(). I will check and fix if that is not correct.
> Thanks Davide.

Indeed there was a bug.
Actually poll/select patch was broken quite noticebly - patchset did not
include major changes I made for it.
I will put them all into next release.

Thanks again Davide for pointing that out.

--
Evgeniy Polyakov

2006-11-08 22:08:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [take23 0/5] kevent: Generic event handling mechanism.

On Wed, 8 Nov 2006 15:51:13 +0100
Eric Dumazet <[email protected]> wrote:

> [PATCH] eventpoll : In case a fault occurs during copy_to_user(), we should
> report the count of events that were successfully copied into user space,
> instead of EFAULT. That would be consistent with behavior of read/write()
> syscalls for example.
>
> Signed-off-by: Eric Dumazet <[email protected]>
>
>
>
> [eventpoll.patch text/plain (424B)]
> --- linux/fs/eventpoll.c 2006-11-08 15:37:36.000000000 +0100
> +++ linux/fs/eventpoll.c 2006-11-08 15:38:31.000000000 +0100
> @@ -1447,7 +1447,7 @@
> &events[eventcnt].events) ||
> __put_user(epi->event.data,
> &events[eventcnt].data))
> - return -EFAULT;
> + return eventcnt ? eventcnt : -EFAULT;
> if (epi->event.events & EPOLLONESHOT)
> epi->event.events &= EP_PRIVATE_BITS;
> eventcnt++;
>

Definitely a better interface, but I wonder if it's too late to change it.

An app which does

if (epoll_wait(...) == -1)
barf(errno);
else
assume_all_events_were_received();

will now do the wrong thing.

otoh, such an applciation basically _has_ to use the epoll_wait()
return value to work out how many events it received, so maybe it's OK...

2006-11-08 22:44:57

by Davide Libenzi

[permalink] [raw]
Subject: Re: [take23 0/5] kevent: Generic event handling mechanism.

On Wed, 8 Nov 2006, Andrew Morton wrote:

> On Wed, 8 Nov 2006 15:51:13 +0100
> Eric Dumazet <[email protected]> wrote:
>
> > [PATCH] eventpoll : In case a fault occurs during copy_to_user(), we should
> > report the count of events that were successfully copied into user space,
> > instead of EFAULT. That would be consistent with behavior of read/write()
> > syscalls for example.
> >
> > Signed-off-by: Eric Dumazet <[email protected]>
> >
> >
> >
> > [eventpoll.patch text/plain (424B)]
> > --- linux/fs/eventpoll.c 2006-11-08 15:37:36.000000000 +0100
> > +++ linux/fs/eventpoll.c 2006-11-08 15:38:31.000000000 +0100
> > @@ -1447,7 +1447,7 @@
> > &events[eventcnt].events) ||
> > __put_user(epi->event.data,
> > &events[eventcnt].data))
> > - return -EFAULT;
> > + return eventcnt ? eventcnt : -EFAULT;
> > if (epi->event.events & EPOLLONESHOT)
> > epi->event.events &= EP_PRIVATE_BITS;
> > eventcnt++;
> >
>
> Definitely a better interface, but I wonder if it's too late to change it.
>
> An app which does
>
> if (epoll_wait(...) == -1)
> barf(errno);
> else
> assume_all_events_were_received();
>
> will now do the wrong thing.
>
> otoh, such an applciation basically _has_ to use the epoll_wait()
> return value to work out how many events it received, so maybe it's OK...

I don't care about both ways, but sys_poll() does the same thing epoll
does right now, so I would not change epoll behaviour.



- Davide


2006-11-08 23:07:32

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take23 0/5] kevent: Generic event handling mechanism.

Davide Libenzi a ?crit :
> On Wed, 8 Nov 2006, Andrew Morton wrote:
>
>> On Wed, 8 Nov 2006 15:51:13 +0100
>> Eric Dumazet <[email protected]> wrote:
>>
>>> [PATCH] eventpoll : In case a fault occurs during copy_to_user(), we should
>>> report the count of events that were successfully copied into user space,
>>> instead of EFAULT. That would be consistent with behavior of read/write()
>>> syscalls for example.
>>>
>>> Signed-off-by: Eric Dumazet <[email protected]>
>>>
>>>
>>>
>>> [eventpoll.patch text/plain (424B)]
>>> --- linux/fs/eventpoll.c 2006-11-08 15:37:36.000000000 +0100
>>> +++ linux/fs/eventpoll.c 2006-11-08 15:38:31.000000000 +0100
>>> @@ -1447,7 +1447,7 @@
>>> &events[eventcnt].events) ||
>>> __put_user(epi->event.data,
>>> &events[eventcnt].data))
>>> - return -EFAULT;
>>> + return eventcnt ? eventcnt : -EFAULT;
>>> if (epi->event.events & EPOLLONESHOT)
>>> epi->event.events &= EP_PRIVATE_BITS;
>>> eventcnt++;
>>>
>> Definitely a better interface, but I wonder if it's too late to change it.
>>
>> An app which does
>>
>> if (epoll_wait(...) == -1)
>> barf(errno);
>> else
>> assume_all_events_were_received();
>>
>> will now do the wrong thing.
>>
>> otoh, such an applciation basically _has_ to use the epoll_wait()
>> return value to work out how many events it received, so maybe it's OK...
>
> I don't care about both ways, but sys_poll() does the same thing epoll
> does right now, so I would not change epoll behaviour.
>

Sure poll() cannot return a partial count, since its return value is :

On success, a positive number is returned, where the number returned is
the number of structures which have non-zero revents fields (in other
words, those descriptors with events or errors reported).

poll() is non destructive (it doesnt change any state into kernel). Returning
EFAULT in case of an error in the very last bit of user area is mandatory.

On the contrary :

epoll_wait() does return a count of transfered events, and update some state
in kernel (it consume Edge Trigered events : They can be lost forever if not
reported to user)

So epoll_wait() is much more like read(), that also updates file state in
kernel (current file position)

2006-11-08 23:56:33

by Davide Libenzi

[permalink] [raw]
Subject: Re: [take23 0/5] kevent: Generic event handling mechanism.

On Thu, 9 Nov 2006, Eric Dumazet wrote:

> Davide Libenzi a ?crit :
> >
> > I don't care about both ways, but sys_poll() does the same thing epoll does
> > right now, so I would not change epoll behaviour.
> >
>
> Sure poll() cannot return a partial count, since its return value is :
>
> On success, a positive number is returned, where the number returned is
> the number of structures which have non-zero revents fields (in other
> words, those descriptors with events or errors reported).
>
> poll() is non destructive (it doesnt change any state into kernel). Returning
> EFAULT in case of an error in the very last bit of user area is mandatory.
>
> On the contrary :
>
> epoll_wait() does return a count of transfered events, and update some state
> in kernel (it consume Edge Trigered events : They can be lost forever if not
> reported to user)
>
> So epoll_wait() is much more like read(), that also updates file state in
> kernel (current file position)

Lost forever means? If there are more processes watching some fd
(external events), they all get their own copy of the events in their own
private epoll fd. It's not that we "steal" things out of the kernel, is
not a 1:1 producer/consumer thing (one producer, 1 queue). It's one
producer, broadcast to all listeners (consumers) thing. The only case
where it'd matter is in the case of multiple threads sharing the same
epoll fd.
In general, I'd be more for having the userspace get his own SEGFAULT
instead of letting it go with broken parameters. If I'm coding userspace,
and I'm doing something wrong, I like the kernel to let me know, instead
of trying to fix things for me.
Also, epoll can easily be fixed (add a param to ep_reinject_items() to
re-inject items in case of error/EFAULT) to leave events in the ready-list
and let the EFAULT emerge.
Anyone else has opinions about this?




PS: Next time it'd be great if you Cc: me when posting epoll patches, so
you avoid Andrew the job of doing it.



- Davide

2006-11-09 07:24:21

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take23 0/5] kevent: Generic event handling mechanism.

Davide Libenzi a ?crit :
> On Thu, 9 Nov 2006, Eric Dumazet wrote:
>
>> Davide Libenzi a ?crit :
>>> I don't care about both ways, but sys_poll() does the same thing epoll does
>>> right now, so I would not change epoll behaviour.
>>>
>> Sure poll() cannot return a partial count, since its return value is :
>>
>> On success, a positive number is returned, where the number returned is
>> the number of structures which have non-zero revents fields (in other
>> words, those descriptors with events or errors reported).
>>
>> poll() is non destructive (it doesnt change any state into kernel). Returning
>> EFAULT in case of an error in the very last bit of user area is mandatory.
>>
>> On the contrary :
>>
>> epoll_wait() does return a count of transfered events, and update some state
>> in kernel (it consume Edge Trigered events : They can be lost forever if not
>> reported to user)
>>
>> So epoll_wait() is much more like read(), that also updates file state in
>> kernel (current file position)
>
> Lost forever means? If there are more processes watching some fd
> (external events), they all get their own copy of the events in their own
> private epoll fd. It's not that we "steal" things out of the kernel, is
> not a 1:1 producer/consumer thing (one producer, 1 queue). It's one
> producer, broadcast to all listeners (consumers) thing. The only case
> where it'd matter is in the case of multiple threads sharing the same
> epoll fd.

In my particular epoll application, the producer is tcp stack, and I have one
consumer. If an network event is lost in the EFAULT handling, its lost
forever. In any case, my application do provide a correct user area, so this
problem is only theorical.

> In general, I'd be more for having the userspace get his own SEGFAULT
> instead of letting it go with broken parameters. If I'm coding userspace,
> and I'm doing something wrong, I like the kernel to let me know, instead
> of trying to fix things for me.
> Also, epoll can easily be fixed (add a param to ep_reinject_items() to
> re-inject items in case of error/EFAULT) to leave events in the ready-list
> and let the EFAULT emerge.

Please dont slow the hot path for a basically "User Error". It's already
tested in the transfert function, with two conditional branches for each
transfered event.

> Anyone else has opinions about this?
>
>
>
>
> PS: Next time it'd be great if you Cc: me when posting epoll patches, so
> you avoid Andrew the job of doing it.

Yes, but this particular patch was a followup on own kevent Andrew patch.

I have a bunch of patches for epoll I will send to you :)

Eric

2006-11-09 07:52:40

by Eric Dumazet

[permalink] [raw]
Subject: Re: [take23 0/5] kevent: Generic event handling mechanism.

Eric Dumazet a ?crit :
> Davide Libenzi a ?crit :
>> Lost forever means? If there are more processes watching some fd
>> (external events), they all get their own copy of the events in their
>> own private epoll fd. It's not that we "steal" things out of the
>> kernel, is not a 1:1 producer/consumer thing (one producer, 1 queue).
>> It's one producer, broadcast to all listeners (consumers) thing. The
>> only case where it'd matter is in the case of multiple threads sharing
>> the same epoll fd.
>
> In my particular epoll application, the producer is tcp stack, and I
> have one consumer. If an network event is lost in the EFAULT handling,
> its lost forever. In any case, my application do provide a correct user
> area, so this problem is only theorical.

I realize I was not explicit, and dit not answer your question (Lost forever
means ?)

if (epi->revents) {
if (__put_user(epi->revents,
&events[eventcnt].events) ||
__put_user(epi->event.data,
&events[eventcnt].data))
return -EFAULT;
>> if (epi->event.events & EPOLLONESHOT)
>> epi->event.events &= EP_PRIVATE_BITS;
eventcnt++;
}

If one EPOLLONESHOT event is correctly copied to user space, its status is
updated.

If other ready events in the same epoll_wait() call cannot be transferred
because of an EFAULT (we reach the real end of user provided area), this
EPOLLONESHOT event is lost forever, because it wont be requeued in ready list.

Eric

2006-11-09 17:13:21

by Davide Libenzi

[permalink] [raw]
Subject: Re: [take23 0/5] kevent: Generic event handling mechanism.

On Thu, 9 Nov 2006, Eric Dumazet wrote:

> > > Lost forever means? If there are more processes watching some fd (external
> > > events), they all get their own copy of the events in their own private
> > > epoll fd. It's not that we "steal" things out of the kernel, is not a 1:1
> > > producer/consumer thing (one producer, 1 queue). It's one producer,
> > > broadcast to all listeners (consumers) thing. The only case where it'd
> > > matter is in the case of multiple threads sharing the same epoll fd.
> >
> > In my particular epoll application, the producer is tcp stack, and I have
> > one consumer. If an network event is lost in the EFAULT handling, its lost
> > forever. In any case, my application do provide a correct user area, so this
> > problem is only theorical.
>
> I realize I was not explicit, and dit not answer your question (Lost forever
> means ?)
>
> if (epi->revents) {
> if (__put_user(epi->revents,
> &events[eventcnt].events) ||
> __put_user(epi->event.data,
> &events[eventcnt].data))
> return -EFAULT;
> >> if (epi->event.events & EPOLLONESHOT)
> >> epi->event.events &= EP_PRIVATE_BITS;
> eventcnt++;
> }
>
> If one EPOLLONESHOT event is correctly copied to user space, its status is
> updated.
>
> If other ready events in the same epoll_wait() call cannot be transferred
> because of an EFAULT (we reach the real end of user provided area), this
> EPOLLONESHOT event is lost forever, because it wont be requeued in ready list.

Your application is feeding crap to the kernel, because of programming
bugs. If that happens, I want an EFAULT and not a partially filled buffer.
And which buffer then? This could have been scribbled in userspace memory
(the pointer), and the try of the kernel to mask out bugs might create
even more subtle problems. Such bug will *never* show up in the up in case
the wrong buffer is partially valid (first part, that is the *only* case
where your fix would make a difference compared to the status quo), since
in case of no ready events we'll never hit it, and in case of some events
we'll always return few of them and never EFAULT. No, the more I think
about it, the more I personally disagree with the change.



> Please dont slow the hot path for a basically "User Error". It's already
> tested in the transfert function, with two conditional
> branches for each transfered event.

Ohh, if you think you can measure them from userspace, those can be turned
in 'err |= __put_user();' with err tested only out of the loop.



- Davide