2006-08-09 07:39:16

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take6 0/3] kevent: Generic event handling mechanism.


Generic event handling mechanism.

Changes from 'take5' patchset:
* removed compilation warnings about unused wariables when lockdep is not turned on
* do not use internal socket structures, use appropriate (exported) wrappers instead
* removed default 1 second timeout
* removed AIO stuff from patchset

Changes from 'take4' patchset:
* use miscdevice instead of chardevice
* comments fixes

Changes from 'take3' patchset:
* removed serializing mutex from kevent_user_wait()
* moved storage list processing to RCU
* removed lockdep screaming - all storage locks are initialized in the same function, so it was learned
to differentiate between various cases
* remove kevent from storage if is marked as broken after callback
* fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion

Changes from 'take2' patchset:
* split kevent_finish_user() to locked and unlocked variants
* do not use KEVENT_STAT ifdefs, use inline functions instead
* use array of callbacks of each type instead of each kevent callback initialization
* changed name of ukevent guarding lock
* use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
* do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
* various indent cleanups
* added optimisation, which is aimed to help when a lot of kevents are being copied from userspace
* mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
- rebased against 2.6.18-git tree
- removed ioctl controlling
- added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
unsigned int timeout, void __user *buf, unsigned flags)
- use old syscall kevent_ctl for creation/removing, modification and initial kevent
initialization
- use mutuxes instead of semaphores
- added file descriptor check and return error if provided descriptor does not match
kevent file operations
- various indent fixes
- removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov <[email protected]>



2006-08-09 07:39:22

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take6 2/3] kevent: poll/select() notifications. Timer notifications.


poll/select() notifications. Timer notifications.

This patch includes generic poll/select and timer notifications.

kevent_poll works similar to epoll and has the same issues (callback
is invoked not from internal state machine of the caller, but through
process awake).

Timer notifications can be used for fine grained per-process time
management, since interval timers are very inconvenient to use,
and they are limited.

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..8a4f863
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,220 @@
+/*
+ * kevent_poll.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+ struct poll_table_struct pt;
+ struct kevent *k;
+};
+
+struct kevent_poll_wait_container
+{
+ struct list_head container_entry;
+ wait_queue_head_t *whead;
+ wait_queue_t wait;
+ struct kevent *k;
+};
+
+struct kevent_poll_private
+{
+ struct list_head container_list;
+ spinlock_t container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+ unsigned mode, int sync, void *key)
+{
+ struct kevent_poll_wait_container *cont =
+ container_of(wait, struct kevent_poll_wait_container, wait);
+ struct kevent *k = cont->k;
+ struct file *file = k->st->origin;
+ u32 revents;
+
+ revents = file->f_op->poll(file, NULL);
+
+ kevent_storage_ready(k->st, NULL, revents);
+
+ return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+ struct poll_table_struct *poll_table)
+{
+ struct kevent *k =
+ container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+ struct kevent_poll_private *priv = k->priv;
+ struct kevent_poll_wait_container *cont;
+ unsigned long flags;
+
+ cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+ if (!cont) {
+ kevent_break(k);
+ return;
+ }
+
+ cont->k = k;
+ init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+ cont->whead = whead;
+
+ spin_lock_irqsave(&priv->container_lock, flags);
+ list_add_tail(&cont->container_entry, &priv->container_list);
+ spin_unlock_irqrestore(&priv->container_lock, flags);
+
+ add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+ struct file *file;
+ int err, ready = 0;
+ unsigned int revents;
+ struct kevent_poll_ctl ctl;
+ struct kevent_poll_private *priv;
+
+ file = fget(k->event.id.raw[0]);
+ if (!file)
+ return -ENODEV;
+
+ err = -EINVAL;
+ if (!file->f_op || !file->f_op->poll)
+ goto err_out_fput;
+
+ err = -ENOMEM;
+ priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+ if (!priv)
+ goto err_out_fput;
+
+ spin_lock_init(&priv->container_lock);
+ INIT_LIST_HEAD(&priv->container_list);
+
+ k->priv = priv;
+
+ ctl.k = k;
+ init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+
+ err = kevent_storage_enqueue(&file->st, k);
+ if (err)
+ goto err_out_free;
+
+ revents = file->f_op->poll(file, &ctl.pt);
+ if (revents & k->event.event) {
+ ready = 1;
+ kevent_poll_dequeue(k);
+ }
+
+ return ready;
+
+err_out_free:
+ kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+ fput(file);
+ return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+ struct file *file = k->st->origin;
+ struct kevent_poll_private *priv = k->priv;
+ struct kevent_poll_wait_container *w, *n;
+ unsigned long flags;
+
+ kevent_storage_dequeue(k->st, k);
+
+ spin_lock_irqsave(&priv->container_lock, flags);
+ list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+ list_del(&w->container_entry);
+ remove_wait_queue(w->whead, &w->wait);
+ kmem_cache_free(kevent_poll_container_cache, w);
+ }
+ spin_unlock_irqrestore(&priv->container_lock, flags);
+
+ kmem_cache_free(kevent_poll_priv_cache, priv);
+ k->priv = NULL;
+
+ fput(file);
+
+ return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+ struct file *file = k->st->origin;
+ unsigned int revents = file->f_op->poll(file, NULL);
+ return (revents & k->event.event);
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+ struct kevent_callbacks *pc = &kevent_registered_callbacks[KEVENT_POLL];
+
+ kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache",
+ sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+ if (!kevent_poll_container_cache) {
+ printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+ return -ENOMEM;
+ }
+
+ kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache",
+ sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+ if (!kevent_poll_priv_cache) {
+ printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+ kmem_cache_destroy(kevent_poll_container_cache);
+ kevent_poll_container_cache = NULL;
+ return -ENOMEM;
+ }
+
+ pc->enqueue = &kevent_poll_enqueue;
+ pc->dequeue = &kevent_poll_dequeue;
+ pc->callback = &kevent_poll_callback;
+
+ printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+ return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+ lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+ kmem_cache_destroy(kevent_poll_priv_cache);
+ kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);
diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..f175edd
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,119 @@
+/*
+ * kevent_timer.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+static void kevent_timer_func(unsigned long data)
+{
+ struct kevent *k = (struct kevent *)data;
+ struct timer_list *t = k->st->origin;
+
+ kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL);
+ mod_timer(t, jiffies + msecs_to_jiffies(k->event.id.raw[0]));
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+ struct timer_list *t;
+ struct kevent_storage *st;
+ int err;
+
+ t = kmalloc(sizeof(struct timer_list) + sizeof(struct kevent_storage),
+ GFP_KERNEL);
+ if (!t)
+ return -ENOMEM;
+
+ init_timer(t);
+ t->function = kevent_timer_func;
+ t->expires = jiffies + msecs_to_jiffies(k->event.id.raw[0]);
+ t->data = (unsigned long)k;
+
+ st = (struct kevent_storage *)(t+1);
+ err = kevent_storage_init(t, st);
+ if (err)
+ goto err_out_free;
+ lockdep_set_class(&st->lock, &kevent_timer_key);
+
+ err = kevent_storage_enqueue(st, k);
+ if (err)
+ goto err_out_st_fini;
+
+ add_timer(t);
+
+ return 0;
+
+err_out_st_fini:
+ kevent_storage_fini(st);
+err_out_free:
+ kfree(t);
+
+ return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+ struct kevent_storage *st = k->st;
+ struct timer_list *t = st->origin;
+
+ if (!t)
+ return -ENODEV;
+
+ del_timer_sync(t);
+
+ kevent_storage_dequeue(st, k);
+
+ kfree(t);
+
+ return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+ struct kevent_storage *st = k->st;
+ struct timer_list *t = st->origin;
+
+ if (!t)
+ return -ENODEV;
+
+ k->event.ret_data[0] = (__u32)jiffies;
+ return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+ struct kevent_callbacks *tc = &kevent_registered_callbacks[KEVENT_TIMER];
+
+ tc->enqueue = &kevent_timer_enqueue;
+ tc->dequeue = &kevent_timer_dequeue;
+ tc->callback = &kevent_timer_callback;
+
+ return 0;
+}
+late_initcall(kevent_init_timer);

2006-08-09 07:39:41

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take6 3/3] kevent: Network AIO, socket notifications.


Network AIO, socket notifications.

This patchset includes socket notifications and network asynchronous IO.
Network AIO is based on kevent and works as usual kevent storage on top
of inode.

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/include/asm-i386/socket.h b/include/asm-i386/socket.h
index 5755d57..9300678 100644
--- a/include/asm-i386/socket.h
+++ b/include/asm-i386/socket.h
@@ -50,4 +50,6 @@ #define SO_ACCEPTCONN 30
#define SO_PEERSEC 31
#define SO_PASSSEC 34

+#define SO_ASYNC_SOCK 35
+
#endif /* _ASM_SOCKET_H */
diff --git a/include/asm-x86_64/socket.h b/include/asm-x86_64/socket.h
index b467026..fc2b49d 100644
--- a/include/asm-x86_64/socket.h
+++ b/include/asm-x86_64/socket.h
@@ -50,4 +50,6 @@ #define SO_ACCEPTCONN 30
#define SO_PEERSEC 31
#define SO_PASSSEC 34

+#define SO_ASYNC_SOCK 35
+
#endif /* _ASM_SOCKET_H */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4307e76..9267873 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1283,6 +1283,8 @@ extern struct sk_buff *skb_recv_datagram
int noblock, int *err);
extern unsigned int datagram_poll(struct file *file, struct socket *sock,
struct poll_table_struct *wait);
+extern int skb_copy_datagram(const struct sk_buff *from,
+ int offset, void *dst, int size);
extern int skb_copy_datagram_iovec(const struct sk_buff *from,
int offset, struct iovec *to,
int size);
diff --git a/include/net/sock.h b/include/net/sock.h
index 324b3ea..c43a153 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -48,6 +48,7 @@ #include <linux/lockdep.h>
#include <linux/netdevice.h>
#include <linux/skbuff.h> /* struct sk_buff */
#include <linux/security.h>
+#include <linux/kevent.h>

#include <linux/filter.h>

@@ -391,6 +392,8 @@ enum sock_flags {
SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_QUEUE_SHRUNK, /* write queue has been shrunk recently */
+ SOCK_ASYNC,
+ SOCK_ASYNC_INUSE,
};

static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)
@@ -450,6 +453,21 @@ static inline int sk_stream_memory_free(

extern void sk_stream_rfree(struct sk_buff *skb);

+struct socket_alloc {
+ struct socket socket;
+ struct inode vfs_inode;
+};
+
+static inline struct socket *SOCKET_I(struct inode *inode)
+{
+ return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
+}
+
+static inline struct inode *SOCK_INODE(struct socket *socket)
+{
+ return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
+}
+
static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk)
{
skb->sk = sk;
@@ -477,6 +495,7 @@ static inline void sk_add_backlog(struct
sk->sk_backlog.tail = skb;
}
skb->next = NULL;
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
}

#define sk_wait_event(__sk, __timeo, __condition) \
@@ -548,6 +567,12 @@ struct proto {

int (*backlog_rcv) (struct sock *sk,
struct sk_buff *skb);
+
+ int (*async_recv) (struct sock *sk,
+ void *dst, size_t size);
+ int (*async_send) (struct sock *sk,
+ struct page **pages, unsigned int poffset,
+ size_t size);

/* Keeping track of sk's, looking them up, and port selection methods. */
void (*hash)(struct sock *sk);
@@ -679,21 +704,6 @@ static inline struct kiocb *siocb_to_kio
return si->kiocb;
}

-struct socket_alloc {
- struct socket socket;
- struct inode vfs_inode;
-};
-
-static inline struct socket *SOCKET_I(struct inode *inode)
-{
- return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
-}
-
-static inline struct inode *SOCK_INODE(struct socket *socket)
-{
- return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
-}
-
extern void __sk_stream_mem_reclaim(struct sock *sk);
extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind);

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0720bdd..5a1899b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -364,6 +364,8 @@ extern int compat_tcp_setsockopt(struc
int level, int optname,
char __user *optval, int optlen);
extern void tcp_set_keepalive(struct sock *sk, int val);
+extern int tcp_async_recv(struct sock *sk, void *dst, size_t size);
+extern int tcp_async_send(struct sock *sk, struct page **pages, unsigned int poffset, size_t size);
extern int tcp_recvmsg(struct kiocb *iocb, struct sock *sk,
struct msghdr *msg,
size_t len, int nonblock,
@@ -857,6 +859,7 @@ static inline int tcp_prequeue(struct so
tp->ucopy.memory = 0;
} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
wake_up_interruptible(sk->sk_sleep);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
if (!inet_csk_ack_scheduled(sk))
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
(3 * TCP_RTO_MIN) / 4,
diff --git a/kernel/kevent/kevent_naio.c b/kernel/kevent/kevent_naio.c
new file mode 100644
index 0000000..1b6122a
--- /dev/null
+++ b/kernel/kevent/kevent_naio.c
@@ -0,0 +1,237 @@
+/*
+ * kevent_naio.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/file.h>
+#include <linux/pagemap.h>
+#include <linux/kevent.h>
+
+#include <net/sock.h>
+#include <net/tcp_states.h>
+
+static int kevent_naio_enqueue(struct kevent *k);
+static int kevent_naio_dequeue(struct kevent *k);
+static int kevent_naio_callback(struct kevent *k);
+
+static int kevent_naio_setup_aio(int ctl_fd, int s, void __user *buf,
+ size_t size, u32 event)
+{
+ struct kevent_user *u;
+ struct file *file;
+ int err;
+ struct ukevent uk;
+
+ file = fget(ctl_fd);
+ if (!file)
+ return -ENODEV;
+
+ u = file->private_data;
+ if (!u) {
+ err = -EINVAL;
+ goto err_out_fput;
+ }
+
+ memset(&uk, 0, sizeof(struct ukevent));
+ uk.type = KEVENT_NAIO;
+ uk.ptr = buf;
+ uk.req_flags = KEVENT_REQ_ONESHOT;
+ uk.event = event;
+ uk.id.raw[0] = s;
+ uk.id.raw[1] = size;
+
+ err = kevent_user_add_ukevent(&uk, u);
+
+err_out_fput:
+ fput(file);
+ return err;
+}
+
+asmlinkage long sys_aio_recv(int ctl_fd, int s, void __user *buf,
+ size_t size, unsigned flags)
+{
+ return kevent_naio_setup_aio(ctl_fd, s, buf, size, KEVENT_SOCKET_RECV);
+}
+
+asmlinkage long sys_aio_send(int ctl_fd, int s, void __user *buf,
+ size_t size, unsigned flags)
+{
+ return kevent_naio_setup_aio(ctl_fd, s, buf, size, KEVENT_SOCKET_SEND);
+}
+
+static int kevent_naio_enqueue(struct kevent *k)
+{
+ int err = -ENODEV, i;
+ struct page **page;
+ void *addr;
+ unsigned int size = k->event.id.raw[1];
+ int num = size/PAGE_SIZE;
+ struct socket *sock;
+ struct sock *sk = NULL;
+
+ sock = sockfd_lookup(k->event.id.raw[0], &err);
+ if (!sock)
+ return -ENODEV;
+
+ sk = sock->sk;
+
+ err = -ESOCKTNOSUPPORT;
+ if (!sk || !sk->sk_prot->async_recv || !sk->sk_prot->async_send ||
+ !sock_flag(sk, SOCK_ASYNC))
+ goto err_out_fput;
+
+ addr = k->event.ptr;
+ if (((unsigned long)addr & PAGE_MASK) != (unsigned long)addr)
+ num++;
+
+ page = kmalloc(sizeof(struct page *) * num, GFP_KERNEL);
+ if (!page)
+ goto err_out_fput;
+
+ down_read(&current->mm->mmap_sem);
+ err = get_user_pages(current, current->mm, (unsigned long)addr,
+ num, 1, 0, page, NULL);
+ up_read(&current->mm->mmap_sem);
+ if (err <= 0)
+ goto err_out_free;
+ num = err;
+
+ k->event.ret_data[0] = num;
+ k->event.ret_data[1] = offset_in_page(k->event.ptr);
+ k->priv = page;
+
+ sk->sk_allocation = GFP_ATOMIC;
+
+ spin_lock_bh(&sk->sk_lock.slock);
+ err = kevent_socket_enqueue(k);
+ spin_unlock_bh(&sk->sk_lock.slock);
+ if (err)
+ goto err_out_put_pages;
+
+ sockfd_put(sock);
+
+ return err;
+
+err_out_put_pages:
+ for (i=0; i<num; ++i)
+ page_cache_release(page[i]);
+err_out_free:
+ kfree(page);
+err_out_fput:
+ sockfd_put(sock);
+
+ return err;
+}
+
+static int kevent_naio_dequeue(struct kevent *k)
+{
+ int err, i, num;
+ struct page **page = k->priv;
+
+ num = k->event.ret_data[0];
+
+ err = kevent_socket_dequeue(k);
+
+ for (i=0; i<num; ++i)
+ page_cache_release(page[i]);
+
+ kfree(k->priv);
+ k->priv = NULL;
+
+ return err;
+}
+
+static int kevent_naio_callback(struct kevent *k)
+{
+ struct inode *inode = k->st->origin;
+ struct sock *sk = SOCKET_I(inode)->sk;
+ unsigned int size = k->event.id.raw[1];
+ unsigned int off = k->event.ret_data[1];
+ struct page **pages = k->priv, *page;
+ int ready = 0, num = off/PAGE_SIZE, err = 0, send = 0;
+ void *ptr, *optr;
+ unsigned int len;
+
+ if (!sock_flag(sk, SOCK_ASYNC))
+ return -1;
+
+ if (k->event.event & KEVENT_SOCKET_SEND)
+ send = 1;
+ else if (!(k->event.event & KEVENT_SOCKET_RECV))
+ return -EINVAL;
+
+ /*
+ * sk_prot->async_*() can return either number of bytes processed,
+ * or negative error value, or zero if socket is closed.
+ */
+
+ if (!send) {
+ page = pages[num];
+
+ optr = ptr = kmap_atomic(page, KM_IRQ0);
+ if (!ptr)
+ return -ENOMEM;
+
+ ptr += off % PAGE_SIZE;
+ len = min_t(unsigned int, PAGE_SIZE - (ptr - optr), size);
+
+ err = sk->sk_prot->async_recv(sk, ptr, len);
+
+ kunmap_atomic(optr, KM_IRQ0);
+ } else {
+ len = size;
+ err = sk->sk_prot->async_send(sk, pages, off, size);
+ }
+
+ if (err > 0) {
+ num++;
+ size -= err;
+ off += err;
+ }
+
+ k->event.ret_data[1] = off;
+ k->event.id.raw[1] = size;
+
+ if (err == 0 || (err < 0 && err != -EAGAIN))
+ ready = -1;
+
+ if (!size)
+ ready = 1;
+#if 0
+ printk("%s: sk=%p, k=%p, size=%4u, off=%4u, err=%3d, ready=%1d.\n",
+ __func__, sk, k, size, off, err, ready);
+#endif
+
+ return ready;
+}
+
+static int __init kevent_init_naio(void)
+{
+ struct kevent_callbacks *nc = &kevent_registered_callbacks[KEVENT_NAIO];
+
+ nc->callback = &kevent_naio_enqueue;
+ nc->dequeue = &kevent_naio_dequeue;
+ nc->callback = &kevent_naio_callback;
+ return 0;
+}
+late_initcall(kevent_init_naio);
diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c
new file mode 100644
index 0000000..3c4a9ad
--- /dev/null
+++ b/kernel/kevent/kevent_socket.c
@@ -0,0 +1,144 @@
+/*
+ * kevent_socket.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/tcp.h>
+#include <linux/kevent.h>
+
+#include <net/sock.h>
+#include <net/request_sock.h>
+#include <net/inet_connection_sock.h>
+
+static int kevent_socket_callback(struct kevent *k)
+{
+ struct inode *inode = k->st->origin;
+ struct sock *sk = SOCKET_I(inode)->sk;
+ int rmem;
+
+ if (k->event.event & KEVENT_SOCKET_RECV) {
+ int ret = 0;
+
+ if ((rmem = atomic_read(&sk->sk_rmem_alloc)) > 0 ||
+ !skb_queue_empty(&sk->sk_receive_queue))
+ ret = 1;
+ if (sk->sk_shutdown & RCV_SHUTDOWN)
+ ret = 1;
+ if (ret)
+ return ret;
+ }
+ if ((k->event.event & KEVENT_SOCKET_ACCEPT) &&
+ (!reqsk_queue_empty(&inet_csk(sk)->icsk_accept_queue) ||
+ reqsk_queue_len_young(&inet_csk(sk)->icsk_accept_queue))) {
+ k->event.ret_data[1] = reqsk_queue_len(&inet_csk(sk)->icsk_accept_queue);
+ return 1;
+ }
+
+ return 0;
+}
+
+int kevent_socket_enqueue(struct kevent *k)
+{
+ struct inode *inode;
+ struct socket *sock;
+ int err = -ENODEV;
+
+ sock = sockfd_lookup(k->event.id.raw[0], &err);
+ if (!sock)
+ goto err_out_exit;
+
+ inode = igrab(SOCK_INODE(sock));
+ if (!inode)
+ goto err_out_fput;
+
+ err = kevent_storage_enqueue(&inode->st, k);
+ if (err)
+ goto err_out_iput;
+
+ err = k->callbacks.callback(k);
+ if (err)
+ goto err_out_dequeue;
+
+ sockfd_put(sock);
+ return err;
+
+err_out_dequeue:
+ kevent_storage_dequeue(k->st, k);
+err_out_iput:
+ iput(inode);
+err_out_fput:
+ sockfd_put(sock);
+err_out_exit:
+ return err;
+}
+
+int kevent_socket_dequeue(struct kevent *k)
+{
+ struct inode *inode = k->st->origin;
+
+ kevent_storage_dequeue(k->st, k);
+ iput(inode);
+
+ return 0;
+}
+
+void kevent_socket_notify(struct sock *sk, u32 event)
+{
+ if (sk->sk_socket && !test_and_set_bit(SOCK_ASYNC_INUSE, &sk->sk_flags)) {
+ kevent_storage_ready(&SOCK_INODE(sk->sk_socket)->st, NULL, event);
+ sock_reset_flag(sk, SOCK_ASYNC_INUSE);
+ }
+}
+
+#ifdef CONFIG_LOCKDEP
+static struct lock_class_key kevent_sock_key;
+
+void kevent_socket_reinit(struct socket *sock)
+{
+ struct inode *inode = SOCK_INODE(sock);
+
+ lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+}
+
+void kevent_sk_reinit(struct sock *sk)
+{
+ if (sk->sk_socket) {
+ struct inode *inode = SOCK_INODE(sk->sk_socket);
+
+ lockdep_set_class(&inode->st.lock, &kevent_sock_key);
+ }
+}
+#endif
+static int __init kevent_init_socket(void)
+{
+ struct kevent_callbacks *sc = &kevent_registered_callbacks[KEVENT_SOCKET];
+
+ sc->enqueue = &kevent_socket_enqueue;
+ sc->dequeue = &kevent_socket_dequeue;
+ sc->callback = &kevent_socket_callback;
+ return 0;
+}
+late_initcall(kevent_init_socket);
diff --git a/net/core/datagram.c b/net/core/datagram.c
index aecddcc..493245b 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -236,6 +236,60 @@ void skb_kill_datagram(struct sock *sk,
EXPORT_SYMBOL(skb_kill_datagram);

/**
+ * skb_copy_datagram - Copy a datagram.
+ * @skb: buffer to copy
+ * @offset: offset in the buffer to start copying from
+ * @to: pointer to copy to
+ * @len: amount of data to copy from buffer to iovec
+ */
+int skb_copy_datagram(const struct sk_buff *skb, int offset,
+ void *to, int len)
+{
+ int i, fraglen, end = 0;
+ struct sk_buff *next = skb_shinfo(skb)->frag_list;
+
+ if (!len)
+ return 0;
+
+next_skb:
+ fraglen = skb_headlen(skb);
+ i = -1;
+
+ while (1) {
+ int start = end;
+
+ if ((end += fraglen) > offset) {
+ int copy = end - offset, o = offset - start;
+
+ if (copy > len)
+ copy = len;
+ if (i == -1)
+ memcpy(to, skb->data + o, copy);
+ else {
+ skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+ struct page *page = frag->page;
+ void *p = kmap(page) + frag->page_offset + o;
+ memcpy(to, p, copy);
+ kunmap(page);
+ }
+ if (!(len -= copy))
+ return 0;
+ offset += copy;
+ }
+ if (++i >= skb_shinfo(skb)->nr_frags)
+ break;
+ fraglen = skb_shinfo(skb)->frags[i].size;
+ }
+ if (next) {
+ skb = next;
+ BUG_ON(skb_shinfo(skb)->frag_list);
+ next = skb->next;
+ goto next_skb;
+ }
+ return -EFAULT;
+}
+
+/**
* skb_copy_datagram_iovec - Copy a datagram to an iovec.
* @skb: buffer to copy
* @offset: offset in the buffer to start copying from
@@ -530,6 +584,7 @@ unsigned int datagram_poll(struct file *

EXPORT_SYMBOL(datagram_poll);
EXPORT_SYMBOL(skb_copy_and_csum_datagram_iovec);
+EXPORT_SYMBOL(skb_copy_datagram);
EXPORT_SYMBOL(skb_copy_datagram_iovec);
EXPORT_SYMBOL(skb_free_datagram);
EXPORT_SYMBOL(skb_recv_datagram);
diff --git a/net/core/sock.c b/net/core/sock.c
index 51fcfbc..138ce90 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -617,6 +617,16 @@ #endif
spin_unlock_bh(&sk->sk_lock.slock);
ret = -ENONET;
break;
+#ifdef CONFIG_KEVENT_SOCKET
+ case SO_ASYNC_SOCK:
+ spin_lock_bh(&sk->sk_lock.slock);
+ if (valbool)
+ sock_set_flag(sk, SOCK_ASYNC);
+ else
+ sock_reset_flag(sk, SOCK_ASYNC);
+ spin_unlock_bh(&sk->sk_lock.slock);
+ break;
+#endif

case SO_PASSSEC:
if (valbool)
@@ -1406,6 +1416,7 @@ static void sock_def_wakeup(struct sock
if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
wake_up_interruptible_all(sk->sk_sleep);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}

static void sock_def_error_report(struct sock *sk)
@@ -1415,6 +1426,7 @@ static void sock_def_error_report(struct
wake_up_interruptible(sk->sk_sleep);
sk_wake_async(sk,0,POLL_ERR);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}

static void sock_def_readable(struct sock *sk, int len)
@@ -1424,6 +1436,7 @@ static void sock_def_readable(struct soc
wake_up_interruptible(sk->sk_sleep);
sk_wake_async(sk,1,POLL_IN);
read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
}

static void sock_def_write_space(struct sock *sk)
@@ -1443,6 +1456,7 @@ static void sock_def_write_space(struct
}

read_unlock(&sk->sk_callback_lock);
+ kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
}

static void sock_def_destruct(struct sock *sk)
@@ -1493,6 +1507,8 @@ #endif
sk->sk_state = TCP_CLOSE;
sk->sk_socket = sock;

+ kevent_sk_reinit(sk);
+
sock_set_flag(sk, SOCK_ZAPPED);

if(sock)
@@ -1559,8 +1575,10 @@ void fastcall release_sock(struct sock *
if (sk->sk_backlog.tail)
__release_sock(sk);
sk->sk_lock.owner = NULL;
- if (waitqueue_active(&sk->sk_lock.wq))
+ if (waitqueue_active(&sk->sk_lock.wq)) {
wake_up(&sk->sk_lock.wq);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND);
+ }
spin_unlock_bh(&sk->sk_lock.slock);
}
EXPORT_SYMBOL(release_sock);
diff --git a/net/core/stream.c b/net/core/stream.c
index d1d7dec..2878c2a 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -36,6 +36,7 @@ void sk_stream_write_space(struct sock *
wake_up_interruptible(sk->sk_sleep);
if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
sock_wake_async(sock, 2, POLL_OUT);
+ kevent_socket_notify(sk, KEVENT_SOCKET_SEND|KEVENT_SOCKET_RECV);
}
}

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f6a2d92..e878a41 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -206,6 +206,7 @@
* lingertime == 0 (RFC 793 ABORT Call)
* Hirokazu Takahashi : Use copy_from_user() instead of
* csum_and_copy_from_user() if possible.
+ * Evgeniy Polyakov : Network asynchronous IO.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
@@ -1085,6 +1086,301 @@ int tcp_read_sock(struct sock *sk, read_
}

/*
+ * Must be called with locked sock.
+ */
+int tcp_async_send(struct sock *sk, struct page **pages, unsigned int poffset, size_t len)
+{
+ struct tcp_sock *tp = tcp_sk(sk);
+ int mss_now, size_goal;
+ int err = -EAGAIN;
+ ssize_t copied;
+
+ /* Wait for a connection to finish. */
+ if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT))
+ goto out_err;
+
+ clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
+
+ mss_now = tcp_current_mss(sk, 1);
+ size_goal = tp->xmit_size_goal;
+ copied = 0;
+
+ err = -EPIPE;
+ if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN) || sock_flag(sk, SOCK_DONE) ||
+ (sk->sk_state == TCP_CLOSE) || (atomic_read(&sk->sk_refcnt) == 1))
+ goto do_error;
+
+ while (len > 0) {
+ struct sk_buff *skb = sk->sk_write_queue.prev;
+ struct page *page = pages[poffset / PAGE_SIZE];
+ int copy, i, can_coalesce;
+ int offset = poffset % PAGE_SIZE;
+ int size = min_t(size_t, len, PAGE_SIZE - offset);
+
+ if (!sk->sk_send_head || (copy = size_goal - skb->len) <= 0) {
+new_segment:
+ if (!sk_stream_memory_free(sk))
+ goto wait_for_sndbuf;
+
+ skb = sk_stream_alloc_pskb(sk, 0, 0,
+ sk->sk_allocation);
+ if (!skb)
+ goto wait_for_memory;
+
+ skb_entail(sk, tp, skb);
+ copy = size_goal;
+ }
+
+ if (copy > size)
+ copy = size;
+
+ i = skb_shinfo(skb)->nr_frags;
+ can_coalesce = skb_can_coalesce(skb, i, page, offset);
+ if (!can_coalesce && i >= MAX_SKB_FRAGS) {
+ tcp_mark_push(tp, skb);
+ goto new_segment;
+ }
+ if (!sk_stream_wmem_schedule(sk, copy))
+ goto wait_for_memory;
+
+ if (can_coalesce) {
+ skb_shinfo(skb)->frags[i - 1].size += copy;
+ } else {
+ get_page(page);
+ skb_fill_page_desc(skb, i, page, offset, copy);
+ }
+
+ skb->len += copy;
+ skb->data_len += copy;
+ skb->truesize += copy;
+ sk->sk_wmem_queued += copy;
+ sk->sk_forward_alloc -= copy;
+ skb->ip_summed = CHECKSUM_HW;
+ tp->write_seq += copy;
+ TCP_SKB_CB(skb)->end_seq += copy;
+ skb_shinfo(skb)->gso_segs = 0;
+
+ if (!copied)
+ TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH;
+
+ copied += copy;
+ poffset += copy;
+ if (!(len -= copy))
+ goto out;
+
+ if (skb->len < mss_now)
+ continue;
+
+ if (forced_push(tp)) {
+ tcp_mark_push(tp, skb);
+ __tcp_push_pending_frames(sk, tp, mss_now, TCP_NAGLE_PUSH);
+ } else if (skb == sk->sk_send_head)
+ tcp_push_one(sk, mss_now);
+ continue;
+
+wait_for_sndbuf:
+ set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
+wait_for_memory:
+ if (copied)
+ tcp_push(sk, tp, 0, mss_now, TCP_NAGLE_PUSH);
+
+ err = -EAGAIN;
+ goto do_error;
+ }
+
+out:
+ if (copied)
+ tcp_push(sk, tp, 0, mss_now, tp->nonagle);
+ return copied;
+
+do_error:
+ if (copied)
+ goto out;
+out_err:
+ return sk_stream_error(sk, 0, err);
+}
+
+/*
+ * Must be called with locked sock.
+ */
+int tcp_async_recv(struct sock *sk, void *dst, size_t len)
+{
+ struct tcp_sock *tp = tcp_sk(sk);
+ int copied = 0;
+ u32 *seq;
+ unsigned long used;
+ int err;
+ int target; /* Read at least this many bytes */
+ int copied_early = 0;
+
+ TCP_CHECK_TIMER(sk);
+
+ err = -ENOTCONN;
+ if (sk->sk_state == TCP_LISTEN)
+ goto out;
+
+ seq = &tp->copied_seq;
+
+ target = sock_rcvlowat(sk, 0, len);
+
+ do {
+ struct sk_buff *skb;
+ u32 offset;
+
+ /* Are we at urgent data? Stop if we have read anything or have SIGURG pending. */
+ if (tp->urg_data && tp->urg_seq == *seq) {
+ if (copied)
+ break;
+ }
+
+ /* Next get a buffer. */
+
+ skb = skb_peek(&sk->sk_receive_queue);
+ do {
+ if (!skb)
+ break;
+
+ /* Now that we have two receive queues this
+ * shouldn't happen.
+ */
+ if (before(*seq, TCP_SKB_CB(skb)->seq)) {
+ printk(KERN_INFO "async_recv bug: copied %X "
+ "seq %X\n", *seq, TCP_SKB_CB(skb)->seq);
+ break;
+ }
+ offset = *seq - TCP_SKB_CB(skb)->seq;
+ if (skb->h.th->syn)
+ offset--;
+ if (offset < skb->len)
+ goto found_ok_skb;
+ if (skb->h.th->fin)
+ goto found_fin_ok;
+ skb = skb->next;
+ } while (skb != (struct sk_buff *)&sk->sk_receive_queue);
+
+ if (copied)
+ break;
+
+ if (sock_flag(sk, SOCK_DONE))
+ break;
+
+ if (sk->sk_err) {
+ copied = sock_error(sk);
+ break;
+ }
+
+ if (sk->sk_shutdown & RCV_SHUTDOWN)
+ break;
+
+ if (sk->sk_state == TCP_CLOSE) {
+ if (!sock_flag(sk, SOCK_DONE)) {
+ /* This occurs when user tries to read
+ * from never connected socket.
+ */
+ copied = -ENOTCONN;
+ break;
+ }
+ break;
+ }
+
+ copied = -EAGAIN;
+ break;
+
+ found_ok_skb:
+ /* Ok so how much can we use? */
+ used = skb->len - offset;
+ if (len < used)
+ used = len;
+
+ /* Do we have urgent data here? */
+ if (tp->urg_data) {
+ u32 urg_offset = tp->urg_seq - *seq;
+ if (urg_offset < used) {
+ if (!urg_offset) {
+ if (!sock_flag(sk, SOCK_URGINLINE)) {
+ ++*seq;
+ offset++;
+ used--;
+ if (!used)
+ goto skip_copy;
+ }
+ } else
+ used = urg_offset;
+ }
+ }
+#ifdef CONFIG_NET_DMA
+ if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
+ tp->ucopy.dma_chan = get_softnet_dma();
+
+ if (tp->ucopy.dma_chan) {
+ tp->ucopy.dma_cookie = dma_skb_copy_datagram_iovec(
+ tp->ucopy.dma_chan, skb, offset,
+ msg->msg_iov, used,
+ tp->ucopy.pinned_list);
+
+ if (tp->ucopy.dma_cookie < 0) {
+
+ printk(KERN_ALERT "dma_cookie < 0\n");
+
+ /* Exception. Bailout! */
+ if (!copied)
+ copied = -EFAULT;
+ break;
+ }
+ if ((offset + used) == skb->len)
+ copied_early = 1;
+
+ } else
+#endif
+ {
+ err = skb_copy_datagram(skb, offset, dst, used);
+ if (err) {
+ /* Exception. Bailout! */
+ if (!copied)
+ copied = -EFAULT;
+ break;
+ }
+ }
+
+ *seq += used;
+ copied += used;
+ len -= used;
+ dst += used;
+
+ tcp_rcv_space_adjust(sk);
+
+skip_copy:
+ if (tp->urg_data && after(tp->copied_seq, tp->urg_seq)) {
+ tp->urg_data = 0;
+ tcp_fast_path_check(sk, tp);
+ }
+ if (used + offset < skb->len)
+ continue;
+
+ if (skb->h.th->fin)
+ goto found_fin_ok;
+ sk_eat_skb(sk, skb, copied_early);
+ continue;
+
+ found_fin_ok:
+ /* Process the FIN. */
+ ++*seq;
+ sk_eat_skb(sk, skb, copied_early);
+ break;
+ } while (len > 0);
+
+ /* Clean up data we have read: This will do ACK frames. */
+ tcp_cleanup_rbuf(sk, copied);
+
+ TCP_CHECK_TIMER(sk);
+ return copied;
+
+out:
+ TCP_CHECK_TIMER(sk);
+ return err;
+}
+
+/*
* This routine copies from a sock struct into the user buffer.
*
* Technical note: in 2.3 we work on _locked_ socket, so that
@@ -2342,6 +2638,8 @@ EXPORT_SYMBOL(tcp_getsockopt);
EXPORT_SYMBOL(tcp_ioctl);
EXPORT_SYMBOL(tcp_poll);
EXPORT_SYMBOL(tcp_read_sock);
+EXPORT_SYMBOL(tcp_async_recv);
+EXPORT_SYMBOL(tcp_async_send);
EXPORT_SYMBOL(tcp_recvmsg);
EXPORT_SYMBOL(tcp_sendmsg);
EXPORT_SYMBOL(tcp_sendpage);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 738dad9..f70d045 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3112,6 +3112,7 @@ static void tcp_ofo_queue(struct sock *s

__skb_unlink(skb, &tp->out_of_order_queue);
__skb_queue_tail(&sk->sk_receive_queue, skb);
+ kevent_socket_notify(sk, KEVENT_SOCKET_RECV);
tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
if(skb->h.th->fin)
tcp_fin(skb, sk, skb->h.th);
@@ -3955,7 +3956,8 @@ int tcp_rcv_established(struct sock *sk,
int copied_early = 0;

if (tp->copied_seq == tp->rcv_nxt &&
- len - tcp_header_len <= tp->ucopy.len) {
+ len - tcp_header_len <= tp->ucopy.len &&
+ !sock_async(sk)) {
#ifdef CONFIG_NET_DMA
if (tcp_dma_try_early_copy(sk, skb, tcp_header_len)) {
copied_early = 1;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f6f39e8..ae4f23c 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -61,6 +61,7 @@ #include <linux/cache.h>
#include <linux/jhash.h>
#include <linux/init.h>
#include <linux/times.h>
+#include <linux/kevent.h>

#include <net/icmp.h>
#include <net/inet_hashtables.h>
@@ -868,6 +869,7 @@ #endif
reqsk_free(req);
} else {
inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+ kevent_socket_notify(sk, KEVENT_SOCKET_ACCEPT);
}
return 0;

@@ -1108,24 +1110,30 @@ process:

skb->dev = NULL;

- bh_lock_sock_nested(sk);
ret = 0;
- if (!sock_owned_by_user(sk)) {
+ if (sock_async(sk)) {
+ spin_lock_bh(&sk->sk_lock.slock);
+ ret = tcp_v4_do_rcv(sk, skb);
+ spin_unlock_bh(&sk->sk_lock.slock);
+ } else {
+ bh_lock_sock_nested(sk);
+ if (!sock_owned_by_user(sk)) {
#ifdef CONFIG_NET_DMA
- struct tcp_sock *tp = tcp_sk(sk);
- if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
- tp->ucopy.dma_chan = get_softnet_dma();
- if (tp->ucopy.dma_chan)
- ret = tcp_v4_do_rcv(sk, skb);
- else
+ struct tcp_sock *tp = tcp_sk(sk);
+ if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
+ tp->ucopy.dma_chan = get_softnet_dma();
+ if (tp->ucopy.dma_chan)
+ ret = tcp_v4_do_rcv(sk, skb);
+ else
#endif
- {
- if (!tcp_prequeue(sk, skb))
- ret = tcp_v4_do_rcv(sk, skb);
- }
- } else
- sk_add_backlog(sk, skb);
- bh_unlock_sock(sk);
+ {
+ if (!tcp_prequeue(sk, skb))
+ ret = tcp_v4_do_rcv(sk, skb);
+ }
+ } else
+ sk_add_backlog(sk, skb);
+ bh_unlock_sock(sk);
+ }

sock_put(sk);

@@ -1849,6 +1857,8 @@ struct proto tcp_prot = {
.getsockopt = tcp_getsockopt,
.sendmsg = tcp_sendmsg,
.recvmsg = tcp_recvmsg,
+ .async_recv = tcp_async_recv,
+ .async_send = tcp_async_send,
.backlog_rcv = tcp_v4_do_rcv,
.hash = tcp_v4_hash,
.unhash = tcp_unhash,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 923989d..a5d3ac8 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1230,22 +1230,28 @@ process:

skb->dev = NULL;

- bh_lock_sock(sk);
ret = 0;
- if (!sock_owned_by_user(sk)) {
+ if (sock_async(sk)) {
+ spin_lock_bh(&sk->sk_lock.slock);
+ ret = tcp_v4_do_rcv(sk, skb);
+ spin_unlock_bh(&sk->sk_lock.slock);
+ } else {
+ bh_lock_sock(sk);
+ if (!sock_owned_by_user(sk)) {
#ifdef CONFIG_NET_DMA
- struct tcp_sock *tp = tcp_sk(sk);
- if (tp->ucopy.dma_chan)
- ret = tcp_v6_do_rcv(sk, skb);
- else
-#endif
- {
- if (!tcp_prequeue(sk, skb))
+ struct tcp_sock *tp = tcp_sk(sk);
+ if (tp->ucopy.dma_chan)
ret = tcp_v6_do_rcv(sk, skb);
- }
- } else
- sk_add_backlog(sk, skb);
- bh_unlock_sock(sk);
+ else
+#endif
+ {
+ if (!tcp_prequeue(sk, skb))
+ ret = tcp_v6_do_rcv(sk, skb);
+ }
+ } else
+ sk_add_backlog(sk, skb);
+ bh_unlock_sock(sk);
+ }

sock_put(sk);
return ret ? -1 : 0;
@@ -1596,6 +1602,8 @@ struct proto tcpv6_prot = {
.getsockopt = tcp_getsockopt,
.sendmsg = tcp_sendmsg,
.recvmsg = tcp_recvmsg,
+ .async_recv = tcp_async_recv,
+ .async_send = tcp_async_send,
.backlog_rcv = tcp_v6_do_rcv,
.hash = tcp_v6_hash,
.unhash = tcp_unhash,

2006-08-09 07:40:06

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take6 1/3] kevent: Core files.


Core files.

This patch includes core kevent files:
- userspace controlling
- kernelspace interfaces
- initialization
- notification state machines

It might also inlclude parts from other subsystem (like network related
syscalls, so it is possible that it will not compile without other
patches applied).

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index dd63d47..0af988a 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -317,3 +317,7 @@ ENTRY(sys_call_table)
.long sys_tee /* 315 */
.long sys_vmsplice
.long sys_move_pages
+ .long sys_aio_recv
+ .long sys_aio_send
+ .long sys_kevent_get_events
+ .long sys_kevent_ctl
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index 5d4a7d1..e157ad4 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -713,4 +713,8 @@ #endif
.quad sys_tee
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
+ .quad sys_aio_recv
+ .quad sys_aio_send
+ .quad sys_kevent_get_events
+ .quad sys_kevent_ctl
ia32_syscall_end:
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index fc1c8dd..a76e50d 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -323,10 +323,14 @@ #define __NR_sync_file_range 314
#define __NR_tee 315
#define __NR_vmsplice 316
#define __NR_move_pages 317
+#define __NR_aio_recv 318
+#define __NR_aio_send 319
+#define __NR_kevent_get_events 320
+#define __NR_kevent_ctl 321

#ifdef __KERNEL__

-#define NR_syscalls 318
+#define NR_syscalls 322

/*
* user-visible error numbers are in the range -1 - -128: see
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 94387c9..9a0b581 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,18 @@ #define __NR_vmsplice 278
__SYSCALL(__NR_vmsplice, sys_vmsplice)
#define __NR_move_pages 279
__SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_aio_recv 280
+__SYSCALL(__NR_aio_recv, sys_aio_recv)
+#define __NR_aio_send 281
+__SYSCALL(__NR_aio_send, sys_aio_send)
+#define __NR_kevent_get_events 282
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl 283
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)

#ifdef __KERNEL__

-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_ctl

#ifndef __NO_STUBS

diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..b4342f0
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,296 @@
+/*
+ * kevent.h
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+#define KEVENT_REQ_ONESHOT 0x1 /* Process this event only once and then dequeue. */
+
+/*
+ * Kevent return flags.
+ */
+#define KEVENT_RET_BROKEN 0x1 /* Kevent is broken. */
+#define KEVENT_RET_DONE 0x2 /* Kevent processing was finished successfully. */
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 0
+#define KEVENT_INODE 1
+#define KEVENT_TIMER 2
+#define KEVENT_POLL 3
+#define KEVENT_NAIO 4
+#define KEVENT_AIO 5
+#define KEVENT_MAX 6
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define KEVENT_TIMER_FIRED 0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#define KEVENT_SOCKET_RECV 0x1
+#define KEVENT_SOCKET_ACCEPT 0x2
+#define KEVENT_SOCKET_SEND 0x4
+
+/*
+ * Inode events.
+ */
+#define KEVENT_INODE_CREATE 0x1
+#define KEVENT_INODE_REMOVE 0x2
+
+/*
+ * Poll events.
+ */
+#define KEVENT_POLL_POLLIN 0x0001
+#define KEVENT_POLL_POLLPRI 0x0002
+#define KEVENT_POLL_POLLOUT 0x0004
+#define KEVENT_POLL_POLLERR 0x0008
+#define KEVENT_POLL_POLLHUP 0x0010
+#define KEVENT_POLL_POLLNVAL 0x0020
+
+#define KEVENT_POLL_POLLRDNORM 0x0040
+#define KEVENT_POLL_POLLRDBAND 0x0080
+#define KEVENT_POLL_POLLWRNORM 0x0100
+#define KEVENT_POLL_POLLWRBAND 0x0200
+#define KEVENT_POLL_POLLMSG 0x0400
+#define KEVENT_POLL_POLLREMOVE 0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define KEVENT_AIO_BIO 0x1
+
+#define KEVENT_MASK_ALL 0xffffffff /* Mask of all possible event values. */
+#define KEVENT_MASK_EMPTY 0x0 /* Empty mask of ready events. */
+
+struct kevent_id
+{
+ __u32 raw[2];
+};
+
+struct ukevent
+{
+ struct kevent_id id; /* Id of this request, e.g. socket number, file descriptor and so on... */
+ __u32 type; /* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+ __u32 event; /* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+ __u32 req_flags; /* Per-event request flags */
+ __u32 ret_flags; /* Per-event return flags */
+ __u32 ret_data[2]; /* Event return data. Event originator fills it with anything it likes. */
+ union {
+ __u32 user[2]; /* User's data. It is not used, just copied to/from user. */
+ void *ptr;
+ };
+};
+
+#define KEVENT_CTL_ADD 0
+#define KEVENT_CTL_REMOVE 1
+#define KEVENT_CTL_MODIFY 2
+#define KEVENT_CTL_INIT 3
+
+#ifdef __KERNEL__
+
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/kevent_storage.h>
+
+#define KEVENT_MAX_EVENTS 4096
+#define KEVENT_MIN_BUFFS_ALLOC 3
+
+struct inode;
+struct dentry;
+struct sock;
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+ kevent_callback_t callback, enqueue, dequeue;
+};
+
+struct kevent
+{
+ struct rcu_head rcu_head; /* Used for kevent freeing.*/
+ struct ukevent event;
+ spinlock_t ulock; /* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+
+ struct list_head kevent_entry; /* Entry of user's queue. */
+ struct list_head storage_entry; /* Entry of origin's queue. */
+ struct list_head ready_entry; /* Entry of user's ready. */
+
+ struct kevent_user *user; /* User who requested this kevent. */
+ struct kevent_storage *st; /* Kevent container. */
+
+ struct kevent_callbacks callbacks;
+
+ void *priv; /* Private data for different storages.
+ * poll()/select storage has a list of wait_queue_t containers
+ * for each ->poll() { poll_wait()' } here.
+ */
+};
+
+extern struct kevent_callbacks kevent_registered_callbacks[];
+
+#define KEVENT_HASH_MASK 0xff
+
+struct kevent_user
+{
+ struct list_head kevent_list[KEVENT_HASH_MASK+1];
+ spinlock_t kevent_lock;
+ unsigned int kevent_num; /* Number of queued kevents. */
+
+ struct list_head ready_list; /* List of ready kevents. */
+ unsigned int ready_num; /* Number of ready kevents. */
+ spinlock_t ready_lock; /* Protects all manipulations with ready queue. */
+
+ unsigned int max_ready_num; /* Requested number of kevents. */
+
+ struct mutex ctl_mutex; /* Protects against simultaneous kevent_user control manipulations. */
+ wait_queue_head_t wait; /* Wait until some events are ready. */
+
+ atomic_t refcnt; /* Reference counter, increased for each new kevent. */
+
+ unsigned long *pring; /* Array of pages forming mapped ring buffer */
+
+#ifdef CONFIG_KEVENT_USER_STAT
+ unsigned long im_num;
+ unsigned long wait_num;
+ unsigned long total;
+#endif
+};
+
+extern kmem_cache_t *kevent_cache;
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+void kevent_user_ring_add_event(struct kevent *k);
+
+void kevent_storage_ready(struct kevent_storage *st,
+ kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_INODE
+void kevent_inode_notify(struct inode *inode, u32 event);
+void kevent_inode_notify_parent(struct dentry *dentry, u32 event);
+void kevent_inode_remove(struct inode *inode);
+#else
+static inline void kevent_inode_notify(struct inode *inode, u32 event)
+{
+}
+static inline void kevent_inode_notify_parent(struct dentry *dentry, u32 event)
+{
+}
+static inline void kevent_inode_remove(struct inode *inode)
+{
+}
+#endif /* CONFIG_KEVENT_INODE */
+#ifdef CONFIG_KEVENT_SOCKET
+#ifdef CONFIG_LOCKDEP
+void kevent_socket_reinit(struct socket *sock);
+void kevent_sk_reinit(struct sock *sk);
+#else
+static inline void kevent_socket_reinit(struct socket *sock)
+{
+}
+static inline void kevent_sk_reinit(struct sock *sk)
+{
+}
+#endif
+void kevent_socket_notify(struct sock *sock, u32 event);
+int kevent_socket_dequeue(struct kevent *k);
+int kevent_socket_enqueue(struct kevent *k);
+#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)
+#else
+static inline void kevent_socket_notify(struct sock *sock, u32 event)
+{
+}
+#define sock_async(__sk) ({ (void)__sk; 0; })
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_user_stat_init(struct kevent_user *u)
+{
+ u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_user_stat_print(struct kevent_user *u)
+{
+ pr_debug("%s: u=%p, wait=%lu, immediately=%lu, total=%lu.\n",
+ __func__, u, u->wait_num, u->im_num, u->total);
+}
+static inline void kevent_user_stat_increase_im(struct kevent_user *u)
+{
+ u->im_num++;
+}
+static inline void kevent_user_stat_increase_wait(struct kevent_user *u)
+{
+ u->wait_num++;
+}
+static inline void kevent_user_stat_increase_total(struct kevent_user *u)
+{
+ u->total++;
+}
+#else
+#define kevent_user_stat_print(u) ({ (void) u;})
+#define kevent_user_stat_init(u) ({ (void) u;})
+#define kevent_user_stat_increase_im(u) ({ (void) u;})
+#define kevent_user_stat_increase_wait(u) ({ (void) u;})
+#define kevent_user_stat_increase_total(u) ({ (void) u;})
+#endif
+
+#endif /* __KERNEL__ */
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..bd891f0
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,12 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+ void *origin; /* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+ struct list_head list; /* List of queued kevents. */
+ unsigned int qlen; /* Number of queued kevents. */
+ spinlock_t lock; /* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 008f04c..143f3b5 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -597,4 +597,9 @@ asmlinkage long sys_get_robust_list(int
asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
size_t len);

+asmlinkage long sys_aio_recv(int ctl_fd, int s, void __user *buf, size_t size, unsigned flags);
+asmlinkage long sys_aio_send(int ctl_fd, int s, void __user *buf, size_t size, unsigned flags);
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max,
+ unsigned int timeout, void __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, void __user *buf);
#endif
diff --git a/init/Kconfig b/init/Kconfig
index a099fc6..c550fcc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -218,6 +218,8 @@ config AUDITSYSCALL
such as SELinux. To use audit's filesystem watch feature, please
ensure that INOTIFY is configured.

+source "kernel/kevent/Kconfig"
+
config IKCONFIG
bool "Kernel .config support"
---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
obj-$(CONFIG_RELAY) += relay.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..88b35af
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,50 @@
+config KEVENT
+ bool "Kernel event notification mechanism"
+ help
+ This option enables event queue mechanism.
+ It can be used as replacement for poll()/select(), AIO callback invocations,
+ advanced timer notifications and other kernel object status changes.
+
+config KEVENT_USER_STAT
+ bool "Kevent user statistic"
+ depends on KEVENT
+ default N
+ help
+ This option will turn kevent_user statistic collection on.
+ Statistic data includes total number of kevent, number of kevents which are ready
+ immediately at insertion time and number of kevents which were removed through
+ readiness completion. It will be printed each time control kevent descriptor
+ is closed.
+
+config KEVENT_SOCKET
+ bool "Kernel event notifications for sockets"
+ depends on NET && KEVENT
+ help
+ This option enables notifications through KEVENT subsystem of
+ sockets operations, like new packet receiving conditions, ready for accept
+ conditions and so on.
+
+config KEVENT_INODE
+ bool "Kernel event notifications for inodes"
+ depends on KEVENT
+ help
+ This option enables notifications through KEVENT subsystem of
+ inode operations, like file creation, removal and so on.
+
+config KEVENT_TIMER
+ bool "Kernel event notifications for timers"
+ depends on KEVENT
+ help
+ This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+ bool "Kernel event notifications for poll()/select()"
+ depends on KEVENT
+ help
+ This option allows to use kevent subsystem for poll()/select() notifications.
+
+config KEVENT_NAIO
+ bool "Network asynchronous IO"
+ depends on KEVENT && KEVENT_SOCKET
+ help
+ This option enables kevent based network asynchronous IO subsystem.
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..d1ef9ba
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,6 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
+obj-$(CONFIG_KEVENT_INODE) += kevent_inode.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
+obj-$(CONFIG_KEVENT_NAIO) += kevent_naio.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..e63a8fd
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,238 @@
+/*
+ * kevent.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+kmem_cache_t *kevent_cache;
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+ if (k->event.type >= KEVENT_MAX)
+ return -E2BIG;
+
+ if (!k->callbacks.enqueue) {
+ kevent_break(k);
+ return -EINVAL;
+ }
+
+ return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+ if (k->event.type >= KEVENT_MAX)
+ return -E2BIG;
+
+ if (!k->callbacks.dequeue) {
+ kevent_break(k);
+ return -EINVAL;
+ }
+
+ return k->callbacks.dequeue(k);
+}
+
+int kevent_break(struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.ret_flags |= KEVENT_RET_BROKEN;
+ spin_unlock_irqrestore(&k->ulock, flags);
+ return 0;
+}
+
+struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX];
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+ spin_lock_init(&k->ulock);
+ k->kevent_entry.next = LIST_POISON1;
+ k->storage_entry.prev = LIST_POISON2;
+ k->ready_entry.next = LIST_POISON1;
+
+ if (k->event.type >= KEVENT_MAX)
+ return -E2BIG;
+
+ k->callbacks = kevent_registered_callbacks[k->event.type];
+ if (!k->callbacks.callback) {
+ kevent_break(k);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+ unsigned long flags;
+
+ k->st = st;
+ spin_lock_irqsave(&st->lock, flags);
+ list_add_tail_rcu(&k->storage_entry, &st->list);
+ st->qlen++;
+ spin_unlock_irqrestore(&st->lock, flags);
+ return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue.
+ * It does not decrease origin's reference counter in any way
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&st->lock, flags);
+ if (k->storage_entry.prev != LIST_POISON2) {
+ list_del_rcu(&k->storage_entry);
+ st->qlen--;
+ }
+ spin_unlock_irqrestore(&st->lock, flags);
+}
+
+static void __kevent_requeue(struct kevent *k, u32 event)
+{
+ int err, rem = 0;
+ unsigned long flags;
+
+ err = k->callbacks.callback(k);
+
+ spin_lock_irqsave(&k->ulock, flags);
+ if (err > 0) {
+ k->event.ret_flags |= KEVENT_RET_DONE;
+ } else if (err < 0) {
+ k->event.ret_flags |= KEVENT_RET_BROKEN;
+ k->event.ret_flags |= KEVENT_RET_DONE;
+ }
+ rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+ if (!err)
+ err = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+ spin_unlock_irqrestore(&k->ulock, flags);
+
+ if (err) {
+ if ((rem || err < 0) && k->storage_entry.prev != LIST_POISON2) {
+ list_del_rcu(&k->storage_entry);
+ k->st->qlen--;
+ }
+
+ spin_lock_irqsave(&k->user->ready_lock, flags);
+ if (k->ready_entry.next == LIST_POISON1) {
+ kevent_user_ring_add_event(k);
+ list_add_tail(&k->ready_entry, &k->user->ready_list);
+ k->user->ready_num++;
+ }
+ spin_unlock_irqrestore(&k->user->ready_lock, flags);
+ wake_up(&k->user->wait);
+ }
+}
+
+void kevent_requeue(struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->st->lock, flags);
+ __kevent_requeue(k, 0);
+ spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st,
+ kevent_callback_t ready_callback, u32 event)
+{
+ struct kevent *k;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(k, &st->list, storage_entry) {
+ if (ready_callback)
+ ready_callback(k);
+
+ if (event & k->event.event)
+ __kevent_requeue(k, event);
+ }
+ rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+ spin_lock_init(&st->lock);
+ st->origin = origin;
+ st->qlen = 0;
+ INIT_LIST_HEAD(&st->list);
+ return 0;
+}
+
+void kevent_storage_fini(struct kevent_storage *st)
+{
+ kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
+
+static int __init kevent_sys_init(void)
+{
+ int i;
+
+ kevent_cache = kmem_cache_create("kevent_cache",
+ sizeof(struct kevent), 0, 0, NULL, NULL);
+ if (!kevent_cache)
+ panic("kevent: Unable to create a cache.\n");
+
+ for (i=0; i<ARRAY_SIZE(kevent_registered_callbacks); ++i) {
+ struct kevent_callbacks *c = &kevent_registered_callbacks[i];
+
+ c->callback = c->enqueue = c->dequeue = NULL;
+ }
+
+ return 0;
+}
+
+late_initcall(kevent_sys_init);
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..7b6374b
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,857 @@
+/*
+ * kevent_user.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/jhash.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static char kevent_name[] = "kevent";
+
+static int kevent_user_open(struct inode *, struct file *);
+static int kevent_user_release(struct inode *, struct file *);
+static unsigned int kevent_user_poll(struct file *, struct poll_table_struct *);
+static int kevnet_user_mmap(struct file *, struct vm_area_struct *);
+
+static struct file_operations kevent_user_fops = {
+ .mmap = kevnet_user_mmap,
+ .open = kevent_user_open,
+ .release = kevent_user_release,
+ .poll = kevent_user_poll,
+ .owner = THIS_MODULE,
+};
+
+static struct miscdevice kevent_miscdev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = kevent_name,
+ .fops = &kevent_user_fops,
+};
+
+static int kevent_get_sb(struct file_system_type *fs_type,
+ int flags, const char *dev_name, void *data, struct vfsmount *mnt)
+{
+ /* So original magic... */
+ return get_sb_pseudo(fs_type, kevent_name, NULL, 0xabcdef, mnt);
+}
+
+static struct file_system_type kevent_fs_type = {
+ .name = kevent_name,
+ .get_sb = kevent_get_sb,
+ .kill_sb = kill_anon_super,
+};
+
+static struct vfsmount *kevent_mnt;
+
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+ struct kevent_user *u = file->private_data;
+ unsigned int mask;
+
+ poll_wait(file, &u->wait, wait);
+ mask = 0;
+
+ if (u->ready_num)
+ mask |= POLLIN | POLLRDNORM;
+
+ return mask;
+}
+
+static inline void kevent_user_ring_set(struct kevent_user *u, unsigned int num)
+{
+ unsigned int *idx;
+
+ idx = (unsigned int *)u->pring[0];
+ idx[0] = num;
+}
+
+/*
+ * Note that kevents does not exactly fill the page (each ukevent is 40 bytes),
+ * so we reuse 4 bytes at the begining of the first page to store index.
+ * Take that into account if you want to change size of struct ukevent.
+ */
+#define KEVENTS_ON_PAGE (PAGE_SIZE/sizeof(struct ukevent))
+
+/*
+ * Called under kevent_user->ready_lock, so updates are always protected.
+ */
+void kevent_user_ring_add_event(struct kevent *k)
+{
+ unsigned int *idx_ptr, idx, pidx, off;
+ struct ukevent *ukev;
+
+ idx_ptr = (unsigned int *)k->user->pring[0];
+ idx = idx_ptr[0];
+
+ pidx = idx/KEVENTS_ON_PAGE;
+ off = idx%KEVENTS_ON_PAGE;
+
+ if (pidx == 0)
+ ukev = (struct ukevent *)(k->user->pring[pidx] + sizeof(unsigned int));
+ else
+ ukev = (struct ukevent *)(k->user->pring[pidx]);
+
+ memcpy(&ukev[off], &k->event, sizeof(struct ukevent));
+
+ idx++;
+ if (idx >= KEVENT_MAX_EVENTS)
+ idx = 0;
+
+ idx_ptr[0] = idx;
+}
+
+static int kevent_user_ring_init(struct kevent_user *u)
+{
+ int i, pnum;
+
+ pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct ukevent) + sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;
+
+ u->pring = kmalloc(pnum * sizeof(unsigned long), GFP_KERNEL);
+ if (!u->pring)
+ return -ENOMEM;
+
+ for (i=0; i<pnum; ++i) {
+ u->pring[i] = __get_free_page(GFP_KERNEL);
+ if (!u->pring)
+ break;
+ }
+
+ if (i != pnum) {
+ pnum = i;
+ goto err_out_free;
+ }
+
+ kevent_user_ring_set(u, 0);
+
+ return 0;
+
+err_out_free:
+ for (i=0; i<pnum; ++i)
+ free_page(u->pring[i]);
+
+ kfree(u->pring);
+
+ return -ENOMEM;
+}
+
+static void kevent_user_ring_fini(struct kevent_user *u)
+{
+ int i, pnum;
+
+ pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct ukevent) + sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;
+
+ for (i=0; i<pnum; ++i)
+ free_page(u->pring[i]);
+
+ kfree(u->pring);
+}
+
+static struct kevent_user *kevent_user_alloc(void)
+{
+ struct kevent_user *u;
+ int i;
+
+ u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+ if (!u)
+ return NULL;
+
+ INIT_LIST_HEAD(&u->ready_list);
+ spin_lock_init(&u->ready_lock);
+ u->ready_num = 0;
+ kevent_user_stat_init(u);
+ spin_lock_init(&u->kevent_lock);
+ for (i=0; i<ARRAY_SIZE(u->kevent_list); ++i)
+ INIT_LIST_HEAD(&u->kevent_list[i]);
+ u->kevent_num = 0;
+
+ mutex_init(&u->ctl_mutex);
+ init_waitqueue_head(&u->wait);
+ u->max_ready_num = 0;
+
+ atomic_set(&u->refcnt, 1);
+
+ if (kevent_user_ring_init(u)) {
+ kfree(u);
+ u = NULL;
+ }
+
+ return u;
+}
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+ struct kevent_user *u = kevent_user_alloc();
+
+ if (!u)
+ return -ENOMEM;
+
+ file->private_data = u;
+
+ return 0;
+}
+
+static inline void kevent_user_get(struct kevent_user *u)
+{
+ atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+ if (atomic_dec_and_test(&u->refcnt)) {
+ kevent_user_stat_print(u);
+ kevent_user_ring_fini(u);
+ kfree(u);
+ }
+}
+
+static int kevnet_user_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ size_t size = vma->vm_end - vma->vm_start, psize;
+ int pnum = size/PAGE_SIZE, i;
+ unsigned long start = vma->vm_start;
+ struct kevent_user *u = file->private_data;
+
+ psize = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct ukevent) + sizeof(unsigned int), PAGE_SIZE);
+
+ if (size + vma->vm_pgoff*PAGE_SIZE != psize)
+ return -EINVAL;
+
+ if (vma->vm_flags & VM_WRITE)
+ return -EPERM;
+
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+ for (i=0; i<pnum; ++i) {
+ if (remap_pfn_range(vma, start, virt_to_phys((void *)u->pring[i+vma->vm_pgoff]), PAGE_SIZE,
+ vma->vm_page_prot))
+ return -EAGAIN;
+ start += PAGE_SIZE;
+ }
+
+ return 0;
+}
+
+#if 0
+static inline unsigned int kevent_user_hash(struct ukevent *uk)
+{
+ unsigned int h = (uk->user[0] ^ uk->user[1]) ^ (uk->id.raw[0] ^ uk->id.raw[1]);
+
+ h = (((h >> 16) & 0xffff) ^ (h & 0xffff)) & 0xffff;
+ h = (((h >> 8) & 0xff) ^ (h & 0xff)) & KEVENT_HASH_MASK;
+
+ return h;
+}
+#else
+static inline unsigned int kevent_user_hash(struct ukevent *uk)
+{
+ return jhash_1word(uk->id.raw[0], 0) & KEVENT_HASH_MASK;
+}
+#endif
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+ struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+ kmem_cache_free(kevent_cache, kevent);
+}
+
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+
+ if (deq)
+ kevent_dequeue(k);
+
+ spin_lock_irqsave(&u->ready_lock, flags);
+ if (k->ready_entry.next != LIST_POISON1) {
+ list_del(&k->ready_entry);
+ u->ready_num--;
+ }
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+
+ kevent_user_put(u);
+ call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+
+ list_del(&k->kevent_entry);
+ u->kevent_num--;
+ kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events,
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ list_del(&k->kevent_entry);
+ u->kevent_num--;
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+ kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+
+static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
+{
+ unsigned long flags;
+ struct kevent *k = NULL;
+
+ spin_lock_irqsave(&u->ready_lock, flags);
+ if (u->ready_num && !list_empty(&u->ready_list)) {
+ k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+ list_del(&k->ready_entry);
+ u->ready_num--;
+ }
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+
+ return k;
+}
+
+static struct kevent *__kevent_search(struct list_head *head, struct ukevent *uk,
+ struct kevent_user *u)
+{
+ struct kevent *k;
+ int found = 0;
+
+ list_for_each_entry(k, head, kevent_entry) {
+ spin_lock(&k->ulock);
+ if (k->event.user[0] == uk->user[0] && k->event.user[1] == uk->user[1] &&
+ k->event.id.raw[0] == uk->id.raw[0] &&
+ k->event.id.raw[1] == uk->id.raw[1]) {
+ found = 1;
+ spin_unlock(&k->ulock);
+ break;
+ }
+ spin_unlock(&k->ulock);
+ }
+
+ return (found)?k:NULL;
+}
+
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ unsigned int hash = kevent_user_hash(uk);
+ int err = -ENODEV;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&u->kevent_list[hash], uk, u);
+ if (k) {
+ spin_lock(&k->ulock);
+ k->event.event = uk->event;
+ k->event.req_flags = uk->req_flags;
+ k->event.ret_flags = 0;
+ spin_unlock(&k->ulock);
+ kevent_requeue(k);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+ int err = -ENODEV;
+ struct kevent *k;
+ unsigned int hash = kevent_user_hash(uk);
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&u->kevent_list[hash], uk, u);
+ if (k) {
+ __kevent_finish_user(k, 1);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * No new entry can be added or removed from any list at this point.
+ * It is not permitted to call ->ioctl() and ->release() in parallel.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+ struct kevent_user *u = file->private_data;
+ struct kevent *k, *n;
+ int i;
+
+ for (i=0; i<KEVENT_HASH_MASK+1; ++i) {
+ list_for_each_entry_safe(k, n, &u->kevent_list[i], kevent_entry)
+ kevent_finish_user(k, 1);
+ }
+
+ kevent_user_put(u);
+ file->private_data = NULL;
+
+ return 0;
+}
+
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+ struct ukevent *ukev;
+
+ ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+ if (!ukev)
+ return NULL;
+
+ if (copy_from_user(arg, ukev, sizeof(struct ukevent) * num)) {
+ kfree(ukev);
+ return NULL;
+ }
+
+ return ukev;
+}
+
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = 0, i;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i=0; i<num; ++i) {
+ if (kevent_modify(&ukev[i], u))
+ ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+ ukev[i].ret_flags |= KEVENT_RET_DONE;
+ }
+ if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+ err = -EINVAL;
+ kfree(ukev);
+ goto out;
+ }
+ }
+
+ for (i=0; i<num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ err = -EINVAL;
+ break;
+ }
+
+ if (kevent_modify(&uk, u))
+ uk.ret_flags |= KEVENT_RET_BROKEN;
+ uk.ret_flags |= KEVENT_RET_DONE;
+
+ if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+ err = -EINVAL;
+ break;
+ }
+
+ arg += sizeof(struct ukevent);
+ }
+out:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = 0, i;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i=0; i<num; ++i) {
+ if (kevent_remove(&ukev[i], u))
+ ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+ ukev[i].ret_flags |= KEVENT_RET_DONE;
+ }
+ if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+ err = -EINVAL;
+ kfree(ukev);
+ goto out;
+ }
+ }
+
+ for (i=0; i<num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ err = -EINVAL;
+ break;
+ }
+
+ if (kevent_remove(&uk, u))
+ uk.ret_flags |= KEVENT_RET_BROKEN;
+
+ uk.ret_flags |= KEVENT_RET_DONE;
+
+ if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+ err = -EINVAL;
+ break;
+ }
+
+ arg += sizeof(struct ukevent);
+ }
+out:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+static void kevent_user_enqueue(struct kevent_user *u, struct kevent *k)
+{
+ unsigned long flags;
+ unsigned int hash = kevent_user_hash(&k->event);
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ list_add_tail(&k->kevent_entry, &u->kevent_list[hash]);
+ u->kevent_num++;
+ kevent_user_get(u);
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+}
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ int err;
+
+ k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+ if (!k) {
+ err = -ENOMEM;
+ goto err_out_exit;
+ }
+
+ memcpy(&k->event, uk, sizeof(struct ukevent));
+ INIT_RCU_HEAD(&k->rcu_head);
+
+ k->event.ret_flags = 0;
+
+ err = kevent_init(k);
+ if (err) {
+ kmem_cache_free(kevent_cache, k);
+ goto err_out_exit;
+ }
+ k->user = u;
+ kevent_user_stat_increase_total(u);
+ kevent_user_enqueue(u, k);
+
+ err = kevent_enqueue(k);
+ if (err) {
+ memcpy(uk, &k->event, sizeof(struct ukevent));
+ if (err < 0)
+ uk->ret_flags |= KEVENT_RET_BROKEN;
+ uk->ret_flags |= KEVENT_RET_DONE;
+ kevent_finish_user(k, 0);
+ }
+
+err_out_exit:
+ return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one
+ * and add them into appropriate kevent_storages,
+ * e.g. sockets, inodes and so on...
+ * If something goes wrong, all events will be dequeued and
+ * negative error will be returned.
+ * On success number of finished events is returned and
+ * Array of finished events (struct ukevent) will be placed behind
+ * kevent_user_control structure. User must run through that array and check
+ * ret_flags field of each ukevent structure to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err, cerr = 0, knum = 0, rnum = 0, i;
+ void __user *orig = arg;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ err = -ENFILE;
+ if (u->kevent_num + num >= KEVENT_MAX_EVENTS)
+ goto out_remove;
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i=0; i<num; ++i) {
+ err = kevent_user_add_ukevent(&ukev[i], u);
+ if (err) {
+ kevent_user_stat_increase_im(u);
+ if (i != rnum)
+ memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+ rnum++;
+ } else
+ knum++;
+ }
+ if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+ cerr = -EINVAL;
+ kfree(ukev);
+ goto out_setup;
+ }
+ }
+
+ for (i=0; i<num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ cerr = -EINVAL;
+ break;
+ }
+ arg += sizeof(struct ukevent);
+
+ err = kevent_user_add_ukevent(&uk, u);
+ if (err) {
+ kevent_user_stat_increase_im(u);
+ if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+ cerr = -EINVAL;
+ break;
+ }
+ orig += sizeof(struct ukevent);
+ rnum++;
+ } else
+ knum++;
+ }
+
+out_setup:
+ if (cerr < 0) {
+ err = cerr;
+ goto out_remove;
+ }
+
+ err = rnum;
+out_remove:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u,
+ unsigned int min_nr, unsigned int max_nr, unsigned int timeout,
+ void __user *buf)
+{
+ struct kevent *k;
+ int cerr = 0, num = 0;
+
+ if (!(file->f_flags & O_NONBLOCK)) {
+ wait_event_interruptible_timeout(u->wait,
+ u->ready_num >= min_nr, msecs_to_jiffies(timeout));
+ }
+
+ while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
+ if (copy_to_user(buf + num*sizeof(struct ukevent),
+ &k->event, sizeof(struct ukevent))) {
+ cerr = -EINVAL;
+ break;
+ }
+
+ /*
+ * If it is one-shot kevent, it has been removed already from
+ * origin's queue, so we can easily free it here.
+ */
+ if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+ kevent_finish_user(k, 1);
+ ++num;
+ kevent_user_stat_increase_wait(u);
+ }
+
+ return (cerr)?cerr:num;
+}
+
+static int kevent_ctl_init(void)
+{
+ struct kevent_user *u;
+ struct file *file;
+ int fd, ret;
+
+ fd = get_unused_fd();
+ if (fd < 0)
+ return fd;
+
+ file = get_empty_filp();
+ if (!file) {
+ ret = -ENFILE;
+ goto out_put_fd;
+ }
+
+ u = kevent_user_alloc();
+ if (unlikely(!u)) {
+ ret = -ENOMEM;
+ goto out_put_file;
+ }
+
+ file->f_op = &kevent_user_fops;
+ file->f_vfsmnt = mntget(kevent_mnt);
+ file->f_dentry = dget(kevent_mnt->mnt_root);
+ file->f_mapping = file->f_dentry->d_inode->i_mapping;
+ file->f_mode = FMODE_READ;
+ file->f_flags = O_RDONLY;
+ file->private_data = u;
+
+ fd_install(fd, file);
+
+ return fd;
+
+out_put_file:
+ put_filp(file);
+out_put_fd:
+ put_unused_fd(fd);
+ return ret;
+}
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+ int err;
+ struct kevent_user *u = file->private_data;
+
+ if (!u)
+ return -EINVAL;
+
+ switch (cmd) {
+ case KEVENT_CTL_ADD:
+ err = kevent_user_ctl_add(u, num, arg);
+ break;
+ case KEVENT_CTL_REMOVE:
+ err = kevent_user_ctl_remove(u, num, arg);
+ break;
+ case KEVENT_CTL_MODIFY:
+ err = kevent_user_ctl_modify(u, num, arg);
+ break;
+ default:
+ err = -EINVAL;
+ break;
+ }
+
+ return err;
+}
+
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+ unsigned int timeout, void __user *buf, unsigned flags)
+{
+ int err = -EINVAL;
+ struct file *file;
+ struct kevent_user *u;
+
+ file = fget(ctl_fd);
+ if (!file)
+ return -ENODEV;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+ u = file->private_data;
+
+ err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+ fput(file);
+ return err;
+}
+
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, void __user *arg)
+{
+ int err = -EINVAL;
+ struct file *file;
+
+ if (cmd == KEVENT_CTL_INIT)
+ return kevent_ctl_init();
+
+ file = fget(fd);
+ if (!file)
+ return -ENODEV;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+
+ err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+ fput(file);
+ return err;
+}
+
+static int __devinit kevent_user_init(void)
+{
+ int err = 0;
+
+ err = register_filesystem(&kevent_fs_type);
+ if (err)
+ panic("%s: failed to register filesystem: err=%d.\n",
+ kevent_name, err);
+
+ kevent_mnt = kern_mount(&kevent_fs_type);
+ if (IS_ERR(kevent_mnt))
+ panic("%s: failed to mount silesystem: err=%ld.\n",
+ kevent_name, PTR_ERR(kevent_mnt));
+
+ err = misc_register(&kevent_miscdev);
+ if (err) {
+ printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
+ goto err_out_exit;
+ }
+
+ printk("KEVENT subsystem has been successfully registered.\n");
+
+ return 0;
+
+err_out_exit:
+ mntput(kevent_mnt);
+ unregister_filesystem(&kevent_fs_type);
+
+ return err;
+}
+
+static void __devexit kevent_user_fini(void)
+{
+ misc_deregister(&kevent_miscdev);
+ mntput(kevent_mnt);
+ unregister_filesystem(&kevent_fs_type);
+}
+
+module_init(kevent_user_init);
+module_exit(kevent_user_fini);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 6991bec..8843cca 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,11 @@ cond_syscall(ppc_rtas);
cond_syscall(sys_spu_run);
cond_syscall(sys_spu_create);

+cond_syscall(sys_aio_recv);
+cond_syscall(sys_aio_send);
+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_ctl);
+
/* mmu depending weak syscall entries */
cond_syscall(sys_mprotect);
cond_syscall(sys_msync);

2006-08-09 07:59:08

by David Miller

[permalink] [raw]
Subject: Re: [take6 0/3] kevent: Generic event handling mechanism.

From: Evgeniy Polyakov <[email protected]>
Date: Wed, 9 Aug 2006 12:02:39 +0400

Evgeniy, it's things like the following that make it very draining
mentally to review your work.

> * removed AIO stuff from patchset

You didn't really do this, you leave the aio_* syscalls and stubs in
there, and you also left things like tcp_async_send() in there.

All the foo_naio_*() stuff is still in there to.

Please remove all of async business we've asked you to.

Thanks.

2006-08-09 08:09:13

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take6 0/3] kevent: Generic event handling mechanism.

On Wed, Aug 09, 2006 at 12:58:56AM -0700, David Miller ([email protected]) wrote:
> From: Evgeniy Polyakov <[email protected]>
> Date: Wed, 9 Aug 2006 12:02:39 +0400
>
> Evgeniy, it's things like the following that make it very draining
> mentally to review your work.
>
> > * removed AIO stuff from patchset
>
> You didn't really do this, you leave the aio_* syscalls and stubs in
> there, and you also left things like tcp_async_send() in there.

By AIO I meant VFS AIO, not network stuff, exactly that part was frown
upon in reviews.

> All the foo_naio_*() stuff is still in there to.
>
> Please remove all of async business we've asked you to.

So you want to review kevent core only at the first point and postpone
network AIO and the rest implementation after core is correct.
Should I remove poll/timer notifications too?

--
Evgeniy Polyakov

2006-08-09 08:20:58

by David Miller

[permalink] [raw]
Subject: Re: [take6 0/3] kevent: Generic event handling mechanism.

From: Evgeniy Polyakov <[email protected]>
Date: Wed, 9 Aug 2006 12:07:57 +0400

> So you want to review kevent core only at the first point and postpone
> network AIO and the rest implementation after core is correct.

That's the idea

> Should I remove poll/timer notifications too?

That can stay

2006-08-09 08:25:13

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take6 0/3] kevent: Generic event handling mechanism.

On Wed, Aug 09, 2006 at 01:20:45AM -0700, David Miller ([email protected]) wrote:
> From: Evgeniy Polyakov <[email protected]>
> Date: Wed, 9 Aug 2006 12:07:57 +0400
>
> > So you want to review kevent core only at the first point and postpone
> > network AIO and the rest implementation after core is correct.
>
> That's the idea
>
> > Should I remove poll/timer notifications too?
>
> That can stay


Ok, I will regenerate the lastest patchset completely without AIO stuff
(both network and VFS) and resend it soon.
Thank you.

--
Evgeniy Polyakov

2006-08-09 17:48:24

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

On Wed, 9 Aug 2006 12:02:40 +0400
Evgeniy Polyakov <[email protected]> wrote:

>
> Core files.
>
> This patch includes core kevent files:
> - userspace controlling
> - kernelspace interfaces
> - initialization
> - notification state machines
>
> It might also inlclude parts from other subsystem (like network related
> syscalls, so it is possible that it will not compile without other
> patches applied).
>
> Signed-off-by: Evgeniy Polyakov <[email protected]>
>
>
> +#ifdef CONFIG_KEVENT_USER_STAT
> +static inline void kevent_user_stat_init(struct kevent_user *u)
> +{
> + u->wait_num = u->im_num = u->total = 0;
> +}
> +static inline void kevent_user_stat_print(struct kevent_user *u)
> +{
> + pr_debug("%s: u=%p, wait=%lu, immediately=%lu, total=%lu.\n",
> + __func__, u, u->wait_num, u->im_num, u->total);
> +}
> +static inline void kevent_user_stat_increase_im(struct kevent_user *u)
> +{
> + u->im_num++;
> +}
> +static inline void kevent_user_stat_increase_wait(struct kevent_user *u)
> +{
> + u->wait_num++;
> +}
> +static inline void kevent_user_stat_increase_total(struct kevent_user *u)
> +{
> + u->total++;
> +}
>

static wrapper_functions_with_execessive_long_names(struct i_really_hate *this)
{
suck();
}

2006-08-09 19:17:57

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

On Wed, Aug 09, 2006 at 10:47:38AM -0700, Stephen Hemminger ([email protected]) wrote:
> > +static inline void kevent_user_stat_increase_total(struct kevent_user *u)
> > +{
> > + u->total++;
> > +}
> >
>
> static wrapper_functions_with_execessive_long_names(struct i_really_hate *this)
> {
> suck();
> }

Understood...

--
Evgeniy Polyakov

2006-08-09 22:22:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

On Wed, 9 Aug 2006 12:02:40 +0400
Evgeniy Polyakov <[email protected]> wrote:

>
> Core files.
>
> This patch includes core kevent files:
> - userspace controlling
> - kernelspace interfaces
> - initialization
> - notification state machines
>
> It might also inlclude parts from other subsystem (like network related
> syscalls, so it is possible that it will not compile without other
> patches applied).

Summary:

- has serious bugs which indicate that much better testing is needed.

- All -EFOO return values need to be reviewed for appropriateness

- needs much better commenting before I can do more than a local-level review.


> --- /dev/null
> +++ b/include/linux/kevent.h
> ...
>
> +/*
> + * Poll events.
> + */
> +#define KEVENT_POLL_POLLIN 0x0001
> +#define KEVENT_POLL_POLLPRI 0x0002
> +#define KEVENT_POLL_POLLOUT 0x0004
> +#define KEVENT_POLL_POLLERR 0x0008
> +#define KEVENT_POLL_POLLHUP 0x0010
> +#define KEVENT_POLL_POLLNVAL 0x0020
> +
> +#define KEVENT_POLL_POLLRDNORM 0x0040
> +#define KEVENT_POLL_POLLRDBAND 0x0080
> +#define KEVENT_POLL_POLLWRNORM 0x0100
> +#define KEVENT_POLL_POLLWRBAND 0x0200
> +#define KEVENT_POLL_POLLMSG 0x0400
> +#define KEVENT_POLL_POLLREMOVE 0x1000

0x0800 got lost.

> +struct ukevent
> +{
> + struct kevent_id id; /* Id of this request, e.g. socket number, file descriptor and so on... */
> + __u32 type; /* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
> + __u32 event; /* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
> + __u32 req_flags; /* Per-event request flags */
> + __u32 ret_flags; /* Per-event return flags */
> + __u32 ret_data[2]; /* Event return data. Event originator fills it with anything it likes. */
> + union {
> + __u32 user[2]; /* User's data. It is not used, just copied to/from user. */
> + void *ptr;
> + };
> +};

What is this union for?

`ptr' needs a __user tag, does it not?

`ptr' will be 64-bit in-kernel and 64-bit for 64-bit userspace, but 32-bit
for 32-bit userspace. I guess that's why user[] is there.

On big-endian machines, this pointer will appear to be word-swapped as far
as a 64-bit kernel is concerned. Or something.

IOW: What's going on here??

> +#ifdef CONFIG_KEVENT_INODE
> +void kevent_inode_notify(struct inode *inode, u32 event);
> +void kevent_inode_notify_parent(struct dentry *dentry, u32 event);
> +void kevent_inode_remove(struct inode *inode);
> +#else
> +static inline void kevent_inode_notify(struct inode *inode, u32 event)
> +{
> +}
> +static inline void kevent_inode_notify_parent(struct dentry *dentry, u32 event)
> +{
> +}
> +static inline void kevent_inode_remove(struct inode *inode)
> +{
> +}
> +#endif /* CONFIG_KEVENT_INODE */
> +#ifdef CONFIG_KEVENT_SOCKET
> +#ifdef CONFIG_LOCKDEP
> +void kevent_socket_reinit(struct socket *sock);
> +void kevent_sk_reinit(struct sock *sk);
> +#else
> +static inline void kevent_socket_reinit(struct socket *sock)
> +{
> +}
> +static inline void kevent_sk_reinit(struct sock *sk)
> +{
> +}
> +#endif
> +void kevent_socket_notify(struct sock *sock, u32 event);
> +int kevent_socket_dequeue(struct kevent *k);
> +int kevent_socket_enqueue(struct kevent *k);
> +#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)

Is this header the correct place to be implementing sock_async()?

> --- /dev/null
> +++ b/kernel/kevent/Kconfig
> @@ -0,0 +1,50 @@
> +config KEVENT
> + bool "Kernel event notification mechanism"
> + help
> + This option enables event queue mechanism.
> + It can be used as replacement for poll()/select(), AIO callback invocations,
> + advanced timer notifications and other kernel object status changes.

Please squeeze all the help text into 80-columns. Or at least check that
it looks OK in menuconfig in an 80-col xterm,

> --- /dev/null
> +++ b/kernel/kevent/kevent.c
> @@ -0,0 +1,238 @@
> +/*
> + * kevent.c
> + *
> + * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
> + * All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/mempool.h>
> +#include <linux/sched.h>
> +#include <linux/wait.h>
> +#include <linux/kevent.h>
> +
> +kmem_cache_t *kevent_cache;
> +
> +/*
> + * Attempts to add an event into appropriate origin's queue.
> + * Returns positive value if this event is ready immediately,
> + * negative value in case of error and zero if event has been queued.
> + * ->enqueue() callback must increase origin's reference counter.
> + */
> +int kevent_enqueue(struct kevent *k)
> +{
> + if (k->event.type >= KEVENT_MAX)
> + return -E2BIG;

E2BIG is "Argument list too long". EINVAL is appropriate here.

> + if (!k->callbacks.enqueue) {
> + kevent_break(k);
> + return -EINVAL;
> + }
> +
> + return k->callbacks.enqueue(k);
> +}
> +
> +/*
> + * Remove event from the appropriate queue.
> + * ->dequeue() callback must decrease origin's reference counter.
> + */
> +int kevent_dequeue(struct kevent *k)
> +{
> + if (k->event.type >= KEVENT_MAX)
> + return -E2BIG;
> +
> + if (!k->callbacks.dequeue) {
> + kevent_break(k);
> + return -EINVAL;
> + }
> +
> + return k->callbacks.dequeue(k);
> +}
> +
> +int kevent_break(struct kevent *k)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(&k->ulock, flags);
> + k->event.ret_flags |= KEVENT_RET_BROKEN;
> + spin_unlock_irqrestore(&k->ulock, flags);
> + return 0;
> +}
> +
> +struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX];
> +
> +/*
> + * Must be called before event is going to be added into some origin's queue.
> + * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
> + * If failed, kevent should not be used or kevent_enqueue() will fail to add
> + * this kevent into origin's queue with setting
> + * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
> + */
> +int kevent_init(struct kevent *k)
> +{
> + spin_lock_init(&k->ulock);
> + k->kevent_entry.next = LIST_POISON1;
> + k->storage_entry.prev = LIST_POISON2;
> + k->ready_entry.next = LIST_POISON1;

Nope ;)

> + if (k->event.type >= KEVENT_MAX)
> + return -E2BIG;
> +
> + k->callbacks = kevent_registered_callbacks[k->event.type];
> + if (!k->callbacks.callback) {
> + kevent_break(k);
> + return -EINVAL;
> + }
> +
> + return 0;
> +}
> +
> +/*
> + * Called from ->enqueue() callback when reference counter for given
> + * origin (socket, inode...) has been increased.
> + */
> +int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
> +{
> + unsigned long flags;
> +
> + k->st = st;
> + spin_lock_irqsave(&st->lock, flags);
> + list_add_tail_rcu(&k->storage_entry, &st->list);
> + st->qlen++;
> + spin_unlock_irqrestore(&st->lock, flags);
> + return 0;
> +}

Is the _rcu variant needed here?

> +/*
> + * Dequeue kevent from origin's queue.
> + * It does not decrease origin's reference counter in any way
> + * and must be called before it, so storage itself must be valid.
> + * It is called from ->dequeue() callback.
> + */
> +void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(&st->lock, flags);
> + if (k->storage_entry.prev != LIST_POISON2) {

Nope, as discussed earlier.

> + list_del_rcu(&k->storage_entry);
> + st->qlen--;
> + }
> + spin_unlock_irqrestore(&st->lock, flags);
> +}
> +
> +static void __kevent_requeue(struct kevent *k, u32 event)
> +{
> + int err, rem = 0;
> + unsigned long flags;
> +
> + err = k->callbacks.callback(k);
> +
> + spin_lock_irqsave(&k->ulock, flags);
> + if (err > 0) {
> + k->event.ret_flags |= KEVENT_RET_DONE;
> + } else if (err < 0) {
> + k->event.ret_flags |= KEVENT_RET_BROKEN;
> + k->event.ret_flags |= KEVENT_RET_DONE;
> + }
> + rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
> + if (!err)
> + err = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
> + spin_unlock_irqrestore(&k->ulock, flags);

Local variable `err' no longer actually indicates an error, does it?

If not, a differently-named local would be appropriate here.

> + if (err) {
> + if ((rem || err < 0) && k->storage_entry.prev != LIST_POISON2) {
> + list_del_rcu(&k->storage_entry);
> + k->st->qlen--;

->qlen was previously modified under spinlock. Here it is not.

> + }
> +
> + spin_lock_irqsave(&k->user->ready_lock, flags);
> + if (k->ready_entry.next == LIST_POISON1) {
> + kevent_user_ring_add_event(k);
> + list_add_tail(&k->ready_entry, &k->user->ready_list);
> + k->user->ready_num++;
> + }
> + spin_unlock_irqrestore(&k->user->ready_lock, flags);
> + wake_up(&k->user->wait);
> + }
> +}
> +
> +void kevent_requeue(struct kevent *k)
> +{
> + unsigned long flags;
> +
> + spin_lock_irqsave(&k->st->lock, flags);
> + __kevent_requeue(k, 0);
> + spin_unlock_irqrestore(&k->st->lock, flags);
> +}
> +
> +/*
> + * Called each time some activity in origin (socket, inode...) is noticed.
> + */
> +void kevent_storage_ready(struct kevent_storage *st,
> + kevent_callback_t ready_callback, u32 event)
> +{
> + struct kevent *k;
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(k, &st->list, storage_entry) {
> + if (ready_callback)
> + ready_callback(k);

For readability reasons I prefer the old-style

(*ready_callback)(k);

so the reader knows not to go off hunting for the function "ready_callback".
Minor point.

So the kevent_callback_t handlers are not allowed to sleep.

> + if (event & k->event.event)
> + __kevent_requeue(k, event);
> + }

Under what circumstances will `event' be zero??

> + rcu_read_unlock();
> +}
> +
> +int kevent_storage_init(void *origin, struct kevent_storage *st)
> +{
> + spin_lock_init(&st->lock);
> + st->origin = origin;
> + st->qlen = 0;
> + INIT_LIST_HEAD(&st->list);
> + return 0;
> +}
> +
> +void kevent_storage_fini(struct kevent_storage *st)
> +{
> + kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
> +}
> +
> +static int __init kevent_sys_init(void)
> +{
> + int i;
> +
> + kevent_cache = kmem_cache_create("kevent_cache",
> + sizeof(struct kevent), 0, 0, NULL, NULL);
> + if (!kevent_cache)
> + panic("kevent: Unable to create a cache.\n");

Can use SLAB_PANIC (a silly thing I added to avoid code duplication).

> + for (i=0; i<ARRAY_SIZE(kevent_registered_callbacks); ++i) {
> + struct kevent_callbacks *c = &kevent_registered_callbacks[i];
> +
> + c->callback = c->enqueue = c->dequeue = NULL;
> + }

This zeroing is redundant.

> + return 0;
> +}
> +
> +late_initcall(kevent_sys_init);

Why is it late_initcall? (A comment is needed)

> diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
> new file mode 100644
> index 0000000..7b6374b
> --- /dev/null
> +++ b/kernel/kevent/kevent_user.c
> @@ -0,0 +1,857 @@
> +/*
> + * kevent_user.c
> + *
> + * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
> + * All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/types.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/fs.h>
> +#include <linux/file.h>
> +#include <linux/mount.h>
> +#include <linux/device.h>
> +#include <linux/poll.h>
> +#include <linux/kevent.h>
> +#include <linux/jhash.h>
> +#include <linux/miscdevice.h>
> +#include <asm/io.h>
> +
> +static char kevent_name[] = "kevent";
> +
> +static int kevent_user_open(struct inode *, struct file *);
> +static int kevent_user_release(struct inode *, struct file *);
> +static unsigned int kevent_user_poll(struct file *, struct poll_table_struct *);
> +static int kevnet_user_mmap(struct file *, struct vm_area_struct *);
> +
> +static struct file_operations kevent_user_fops = {
> + .mmap = kevnet_user_mmap,
> + .open = kevent_user_open,
> + .release = kevent_user_release,
> + .poll = kevent_user_poll,
> + .owner = THIS_MODULE,
> +};
> +
> +static struct miscdevice kevent_miscdev = {
> + .minor = MISC_DYNAMIC_MINOR,
> + .name = kevent_name,
> + .fops = &kevent_user_fops,
> +};
> +
> +static int kevent_get_sb(struct file_system_type *fs_type,
> + int flags, const char *dev_name, void *data, struct vfsmount *mnt)
> +{
> + /* So original magic... */
> + return get_sb_pseudo(fs_type, kevent_name, NULL, 0xabcdef, mnt);
> +}

That doesn't look like a well-chosen magic number...

> +static struct file_system_type kevent_fs_type = {
> + .name = kevent_name,
> + .get_sb = kevent_get_sb,
> + .kill_sb = kill_anon_super,
> +};
> +
> +static struct vfsmount *kevent_mnt;
> +
> +static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
> +{
> + struct kevent_user *u = file->private_data;
> + unsigned int mask;
> +
> + poll_wait(file, &u->wait, wait);
> + mask = 0;
> +
> + if (u->ready_num)
> + mask |= POLLIN | POLLRDNORM;
> +
> + return mask;
> +}
> +
> +static inline void kevent_user_ring_set(struct kevent_user *u, unsigned int num)
> +{
> + unsigned int *idx;
> +
> + idx = (unsigned int *)u->pring[0];

This is a bit ugly.


> + idx[0] = num;
> +}
> +
> +/*
> + * Note that kevents does not exactly fill the page (each ukevent is 40 bytes),
> + * so we reuse 4 bytes at the begining of the first page to store index.
> + * Take that into account if you want to change size of struct ukevent.
> + */
> +#define KEVENTS_ON_PAGE (PAGE_SIZE/sizeof(struct ukevent))

How about doing

struct ukevent_ring {
unsigned int index;
struct ukevent[0];
}

and removing all those nasty typeasting and offsetting games?

In fact you can even do

struct ukevent_ring {
struct ukevent[(PAGE_SIZE - sizeof(unsigned int)) /
sizeof(struct ukevent)];
unsigned int index;
};

if you're careful ;)

> +/*
> + * Called under kevent_user->ready_lock, so updates are always protected.
> + */
> +void kevent_user_ring_add_event(struct kevent *k)
> +{
> + unsigned int *idx_ptr, idx, pidx, off;
> + struct ukevent *ukev;
> +
> + idx_ptr = (unsigned int *)k->user->pring[0];
> + idx = idx_ptr[0];
> +
> + pidx = idx/KEVENTS_ON_PAGE;
> + off = idx%KEVENTS_ON_PAGE;
> +
> + if (pidx == 0)
> + ukev = (struct ukevent *)(k->user->pring[pidx] + sizeof(unsigned int));
> + else
> + ukev = (struct ukevent *)(k->user->pring[pidx]);

Such as these.

> + memcpy(&ukev[off], &k->event, sizeof(struct ukevent));
> +
> + idx++;
> + if (idx >= KEVENT_MAX_EVENTS)
> + idx = 0;
> +
> + idx_ptr[0] = idx;
> +}
> +
> +static int kevent_user_ring_init(struct kevent_user *u)
> +{
> + int i, pnum;
> +
> + pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct ukevent) + sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;

And these.

> + u->pring = kmalloc(pnum * sizeof(unsigned long), GFP_KERNEL);
> + if (!u->pring)
> + return -ENOMEM;
> +
> + for (i=0; i<pnum; ++i) {
> + u->pring[i] = __get_free_page(GFP_KERNEL);
> + if (!u->pring)

bug: this is testing the wrong thing.

> + break;
> + }
> +
> + if (i != pnum) {
> + pnum = i;
> + goto err_out_free;
> + }

Move this logic into the `if (!u->pring)' logic, above.

> + kevent_user_ring_set(u, 0);
> +
> + return 0;
> +
> +err_out_free:
> + for (i=0; i<pnum; ++i)
> + free_page(u->pring[i]);
> +
> + kfree(u->pring);
> +
> + return -ENOMEM;
> +}
> +
> +static void kevent_user_ring_fini(struct kevent_user *u)
> +{
> + int i, pnum;
> +
> + pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct ukevent) + sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;
> +
> + for (i=0; i<pnum; ++i)
> + free_page(u->pring[i]);
> +
> + kfree(u->pring);
> +}
> +
> +static struct kevent_user *kevent_user_alloc(void)
> +{
> + struct kevent_user *u;
> + int i;
> +
> + u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
> + if (!u)
> + return NULL;
> +
> + INIT_LIST_HEAD(&u->ready_list);
> + spin_lock_init(&u->ready_lock);
> + u->ready_num = 0;
> + kevent_user_stat_init(u);
> + spin_lock_init(&u->kevent_lock);
> + for (i=0; i<ARRAY_SIZE(u->kevent_list); ++i)
> + INIT_LIST_HEAD(&u->kevent_list[i]);
> + u->kevent_num = 0;
> +
> + mutex_init(&u->ctl_mutex);
> + init_waitqueue_head(&u->wait);
> + u->max_ready_num = 0;

The above zeroes out a bunch of known-to-already-be-zero things.

> +static int kevnet_user_mmap(struct file *file, struct vm_area_struct *vma)

The function name is mistyped.

This code doesn't have many comments, does it? What are we mapping here,
and why would an application want to map it?

And what are the portability implications? Does userspace need to know the
64-bitness of its kernel to be able to work out the alignment of things?
If so, what happens if a later/different compiler aligns things
differently?

> +{
> + size_t size = vma->vm_end - vma->vm_start, psize;
> + int pnum = size/PAGE_SIZE, i;
> + unsigned long start = vma->vm_start;
> + struct kevent_user *u = file->private_data;
> +
> + psize = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct ukevent) + sizeof(unsigned int), PAGE_SIZE);
> +
> + if (size + vma->vm_pgoff*PAGE_SIZE != psize)
> + return -EINVAL;
> +
> + if (vma->vm_flags & VM_WRITE)
> + return -EPERM;
> +
> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> +
> + for (i=0; i<pnum; ++i) {
> + if (remap_pfn_range(vma, start, virt_to_phys((void *)u->pring[i+vma->vm_pgoff]), PAGE_SIZE,
> + vma->vm_page_prot))
> + return -EAGAIN;
> + start += PAGE_SIZE;
> + }
> +
> + return 0;
> +}

Is EAGAIN an appropriate return value?

If this function had a decent comment we could ask Hugh to review it.

> +#if 0
> +static inline unsigned int kevent_user_hash(struct ukevent *uk)
> +{
> + unsigned int h = (uk->user[0] ^ uk->user[1]) ^ (uk->id.raw[0] ^ uk->id.raw[1]);
> +
> + h = (((h >> 16) & 0xffff) ^ (h & 0xffff)) & 0xffff;
> + h = (((h >> 8) & 0xff) ^ (h & 0xff)) & KEVENT_HASH_MASK;
> +
> + return h;
> +}
> +#else
> +static inline unsigned int kevent_user_hash(struct ukevent *uk)
> +{
> + return jhash_1word(uk->id.raw[0], 0) & KEVENT_HASH_MASK;
> +}
> +#endif
> +
> +static void kevent_free_rcu(struct rcu_head *rcu)
> +{
> + struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
> + kmem_cache_free(kevent_cache, kevent);
> +}
> +
> +static void kevent_finish_user_complete(struct kevent *k, int deq)
> +{
> + struct kevent_user *u = k->user;
> + unsigned long flags;
> +
> + if (deq)
> + kevent_dequeue(k);
> +
> + spin_lock_irqsave(&u->ready_lock, flags);
> + if (k->ready_entry.next != LIST_POISON1) {
> + list_del(&k->ready_entry);

list_del_rcu()?

> + u->ready_num--;
> + }
> + spin_unlock_irqrestore(&u->ready_lock, flags);
> +
> + kevent_user_put(u);
> + call_rcu(&k->rcu_head, kevent_free_rcu);
> +}
> +
> +static void __kevent_finish_user(struct kevent *k, int deq)
> +{
> + struct kevent_user *u = k->user;
> +
> + list_del(&k->kevent_entry);
> + u->kevent_num--;
> + kevent_finish_user_complete(k, deq);
> +}

No locking needed?

It's hard to review uncommented code. And the review is less useful if the
reviewer cannot determine what the developer was attempting to do.

> +/*
> + * Remove kevent from user's list of all events,
> + * dequeue it from storage and decrease user's reference counter,
> + * since this kevent does not exist anymore. That is why it is freed here.
> + */

That's nice.

> +static void kevent_finish_user(struct kevent *k, int deq)
> +{
> + struct kevent_user *u = k->user;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&u->kevent_lock, flags);
> + list_del(&k->kevent_entry);

list_del_rcu()?

> + u->kevent_num--;
> + spin_unlock_irqrestore(&u->kevent_lock, flags);
> + kevent_finish_user_complete(k, deq);
> +}
> +
> +/*
> + * Dequeue one entry from user's ready queue.
> + */
> +
> +static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
> +{
> + unsigned long flags;
> + struct kevent *k = NULL;
> +
> + spin_lock_irqsave(&u->ready_lock, flags);
> + if (u->ready_num && !list_empty(&u->ready_list)) {
> + k = list_entry(u->ready_list.next, struct kevent, ready_entry);
> + list_del(&k->ready_entry);
> + u->ready_num--;
> + }
> + spin_unlock_irqrestore(&u->ready_lock, flags);
> +
> + return k;
> +}
> +
> +static struct kevent *__kevent_search(struct list_head *head, struct ukevent *uk,
> + struct kevent_user *u)
> +{
> + struct kevent *k;
> + int found = 0;
> +
> + list_for_each_entry(k, head, kevent_entry) {
> + spin_lock(&k->ulock);
> + if (k->event.user[0] == uk->user[0] && k->event.user[1] == uk->user[1] &&
> + k->event.id.raw[0] == uk->id.raw[0] &&
> + k->event.id.raw[1] == uk->id.raw[1]) {
> + found = 1;
> + spin_unlock(&k->ulock);
> + break;
> + }
> + spin_unlock(&k->ulock);
> + }
> +
> + return (found)?k:NULL;
> +}

Remove `found', do

struct kevent *ret = NULL;

...
ret = k;
break;
...
return ret;


> +static int kevent_modify(struct ukevent *uk, struct kevent_user *u)

<wonders what this function does>

> +{
> + struct kevent *k;
> + unsigned int hash = kevent_user_hash(uk);
> + int err = -ENODEV;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&u->kevent_lock, flags);
> + k = __kevent_search(&u->kevent_list[hash], uk, u);
> + if (k) {
> + spin_lock(&k->ulock);
> + k->event.event = uk->event;
> + k->event.req_flags = uk->req_flags;
> + k->event.ret_flags = 0;
> + spin_unlock(&k->ulock);
> + kevent_requeue(k);
> + err = 0;
> + }
> + spin_unlock_irqrestore(&u->kevent_lock, flags);
> +
> + return err;
> +}

ENODEV: "No such device". Doesn't sound appropriate.

> +static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
> +{
> + int err = -ENODEV;
> + struct kevent *k;
> + unsigned int hash = kevent_user_hash(uk);
> + unsigned long flags;
> +
> + spin_lock_irqsave(&u->kevent_lock, flags);
> + k = __kevent_search(&u->kevent_list[hash], uk, u);
> + if (k) {
> + __kevent_finish_user(k, 1);
> + err = 0;
> + }
> + spin_unlock_irqrestore(&u->kevent_lock, flags);
> +
> + return err;
> +}
> +
> +/*
> + * No new entry can be added or removed from any list at this point.
> + * It is not permitted to call ->ioctl() and ->release() in parallel.
> + */
> +static int kevent_user_release(struct inode *inode, struct file *file)
> +{
> + struct kevent_user *u = file->private_data;
> + struct kevent *k, *n;
> + int i;
> +
> + for (i=0; i<KEVENT_HASH_MASK+1; ++i) {

ARRAY_SIZE

> + list_for_each_entry_safe(k, n, &u->kevent_list[i], kevent_entry)
> + kevent_finish_user(k, 1);
> + }
> +
> + kevent_user_put(u);
> + file->private_data = NULL;
> +
> + return 0;
> +}
> +
> +static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
> +{
> + struct ukevent *ukev;
> +
> + ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
> + if (!ukev)
> + return NULL;
> +
> + if (copy_from_user(arg, ukev, sizeof(struct ukevent) * num)) {
> + kfree(ukev);
> + return NULL;
> + }
> +
> + return ukev;
> +}

The copy_fom_user() args are reversed.

This is serious breakage and raises concerns about the amount of testing
which has been performed.

AFAICT there is no bounds checking on `num', so the user can force a
deliberate multiplication overflow and cause havoc here.

> +static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
> +{
> + int err = 0, i;
> + struct ukevent uk;
> +
> + mutex_lock(&u->ctl_mutex);
> +
> + if (num > KEVENT_MIN_BUFFS_ALLOC) {
> + struct ukevent *ukev;
> +
> + ukev = kevent_get_user(num, arg);
> + if (ukev) {
> + for (i=0; i<num; ++i) {
> + if (kevent_modify(&ukev[i], u))
> + ukev[i].ret_flags |= KEVENT_RET_BROKEN;
> + ukev[i].ret_flags |= KEVENT_RET_DONE;
> + }
> + if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
> + err = -EINVAL;

EFAULT

> + kfree(ukev);
> + goto out;
> + }
> + }
> +
> + for (i=0; i<num; ++i) {
> + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
> + err = -EINVAL;

EFAULT

> + break;
> + }
> +
> + if (kevent_modify(&uk, u))
> + uk.ret_flags |= KEVENT_RET_BROKEN;
> + uk.ret_flags |= KEVENT_RET_DONE;
> +
> + if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
> + err = -EINVAL;

EFAULT.

> + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
> + err = -EINVAL;

EFAULT (all over the place).

> +static void kevent_user_enqueue(struct kevent_user *u, struct kevent *k)
> +{
> + unsigned long flags;
> + unsigned int hash = kevent_user_hash(&k->event);
> +
> + spin_lock_irqsave(&u->kevent_lock, flags);
> + list_add_tail(&k->kevent_entry, &u->kevent_list[hash]);
> + u->kevent_num++;
> + kevent_user_get(u);
> + spin_unlock_irqrestore(&u->kevent_lock, flags);
> +}

kevent_user_get() can be moved outside the lock?

> +/*
> + * Copy all ukevents from userspace, allocate kevent for each one
> + * and add them into appropriate kevent_storages,
> + * e.g. sockets, inodes and so on...
> + * If something goes wrong, all events will be dequeued and
> + * negative error will be returned.
> + * On success number of finished events is returned and
> + * Array of finished events (struct ukevent) will be placed behind
> + * kevent_user_control structure. User must run through that array and check
> + * ret_flags field of each ukevent structure to determine if it is fired or failed event.
> + */
> +static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
> +{
> + int err, cerr = 0, knum = 0, rnum = 0, i;
> + void __user *orig = arg;
> + struct ukevent uk;
> +
> + mutex_lock(&u->ctl_mutex);
> +
> + err = -ENFILE;
> + if (u->kevent_num + num >= KEVENT_MAX_EVENTS)

Can a malicious user force an arithmetic overflow here?

> + goto out_remove;
> +
> + if (num > KEVENT_MIN_BUFFS_ALLOC) {
> + struct ukevent *ukev;
> +
> + ukev = kevent_get_user(num, arg);
> + if (ukev) {
> + for (i=0; i<num; ++i) {
> + err = kevent_user_add_ukevent(&ukev[i], u);
> + if (err) {
> + kevent_user_stat_increase_im(u);
> + if (i != rnum)
> + memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
> + rnum++;

What's happening here? The games with `rnum' and comparing it with `i'??

Perhaps these are not the best-chosen identifiers..

> +/*
> + * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
> + * In blocking mode it waits until timeout or if at least @min_nr events are ready.
> + */
> +static int kevent_user_wait(struct file *file, struct kevent_user *u,
> + unsigned int min_nr, unsigned int max_nr, unsigned int timeout,
> + void __user *buf)
> +{
> + struct kevent *k;
> + int cerr = 0, num = 0;
> +
> + if (!(file->f_flags & O_NONBLOCK)) {
> + wait_event_interruptible_timeout(u->wait,
> + u->ready_num >= min_nr, msecs_to_jiffies(timeout));
> + }
> +
> + while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
> + if (copy_to_user(buf + num*sizeof(struct ukevent),
> + &k->event, sizeof(struct ukevent))) {
> + cerr = -EINVAL;
> + break;
> + }
> +
> + /*
> + * If it is one-shot kevent, it has been removed already from
> + * origin's queue, so we can easily free it here.
> + */
> + if (k->event.req_flags & KEVENT_REQ_ONESHOT)
> + kevent_finish_user(k, 1);
> + ++num;
> + kevent_user_stat_increase_wait(u);
> + }
> +
> + return (cerr)?cerr:num;
> +}

So if this returns an error, the user doesn't know how many events were
actually completed? That doesn't seem good.

> +asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, void __user *arg)

At some point Michael will want to be writing the manpages for things like
this. He'll start out by reading the comment block, poor guy.

> +{
> + int err = -EINVAL;
> + struct file *file;
> +
> + if (cmd == KEVENT_CTL_INIT)
> + return kevent_ctl_init();
> +
> + file = fget(fd);
> + if (!file)
> + return -ENODEV;
> +
> + if (file->f_op != &kevent_user_fops)
> + goto out_fput;
> +
> + err = kevent_ctl_process(file, cmd, num, arg);
> +
> +out_fput:
> + fput(file);
> + return err;
> +}

2006-08-10 00:04:51

by David Miller

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

From: Stephen Hemminger <[email protected]>
Date: Wed, 9 Aug 2006 10:47:38 -0700

> static wrapper_functions_with_execessive_long_names(struct i_really_hate *this)
> {
> suck();
> }

Yes, typing 50 characters just to bump a counter, it's beyond
rediculious.

Go hack on the X server if you like long-winded function names
that do trivial operations :-)

2006-08-10 06:15:22

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

On Wed, Aug 09, 2006 at 03:21:27PM -0700, Andrew Morton ([email protected]) wrote:
> On Wed, 9 Aug 2006 12:02:40 +0400
> Evgeniy Polyakov <[email protected]> wrote:
>
> >
> > Core files.
> >
> > This patch includes core kevent files:
> > - userspace controlling
> > - kernelspace interfaces
> > - initialization
> > - notification state machines
> >
> > It might also inlclude parts from other subsystem (like network related
> > syscalls, so it is possible that it will not compile without other
> > patches applied).
>
> Summary:
>
> - has serious bugs which indicate that much better testing is needed.
>
> - All -EFOO return values need to be reviewed for appropriateness
>
> - needs much better commenting before I can do more than a local-level review.
>
>
> > --- /dev/null
> > +++ b/include/linux/kevent.h
> > ...
> >
> > +/*
> > + * Poll events.
> > + */
> > +#define KEVENT_POLL_POLLIN 0x0001
> > +#define KEVENT_POLL_POLLPRI 0x0002
> > +#define KEVENT_POLL_POLLOUT 0x0004
> > +#define KEVENT_POLL_POLLERR 0x0008
> > +#define KEVENT_POLL_POLLHUP 0x0010
> > +#define KEVENT_POLL_POLLNVAL 0x0020
> > +
> > +#define KEVENT_POLL_POLLRDNORM 0x0040
> > +#define KEVENT_POLL_POLLRDBAND 0x0080
> > +#define KEVENT_POLL_POLLWRNORM 0x0100
> > +#define KEVENT_POLL_POLLWRBAND 0x0200
> > +#define KEVENT_POLL_POLLMSG 0x0400
> > +#define KEVENT_POLL_POLLREMOVE 0x1000
>
> 0x0800 got lost.

I will use usual poll definitions.

> > +struct ukevent
> > +{
> > + struct kevent_id id; /* Id of this request, e.g. socket number, file descriptor and so on... */
> > + __u32 type; /* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
> > + __u32 event; /* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
> > + __u32 req_flags; /* Per-event request flags */
> > + __u32 ret_flags; /* Per-event return flags */
> > + __u32 ret_data[2]; /* Event return data. Event originator fills it with anything it likes. */
> > + union {
> > + __u32 user[2]; /* User's data. It is not used, just copied to/from user. */
> > + void *ptr;
> > + };
> > +};
>
> What is this union for?
>
> `ptr' needs a __user tag, does it not?

Not, it is never touched by kernel.

> `ptr' will be 64-bit in-kernel and 64-bit for 64-bit userspace, but 32-bit
> for 32-bit userspace. I guess that's why user[] is there.

Exactly.

> On big-endian machines, this pointer will appear to be word-swapped as far
> as a 64-bit kernel is concerned. Or something.
>
> IOW: What's going on here??

It is user data - I put there a union just to simplify userspace, so it
sould not require some typecasting.

> > +#ifdef CONFIG_KEVENT_INODE
> > +void kevent_inode_notify(struct inode *inode, u32 event);
> > +void kevent_inode_notify_parent(struct dentry *dentry, u32 event);
> > +void kevent_inode_remove(struct inode *inode);
> > +#else
> > +static inline void kevent_inode_notify(struct inode *inode, u32 event)
> > +{
> > +}
> > +static inline void kevent_inode_notify_parent(struct dentry *dentry, u32 event)
> > +{
> > +}
> > +static inline void kevent_inode_remove(struct inode *inode)
> > +{
> > +}
> > +#endif /* CONFIG_KEVENT_INODE */
> > +#ifdef CONFIG_KEVENT_SOCKET
> > +#ifdef CONFIG_LOCKDEP
> > +void kevent_socket_reinit(struct socket *sock);
> > +void kevent_sk_reinit(struct sock *sk);
> > +#else
> > +static inline void kevent_socket_reinit(struct socket *sock)
> > +{
> > +}
> > +static inline void kevent_sk_reinit(struct sock *sk)
> > +{
> > +}
> > +#endif
> > +void kevent_socket_notify(struct sock *sock, u32 event);
> > +int kevent_socket_dequeue(struct kevent *k);
> > +int kevent_socket_enqueue(struct kevent *k);
> > +#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)
>
> Is this header the correct place to be implementing sock_async()?

I decided to have kevent as much separate as possible, so I put a lot
there. When people decide that it is ok, than it can be moved into
appropriate network header file - here it is much easier to review.

> > --- /dev/null
> > +++ b/kernel/kevent/Kconfig
> > @@ -0,0 +1,50 @@
> > +config KEVENT
> > + bool "Kernel event notification mechanism"
> > + help
> > + This option enables event queue mechanism.
> > + It can be used as replacement for poll()/select(), AIO callback invocations,
> > + advanced timer notifications and other kernel object status changes.
>
> Please squeeze all the help text into 80-columns. Or at least check that
> it looks OK in menuconfig in an 80-col xterm,

Ok.

> > --- /dev/null
> > +++ b/kernel/kevent/kevent.c
> > @@ -0,0 +1,238 @@
> > +/*
> > + * kevent.c
> > + *
> > + * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
> > + * All rights reserved.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> > + */
> > +
> > +#include <linux/kernel.h>
> > +#include <linux/types.h>
> > +#include <linux/list.h>
> > +#include <linux/slab.h>
> > +#include <linux/spinlock.h>
> > +#include <linux/mempool.h>
> > +#include <linux/sched.h>
> > +#include <linux/wait.h>
> > +#include <linux/kevent.h>
> > +
> > +kmem_cache_t *kevent_cache;
> > +
> > +/*
> > + * Attempts to add an event into appropriate origin's queue.
> > + * Returns positive value if this event is ready immediately,
> > + * negative value in case of error and zero if event has been queued.
> > + * ->enqueue() callback must increase origin's reference counter.
> > + */
> > +int kevent_enqueue(struct kevent *k)
> > +{
> > + if (k->event.type >= KEVENT_MAX)
> > + return -E2BIG;
>
> E2BIG is "Argument list too long". EINVAL is appropriate here.

No problem.

> > + if (!k->callbacks.enqueue) {
> > + kevent_break(k);
> > + return -EINVAL;
> > + }
> > +
> > + return k->callbacks.enqueue(k);
> > +}
> > +
> > +/*
> > + * Remove event from the appropriate queue.
> > + * ->dequeue() callback must decrease origin's reference counter.
> > + */
> > +int kevent_dequeue(struct kevent *k)
> > +{
> > + if (k->event.type >= KEVENT_MAX)
> > + return -E2BIG;
> > +
> > + if (!k->callbacks.dequeue) {
> > + kevent_break(k);
> > + return -EINVAL;
> > + }
> > +
> > + return k->callbacks.dequeue(k);
> > +}
> > +
> > +int kevent_break(struct kevent *k)
> > +{
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&k->ulock, flags);
> > + k->event.ret_flags |= KEVENT_RET_BROKEN;
> > + spin_unlock_irqrestore(&k->ulock, flags);
> > + return 0;
> > +}
> > +
> > +struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX];
> > +
> > +/*
> > + * Must be called before event is going to be added into some origin's queue.
> > + * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
> > + * If failed, kevent should not be used or kevent_enqueue() will fail to add
> > + * this kevent into origin's queue with setting
> > + * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
> > + */
> > +int kevent_init(struct kevent *k)
> > +{
> > + spin_lock_init(&k->ulock);
> > + k->kevent_entry.next = LIST_POISON1;
> > + k->storage_entry.prev = LIST_POISON2;
> > + k->ready_entry.next = LIST_POISON1;
>
> Nope ;)

I use pointer checks to determine if entry is in the list or not, why it
is frowned upon here?
Please do not say about poisoning which takes a lot of cpu cycles to get
new cachelines and so on - everything in that entry is in the cache,
since entry was added/deleted/accessed through list walk macro.

> > + if (k->event.type >= KEVENT_MAX)
> > + return -E2BIG;
> > +
> > + k->callbacks = kevent_registered_callbacks[k->event.type];
> > + if (!k->callbacks.callback) {
> > + kevent_break(k);
> > + return -EINVAL;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +/*
> > + * Called from ->enqueue() callback when reference counter for given
> > + * origin (socket, inode...) has been increased.
> > + */
> > +int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
> > +{
> > + unsigned long flags;
> > +
> > + k->st = st;
> > + spin_lock_irqsave(&st->lock, flags);
> > + list_add_tail_rcu(&k->storage_entry, &st->list);
> > + st->qlen++;
> > + spin_unlock_irqrestore(&st->lock, flags);
> > + return 0;
> > +}
>
> Is the _rcu variant needed here?

Yes, storage list is protected by RCU.
st->lock is used to remove race between several "writers".

> > +/*
> > + * Dequeue kevent from origin's queue.
> > + * It does not decrease origin's reference counter in any way
> > + * and must be called before it, so storage itself must be valid.
> > + * It is called from ->dequeue() callback.
> > + */
> > +void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
> > +{
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&st->lock, flags);
> > + if (k->storage_entry.prev != LIST_POISON2) {
>
> Nope, as discussed earlier.

Sorry, but I do not agree, that lsit poisoning costs something and I
explained why. It can be wrong from some arcitechtural point of view,
but I can crate a macros and place them into list.h which will do just
the same.

The problem is how to determine if entry is in the list or not
regardless kevent code.

> > + list_del_rcu(&k->storage_entry);
> > + st->qlen--;
> > + }
> > + spin_unlock_irqrestore(&st->lock, flags);
> > +}
> > +
> > +static void __kevent_requeue(struct kevent *k, u32 event)
> > +{
> > + int err, rem = 0;
> > + unsigned long flags;
> > +
> > + err = k->callbacks.callback(k);
> > +
> > + spin_lock_irqsave(&k->ulock, flags);
> > + if (err > 0) {
> > + k->event.ret_flags |= KEVENT_RET_DONE;
> > + } else if (err < 0) {
> > + k->event.ret_flags |= KEVENT_RET_BROKEN;
> > + k->event.ret_flags |= KEVENT_RET_DONE;
> > + }
> > + rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
> > + if (!err)
> > + err = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
> > + spin_unlock_irqrestore(&k->ulock, flags);
>
> Local variable `err' no longer actually indicates an error, does it?
>
> If not, a differently-named local would be appropriate here.

Ok, I will rename it.

> > + if (err) {
> > + if ((rem || err < 0) && k->storage_entry.prev != LIST_POISON2) {
> > + list_del_rcu(&k->storage_entry);
> > + k->st->qlen--;
>
> ->qlen was previously modified under spinlock. Here it is not.

It is tricky part - this part of the code can not be reentered (by
design of all storages which are used with kevents), and different CPUs
can not access the same list due to RCU rules - i.e. it is perfectly ok,
if different CPU sees old value of the queue length.

> > + }
> > +
> > + spin_lock_irqsave(&k->user->ready_lock, flags);
> > + if (k->ready_entry.next == LIST_POISON1) {
> > + kevent_user_ring_add_event(k);
> > + list_add_tail(&k->ready_entry, &k->user->ready_list);
> > + k->user->ready_num++;
> > + }
> > + spin_unlock_irqrestore(&k->user->ready_lock, flags);
> > + wake_up(&k->user->wait);
> > + }
> > +}
> > +
> > +void kevent_requeue(struct kevent *k)
> > +{
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&k->st->lock, flags);
> > + __kevent_requeue(k, 0);
> > + spin_unlock_irqrestore(&k->st->lock, flags);
> > +}
> > +
> > +/*
> > + * Called each time some activity in origin (socket, inode...) is noticed.
> > + */
> > +void kevent_storage_ready(struct kevent_storage *st,
> > + kevent_callback_t ready_callback, u32 event)
> > +{
> > + struct kevent *k;
> > +
> > + rcu_read_lock();
> > + list_for_each_entry_rcu(k, &st->list, storage_entry) {
> > + if (ready_callback)
> > + ready_callback(k);
>
> For readability reasons I prefer the old-style
>
> (*ready_callback)(k);
>
> so the reader knows not to go off hunting for the function "ready_callback".
> Minor point.
>
> So the kevent_callback_t handlers are not allowed to sleep.

No, since they are called from internals of state machines of the
origins - it is either softirqs (network) or hard irqs (block layer).

> > + if (event & k->event.event)
> > + __kevent_requeue(k, event);
> > + }
>
> Under what circumstances will `event' be zero??

It is a bit AND, requeing happens only when requested event matches at
least one produced event.

> > + rcu_read_unlock();
> > +}
> > +
> > +int kevent_storage_init(void *origin, struct kevent_storage *st)
> > +{
> > + spin_lock_init(&st->lock);
> > + st->origin = origin;
> > + st->qlen = 0;
> > + INIT_LIST_HEAD(&st->list);
> > + return 0;
> > +}
> > +
> > +void kevent_storage_fini(struct kevent_storage *st)
> > +{
> > + kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
> > +}
> > +
> > +static int __init kevent_sys_init(void)
> > +{
> > + int i;
> > +
> > + kevent_cache = kmem_cache_create("kevent_cache",
> > + sizeof(struct kevent), 0, 0, NULL, NULL);
> > + if (!kevent_cache)
> > + panic("kevent: Unable to create a cache.\n");
>
> Can use SLAB_PANIC (a silly thing I added to avoid code duplication).

Ok.

> > + for (i=0; i<ARRAY_SIZE(kevent_registered_callbacks); ++i) {
> > + struct kevent_callbacks *c = &kevent_registered_callbacks[i];
> > +
> > + c->callback = c->enqueue = c->dequeue = NULL;
> > + }
>
> This zeroing is redundant.

It is not static, I you sure it will be zeroed?

> > + return 0;
> > +}
> > +
> > +late_initcall(kevent_sys_init);
>
> Why is it late_initcall? (A comment is needed)

Why not?
It can be any initcall or __init.

> > diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
> > new file mode 100644
> > index 0000000..7b6374b
> > --- /dev/null
> > +++ b/kernel/kevent/kevent_user.c
> > @@ -0,0 +1,857 @@
> > +/*
> > + * kevent_user.c
> > + *
> > + * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
> > + * All rights reserved.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; either version 2 of the License, or
> > + * (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> > + */
> > +
> > +#include <linux/kernel.h>
> > +#include <linux/module.h>
> > +#include <linux/types.h>
> > +#include <linux/list.h>
> > +#include <linux/slab.h>
> > +#include <linux/spinlock.h>
> > +#include <linux/fs.h>
> > +#include <linux/file.h>
> > +#include <linux/mount.h>
> > +#include <linux/device.h>
> > +#include <linux/poll.h>
> > +#include <linux/kevent.h>
> > +#include <linux/jhash.h>
> > +#include <linux/miscdevice.h>
> > +#include <asm/io.h>
> > +
> > +static char kevent_name[] = "kevent";
> > +
> > +static int kevent_user_open(struct inode *, struct file *);
> > +static int kevent_user_release(struct inode *, struct file *);
> > +static unsigned int kevent_user_poll(struct file *, struct poll_table_struct *);
> > +static int kevnet_user_mmap(struct file *, struct vm_area_struct *);
> > +
> > +static struct file_operations kevent_user_fops = {
> > + .mmap = kevnet_user_mmap,
> > + .open = kevent_user_open,
> > + .release = kevent_user_release,
> > + .poll = kevent_user_poll,
> > + .owner = THIS_MODULE,
> > +};
> > +
> > +static struct miscdevice kevent_miscdev = {
> > + .minor = MISC_DYNAMIC_MINOR,
> > + .name = kevent_name,
> > + .fops = &kevent_user_fops,
> > +};
> > +
> > +static int kevent_get_sb(struct file_system_type *fs_type,
> > + int flags, const char *dev_name, void *data, struct vfsmount *mnt)
> > +{
> > + /* So original magic... */
> > + return get_sb_pseudo(fs_type, kevent_name, NULL, 0xabcdef, mnt);
> > +}
>
> That doesn't look like a well-chosen magic number...
>
> > +static struct file_system_type kevent_fs_type = {
> > + .name = kevent_name,
> > + .get_sb = kevent_get_sb,
> > + .kill_sb = kill_anon_super,
> > +};
> > +
> > +static struct vfsmount *kevent_mnt;
> > +
> > +static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
> > +{
> > + struct kevent_user *u = file->private_data;
> > + unsigned int mask;
> > +
> > + poll_wait(file, &u->wait, wait);
> > + mask = 0;
> > +
> > + if (u->ready_num)
> > + mask |= POLLIN | POLLRDNORM;
> > +
> > + return mask;
> > +}
> > +
> > +static inline void kevent_user_ring_set(struct kevent_user *u, unsigned int num)
> > +{
> > + unsigned int *idx;
> > +
> > + idx = (unsigned int *)u->pring[0];
>
> This is a bit ugly.

I specially use first 4 bytes in the first page to store index there,
since it must be accessed from userspace and kernelspace.

> > + idx[0] = num;
> > +}
> > +
> > +/*
> > + * Note that kevents does not exactly fill the page (each ukevent is 40 bytes),
> > + * so we reuse 4 bytes at the begining of the first page to store index.
> > + * Take that into account if you want to change size of struct ukevent.
> > + */
> > +#define KEVENTS_ON_PAGE (PAGE_SIZE/sizeof(struct ukevent))
>
> How about doing
>
> struct ukevent_ring {
> unsigned int index;
> struct ukevent[0];
> }
>
> and removing all those nasty typeasting and offsetting games?
>
> In fact you can even do
>
> struct ukevent_ring {
> struct ukevent[(PAGE_SIZE - sizeof(unsigned int)) /
> sizeof(struct ukevent)];
> unsigned int index;
> };
>
> if you're careful ;)

Ring takes more than one page, so it will be
struct ukevent_ring_0 and struct ukevent_ring_other.
Does it really needed?
Not a big problem, if you do thing it worse it.

> > +/*
> > + * Called under kevent_user->ready_lock, so updates are always protected.
> > + */
> > +void kevent_user_ring_add_event(struct kevent *k)
> > +{
> > + unsigned int *idx_ptr, idx, pidx, off;
> > + struct ukevent *ukev;
> > +
> > + idx_ptr = (unsigned int *)k->user->pring[0];
> > + idx = idx_ptr[0];
> > +
> > + pidx = idx/KEVENTS_ON_PAGE;
> > + off = idx%KEVENTS_ON_PAGE;
> > +
> > + if (pidx == 0)
> > + ukev = (struct ukevent *)(k->user->pring[pidx] + sizeof(unsigned int));
> > + else
> > + ukev = (struct ukevent *)(k->user->pring[pidx]);
>
> Such as these.
>
> > + memcpy(&ukev[off], &k->event, sizeof(struct ukevent));
> > +
> > + idx++;
> > + if (idx >= KEVENT_MAX_EVENTS)
> > + idx = 0;
> > +
> > + idx_ptr[0] = idx;
> > +}
> > +
> > +static int kevent_user_ring_init(struct kevent_user *u)
> > +{
> > + int i, pnum;
> > +
> > + pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct ukevent) + sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;
>
> And these.
>
> > + u->pring = kmalloc(pnum * sizeof(unsigned long), GFP_KERNEL);
> > + if (!u->pring)
> > + return -ENOMEM;
> > +
> > + for (i=0; i<pnum; ++i) {
> > + u->pring[i] = __get_free_page(GFP_KERNEL);
> > + if (!u->pring)
>
> bug: this is testing the wrong thing.

HOw come?
__get_free_page() can return 0 if page was not allocated.

> > + break;
> > + }
> > +
> > + if (i != pnum) {
> > + pnum = i;
> > + goto err_out_free;
> > + }
>
> Move this logic into the `if (!u->pring)' logic, above.

Ok.

> > + kevent_user_ring_set(u, 0);
> > +
> > + return 0;
> > +
> > +err_out_free:
> > + for (i=0; i<pnum; ++i)
> > + free_page(u->pring[i]);
> > +
> > + kfree(u->pring);
> > +
> > + return -ENOMEM;
> > +}
> > +
> > +static void kevent_user_ring_fini(struct kevent_user *u)
> > +{
> > + int i, pnum;
> > +
> > + pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct ukevent) + sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;
> > +
> > + for (i=0; i<pnum; ++i)
> > + free_page(u->pring[i]);
> > +
> > + kfree(u->pring);
> > +}
> > +
> > +static struct kevent_user *kevent_user_alloc(void)
> > +{
> > + struct kevent_user *u;
> > + int i;
> > +
> > + u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
> > + if (!u)
> > + return NULL;
> > +
> > + INIT_LIST_HEAD(&u->ready_list);
> > + spin_lock_init(&u->ready_lock);
> > + u->ready_num = 0;
> > + kevent_user_stat_init(u);
> > + spin_lock_init(&u->kevent_lock);
> > + for (i=0; i<ARRAY_SIZE(u->kevent_list); ++i)
> > + INIT_LIST_HEAD(&u->kevent_list[i]);
> > + u->kevent_num = 0;
> > +
> > + mutex_init(&u->ctl_mutex);
> > + init_waitqueue_head(&u->wait);
> > + u->max_ready_num = 0;
>
> The above zeroes out a bunch of known-to-already-be-zero things.

Ok, I will remove redundant settings.

> > +static int kevnet_user_mmap(struct file *file, struct vm_area_struct *vma)
>
> The function name is mistyped.
>
> This code doesn't have many comments, does it? What are we mapping here,
> and why would an application want to map it?

That code waits comments from people who requested it.
It is ring of the ready events, which can be read by userspace instead
of calling syscall, so syscall just becomes "wait until there is a
place" or something like that.

> And what are the portability implications? Does userspace need to know the
> 64-bitness of its kernel to be able to work out the alignment of things?
> If so, what happens if a later/different compiler aligns things
> differently?

There are no alignment issues - I use 32bit values anywhere.

> > +{
> > + size_t size = vma->vm_end - vma->vm_start, psize;
> > + int pnum = size/PAGE_SIZE, i;
> > + unsigned long start = vma->vm_start;
> > + struct kevent_user *u = file->private_data;
> > +
> > + psize = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct ukevent) + sizeof(unsigned int), PAGE_SIZE);
> > +
> > + if (size + vma->vm_pgoff*PAGE_SIZE != psize)
> > + return -EINVAL;
> > +
> > + if (vma->vm_flags & VM_WRITE)
> > + return -EPERM;
> > +
> > + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > +
> > + for (i=0; i<pnum; ++i) {
> > + if (remap_pfn_range(vma, start, virt_to_phys((void *)u->pring[i+vma->vm_pgoff]), PAGE_SIZE,
> > + vma->vm_page_prot))
> > + return -EAGAIN;
> > + start += PAGE_SIZE;
> > + }
> > +
> > + return 0;
> > +}
>
> Is EAGAIN an appropriate return value?
>
> If this function had a decent comment we could ask Hugh to review it.

It is trivial ->mmap() implementation - ring buffer, which contains of
several pages (allocated through __get_free_page()) is mapped into
userspace.
vm_pgoff shows offset inside that ring.

> > +#if 0
> > +static inline unsigned int kevent_user_hash(struct ukevent *uk)
> > +{
> > + unsigned int h = (uk->user[0] ^ uk->user[1]) ^ (uk->id.raw[0] ^ uk->id.raw[1]);
> > +
> > + h = (((h >> 16) & 0xffff) ^ (h & 0xffff)) & 0xffff;
> > + h = (((h >> 8) & 0xff) ^ (h & 0xff)) & KEVENT_HASH_MASK;
> > +
> > + return h;
> > +}
> > +#else
> > +static inline unsigned int kevent_user_hash(struct ukevent *uk)
> > +{
> > + return jhash_1word(uk->id.raw[0], 0) & KEVENT_HASH_MASK;
> > +}
> > +#endif
> > +
> > +static void kevent_free_rcu(struct rcu_head *rcu)
> > +{
> > + struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
> > + kmem_cache_free(kevent_cache, kevent);
> > +}
> > +
> > +static void kevent_finish_user_complete(struct kevent *k, int deq)
> > +{
> > + struct kevent_user *u = k->user;
> > + unsigned long flags;
> > +
> > + if (deq)
> > + kevent_dequeue(k);
> > +
> > + spin_lock_irqsave(&u->ready_lock, flags);
> > + if (k->ready_entry.next != LIST_POISON1) {
> > + list_del(&k->ready_entry);
>
> list_del_rcu()?

No, ready list does not need RCU protection.

> > + u->ready_num--;
> > + }
> > + spin_unlock_irqrestore(&u->ready_lock, flags);
> > +
> > + kevent_user_put(u);
> > + call_rcu(&k->rcu_head, kevent_free_rcu);
> > +}
> > +
> > +static void __kevent_finish_user(struct kevent *k, int deq)
> > +{
> > + struct kevent_user *u = k->user;
> > +
> > + list_del(&k->kevent_entry);
> > + u->kevent_num--;
> > + kevent_finish_user_complete(k, deq);
> > +}
>
> No locking needed?

It is special function which is called without lock, function without __
prefix holds appropriate lock.

> It's hard to review uncommented code. And the review is less useful if the
> reviewer cannot determine what the developer was attempting to do.

Comment is 5 lines below, where that function is called wrapped with
appropriate lock.

> > +/*
> > + * Remove kevent from user's list of all events,
> > + * dequeue it from storage and decrease user's reference counter,
> > + * since this kevent does not exist anymore. That is why it is freed here.
> > + */
>
> That's nice.

Here it is.

> > +static void kevent_finish_user(struct kevent *k, int deq)
> > +{
> > + struct kevent_user *u = k->user;
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&u->kevent_lock, flags);
> > + list_del(&k->kevent_entry);
>
> list_del_rcu()?

No, this list does not require RCU protection, only storage_list
(storage_entry) requires that.

> > + u->kevent_num--;
> > + spin_unlock_irqrestore(&u->kevent_lock, flags);
> > + kevent_finish_user_complete(k, deq);
> > +}
> > +
> > +/*
> > + * Dequeue one entry from user's ready queue.
> > + */
> > +
> > +static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
> > +{
> > + unsigned long flags;
> > + struct kevent *k = NULL;
> > +
> > + spin_lock_irqsave(&u->ready_lock, flags);
> > + if (u->ready_num && !list_empty(&u->ready_list)) {
> > + k = list_entry(u->ready_list.next, struct kevent, ready_entry);
> > + list_del(&k->ready_entry);
> > + u->ready_num--;
> > + }
> > + spin_unlock_irqrestore(&u->ready_lock, flags);
> > +
> > + return k;
> > +}
> > +
> > +static struct kevent *__kevent_search(struct list_head *head, struct ukevent *uk,
> > + struct kevent_user *u)
> > +{
> > + struct kevent *k;
> > + int found = 0;
> > +
> > + list_for_each_entry(k, head, kevent_entry) {
> > + spin_lock(&k->ulock);
> > + if (k->event.user[0] == uk->user[0] && k->event.user[1] == uk->user[1] &&
> > + k->event.id.raw[0] == uk->id.raw[0] &&
> > + k->event.id.raw[1] == uk->id.raw[1]) {
> > + found = 1;
> > + spin_unlock(&k->ulock);
> > + break;
> > + }
> > + spin_unlock(&k->ulock);
> > + }
> > +
> > + return (found)?k:NULL;
> > +}
>
> Remove `found', do
>
> struct kevent *ret = NULL;
>
> ...
> ret = k;
> break;
> ...
> return ret;

Ok.

> > +static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
>
> <wonders what this function does>

Let me guess... It modifies kevent? :)
I will add comments.

> > +{
> > + struct kevent *k;
> > + unsigned int hash = kevent_user_hash(uk);
> > + int err = -ENODEV;
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&u->kevent_lock, flags);
> > + k = __kevent_search(&u->kevent_list[hash], uk, u);
> > + if (k) {
> > + spin_lock(&k->ulock);
> > + k->event.event = uk->event;
> > + k->event.req_flags = uk->req_flags;
> > + k->event.ret_flags = 0;
> > + spin_unlock(&k->ulock);
> > + kevent_requeue(k);
> > + err = 0;
> > + }
> > + spin_unlock_irqrestore(&u->kevent_lock, flags);
> > +
> > + return err;
> > +}
>
> ENODEV: "No such device". Doesn't sound appropriate.

ENOKEVENT? I expect ENODEV means "there is no requested thing".

> > +static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
> > +{
> > + int err = -ENODEV;
> > + struct kevent *k;
> > + unsigned int hash = kevent_user_hash(uk);
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&u->kevent_lock, flags);
> > + k = __kevent_search(&u->kevent_list[hash], uk, u);
> > + if (k) {
> > + __kevent_finish_user(k, 1);
> > + err = 0;
> > + }
> > + spin_unlock_irqrestore(&u->kevent_lock, flags);
> > +
> > + return err;
> > +}
> > +
> > +/*
> > + * No new entry can be added or removed from any list at this point.
> > + * It is not permitted to call ->ioctl() and ->release() in parallel.
> > + */
> > +static int kevent_user_release(struct inode *inode, struct file *file)
> > +{
> > + struct kevent_user *u = file->private_data;
> > + struct kevent *k, *n;
> > + int i;
> > +
> > + for (i=0; i<KEVENT_HASH_MASK+1; ++i) {
>
> ARRAY_SIZE

Ok.

> > + list_for_each_entry_safe(k, n, &u->kevent_list[i], kevent_entry)
> > + kevent_finish_user(k, 1);
> > + }
> > +
> > + kevent_user_put(u);
> > + file->private_data = NULL;
> > +
> > + return 0;
> > +}
> > +
> > +static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
> > +{
> > + struct ukevent *ukev;
> > +
> > + ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
> > + if (!ukev)
> > + return NULL;
> > +
> > + if (copy_from_user(arg, ukev, sizeof(struct ukevent) * num)) {
> > + kfree(ukev);
> > + return NULL;
> > + }
> > +
> > + return ukev;
> > +}
>
> The copy_fom_user() args are reversed.
>
> This is serious breakage and raises concerns about the amount of testing
> which has been performed.

It is typo in the new code, which was added by request in this thread.

> AFAICT there is no bounds checking on `num', so the user can force a
> deliberate multiplication overflow and cause havoc here.

It is checked when it is added into the ring, if kevent was not added
and is going to be removed or modified, it will be just thrown with
appropriate return code.
It should be checked against u->kevent_num here.

> > +static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
> > +{
> > + int err = 0, i;
> > + struct ukevent uk;
> > +
> > + mutex_lock(&u->ctl_mutex);
> > +
> > + if (num > KEVENT_MIN_BUFFS_ALLOC) {
> > + struct ukevent *ukev;
> > +
> > + ukev = kevent_get_user(num, arg);
> > + if (ukev) {
> > + for (i=0; i<num; ++i) {
> > + if (kevent_modify(&ukev[i], u))
> > + ukev[i].ret_flags |= KEVENT_RET_BROKEN;
> > + ukev[i].ret_flags |= KEVENT_RET_DONE;
> > + }
> > + if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
> > + err = -EINVAL;
>
> EFAULT

Ok.

> > + kfree(ukev);
> > + goto out;
> > + }
> > + }
> > +
> > + for (i=0; i<num; ++i) {
> > + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
> > + err = -EINVAL;
>
> EFAULT
>
> > + break;
> > + }
> > +
> > + if (kevent_modify(&uk, u))
> > + uk.ret_flags |= KEVENT_RET_BROKEN;
> > + uk.ret_flags |= KEVENT_RET_DONE;
> > +
> > + if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
> > + err = -EINVAL;
>
> EFAULT.
>
> > + if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
> > + err = -EINVAL;
>
> EFAULT (all over the place).

Ok, I will return EFAULT when copy*user fails.

> > +static void kevent_user_enqueue(struct kevent_user *u, struct kevent *k)
> > +{
> > + unsigned long flags;
> > + unsigned int hash = kevent_user_hash(&k->event);
> > +
> > + spin_lock_irqsave(&u->kevent_lock, flags);
> > + list_add_tail(&k->kevent_entry, &u->kevent_list[hash]);
> > + u->kevent_num++;
> > + kevent_user_get(u);
> > + spin_unlock_irqrestore(&u->kevent_lock, flags);
> > +}
>
> kevent_user_get() can be moved outside the lock?

Yes.

> > +/*
> > + * Copy all ukevents from userspace, allocate kevent for each one
> > + * and add them into appropriate kevent_storages,
> > + * e.g. sockets, inodes and so on...
> > + * If something goes wrong, all events will be dequeued and
> > + * negative error will be returned.
> > + * On success number of finished events is returned and
> > + * Array of finished events (struct ukevent) will be placed behind
> > + * kevent_user_control structure. User must run through that array and check
> > + * ret_flags field of each ukevent structure to determine if it is fired or failed event.
> > + */
> > +static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
> > +{
> > + int err, cerr = 0, knum = 0, rnum = 0, i;
> > + void __user *orig = arg;
> > + struct ukevent uk;
> > +
> > + mutex_lock(&u->ctl_mutex);
> > +
> > + err = -ENFILE;
> > + if (u->kevent_num + num >= KEVENT_MAX_EVENTS)
>
> Can a malicious user force an arithmetic overflow here?

All numbers here are unsigned and are compared against 4096.
So, answer is no.

> > + goto out_remove;
> > +
> > + if (num > KEVENT_MIN_BUFFS_ALLOC) {
> > + struct ukevent *ukev;
> > +
> > + ukev = kevent_get_user(num, arg);
> > + if (ukev) {
> > + for (i=0; i<num; ++i) {
> > + err = kevent_user_add_ukevent(&ukev[i], u);
> > + if (err) {
> > + kevent_user_stat_increase_im(u);
> > + if (i != rnum)
> > + memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
> > + rnum++;
>
> What's happening here? The games with `rnum' and comparing it with `i'??
>
> Perhaps these are not the best-chosen identifiers..

When kevent is ready immediately it is copied into the same buffer into
previous (rnum ready num) position. kevent at "rpos" was not ready
immediately (otherwise it would be copied and rnum increased) and thus
it is copied into the queue and can be overwritten here.

> > +/*
> > + * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
> > + * In blocking mode it waits until timeout or if at least @min_nr events are ready.
> > + */
> > +static int kevent_user_wait(struct file *file, struct kevent_user *u,
> > + unsigned int min_nr, unsigned int max_nr, unsigned int timeout,
> > + void __user *buf)
> > +{
> > + struct kevent *k;
> > + int cerr = 0, num = 0;
> > +
> > + if (!(file->f_flags & O_NONBLOCK)) {
> > + wait_event_interruptible_timeout(u->wait,
> > + u->ready_num >= min_nr, msecs_to_jiffies(timeout));
> > + }
> > +
> > + while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
> > + if (copy_to_user(buf + num*sizeof(struct ukevent),
> > + &k->event, sizeof(struct ukevent))) {
> > + cerr = -EINVAL;
> > + break;
> > + }
> > +
> > + /*
> > + * If it is one-shot kevent, it has been removed already from
> > + * origin's queue, so we can easily free it here.
> > + */
> > + if (k->event.req_flags & KEVENT_REQ_ONESHOT)
> > + kevent_finish_user(k, 1);
> > + ++num;
> > + kevent_user_stat_increase_wait(u);
> > + }
> > +
> > + return (cerr)?cerr:num;
> > +}
>
> So if this returns an error, the user doesn't know how many events were
> actually completed? That doesn't seem good.

What is the alternative?
read() work the same way - either error or number of bytes read.

> > +asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, void __user *arg)
>
> At some point Michael will want to be writing the manpages for things like
> this. He'll start out by reading the comment block, poor guy.

I will add comments.

> > +{
> > + int err = -EINVAL;
> > + struct file *file;
> > +
> > + if (cmd == KEVENT_CTL_INIT)
> > + return kevent_ctl_init();
> > +
> > + file = fget(fd);
> > + if (!file)
> > + return -ENODEV;
> > +
> > + if (file->f_op != &kevent_user_fops)
> > + goto out_fput;
> > +
> > + err = kevent_ctl_process(file, cmd, num, arg);
> > +
> > +out_fput:
> > + fput(file);
> > + return err;
> > +}

So let me quote your first words about kevent:

> Summary:
>
> - has serious bugs which indicate that much better testing is needed.
>
> - All -EFOO return values need to be reviewed for appropriateness
>
> - needs much better commenting before I can do more than a local-level review.

As far as I can see there are no serious bugs except absence of two
checks for and typo in function order, which abviously will be fixed.
All EFOO will be changed according to comments and better comments will
be added.

Thank you for review, Andrew.

--
Evgeniy Polyakov

2006-08-10 06:42:31

by David Miller

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

From: Evgeniy Polyakov <[email protected]>
Date: Thu, 10 Aug 2006 10:14:33 +0400

> On Wed, Aug 09, 2006 at 03:21:27PM -0700, Andrew Morton ([email protected]) wrote:
> > On big-endian machines, this pointer will appear to be word-swapped as far
> > as a 64-bit kernel is concerned. Or something.
> >
> > IOW: What's going on here??
>
> It is user data - I put there a union just to simplify userspace, so it
> sould not require some typecasting.

And this is consistent with similar mechianism we use for
netlink socket dumping, so that we don't have compat layer
crap just because we provide a place for the user to store
his pointer or whatever there.

> > > + k->kevent_entry.next = LIST_POISON1;
> > > + k->storage_entry.prev = LIST_POISON2;
> > > + k->ready_entry.next = LIST_POISON1;
> >
> > Nope ;)
>
> I use pointer checks to determine if entry is in the list or not, why it
> is frowned upon here?

As Andrew mentioned in another posting, these poison macros
are likely to simply go away some day, so you should not use
them.

If you want pointer encoded tags you use internally, define your own.

2006-08-10 06:49:00

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

On Wed, Aug 09, 2006 at 11:42:35PM -0700, David Miller ([email protected]) wrote:
> > > > + k->kevent_entry.next = LIST_POISON1;
> > > > + k->storage_entry.prev = LIST_POISON2;
> > > > + k->ready_entry.next = LIST_POISON1;
> > >
> > > Nope ;)
> >
> > I use pointer checks to determine if entry is in the list or not, why it
> > is frowned upon here?
>
> As Andrew mentioned in another posting, these poison macros
> are likely to simply go away some day, so you should not use
> them.

They exist for ages and sudently can go away?..

> If you want pointer encoded tags you use internally, define your own.

I think if I will add code like this
list_del(&k->entry);
k->entry.prev = KEVENT_POISON1;
k->entry.next = KEVENT_POISON2;

I will be suggested to make myself a lobotomy.

I have enough space in flags in each kevent, so I will use some bits there.

--
Evgeniy Polyakov

2006-08-10 07:19:12

by Andrew Morton

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

On Thu, 10 Aug 2006 10:14:33 +0400
Evgeniy Polyakov <[email protected]> wrote:

> > > + union {
> > > + __u32 user[2]; /* User's data. It is not used, just copied to/from user. */
> > > + void *ptr;
> > > + };
> > > +};
> >
> > What is this union for?
> >
> > `ptr' needs a __user tag, does it not?
>
> Not, it is never touched by kernel.

hrm, if you say so.

> > > +/*
> > > + * Must be called before event is going to be added into some origin's queue.
> > > + * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
> > > + * If failed, kevent should not be used or kevent_enqueue() will fail to add
> > > + * this kevent into origin's queue with setting
> > > + * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
> > > + */
> > > +int kevent_init(struct kevent *k)
> > > +{
> > > + spin_lock_init(&k->ulock);
> > > + k->kevent_entry.next = LIST_POISON1;
> > > + k->storage_entry.prev = LIST_POISON2;
> > > + k->ready_entry.next = LIST_POISON1;
> >
> > Nope ;)
>
> I use pointer checks to determine if entry is in the list or not, why it
> is frowned upon here?
> Please do not say about poisoning which takes a lot of cpu cycles to get
> new cachelines and so on - everything in that entry is in the cache,
> since entry was added/deleted/accessed through list walk macro.

"poisoning which takes a lot of cpu cycles". So there ;)

I assure you, that poisoning code might disappear at any time.

If you want to be able to determine whether a list_head has been detached
you can detach it with list_del_init() and then use list_empty() on it.

> > > +}
> > > +
> > > +late_initcall(kevent_sys_init);
> >
> > Why is it late_initcall? (A comment is needed)
>
> Why not?

Why?

There must have been some reason for having made this a late_initcall() and
that reason is 100% concealed from the reader of this code.

IOW, it needs a comment.

> > > +static inline void kevent_user_ring_set(struct kevent_user *u, unsigned int num)
> > > +{
> > > + unsigned int *idx;
> > > +
> > > + idx = (unsigned int *)u->pring[0];
> >
> > This is a bit ugly.
>
> I specially use first 4 bytes in the first page to store index there,
> since it must be accessed from userspace and kernelspace.

Sure, but the C language is the preferred way in which we communicate and
calcuate pointer offsets.

> > > + idx[0] = num;
> > > +}
> > > +
> > > +/*
> > > + * Note that kevents does not exactly fill the page (each ukevent is 40 bytes),
> > > + * so we reuse 4 bytes at the begining of the first page to store index.
> > > + * Take that into account if you want to change size of struct ukevent.
> > > + */
> > > +#define KEVENTS_ON_PAGE (PAGE_SIZE/sizeof(struct ukevent))
> >
> > How about doing
> >
> > struct ukevent_ring {
> > unsigned int index;
> > struct ukevent[0];
> > }
> >
> > and removing all those nasty typeasting and offsetting games?
> >
> > In fact you can even do
> >
> > struct ukevent_ring {
> > struct ukevent[(PAGE_SIZE - sizeof(unsigned int)) /
> > sizeof(struct ukevent)];
> > unsigned int index;
> > };
> >
> > if you're careful ;)
>
> Ring takes more than one page, so it will be
> struct ukevent_ring_0 and struct ukevent_ring_other.
> Does it really needed?
> Not a big problem, if you do thing it worse it.

Well, I've given a couple of prototype-style suggestions. Please take a
look, see if all this open-coded offsetting magic can be done by the
compiler in some reliable and readable fashion. It might not work out, but
I suspect it will.

> > > + u->pring = kmalloc(pnum * sizeof(unsigned long), GFP_KERNEL);
> > > + if (!u->pring)
> > > + return -ENOMEM;
> > > +
> > > + for (i=0; i<pnum; ++i) {
> > > + u->pring[i] = __get_free_page(GFP_KERNEL);
> > > + if (!u->pring)
> >
> > bug: this is testing the wrong thing.
>
> HOw come?

Take a closer look ;)

> __get_free_page() can return 0 if page was not allocated.

And that 0 is copied to u->pring[0], not to u->pring.

> > The function name is mistyped.

Did you miss an "OK"? It needs s/kevnet_user_mmap/kevent_user_mmap/g

> > This code doesn't have many comments, does it? What are we mapping here,
> > and why would an application want to map it?
>
> That code waits comments from people who requested it.
> It is ring of the ready events, which can be read by userspace instead
> of calling syscall, so syscall just becomes "wait until there is a
> place" or something like that.

hm. Well, please fully comment code prior to sending it out for review. I
do go on about this, but trust me, it makes the review much more effective.

Afaict this mmap function gives a user a free way of getting pinned memory.
What is the upper bound on the amount of memory which a user can thus
obtain?

> > > +static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
> >
> > <wonders what this function does>
>
> Let me guess... It modifies kevent? :)
> I will add comments.
>
> > > +{
> > > + struct kevent *k;
> > > + unsigned int hash = kevent_user_hash(uk);
> > > + int err = -ENODEV;
> > > + unsigned long flags;
> > > +
> > > + spin_lock_irqsave(&u->kevent_lock, flags);
> > > + k = __kevent_search(&u->kevent_list[hash], uk, u);
> > > + if (k) {
> > > + spin_lock(&k->ulock);
> > > + k->event.event = uk->event;
> > > + k->event.req_flags = uk->req_flags;
> > > + k->event.ret_flags = 0;
> > > + spin_unlock(&k->ulock);
> > > + kevent_requeue(k);
> > > + err = 0;
> > > + }
> > > + spin_unlock_irqrestore(&u->kevent_lock, flags);
> > > +
> > > + return err;
> > > +}
> >
> > ENODEV: "No such device". Doesn't sound appropriate.
>
> ENOKEVENT? I expect ENODEV means "there is no requested thing".

yes, it is hard to map standard errnos onto new and complex non-standard
features.

I don't have a good answer to this, sorry.

Perhaps we should do

#define EPER_SYSCALL_BASE 0x10000

and then each syscall is free to implement new, syscall-specific errnos
starting at this base. But that might be a stupid idea - I don't know.
I'm sure the implementor of strerror() would think so ;)

> >
> > EFAULT (all over the place).
>
> Ok, I will return EFAULT when copy*user fails.

If that makes sense, fine. Sometimes it makes sense to return the number
of bytes transferred up to the point of the fault. Please have a careful
think and decide which behaviour is best in each of these cases.

> > > + int err, cerr = 0, knum = 0, rnum = 0, i;
> > > + void __user *orig = arg;
> > > + struct ukevent uk;
> > > +
> > > + mutex_lock(&u->ctl_mutex);
> > > +
> > > + err = -ENFILE;
> > > + if (u->kevent_num + num >= KEVENT_MAX_EVENTS)
> >
> > Can a malicious user force an arithmetic overflow here?
>
> All numbers here are unsigned and are compared against 4096.

Are they? I only see a comparison of a _sum_ against KEVENT_MAX_EVENTS.
So if the user passes 0x0800,0xfffffff0, for example?

> So, answer is no.
>
> > > + goto out_remove;
> > > +
> > > + if (num > KEVENT_MIN_BUFFS_ALLOC) {
> > > + struct ukevent *ukev;
> > > +
> > > + ukev = kevent_get_user(num, arg);
> > > + if (ukev) {
> > > + for (i=0; i<num; ++i) {
> > > + err = kevent_user_add_ukevent(&ukev[i], u);
> > > + if (err) {
> > > + kevent_user_stat_increase_im(u);
> > > + if (i != rnum)
> > > + memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
> > > + rnum++;
> >
> > What's happening here? The games with `rnum' and comparing it with `i'??
> >
> > Perhaps these are not the best-chosen identifiers..
>
> When kevent is ready immediately it is copied into the same buffer into
> previous (rnum ready num) position. kevent at "rpos" was not ready
> immediately (otherwise it would be copied and rnum increased) and thus
> it is copied into the queue and can be overwritten here.

If you say so ;)

Please bear in mind that Michael Kerrisk <[email protected]> will want
to be writing manpages for all this stuff.

And I must say that Michael repeatedly and correctly dragged me across the
coals for something as simple and stupid as sys_sync_file_range(). Based
on that experience, I wouldn't consider a new syscall like this to be
settled until Michael has fully understood it. And I suspect he doesn't
fully understand it until he has fully documented it.

> > > +/*
> > > + * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
> > > + * In blocking mode it waits until timeout or if at least @min_nr events are ready.
> > > + */
> > > +static int kevent_user_wait(struct file *file, struct kevent_user *u,
> > > + unsigned int min_nr, unsigned int max_nr, unsigned int timeout,
> > > + void __user *buf)
> > > +{
> > > + struct kevent *k;
> > > + int cerr = 0, num = 0;
> > > +
> > > + if (!(file->f_flags & O_NONBLOCK)) {
> > > + wait_event_interruptible_timeout(u->wait,
> > > + u->ready_num >= min_nr, msecs_to_jiffies(timeout));
> > > + }
> > > +
> > > + while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
> > > + if (copy_to_user(buf + num*sizeof(struct ukevent),
> > > + &k->event, sizeof(struct ukevent))) {
> > > + cerr = -EINVAL;
> > > + break;
> > > + }
> > > +
> > > + /*
> > > + * If it is one-shot kevent, it has been removed already from
> > > + * origin's queue, so we can easily free it here.
> > > + */
> > > + if (k->event.req_flags & KEVENT_REQ_ONESHOT)
> > > + kevent_finish_user(k, 1);
> > > + ++num;
> > > + kevent_user_stat_increase_wait(u);
> > > + }
> > > +
> > > + return (cerr)?cerr:num;
> > > +}
> >
> > So if this returns an error, the user doesn't know how many events were
> > actually completed? That doesn't seem good.
>
> What is the alternative?
> read() work the same way - either error or number of bytes read.

No. If read() hits an IO error or EFAULT partway through, read() will
return the number-of-bytes-trasferred. read() will only return a -ve
errno if it transferred zero bytes. This way, there is no lost
information.

However kevent_user_wait() will return a -ve errno even if it has reaped
some events. That's lost information and this might make it hard for a
robust userspace client to implement error recovery?

> So let me quote your first words about kevent:
>
> > Summary:
> >
> > - has serious bugs which indicate that much better testing is needed.
> >
> > - All -EFOO return values need to be reviewed for appropriateness
> >
> > - needs much better commenting before I can do more than a local-level review.
>
> As far as I can see there are no serious bugs except absence of two
> checks for and typo in function order, which abviously will be fixed.

Thus far I have found at least two bugs in this patchset which provide at
least a local DoS and possibly a privilege escalation (aka a roothole) to
local users. We hit a similar one in the epoll() implementation a while
back.

This is serious stuff. So experience tells us to be fanatical in the
checking of incoming syscall args. Check all the arguments to death for
correct values. Look out for overflows in additions and multiplications.
Look out for opportunities for excessive resource consumption.
Exhaustively test your new syscalls with many combinations of values when
the kernel is in various states.

> All EFOO will be changed according to comments and better comments will
> be added.

Thanks.

2006-08-10 07:51:34

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

On Thu, Aug 10, 2006 at 12:18:44AM -0700, Andrew Morton ([email protected]) wrote:
> > > > + spin_lock_init(&k->ulock);
> > > > + k->kevent_entry.next = LIST_POISON1;
> > > > + k->storage_entry.prev = LIST_POISON2;
> > > > + k->ready_entry.next = LIST_POISON1;
> > >
> > > Nope ;)
> >
> > I use pointer checks to determine if entry is in the list or not, why it
> > is frowned upon here?
> > Please do not say about poisoning which takes a lot of cpu cycles to get
> > new cachelines and so on - everything in that entry is in the cache,
> > since entry was added/deleted/accessed through list walk macro.
>
> "poisoning which takes a lot of cpu cycles". So there ;)
>
> I assure you, that poisoning code might disappear at any time.
>
> If you want to be able to determine whether a list_head has been detached
> you can detach it with list_del_init() and then use list_empty() on it.

I can't due to RCU rules.

> > > > +}
> > > > +
> > > > +late_initcall(kevent_sys_init);
> > >
> > > Why is it late_initcall? (A comment is needed)
> >
> > Why not?
>
> Why?
>
> There must have been some reason for having made this a late_initcall() and
> that reason is 100% concealed from the reader of this code.

kevent must be initialized before use, and it must happen before
userspace started, so I use late_initcall(), as I said it can be
anything other which is called before userspace.

> IOW, it needs a comment.

Sure.
I'm working right now on fixing all issues mentioned in this thread, and
comments are not on the last place.

> > > > +static inline void kevent_user_ring_set(struct kevent_user *u, unsigned int num)
> > > > +{
> > > > + unsigned int *idx;
> > > > +
> > > > + idx = (unsigned int *)u->pring[0];
> > >
> > > This is a bit ugly.
> >
> > I specially use first 4 bytes in the first page to store index there,
> > since it must be accessed from userspace and kernelspace.
>
> Sure, but the C language is the preferred way in which we communicate and
> calcuate pointer offsets.
>
> > > > + idx[0] = num;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Note that kevents does not exactly fill the page (each ukevent is 40 bytes),
> > > > + * so we reuse 4 bytes at the begining of the first page to store index.
> > > > + * Take that into account if you want to change size of struct ukevent.
> > > > + */
> > > > +#define KEVENTS_ON_PAGE (PAGE_SIZE/sizeof(struct ukevent))
> > >
> > > How about doing
> > >
> > > struct ukevent_ring {
> > > unsigned int index;
> > > struct ukevent[0];
> > > }
> > >
> > > and removing all those nasty typeasting and offsetting games?
> > >
> > > In fact you can even do
> > >
> > > struct ukevent_ring {
> > > struct ukevent[(PAGE_SIZE - sizeof(unsigned int)) /
> > > sizeof(struct ukevent)];
> > > unsigned int index;
> > > };
> > >
> > > if you're careful ;)
> >
> > Ring takes more than one page, so it will be
> > struct ukevent_ring_0 and struct ukevent_ring_other.
> > Does it really needed?
> > Not a big problem, if you do thing it worse it.
>
> Well, I've given a couple of prototype-style suggestions. Please take a
> look, see if all this open-coded offsetting magic can be done by the
> compiler in some reliable and readable fashion. It might not work out, but
> I suspect it will.

I think I will use structure with index on each page, since kevents are
unaligned to exaclty fit page, and it can be some kind of (later)
optimisation to use not global counter, but per-page one.

> > > > + u->pring = kmalloc(pnum * sizeof(unsigned long), GFP_KERNEL);
> > > > + if (!u->pring)
> > > > + return -ENOMEM;
> > > > +
> > > > + for (i=0; i<pnum; ++i) {
> > > > + u->pring[i] = __get_free_page(GFP_KERNEL);
> > > > + if (!u->pring)
> > >
> > > bug: this is testing the wrong thing.
> >
> > HOw come?
>
> Take a closer look ;)

[i] My fault :)

> > __get_free_page() can return 0 if page was not allocated.
>
> And that 0 is copied to u->pring[0], not to u->pring.
>
> > > The function name is mistyped.
>
> Did you miss an "OK"? It needs s/kevnet_user_mmap/kevent_user_mmap/g

It is already fixed :)

> > > This code doesn't have many comments, does it? What are we mapping here,
> > > and why would an application want to map it?
> >
> > That code waits comments from people who requested it.
> > It is ring of the ready events, which can be read by userspace instead
> > of calling syscall, so syscall just becomes "wait until there is a
> > place" or something like that.
>
> hm. Well, please fully comment code prior to sending it out for review. I
> do go on about this, but trust me, it makes the review much more effective.
>
> Afaict this mmap function gives a user a free way of getting pinned memory.
> What is the upper bound on the amount of memory which a user can thus
> obtain?

it is limited by maximum queue length which is 4k entries right now, so
maximum number of paged here is 4k*40/page_size, i.e. about 40 pages on
x86.

> > > > +static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
> > >
> > > <wonders what this function does>
> >
> > Let me guess... It modifies kevent? :)
> > I will add comments.
> >
> > > > +{
> > > > + struct kevent *k;
> > > > + unsigned int hash = kevent_user_hash(uk);
> > > > + int err = -ENODEV;
> > > > + unsigned long flags;
> > > > +
> > > > + spin_lock_irqsave(&u->kevent_lock, flags);
> > > > + k = __kevent_search(&u->kevent_list[hash], uk, u);
> > > > + if (k) {
> > > > + spin_lock(&k->ulock);
> > > > + k->event.event = uk->event;
> > > > + k->event.req_flags = uk->req_flags;
> > > > + k->event.ret_flags = 0;
> > > > + spin_unlock(&k->ulock);
> > > > + kevent_requeue(k);
> > > > + err = 0;
> > > > + }
> > > > + spin_unlock_irqrestore(&u->kevent_lock, flags);
> > > > +
> > > > + return err;
> > > > +}
> > >
> > > ENODEV: "No such device". Doesn't sound appropriate.
> >
> > ENOKEVENT? I expect ENODEV means "there is no requested thing".
>
> yes, it is hard to map standard errnos onto new and complex non-standard
> features.
>
> I don't have a good answer to this, sorry.
>
> Perhaps we should do
>
> #define EPER_SYSCALL_BASE 0x10000
>
> and then each syscall is free to implement new, syscall-specific errnos
> starting at this base. But that might be a stupid idea - I don't know.
> I'm sure the implementor of strerror() would think so ;)

There are some issues with errno and rules for kernel-only errno
holes... Let's just return EINVAL here.

> > > EFAULT (all over the place).
> >
> > Ok, I will return EFAULT when copy*user fails.
>
> If that makes sense, fine. Sometimes it makes sense to return the number
> of bytes transferred up to the point of the fault. Please have a careful
> think and decide which behaviour is best in each of these cases.

No, it much better to show that things are broken.
Half of the event transferred will not help user, since his pointer is
last data in the structure :)

> > > > + int err, cerr = 0, knum = 0, rnum = 0, i;
> > > > + void __user *orig = arg;
> > > > + struct ukevent uk;
> > > > +
> > > > + mutex_lock(&u->ctl_mutex);
> > > > +
> > > > + err = -ENFILE;
> > > > + if (u->kevent_num + num >= KEVENT_MAX_EVENTS)
> > >
> > > Can a malicious user force an arithmetic overflow here?
> >
> > All numbers here are unsigned and are compared against 4096.
>
> Are they? I only see a comparison of a _sum_ against KEVENT_MAX_EVENTS.
> So if the user passes 0x0800,0xfffffff0, for example?

I've already added check for that user value, you are correct that sum
can overflow.

> > So, answer is no.
> >
> > > > + goto out_remove;
> > > > +
> > > > + if (num > KEVENT_MIN_BUFFS_ALLOC) {
> > > > + struct ukevent *ukev;
> > > > +
> > > > + ukev = kevent_get_user(num, arg);
> > > > + if (ukev) {
> > > > + for (i=0; i<num; ++i) {
> > > > + err = kevent_user_add_ukevent(&ukev[i], u);
> > > > + if (err) {
> > > > + kevent_user_stat_increase_im(u);
> > > > + if (i != rnum)
> > > > + memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
> > > > + rnum++;
> > >
> > > What's happening here? The games with `rnum' and comparing it with `i'??
> > >
> > > Perhaps these are not the best-chosen identifiers..
> >
> > When kevent is ready immediately it is copied into the same buffer into
> > previous (rnum ready num) position. kevent at "rpos" was not ready
> > immediately (otherwise it would be copied and rnum increased) and thus
> > it is copied into the queue and can be overwritten here.
>
> If you say so ;)
>
> Please bear in mind that Michael Kerrisk <[email protected]> will want
> to be writing manpages for all this stuff.
>
> And I must say that Michael repeatedly and correctly dragged me across the
> coals for something as simple and stupid as sys_sync_file_range(). Based
> on that experience, I wouldn't consider a new syscall like this to be
> settled until Michael has fully understood it. And I suspect he doesn't
> fully understand it until he has fully documented it.

Yep, comments are always not that cool thing to do...

> > > > +/*
> > > > + * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
> > > > + * In blocking mode it waits until timeout or if at least @min_nr events are ready.
> > > > + */
> > > > +static int kevent_user_wait(struct file *file, struct kevent_user *u,
> > > > + unsigned int min_nr, unsigned int max_nr, unsigned int timeout,
> > > > + void __user *buf)
> > > > +{
> > > > + struct kevent *k;
> > > > + int cerr = 0, num = 0;
> > > > +
> > > > + if (!(file->f_flags & O_NONBLOCK)) {
> > > > + wait_event_interruptible_timeout(u->wait,
> > > > + u->ready_num >= min_nr, msecs_to_jiffies(timeout));
> > > > + }
> > > > +
> > > > + while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
> > > > + if (copy_to_user(buf + num*sizeof(struct ukevent),
> > > > + &k->event, sizeof(struct ukevent))) {
> > > > + cerr = -EINVAL;
> > > > + break;
> > > > + }
> > > > +
> > > > + /*
> > > > + * If it is one-shot kevent, it has been removed already from
> > > > + * origin's queue, so we can easily free it here.
> > > > + */
> > > > + if (k->event.req_flags & KEVENT_REQ_ONESHOT)
> > > > + kevent_finish_user(k, 1);
> > > > + ++num;
> > > > + kevent_user_stat_increase_wait(u);
> > > > + }
> > > > +
> > > > + return (cerr)?cerr:num;
> > > > +}
> > >
> > > So if this returns an error, the user doesn't know how many events were
> > > actually completed? That doesn't seem good.
> >
> > What is the alternative?
> > read() work the same way - either error or number of bytes read.
>
> No. If read() hits an IO error or EFAULT partway through, read() will
> return the number-of-bytes-trasferred. read() will only return a -ve
> errno if it transferred zero bytes. This way, there is no lost
> information.
>
> However kevent_user_wait() will return a -ve errno even if it has reaped
> some events. That's lost information and this might make it hard for a
> robust userspace client to implement error recovery?

I have no strong opinion on what value must be returned here.
I always think that if error happend, than it must be indicated.
But it is perfectly ok to return number of correctly read kevents and
userspace can compare that number with requested number.

> > So let me quote your first words about kevent:
> >
> > > Summary:
> > >
> > > - has serious bugs which indicate that much better testing is needed.
> > >
> > > - All -EFOO return values need to be reviewed for appropriateness
> > >
> > > - needs much better commenting before I can do more than a local-level review.
> >
> > As far as I can see there are no serious bugs except absence of two
> > checks for and typo in function order, which abviously will be fixed.
>
> Thus far I have found at least two bugs in this patchset which provide at
> least a local DoS and possibly a privilege escalation (aka a roothole) to
> local users. We hit a similar one in the epoll() implementation a while
> back.

It is already fixed.

> This is serious stuff. So experience tells us to be fanatical in the
> checking of incoming syscall args. Check all the arguments to death for
> correct values. Look out for overflows in additions and multiplications.
> Look out for opportunities for excessive resource consumption.
> Exhaustively test your new syscalls with many combinations of values when
> the kernel is in various states.

Agree, such security-related issues must be reviewed and tested as much
as possible.

--
Evgeniy Polyakov

2006-08-10 08:03:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

On Thu, 10 Aug 2006 11:50:47 +0400
Evgeniy Polyakov <[email protected]> wrote:

> > Afaict this mmap function gives a user a free way of getting pinned memory.
> > What is the upper bound on the amount of memory which a user can thus
> > obtain?
>
> it is limited by maximum queue length which is 4k entries right now, so
> maximum number of paged here is 4k*40/page_size, i.e. about 40 pages on
> x86.

Is that per user or per fd? If the latter that is, with the usual
RLIMIT_NOFILE, 160MBytes. 2GB with 64k pagesize. Problem ;)

2006-08-10 08:23:13

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

On Thu, Aug 10, 2006 at 01:02:54AM -0700, Andrew Morton ([email protected]) wrote:
> > > Afaict this mmap function gives a user a free way of getting pinned memory.
> > > What is the upper bound on the amount of memory which a user can thus
> > > obtain?
> >
> > it is limited by maximum queue length which is 4k entries right now, so
> > maximum number of paged here is 4k*40/page_size, i.e. about 40 pages on
> > x86.
>
> Is that per user or per fd? If the latter that is, with the usual
> RLIMIT_NOFILE, 160MBytes. 2GB with 64k pagesize. Problem ;)

Per kevent fd.
I have some ideas about better mmap ring implementation, which would
dinamically grow it's buffer when events are added and reuse the same
place for next events, but there are some nitpics unresolved yet.
Let's not see there in next releases (no merge of course), until better
solution is ready. I will change that area when other things are ready.

--
Evgeniy Polyakov

2006-08-10 12:13:39

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take7 0/1] kevent: generic event handling mechanism.

Hello.

Generic event handling mechanism.

Changes from 'take6' patchset:
* a lot of comments!
* do not use list poisoning for detection of the fact, that entry is in the list
* return number of ready kevents even if copy*user() fails
* strict check for number of kevents in syscall
* use ARRAY_SIZE for array size calculation
* changed superblock magic number
* use SLAB_PANIC instead of direct panic() call
* changed -E* return values
* a lot of small cleanups and indent fixes
* fully removed AIO stuff from patchset

Changes from 'take5' patchset:
* removed compilation warnings about unused wariables when lockdep is not turned on
* do not use internal socket structures, use appropriate (exported) wrappers instead
* removed default 1 second timeout
* removed AIO stuff from patchset

Changes from 'take4' patchset:
* use miscdevice instead of chardevice
* comments fixes

Changes from 'take3' patchset:
* removed serializing mutex from kevent_user_wait()
* moved storage list processing to RCU
* removed lockdep screaming - all storage locks are initialized in the same function, so it was learned
to differentiate between various cases
* remove kevent from storage if is marked as broken after callback
* fixed a typo in mmaped buffer implementation which would end up in wrong index calcualtion

Changes from 'take2' patchset:
* split kevent_finish_user() to locked and unlocked variants
* do not use KEVENT_STAT ifdefs, use inline functions instead
* use array of callbacks of each type instead of each kevent callback initialization
* changed name of ukevent guarding lock
* use only one kevent lock in kevent_user for all hash buckets instead of per-bucket locks
* do not use kevent_user_ctl structure instead provide needed arguments as syscall parameters
* various indent cleanups
* added optimisation, which is aimed to help when a lot of kevents are being copied from userspace
* mapped buffer (initial) implementation (no userspace yet)

Changes from 'take1' patchset:
- rebased against 2.6.18-git tree
- removed ioctl controlling
- added new syscall kevent_get_events(int fd, unsigned int min_nr, unsigned int max_nr,
unsigned int timeout, void __user *buf, unsigned flags)
- use old syscall kevent_ctl for creation/removing, modification and initial kevent
initialization
- use mutuxes instead of semaphores
- added file descriptor check and return error if provided descriptor does not match
kevent file operations
- various indent fixes
- removed aio_sendfile() declarations.

Thank you.

Signed-off-by: Evgeniy Polyakov <[email protected]>



--
Evgeniy Polyakov

2006-08-10 12:17:21

by Evgeniy Polyakov

[permalink] [raw]
Subject: [take7 1/1] kevent: core files and timer/poll notifications.

This patch includes core kevent files:
- userspace controlling
- kernelspace interfaces
- initialization
- notification state machines
- timer and poll/select notifications

With this patchset rate of requests per second has achieved 2500 req/sec
while with epoll/kqueue and similar techniques it is about 1600-1800
requests per second on my test hardware and trivial web server.

Signed-off-by: Evgeniy Polyakov <[email protected]>

diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index dd63d47..091ff42 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -317,3 +317,5 @@ ENTRY(sys_call_table)
.long sys_tee /* 315 */
.long sys_vmsplice
.long sys_move_pages
+ .long sys_kevent_get_events
+ .long sys_kevent_ctl
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index 5d4a7d1..b2af4a8 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -713,4 +713,6 @@ #endif
.quad sys_tee
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
+ .quad sys_kevent_get_events
+ .quad sys_kevent_ctl
ia32_syscall_end:
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index fc1c8dd..c9dde13 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -323,10 +323,12 @@ #define __NR_sync_file_range 314
#define __NR_tee 315
#define __NR_vmsplice 316
#define __NR_move_pages 317
+#define __NR_kevent_get_events 318
+#define __NR_kevent_ctl 319

#ifdef __KERNEL__

-#define NR_syscalls 318
+#define NR_syscalls 320

/*
* user-visible error numbers are in the range -1 - -128: see
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index 94387c9..61363e0 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -619,10 +619,14 @@ #define __NR_vmsplice 278
__SYSCALL(__NR_vmsplice, sys_vmsplice)
#define __NR_move_pages 279
__SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_kevent_get_events 280
+__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events)
+#define __NR_kevent_ctl 281
+__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl)

#ifdef __KERNEL__

-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_kevent_ctl

#ifndef __NO_STUBS

diff --git a/include/linux/kevent.h b/include/linux/kevent.h
new file mode 100644
index 0000000..d3ff0cd
--- /dev/null
+++ b/include/linux/kevent.h
@@ -0,0 +1,302 @@
+/*
+ * kevent.h
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __KEVENT_H
+#define __KEVENT_H
+
+/*
+ * Kevent request flags.
+ */
+
+#define KEVENT_REQ_ONESHOT 0x1 /* Process this event only once and then dequeue. */
+
+/*
+ * Kevent return flags.
+ */
+#define KEVENT_RET_BROKEN 0x1 /* Kevent is broken. */
+#define KEVENT_RET_DONE 0x2 /* Kevent processing was finished successfully. */
+
+/*
+ * Kevent type set.
+ */
+#define KEVENT_SOCKET 0
+#define KEVENT_INODE 1
+#define KEVENT_TIMER 2
+#define KEVENT_POLL 3
+#define KEVENT_NAIO 4
+#define KEVENT_AIO 5
+#define KEVENT_MAX 6
+
+/*
+ * Per-type event sets.
+ * Number of per-event sets should be exactly as number of kevent types.
+ */
+
+/*
+ * Timer events.
+ */
+#define KEVENT_TIMER_FIRED 0x1
+
+/*
+ * Socket/network asynchronous IO events.
+ */
+#define KEVENT_SOCKET_RECV 0x1
+#define KEVENT_SOCKET_ACCEPT 0x2
+#define KEVENT_SOCKET_SEND 0x4
+
+/*
+ * Inode events.
+ */
+#define KEVENT_INODE_CREATE 0x1
+#define KEVENT_INODE_REMOVE 0x2
+
+/*
+ * Poll events.
+ */
+#define KEVENT_POLL_POLLIN 0x0001
+#define KEVENT_POLL_POLLPRI 0x0002
+#define KEVENT_POLL_POLLOUT 0x0004
+#define KEVENT_POLL_POLLERR 0x0008
+#define KEVENT_POLL_POLLHUP 0x0010
+#define KEVENT_POLL_POLLNVAL 0x0020
+
+#define KEVENT_POLL_POLLRDNORM 0x0040
+#define KEVENT_POLL_POLLRDBAND 0x0080
+#define KEVENT_POLL_POLLWRNORM 0x0100
+#define KEVENT_POLL_POLLWRBAND 0x0200
+#define KEVENT_POLL_POLLMSG 0x0400
+#define KEVENT_POLL_POLLREMOVE 0x1000
+
+/*
+ * Asynchronous IO events.
+ */
+#define KEVENT_AIO_BIO 0x1
+
+#define KEVENT_MASK_ALL 0xffffffff /* Mask of all possible event values. */
+#define KEVENT_MASK_EMPTY 0x0 /* Empty mask of ready events. */
+
+struct kevent_id
+{
+ __u32 raw[2];
+};
+
+struct ukevent
+{
+ struct kevent_id id; /* Id of this request, e.g. socket number, file descriptor and so on... */
+ __u32 type; /* Event type, e.g. KEVENT_SOCK, KEVENT_INODE, KEVENT_TIMER and so on... */
+ __u32 event; /* Event itself, e.g. SOCK_ACCEPT, INODE_CREATED, TIMER_FIRED... */
+ __u32 req_flags; /* Per-event request flags */
+ __u32 ret_flags; /* Per-event return flags */
+ __u32 ret_data[2]; /* Event return data. Event originator fills it with anything it likes. */
+ union {
+ __u32 user[2]; /* User's data. It is not used, just copied to/from user. */
+ void *ptr;
+ };
+};
+
+#define KEVENT_CTL_ADD 0
+#define KEVENT_CTL_REMOVE 1
+#define KEVENT_CTL_MODIFY 2
+#define KEVENT_CTL_INIT 3
+
+#ifdef __KERNEL__
+
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/mutex.h>
+#include <linux/wait.h>
+#include <linux/net.h>
+#include <linux/rcupdate.h>
+#include <linux/kevent_storage.h>
+
+#define KEVENT_MAX_EVENTS 4096
+#define KEVENT_MIN_BUFFS_ALLOC 3
+
+struct inode;
+struct dentry;
+struct sock;
+
+struct kevent;
+struct kevent_storage;
+typedef int (* kevent_callback_t)(struct kevent *);
+
+/* @callback is called each time new event has been caught. */
+/* @enqueue is called each time new event is queued. */
+/* @dequeue is called each time event is dequeued. */
+
+struct kevent_callbacks {
+ kevent_callback_t callback, enqueue, dequeue;
+};
+
+#define KEVENT_READY 0x1
+#define KEVENT_STORAGE 0x2
+#define KEVENT_USER 0x4
+
+struct kevent
+{
+ struct rcu_head rcu_head; /* Used for kevent freeing.*/
+ struct ukevent event;
+ spinlock_t ulock; /* This lock protects ukevent manipulations, e.g. ret_flags changes. */
+
+ struct list_head kevent_entry; /* Entry of user's queue. */
+ struct list_head storage_entry; /* Entry of origin's queue. */
+ struct list_head ready_entry; /* Entry of user's ready. */
+
+ u32 flags;
+
+ struct kevent_user *user; /* User who requested this kevent. */
+ struct kevent_storage *st; /* Kevent container. */
+
+ struct kevent_callbacks callbacks;
+
+ void *priv; /* Private data for different storages.
+ * poll()/select storage has a list of wait_queue_t containers
+ * for each ->poll() { poll_wait()' } here.
+ */
+};
+
+extern struct kevent_callbacks kevent_registered_callbacks[];
+
+#define KEVENT_HASH_MASK 0xff
+
+struct kevent_user
+{
+ struct list_head kevent_list[KEVENT_HASH_MASK+1];
+ spinlock_t kevent_lock;
+ unsigned int kevent_num; /* Number of queued kevents. */
+
+ struct list_head ready_list; /* List of ready kevents. */
+ unsigned int ready_num; /* Number of ready kevents. */
+ spinlock_t ready_lock; /* Protects all manipulations with ready queue. */
+
+ unsigned int max_ready_num; /* Requested number of kevents. */
+
+ struct mutex ctl_mutex; /* Protects against simultaneous kevent_user control manipulations. */
+ wait_queue_head_t wait; /* Wait until some events are ready. */
+
+ atomic_t refcnt; /* Reference counter, increased for each new kevent. */
+
+ unsigned long *pring; /* Array of pages forming mapped ring buffer */
+
+#ifdef CONFIG_KEVENT_USER_STAT
+ unsigned long im_num;
+ unsigned long wait_num;
+ unsigned long total;
+#endif
+};
+
+extern kmem_cache_t *kevent_cache;
+int kevent_enqueue(struct kevent *k);
+int kevent_dequeue(struct kevent *k);
+int kevent_init(struct kevent *k);
+void kevent_requeue(struct kevent *k);
+int kevent_break(struct kevent *k);
+
+void kevent_user_ring_add_event(struct kevent *k);
+
+void kevent_storage_ready(struct kevent_storage *st,
+ kevent_callback_t ready_callback, u32 event);
+int kevent_storage_init(void *origin, struct kevent_storage *st);
+void kevent_storage_fini(struct kevent_storage *st);
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k);
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k);
+
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u);
+
+#ifdef CONFIG_KEVENT_POLL
+void kevent_poll_reinit(struct file *file);
+#else
+static inline void kevent_poll_reinit(struct file *file)
+{
+}
+#endif
+
+#ifdef CONFIG_KEVENT_INODE
+void kevent_inode_notify(struct inode *inode, u32 event);
+void kevent_inode_notify_parent(struct dentry *dentry, u32 event);
+void kevent_inode_remove(struct inode *inode);
+#else
+static inline void kevent_inode_notify(struct inode *inode, u32 event)
+{
+}
+static inline void kevent_inode_notify_parent(struct dentry *dentry, u32 event)
+{
+}
+static inline void kevent_inode_remove(struct inode *inode)
+{
+}
+#endif /* CONFIG_KEVENT_INODE */
+#ifdef CONFIG_KEVENT_SOCKET
+#ifdef CONFIG_LOCKDEP
+void kevent_socket_reinit(struct socket *sock);
+void kevent_sk_reinit(struct sock *sk);
+#else
+static inline void kevent_socket_reinit(struct socket *sock)
+{
+}
+static inline void kevent_sk_reinit(struct sock *sk)
+{
+}
+#endif
+void kevent_socket_notify(struct sock *sock, u32 event);
+int kevent_socket_dequeue(struct kevent *k);
+int kevent_socket_enqueue(struct kevent *k);
+#define sock_async(__sk) sock_flag(__sk, SOCK_ASYNC)
+#else
+static inline void kevent_socket_notify(struct sock *sock, u32 event)
+{
+}
+#define sock_async(__sk) ({ (void)__sk; 0; })
+#endif
+
+#ifdef CONFIG_KEVENT_USER_STAT
+static inline void kevent_stat_init(struct kevent_user *u)
+{
+ u->wait_num = u->im_num = u->total = 0;
+}
+static inline void kevent_stat_print(struct kevent_user *u)
+{
+ pr_debug("%s: u=%p, wait=%lu, immediately=%lu, total=%lu.\n",
+ __func__, u, u->wait_num, u->im_num, u->total);
+}
+static inline void kevent_stat_im(struct kevent_user *u)
+{
+ u->im_num++;
+}
+static inline void kevent_stat_wait(struct kevent_user *u)
+{
+ u->wait_num++;
+}
+static inline void kevent_stat_total(struct kevent_user *u)
+{
+ u->total++;
+}
+#else
+#define kevent_stat_print(u) ({ (void) u;})
+#define kevent_stat_init(u) ({ (void) u;})
+#define kevent_stat_im(u) ({ (void) u;})
+#define kevent_stat_wait(u) ({ (void) u;})
+#define kevent_stat_total(u) ({ (void) u;})
+#endif
+
+#endif /* __KERNEL__ */
+#endif /* __KEVENT_H */
diff --git a/include/linux/kevent_storage.h b/include/linux/kevent_storage.h
new file mode 100644
index 0000000..a38575d
--- /dev/null
+++ b/include/linux/kevent_storage.h
@@ -0,0 +1,11 @@
+#ifndef __KEVENT_STORAGE_H
+#define __KEVENT_STORAGE_H
+
+struct kevent_storage
+{
+ void *origin; /* Originator's pointer, e.g. struct sock or struct file. Can be NULL. */
+ struct list_head list; /* List of queued kevents. */
+ spinlock_t lock; /* Protects users queue. */
+};
+
+#endif /* __KEVENT_STORAGE_H */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 008f04c..8609910 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -597,4 +597,7 @@ asmlinkage long sys_get_robust_list(int
asmlinkage long sys_set_robust_list(struct robust_list_head __user *head,
size_t len);

+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min, unsigned int max,
+ unsigned int timeout, void __user *buf, unsigned flags);
+asmlinkage long sys_kevent_ctl(int ctl_fd, unsigned int cmd, unsigned int num, void __user *buf);
#endif
diff --git a/init/Kconfig b/init/Kconfig
index a099fc6..c550fcc 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -218,6 +218,8 @@ config AUDITSYSCALL
such as SELinux. To use audit's filesystem watch feature, please
ensure that INOTIFY is configured.

+source "kernel/kevent/Kconfig"
+
config IKCONFIG
bool "Kernel .config support"
---help---
diff --git a/kernel/Makefile b/kernel/Makefile
index d62ec66..2d7a6dd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_KEVENT) += kevent/
obj-$(CONFIG_RELAY) += relay.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
obj-$(CONFIG_TASKSTATS) += taskstats.o
diff --git a/kernel/kevent/Kconfig b/kernel/kevent/Kconfig
new file mode 100644
index 0000000..31ea7b2
--- /dev/null
+++ b/kernel/kevent/Kconfig
@@ -0,0 +1,59 @@
+config KEVENT
+ bool "Kernel event notification mechanism"
+ help
+ This option enables event queue mechanism.
+ It can be used as replacement for poll()/select(), AIO callback
+ invocations, advanced timer notifications and other kernel
+ object status changes.
+
+config KEVENT_USER_STAT
+ bool "Kevent user statistic"
+ depends on KEVENT
+ default N
+ help
+ This option will turn kevent_user statistic collection on.
+ Statistic data includes total number of kevent, number of kevents
+ which are ready immediately at insertion time and number of kevents
+ which were removed through readiness completion.
+ It will be printed each time control kevent descriptor is closed.
+
+config KEVENT_SOCKET
+ bool "Kernel event notifications for sockets"
+ depends on NET && KEVENT
+ help
+ This option enables notifications through KEVENT subsystem of
+ sockets operations, like new packet receiving conditions,
+ ready for accept conditions and so on.
+
+config KEVENT_INODE
+ bool "Kernel event notifications for inodes"
+ depends on KEVENT
+ help
+ This option enables notifications through KEVENT subsystem of
+ inode operations, like file creation, removal and so on.
+
+config KEVENT_TIMER
+ bool "Kernel event notifications for timers"
+ depends on KEVENT
+ help
+ This option allows to use timers through KEVENT subsystem.
+
+config KEVENT_POLL
+ bool "Kernel event notifications for poll()/select()"
+ depends on KEVENT
+ help
+ This option allows to use kevent subsystem for poll()/select()
+ notifications.
+
+config KEVENT_NAIO
+ bool "Network asynchronous IO"
+ depends on KEVENT && KEVENT_SOCKET
+ help
+ This option enables kevent based network asynchronous IO subsystem.
+
+config KEVENT_AIO
+ bool "Asynchronous IO"
+ depends on KEVENT
+ help
+ This option allows to use kevent subsystem for AIO operations.
+ AIO read is currently supported.
diff --git a/kernel/kevent/Makefile b/kernel/kevent/Makefile
new file mode 100644
index 0000000..d1ef9ba
--- /dev/null
+++ b/kernel/kevent/Makefile
@@ -0,0 +1,7 @@
+obj-y := kevent.o kevent_user.o
+obj-$(CONFIG_KEVENT_SOCKET) += kevent_socket.o
+obj-$(CONFIG_KEVENT_INODE) += kevent_inode.o
+obj-$(CONFIG_KEVENT_TIMER) += kevent_timer.o
+obj-$(CONFIG_KEVENT_POLL) += kevent_poll.o
+obj-$(CONFIG_KEVENT_NAIO) += kevent_naio.o
+obj-$(CONFIG_KEVENT_AIO) += kevent_aio.o
diff --git a/kernel/kevent/kevent.c b/kernel/kevent/kevent.c
new file mode 100644
index 0000000..03430c9
--- /dev/null
+++ b/kernel/kevent/kevent.c
@@ -0,0 +1,251 @@
+/*
+ * kevent.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/mempool.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/kevent.h>
+
+kmem_cache_t *kevent_cache;
+
+/*
+ * Attempts to add an event into appropriate origin's queue.
+ * Returns positive value if this event is ready immediately,
+ * negative value in case of error and zero if event has been queued.
+ * ->enqueue() callback must increase origin's reference counter.
+ */
+int kevent_enqueue(struct kevent *k)
+{
+ if (k->event.type >= KEVENT_MAX)
+ return -EINVAL;
+
+ if (!k->callbacks.enqueue) {
+ kevent_break(k);
+ return -EINVAL;
+ }
+
+ return k->callbacks.enqueue(k);
+}
+
+/*
+ * Remove event from the appropriate queue.
+ * ->dequeue() callback must decrease origin's reference counter.
+ */
+int kevent_dequeue(struct kevent *k)
+{
+ if (k->event.type >= KEVENT_MAX)
+ return -EINVAL;
+
+ if (!k->callbacks.dequeue) {
+ kevent_break(k);
+ return -EINVAL;
+ }
+
+ return k->callbacks.dequeue(k);
+}
+
+/*
+ * Mark kevent as broken.
+ */
+int kevent_break(struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->ulock, flags);
+ k->event.ret_flags |= KEVENT_RET_BROKEN;
+ spin_unlock_irqrestore(&k->ulock, flags);
+ return 0;
+}
+
+struct kevent_callbacks kevent_registered_callbacks[KEVENT_MAX];
+
+/*
+ * Must be called before event is going to be added into some origin's queue.
+ * Initializes ->enqueue(), ->dequeue() and ->callback() callbacks.
+ * If failed, kevent should not be used or kevent_enqueue() will fail to add
+ * this kevent into origin's queue with setting
+ * KEVENT_RET_BROKEN flag in kevent->event.ret_flags.
+ */
+int kevent_init(struct kevent *k)
+{
+ spin_lock_init(&k->ulock);
+ k->flags = 0;
+
+ if (k->event.type >= KEVENT_MAX)
+ return -EINVAL;
+
+ k->callbacks = kevent_registered_callbacks[k->event.type];
+ if (!k->callbacks.callback) {
+ kevent_break(k);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+/*
+ * Called from ->enqueue() callback when reference counter for given
+ * origin (socket, inode...) has been increased.
+ */
+int kevent_storage_enqueue(struct kevent_storage *st, struct kevent *k)
+{
+ unsigned long flags;
+
+ k->st = st;
+ spin_lock_irqsave(&st->lock, flags);
+ list_add_tail_rcu(&k->storage_entry, &st->list);
+ k->flags |= KEVENT_STORAGE;
+ spin_unlock_irqrestore(&st->lock, flags);
+ return 0;
+}
+
+/*
+ * Dequeue kevent from origin's queue.
+ * It does not decrease origin's reference counter in any way
+ * and must be called before it, so storage itself must be valid.
+ * It is called from ->dequeue() callback.
+ */
+void kevent_storage_dequeue(struct kevent_storage *st, struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&st->lock, flags);
+ if (k->flags & KEVENT_STORAGE) {
+ list_del_rcu(&k->storage_entry);
+ k->flags &= ~KEVENT_STORAGE;
+ }
+ spin_unlock_irqrestore(&st->lock, flags);
+}
+
+/*
+ * Call kevent ready callback and queue it into ready queue if needed.
+ * If kevent is marked as one-shot, then remove it from storage queue.
+ */
+static void __kevent_requeue(struct kevent *k, u32 event)
+{
+ int ret, rem = 0;
+ unsigned long flags;
+
+ ret = k->callbacks.callback(k);
+
+ spin_lock_irqsave(&k->ulock, flags);
+ if (ret > 0) {
+ k->event.ret_flags |= KEVENT_RET_DONE;
+ } else if (ret < 0) {
+ k->event.ret_flags |= KEVENT_RET_BROKEN;
+ k->event.ret_flags |= KEVENT_RET_DONE;
+ }
+ rem = (k->event.req_flags & KEVENT_REQ_ONESHOT);
+ if (!ret)
+ ret = (k->event.ret_flags & (KEVENT_RET_BROKEN|KEVENT_RET_DONE));
+ spin_unlock_irqrestore(&k->ulock, flags);
+
+ if (ret) {
+ if ((rem || ret < 0) && k->flags &KEVENT_STORAGE) {
+ list_del_rcu(&k->storage_entry);
+ k->flags &= ~KEVENT_STORAGE;
+ }
+
+ spin_lock_irqsave(&k->user->ready_lock, flags);
+ if (!(k->flags & KEVENT_READY)) {
+ kevent_user_ring_add_event(k);
+ list_add_tail(&k->ready_entry, &k->user->ready_list);
+ k->flags |= KEVENT_READY;
+ k->user->ready_num++;
+ }
+ spin_unlock_irqrestore(&k->user->ready_lock, flags);
+ wake_up(&k->user->wait);
+ }
+}
+
+/*
+ * Check if kevent is ready (by invoking it's callback) and requeue/remove
+ * if needed.
+ */
+void kevent_requeue(struct kevent *k)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&k->st->lock, flags);
+ __kevent_requeue(k, 0);
+ spin_unlock_irqrestore(&k->st->lock, flags);
+}
+
+/*
+ * Called each time some activity in origin (socket, inode...) is noticed.
+ */
+void kevent_storage_ready(struct kevent_storage *st,
+ kevent_callback_t ready_callback, u32 event)
+{
+ struct kevent *k;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(k, &st->list, storage_entry) {
+ if (ready_callback)
+ (*ready_callback)(k);
+
+ if (event & k->event.event)
+ __kevent_requeue(k, event);
+ }
+ rcu_read_unlock();
+}
+
+int kevent_storage_init(void *origin, struct kevent_storage *st)
+{
+ spin_lock_init(&st->lock);
+ st->origin = origin;
+ INIT_LIST_HEAD(&st->list);
+ return 0;
+}
+
+/*
+ * Mark all events as broken, that will remove them from storage,
+ * so storage origin (inode, sockt and so on) can be safely removed.
+ * No new entries are allowed to be added into the storage at this point.
+ * (Socket is removed from file table at this point for example).
+ */
+void kevent_storage_fini(struct kevent_storage *st)
+{
+ kevent_storage_ready(st, kevent_break, KEVENT_MASK_ALL);
+}
+
+static int __init kevent_sys_init(void)
+{
+ int i;
+
+ kevent_cache = kmem_cache_create("kevent_cache",
+ sizeof(struct kevent), 0, SLAB_PANIC, NULL, NULL);
+
+ for (i=0; i<ARRAY_SIZE(kevent_registered_callbacks); ++i) {
+ struct kevent_callbacks *c = &kevent_registered_callbacks[i];
+
+ c->callback = c->enqueue = c->dequeue = NULL;
+ }
+
+ return 0;
+}
+
+late_initcall(kevent_sys_init);
diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c
new file mode 100644
index 0000000..8a4f863
--- /dev/null
+++ b/kernel/kevent/kevent_poll.c
@@ -0,0 +1,220 @@
+/*
+ * kevent_poll.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/file.h>
+#include <linux/kevent.h>
+#include <linux/poll.h>
+#include <linux/fs.h>
+
+static kmem_cache_t *kevent_poll_container_cache;
+static kmem_cache_t *kevent_poll_priv_cache;
+
+struct kevent_poll_ctl
+{
+ struct poll_table_struct pt;
+ struct kevent *k;
+};
+
+struct kevent_poll_wait_container
+{
+ struct list_head container_entry;
+ wait_queue_head_t *whead;
+ wait_queue_t wait;
+ struct kevent *k;
+};
+
+struct kevent_poll_private
+{
+ struct list_head container_list;
+ spinlock_t container_lock;
+};
+
+static int kevent_poll_enqueue(struct kevent *k);
+static int kevent_poll_dequeue(struct kevent *k);
+static int kevent_poll_callback(struct kevent *k);
+
+static int kevent_poll_wait_callback(wait_queue_t *wait,
+ unsigned mode, int sync, void *key)
+{
+ struct kevent_poll_wait_container *cont =
+ container_of(wait, struct kevent_poll_wait_container, wait);
+ struct kevent *k = cont->k;
+ struct file *file = k->st->origin;
+ u32 revents;
+
+ revents = file->f_op->poll(file, NULL);
+
+ kevent_storage_ready(k->st, NULL, revents);
+
+ return 0;
+}
+
+static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead,
+ struct poll_table_struct *poll_table)
+{
+ struct kevent *k =
+ container_of(poll_table, struct kevent_poll_ctl, pt)->k;
+ struct kevent_poll_private *priv = k->priv;
+ struct kevent_poll_wait_container *cont;
+ unsigned long flags;
+
+ cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL);
+ if (!cont) {
+ kevent_break(k);
+ return;
+ }
+
+ cont->k = k;
+ init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback);
+ cont->whead = whead;
+
+ spin_lock_irqsave(&priv->container_lock, flags);
+ list_add_tail(&cont->container_entry, &priv->container_list);
+ spin_unlock_irqrestore(&priv->container_lock, flags);
+
+ add_wait_queue(whead, &cont->wait);
+}
+
+static int kevent_poll_enqueue(struct kevent *k)
+{
+ struct file *file;
+ int err, ready = 0;
+ unsigned int revents;
+ struct kevent_poll_ctl ctl;
+ struct kevent_poll_private *priv;
+
+ file = fget(k->event.id.raw[0]);
+ if (!file)
+ return -ENODEV;
+
+ err = -EINVAL;
+ if (!file->f_op || !file->f_op->poll)
+ goto err_out_fput;
+
+ err = -ENOMEM;
+ priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL);
+ if (!priv)
+ goto err_out_fput;
+
+ spin_lock_init(&priv->container_lock);
+ INIT_LIST_HEAD(&priv->container_list);
+
+ k->priv = priv;
+
+ ctl.k = k;
+ init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
+
+ err = kevent_storage_enqueue(&file->st, k);
+ if (err)
+ goto err_out_free;
+
+ revents = file->f_op->poll(file, &ctl.pt);
+ if (revents & k->event.event) {
+ ready = 1;
+ kevent_poll_dequeue(k);
+ }
+
+ return ready;
+
+err_out_free:
+ kmem_cache_free(kevent_poll_priv_cache, priv);
+err_out_fput:
+ fput(file);
+ return err;
+}
+
+static int kevent_poll_dequeue(struct kevent *k)
+{
+ struct file *file = k->st->origin;
+ struct kevent_poll_private *priv = k->priv;
+ struct kevent_poll_wait_container *w, *n;
+ unsigned long flags;
+
+ kevent_storage_dequeue(k->st, k);
+
+ spin_lock_irqsave(&priv->container_lock, flags);
+ list_for_each_entry_safe(w, n, &priv->container_list, container_entry) {
+ list_del(&w->container_entry);
+ remove_wait_queue(w->whead, &w->wait);
+ kmem_cache_free(kevent_poll_container_cache, w);
+ }
+ spin_unlock_irqrestore(&priv->container_lock, flags);
+
+ kmem_cache_free(kevent_poll_priv_cache, priv);
+ k->priv = NULL;
+
+ fput(file);
+
+ return 0;
+}
+
+static int kevent_poll_callback(struct kevent *k)
+{
+ struct file *file = k->st->origin;
+ unsigned int revents = file->f_op->poll(file, NULL);
+ return (revents & k->event.event);
+}
+
+static int __init kevent_poll_sys_init(void)
+{
+ struct kevent_callbacks *pc = &kevent_registered_callbacks[KEVENT_POLL];
+
+ kevent_poll_container_cache = kmem_cache_create("kevent_poll_container_cache",
+ sizeof(struct kevent_poll_wait_container), 0, 0, NULL, NULL);
+ if (!kevent_poll_container_cache) {
+ printk(KERN_ERR "Failed to create kevent poll container cache.\n");
+ return -ENOMEM;
+ }
+
+ kevent_poll_priv_cache = kmem_cache_create("kevent_poll_priv_cache",
+ sizeof(struct kevent_poll_private), 0, 0, NULL, NULL);
+ if (!kevent_poll_priv_cache) {
+ printk(KERN_ERR "Failed to create kevent poll private data cache.\n");
+ kmem_cache_destroy(kevent_poll_container_cache);
+ kevent_poll_container_cache = NULL;
+ return -ENOMEM;
+ }
+
+ pc->enqueue = &kevent_poll_enqueue;
+ pc->dequeue = &kevent_poll_dequeue;
+ pc->callback = &kevent_poll_callback;
+
+ printk(KERN_INFO "Kevent poll()/select() subsystem has been initialized.\n");
+ return 0;
+}
+
+static struct lock_class_key kevent_poll_key;
+
+void kevent_poll_reinit(struct file *file)
+{
+ lockdep_set_class(&file->st.lock, &kevent_poll_key);
+}
+
+static void __exit kevent_poll_sys_fini(void)
+{
+ kmem_cache_destroy(kevent_poll_priv_cache);
+ kmem_cache_destroy(kevent_poll_container_cache);
+}
+
+module_init(kevent_poll_sys_init);
+module_exit(kevent_poll_sys_fini);
diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c
new file mode 100644
index 0000000..f175edd
--- /dev/null
+++ b/kernel/kevent/kevent_timer.c
@@ -0,0 +1,119 @@
+/*
+ * kevent_timer.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+#include <linux/jiffies.h>
+#include <linux/kevent.h>
+
+static void kevent_timer_func(unsigned long data)
+{
+ struct kevent *k = (struct kevent *)data;
+ struct timer_list *t = k->st->origin;
+
+ kevent_storage_ready(k->st, NULL, KEVENT_MASK_ALL);
+ mod_timer(t, jiffies + msecs_to_jiffies(k->event.id.raw[0]));
+}
+
+static struct lock_class_key kevent_timer_key;
+
+static int kevent_timer_enqueue(struct kevent *k)
+{
+ struct timer_list *t;
+ struct kevent_storage *st;
+ int err;
+
+ t = kmalloc(sizeof(struct timer_list) + sizeof(struct kevent_storage),
+ GFP_KERNEL);
+ if (!t)
+ return -ENOMEM;
+
+ init_timer(t);
+ t->function = kevent_timer_func;
+ t->expires = jiffies + msecs_to_jiffies(k->event.id.raw[0]);
+ t->data = (unsigned long)k;
+
+ st = (struct kevent_storage *)(t+1);
+ err = kevent_storage_init(t, st);
+ if (err)
+ goto err_out_free;
+ lockdep_set_class(&st->lock, &kevent_timer_key);
+
+ err = kevent_storage_enqueue(st, k);
+ if (err)
+ goto err_out_st_fini;
+
+ add_timer(t);
+
+ return 0;
+
+err_out_st_fini:
+ kevent_storage_fini(st);
+err_out_free:
+ kfree(t);
+
+ return err;
+}
+
+static int kevent_timer_dequeue(struct kevent *k)
+{
+ struct kevent_storage *st = k->st;
+ struct timer_list *t = st->origin;
+
+ if (!t)
+ return -ENODEV;
+
+ del_timer_sync(t);
+
+ kevent_storage_dequeue(st, k);
+
+ kfree(t);
+
+ return 0;
+}
+
+static int kevent_timer_callback(struct kevent *k)
+{
+ struct kevent_storage *st = k->st;
+ struct timer_list *t = st->origin;
+
+ if (!t)
+ return -ENODEV;
+
+ k->event.ret_data[0] = (__u32)jiffies;
+ return 1;
+}
+
+static int __init kevent_init_timer(void)
+{
+ struct kevent_callbacks *tc = &kevent_registered_callbacks[KEVENT_TIMER];
+
+ tc->enqueue = &kevent_timer_enqueue;
+ tc->dequeue = &kevent_timer_dequeue;
+ tc->callback = &kevent_timer_callback;
+
+ return 0;
+}
+late_initcall(kevent_init_timer);
diff --git a/kernel/kevent/kevent_user.c b/kernel/kevent/kevent_user.c
new file mode 100644
index 0000000..7d699aa
--- /dev/null
+++ b/kernel/kevent/kevent_user.c
@@ -0,0 +1,959 @@
+/*
+ * kevent_user.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <[email protected]>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/mount.h>
+#include <linux/device.h>
+#include <linux/poll.h>
+#include <linux/kevent.h>
+#include <linux/jhash.h>
+#include <linux/miscdevice.h>
+#include <asm/io.h>
+
+static char kevent_name[] = "kevent";
+
+static int kevent_user_open(struct inode *, struct file *);
+static int kevent_user_release(struct inode *, struct file *);
+static unsigned int kevent_user_poll(struct file *, struct poll_table_struct *);
+static int kevent_user_mmap(struct file *, struct vm_area_struct *);
+
+static struct file_operations kevent_user_fops = {
+ .mmap = kevent_user_mmap,
+ .open = kevent_user_open,
+ .release = kevent_user_release,
+ .poll = kevent_user_poll,
+ .owner = THIS_MODULE,
+};
+
+static struct miscdevice kevent_miscdev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = kevent_name,
+ .fops = &kevent_user_fops,
+};
+
+static int kevent_get_sb(struct file_system_type *fs_type,
+ int flags, const char *dev_name, void *data, struct vfsmount *mnt)
+{
+ /* So original magic... */
+ return get_sb_pseudo(fs_type, kevent_name, NULL, 0xbcdbcdbcdul, mnt);
+}
+
+static struct file_system_type kevent_fs_type = {
+ .name = kevent_name,
+ .get_sb = kevent_get_sb,
+ .kill_sb = kill_anon_super,
+};
+
+static struct vfsmount *kevent_mnt;
+
+/*
+ * kevents are pollable, return POLLIN and POLLRDNORM
+ * when there is at least one ready kevent.
+ */
+static unsigned int kevent_user_poll(struct file *file, struct poll_table_struct *wait)
+{
+ struct kevent_user *u = file->private_data;
+ unsigned int mask;
+
+ poll_wait(file, &u->wait, wait);
+ mask = 0;
+
+ if (u->ready_num)
+ mask |= POLLIN | POLLRDNORM;
+
+ return mask;
+}
+
+static inline void kevent_user_ring_set(struct kevent_user *u, unsigned int num)
+{
+ unsigned int *idx;
+
+ idx = (unsigned int *)u->pring[0];
+ idx[0] = num;
+}
+
+/*
+ * Note that kevents does not exactly fill the page (each ukevent is 40 bytes),
+ * so we reuse 4 bytes at the begining of the first page to store index.
+ * Take that into account if you want to change size of struct ukevent.
+ */
+#define KEVENTS_ON_PAGE (PAGE_SIZE/sizeof(struct ukevent))
+
+/*
+ * Called under kevent_user->ready_lock, so updates are always protected.
+ */
+void kevent_user_ring_add_event(struct kevent *k)
+{
+ unsigned int *idx_ptr, idx, pidx, off;
+ struct ukevent *ukev;
+
+ idx_ptr = (unsigned int *)k->user->pring[0];
+ idx = idx_ptr[0];
+
+ pidx = idx/KEVENTS_ON_PAGE;
+ off = idx%KEVENTS_ON_PAGE;
+
+ if (pidx == 0)
+ ukev = (struct ukevent *)(k->user->pring[pidx] + sizeof(unsigned int));
+ else
+ ukev = (struct ukevent *)(k->user->pring[pidx]);
+
+ memcpy(&ukev[off], &k->event, sizeof(struct ukevent));
+
+ idx++;
+ if (idx >= KEVENT_MAX_EVENTS)
+ idx = 0;
+
+ idx_ptr[0] = idx;
+}
+
+/*
+ * Initialize mmap ring buffer.
+ * It will store ready kevents, so userspace could get them directly instead
+ * of using syscall. Esentially syscall becomes just a waiting point.
+ */
+static int kevent_user_ring_init(struct kevent_user *u)
+{
+ int i, pnum;
+
+ pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct ukevent) + sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;
+
+ u->pring = kmalloc(pnum * sizeof(unsigned long), GFP_KERNEL);
+ if (!u->pring)
+ return -ENOMEM;
+
+ for (i=0; i<pnum; ++i) {
+ u->pring[i] = __get_free_page(GFP_KERNEL);
+ if (!u->pring[i]) {
+ pnum = i;
+ goto err_out_free;
+ }
+ }
+
+ kevent_user_ring_set(u, 0);
+
+ return 0;
+
+err_out_free:
+ for (i=0; i<pnum; ++i)
+ free_page(u->pring[i]);
+
+ kfree(u->pring);
+
+ return -ENOMEM;
+}
+
+static void kevent_user_ring_fini(struct kevent_user *u)
+{
+ int i, pnum;
+
+ pnum = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct ukevent) + sizeof(unsigned int), PAGE_SIZE)/PAGE_SIZE;
+
+ for (i=0; i<pnum; ++i)
+ free_page(u->pring[i]);
+
+ kfree(u->pring);
+}
+
+
+/*
+ * Allocate new kevent userspace control entry.
+ */
+static struct kevent_user *kevent_user_alloc(void)
+{
+ struct kevent_user *u;
+ int i;
+
+ u = kzalloc(sizeof(struct kevent_user), GFP_KERNEL);
+ if (!u)
+ return NULL;
+
+ INIT_LIST_HEAD(&u->ready_list);
+ spin_lock_init(&u->ready_lock);
+ kevent_stat_init(u);
+ spin_lock_init(&u->kevent_lock);
+ for (i=0; i<ARRAY_SIZE(u->kevent_list); ++i)
+ INIT_LIST_HEAD(&u->kevent_list[i]);
+
+ mutex_init(&u->ctl_mutex);
+ init_waitqueue_head(&u->wait);
+
+ atomic_set(&u->refcnt, 1);
+
+ if (kevent_user_ring_init(u)) {
+ kfree(u);
+ u = NULL;
+ }
+
+ return u;
+}
+
+static int kevent_user_open(struct inode *inode, struct file *file)
+{
+ struct kevent_user *u = kevent_user_alloc();
+
+ if (!u)
+ return -ENOMEM;
+
+ file->private_data = u;
+
+ return 0;
+}
+
+
+/*
+ * Kevent userspace control block reference counting.
+ * Set to 1 at creation time, when appropriate kevent file descriptor
+ * is closed, that reference counter is decreased.
+ * When counter hits zero block is freed.
+ */
+static inline void kevent_user_get(struct kevent_user *u)
+{
+ atomic_inc(&u->refcnt);
+}
+
+static inline void kevent_user_put(struct kevent_user *u)
+{
+ if (atomic_dec_and_test(&u->refcnt)) {
+ kevent_stat_print(u);
+ kevent_user_ring_fini(u);
+ kfree(u);
+ }
+}
+
+/*
+ * Mmap implementation for ring buffer, which is created as array
+ * of pages, so vm_pgoff is an offset (in pages, not in bytes) of
+ * the first page to be mapped.
+ */
+static int kevent_user_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ size_t size = vma->vm_end - vma->vm_start, psize;
+ int pnum = size/PAGE_SIZE, i;
+ unsigned long start = vma->vm_start;
+ struct kevent_user *u = file->private_data;
+
+ psize = ALIGN(KEVENT_MAX_EVENTS*sizeof(struct ukevent) + sizeof(unsigned int), PAGE_SIZE);
+
+ if (size + vma->vm_pgoff*PAGE_SIZE != psize)
+ return -EINVAL;
+
+ if (vma->vm_flags & VM_WRITE)
+ return -EPERM;
+
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+ for (i=0; i<pnum; ++i) {
+ if (remap_pfn_range(vma, start, virt_to_phys((void *)u->pring[i+vma->vm_pgoff]), PAGE_SIZE,
+ vma->vm_page_prot))
+ return -EFAULT;
+ start += PAGE_SIZE;
+ }
+
+ return 0;
+}
+
+#if 0
+static inline unsigned int kevent_user_hash(struct ukevent *uk)
+{
+ unsigned int h = (uk->user[0] ^ uk->user[1]) ^ (uk->id.raw[0] ^ uk->id.raw[1]);
+
+ h = (((h >> 16) & 0xffff) ^ (h & 0xffff)) & 0xffff;
+ h = (((h >> 8) & 0xff) ^ (h & 0xff)) & KEVENT_HASH_MASK;
+
+ return h;
+}
+#else
+static inline unsigned int kevent_user_hash(struct ukevent *uk)
+{
+ return jhash_1word(uk->id.raw[0], 0) & KEVENT_HASH_MASK;
+}
+#endif
+
+/*
+ * RCU protects storage list (kevent->storage_entry).
+ * Free entry in RCU callback, it is dequeued from all lists at
+ * this point.
+ */
+
+static void kevent_free_rcu(struct rcu_head *rcu)
+{
+ struct kevent *kevent = container_of(rcu, struct kevent, rcu_head);
+ kmem_cache_free(kevent_cache, kevent);
+}
+
+/*
+ * Complete kevent removing - it dequeues kevent from storage list
+ * if it is requested, removes kevent from ready list, drops userspace
+ * control block reference counter and schedules kevent freeing through RCU.
+ */
+static void kevent_finish_user_complete(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+
+ if (deq)
+ kevent_dequeue(k);
+
+ spin_lock_irqsave(&u->ready_lock, flags);
+ if (k->flags & KEVENT_READY) {
+ list_del(&k->ready_entry);
+ k->flags &= ~KEVENT_READY;
+ u->ready_num--;
+ }
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+
+ kevent_user_put(u);
+ call_rcu(&k->rcu_head, kevent_free_rcu);
+}
+
+/*
+ * Remove from all lists and free kevent.
+ * Must be called under kevent_user->kevent_lock to protect
+ * kevent->kevent_entry removing.
+ */
+static void __kevent_finish_user(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+
+ list_del(&k->kevent_entry);
+ k->flags &= ~KEVENT_USER;
+ u->kevent_num--;
+ kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Remove kevent from user's list of all events,
+ * dequeue it from storage and decrease user's reference counter,
+ * since this kevent does not exist anymore. That is why it is freed here.
+ */
+static void kevent_finish_user(struct kevent *k, int deq)
+{
+ struct kevent_user *u = k->user;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ list_del(&k->kevent_entry);
+ k->flags &= ~KEVENT_USER;
+ u->kevent_num--;
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+ kevent_finish_user_complete(k, deq);
+}
+
+/*
+ * Dequeue one entry from user's ready queue.
+ */
+static struct kevent *kqueue_dequeue_ready(struct kevent_user *u)
+{
+ unsigned long flags;
+ struct kevent *k = NULL;
+
+ spin_lock_irqsave(&u->ready_lock, flags);
+ if (u->ready_num && !list_empty(&u->ready_list)) {
+ k = list_entry(u->ready_list.next, struct kevent, ready_entry);
+ list_del(&k->ready_entry);
+ k->flags &= ~KEVENT_READY;
+ u->ready_num--;
+ }
+ spin_unlock_irqrestore(&u->ready_lock, flags);
+
+ return k;
+}
+
+/*
+ * Search a kevent inside hash bucket for given ukevent.
+ */
+static struct kevent *__kevent_search(struct list_head *head, struct ukevent *uk,
+ struct kevent_user *u)
+{
+ struct kevent *k, *ret = NULL;
+
+ list_for_each_entry(k, head, kevent_entry) {
+ spin_lock(&k->ulock);
+ if (k->event.user[0] == uk->user[0] && k->event.user[1] == uk->user[1] &&
+ k->event.id.raw[0] == uk->id.raw[0] &&
+ k->event.id.raw[1] == uk->id.raw[1]) {
+ ret = k;
+ spin_unlock(&k->ulock);
+ break;
+ }
+ spin_unlock(&k->ulock);
+ }
+
+ return ret;
+}
+
+/*
+ * Search and modify kevent according to provided ukevent.
+ */
+static int kevent_modify(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ unsigned int hash = kevent_user_hash(uk);
+ int err = -ENODEV;
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&u->kevent_list[hash], uk, u);
+ if (k) {
+ spin_lock(&k->ulock);
+ k->event.event = uk->event;
+ k->event.req_flags = uk->req_flags;
+ k->event.ret_flags = 0;
+ spin_unlock(&k->ulock);
+ kevent_requeue(k);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Remove kevent which matches provided ukevent.
+ */
+static int kevent_remove(struct ukevent *uk, struct kevent_user *u)
+{
+ int err = -ENODEV;
+ struct kevent *k;
+ unsigned int hash = kevent_user_hash(uk);
+ unsigned long flags;
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ k = __kevent_search(&u->kevent_list[hash], uk, u);
+ if (k) {
+ __kevent_finish_user(k, 1);
+ err = 0;
+ }
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+
+ return err;
+}
+
+/*
+ * Detaches userspace control block from file descriptor
+ * and decrease it's reference counter.
+ * No new kevents can be added or removed from any list at this point.
+ */
+static int kevent_user_release(struct inode *inode, struct file *file)
+{
+ struct kevent_user *u = file->private_data;
+ struct kevent *k, *n;
+ int i;
+
+ for (i=0; i<ARRAY_SIZE(u->kevent_list); ++i) {
+ list_for_each_entry_safe(k, n, &u->kevent_list[i], kevent_entry)
+ kevent_finish_user(k, 1);
+ }
+
+ kevent_user_put(u);
+ file->private_data = NULL;
+
+ return 0;
+}
+
+/*
+ * Read requested number of ukevents in one shot.
+ */
+static struct ukevent *kevent_get_user(unsigned int num, void __user *arg)
+{
+ struct ukevent *ukev;
+
+ ukev = kmalloc(sizeof(struct ukevent) * num, GFP_KERNEL);
+ if (!ukev)
+ return NULL;
+
+ if (copy_from_user(ukev, arg, sizeof(struct ukevent) * num)) {
+ kfree(ukev);
+ return NULL;
+ }
+
+ return ukev;
+}
+
+/*
+ * Read from userspace all ukevents and modify appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_modify(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = 0, i;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > u->kevent_num) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i=0; i<num; ++i) {
+ if (kevent_modify(&ukev[i], u))
+ ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+ ukev[i].ret_flags |= KEVENT_RET_DONE;
+ }
+ if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+ err = -EFAULT;
+ kfree(ukev);
+ goto out;
+ }
+ }
+
+ for (i=0; i<num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ if (kevent_modify(&uk, u))
+ uk.ret_flags |= KEVENT_RET_BROKEN;
+ uk.ret_flags |= KEVENT_RET_DONE;
+
+ if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ arg += sizeof(struct ukevent);
+ }
+out:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * Read from userspace all ukevents and remove appropriate kevents.
+ * If provided number of ukevents is more that threshold, it is faster
+ * to allocate a room for them and copy in one shot instead of copy
+ * one-by-one and then process them.
+ */
+static int kevent_user_ctl_remove(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err = 0, i;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ if (num > u->kevent_num) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i=0; i<num; ++i) {
+ if (kevent_remove(&ukev[i], u))
+ ukev[i].ret_flags |= KEVENT_RET_BROKEN;
+ ukev[i].ret_flags |= KEVENT_RET_DONE;
+ }
+ if (copy_to_user(arg, ukev, num*sizeof(struct ukevent)))
+ err = -EFAULT;
+ kfree(ukev);
+ goto out;
+ }
+ }
+
+ for (i=0; i<num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ if (kevent_remove(&uk, u))
+ uk.ret_flags |= KEVENT_RET_BROKEN;
+
+ uk.ret_flags |= KEVENT_RET_DONE;
+
+ if (copy_to_user(arg, &uk, sizeof(struct ukevent))) {
+ err = -EFAULT;
+ break;
+ }
+
+ arg += sizeof(struct ukevent);
+ }
+out:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * Queue kevent into userspace control block and increase
+ * it's reference counter.
+ */
+static void kevent_user_enqueue(struct kevent_user *u, struct kevent *k)
+{
+ unsigned long flags;
+ unsigned int hash = kevent_user_hash(&k->event);
+
+ spin_lock_irqsave(&u->kevent_lock, flags);
+ list_add_tail(&k->kevent_entry, &u->kevent_list[hash]);
+ k->flags |= KEVENT_USER;
+ u->kevent_num++;
+ kevent_user_get(u);
+ spin_unlock_irqrestore(&u->kevent_lock, flags);
+}
+
+/*
+ * Add kevent from both kernel and userspace users.
+ * This function allocates and queues kevent, returns negative value
+ * on error, positive if kevent is ready immediately and zero
+ * if kevent has been queued.
+ */
+int kevent_user_add_ukevent(struct ukevent *uk, struct kevent_user *u)
+{
+ struct kevent *k;
+ int err;
+
+ k = kmem_cache_alloc(kevent_cache, GFP_KERNEL);
+ if (!k) {
+ err = -ENOMEM;
+ goto err_out_exit;
+ }
+
+ memcpy(&k->event, uk, sizeof(struct ukevent));
+ INIT_RCU_HEAD(&k->rcu_head);
+
+ k->event.ret_flags = 0;
+
+ err = kevent_init(k);
+ if (err) {
+ kmem_cache_free(kevent_cache, k);
+ goto err_out_exit;
+ }
+ k->user = u;
+ kevent_stat_total(u);
+ kevent_user_enqueue(u, k);
+
+ err = kevent_enqueue(k);
+ if (err) {
+ memcpy(uk, &k->event, sizeof(struct ukevent));
+ if (err < 0)
+ uk->ret_flags |= KEVENT_RET_BROKEN;
+ uk->ret_flags |= KEVENT_RET_DONE;
+ kevent_finish_user(k, 0);
+ }
+
+err_out_exit:
+ return err;
+}
+
+/*
+ * Copy all ukevents from userspace, allocate kevent for each one
+ * and add them into appropriate kevent_storages,
+ * e.g. sockets, inodes and so on...
+ * Ready events will replace ones provided by used and number
+ * of ready events is returned.
+ * User must check ret_flags field of each ukevent structure
+ * to determine if it is fired or failed event.
+ */
+static int kevent_user_ctl_add(struct kevent_user *u, unsigned int num, void __user *arg)
+{
+ int err, cerr = 0, knum = 0, rnum = 0, i;
+ void __user *orig = arg;
+ struct ukevent uk;
+
+ mutex_lock(&u->ctl_mutex);
+
+ err = -EINVAL;
+ if (u->kevent_num + num >= KEVENT_MAX_EVENTS)
+ goto out_remove;
+
+ if (num > KEVENT_MIN_BUFFS_ALLOC) {
+ struct ukevent *ukev;
+
+ ukev = kevent_get_user(num, arg);
+ if (ukev) {
+ for (i=0; i<num; ++i) {
+ err = kevent_user_add_ukevent(&ukev[i], u);
+ if (err) {
+ kevent_stat_im(u);
+ if (i != rnum)
+ memcpy(&ukev[rnum], &ukev[i], sizeof(struct ukevent));
+ rnum++;
+ } else
+ knum++;
+ }
+ if (copy_to_user(orig, ukev, rnum*sizeof(struct ukevent)))
+ cerr = -EFAULT;
+ kfree(ukev);
+ goto out_setup;
+ }
+ }
+
+ for (i=0; i<num; ++i) {
+ if (copy_from_user(&uk, arg, sizeof(struct ukevent))) {
+ cerr = -EFAULT;
+ break;
+ }
+ arg += sizeof(struct ukevent);
+
+ err = kevent_user_add_ukevent(&uk, u);
+ if (err) {
+ kevent_stat_im(u);
+ if (copy_to_user(orig, &uk, sizeof(struct ukevent))) {
+ cerr = -EFAULT;
+ break;
+ }
+ orig += sizeof(struct ukevent);
+ rnum++;
+ } else
+ knum++;
+ }
+
+out_setup:
+ if (cerr < 0) {
+ err = cerr;
+ goto out_remove;
+ }
+
+ err = rnum;
+out_remove:
+ mutex_unlock(&u->ctl_mutex);
+
+ return err;
+}
+
+/*
+ * In nonblocking mode it returns as many events as possible, but not more than @max_nr.
+ * In blocking mode it waits until timeout or if at least @min_nr events are ready.
+ */
+static int kevent_user_wait(struct file *file, struct kevent_user *u,
+ unsigned int min_nr, unsigned int max_nr, unsigned int timeout,
+ void __user *buf)
+{
+ struct kevent *k;
+ int num = 0;
+
+ if (!(file->f_flags & O_NONBLOCK)) {
+ wait_event_interruptible_timeout(u->wait,
+ u->ready_num >= min_nr, msecs_to_jiffies(timeout));
+ }
+
+ while (num < max_nr && ((k = kqueue_dequeue_ready(u)) != NULL)) {
+ if (copy_to_user(buf + num*sizeof(struct ukevent),
+ &k->event, sizeof(struct ukevent)))
+ break;
+
+ /*
+ * If it is one-shot kevent, it has been removed already from
+ * origin's queue, so we can easily free it here.
+ */
+ if (k->event.req_flags & KEVENT_REQ_ONESHOT)
+ kevent_finish_user(k, 1);
+ ++num;
+ kevent_stat_wait(u);
+ }
+
+ return num;
+}
+
+/*
+ * Userspace control block creation and initialization.
+ */
+static int kevent_ctl_init(void)
+{
+ struct kevent_user *u;
+ struct file *file;
+ int fd, ret;
+
+ fd = get_unused_fd();
+ if (fd < 0)
+ return fd;
+
+ file = get_empty_filp();
+ if (!file) {
+ ret = -ENFILE;
+ goto out_put_fd;
+ }
+
+ u = kevent_user_alloc();
+ if (unlikely(!u)) {
+ ret = -ENOMEM;
+ goto out_put_file;
+ }
+
+ file->f_op = &kevent_user_fops;
+ file->f_vfsmnt = mntget(kevent_mnt);
+ file->f_dentry = dget(kevent_mnt->mnt_root);
+ file->f_mapping = file->f_dentry->d_inode->i_mapping;
+ file->f_mode = FMODE_READ;
+ file->f_flags = O_RDONLY;
+ file->private_data = u;
+
+ fd_install(fd, file);
+
+ return fd;
+
+out_put_file:
+ put_filp(file);
+out_put_fd:
+ put_unused_fd(fd);
+ return ret;
+}
+
+static int kevent_ctl_process(struct file *file, unsigned int cmd, unsigned int num, void __user *arg)
+{
+ int err;
+ struct kevent_user *u = file->private_data;
+
+ if (!u || num > KEVENT_MAX_EVENTS)
+ return -EINVAL;
+
+ switch (cmd) {
+ case KEVENT_CTL_ADD:
+ err = kevent_user_ctl_add(u, num, arg);
+ break;
+ case KEVENT_CTL_REMOVE:
+ err = kevent_user_ctl_remove(u, num, arg);
+ break;
+ case KEVENT_CTL_MODIFY:
+ err = kevent_user_ctl_modify(u, num, arg);
+ break;
+ default:
+ err = -EINVAL;
+ break;
+ }
+
+ return err;
+}
+
+/*
+ * Used to get ready kevents from queue.
+ * @ctl_fd - kevent control descriptor which must be obtained through kevent_ctl(KEVENT_CTL_INIT).
+ * @min_nr - minimum number of ready kevents.
+ * @max_nr - maximum number of ready kevents.
+ * @timeout - timeout in milliseconds to wait until some events are ready.
+ * @buf - buffer to place ready events.
+ * @flags - ununsed for now (will be used for mmap implementation).
+ */
+asmlinkage long sys_kevent_get_events(int ctl_fd, unsigned int min_nr, unsigned int max_nr,
+ unsigned int timeout, void __user *buf, unsigned flags)
+{
+ int err = -EINVAL;
+ struct file *file;
+ struct kevent_user *u;
+
+ file = fget(ctl_fd);
+ if (!file)
+ return -ENODEV;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+ u = file->private_data;
+
+ err = kevent_user_wait(file, u, min_nr, max_nr, timeout, buf);
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * This syscall is used to perform various control operations
+ * on given kevent queue, which is obtained through kevent file descriptor @fd.
+ * @cmd - type of operation.
+ * @num - number of kevents to be processed.
+ * @arg - pointer to array of struct ukevent.
+ */
+asmlinkage long sys_kevent_ctl(int fd, unsigned int cmd, unsigned int num, void __user *arg)
+{
+ int err = -EINVAL;
+ struct file *file;
+
+ if (cmd == KEVENT_CTL_INIT)
+ return kevent_ctl_init();
+
+ file = fget(fd);
+ if (!file)
+ return -ENODEV;
+
+ if (file->f_op != &kevent_user_fops)
+ goto out_fput;
+
+ err = kevent_ctl_process(file, cmd, num, arg);
+
+out_fput:
+ fput(file);
+ return err;
+}
+
+/*
+ * Kevent subsystem initialization - create kevent cache and register
+ * filesystem to get control file descriptors from.
+ */
+static int __devinit kevent_user_init(void)
+{
+ int err = 0;
+
+ err = register_filesystem(&kevent_fs_type);
+ if (err)
+ panic("%s: failed to register filesystem: err=%d.\n",
+ kevent_name, err);
+
+ kevent_mnt = kern_mount(&kevent_fs_type);
+ if (IS_ERR(kevent_mnt))
+ panic("%s: failed to mount silesystem: err=%ld.\n",
+ kevent_name, PTR_ERR(kevent_mnt));
+
+ err = misc_register(&kevent_miscdev);
+ if (err) {
+ printk(KERN_ERR "Failed to register kevent miscdev: err=%d.\n", err);
+ goto err_out_exit;
+ }
+
+ printk("KEVENT subsystem has been successfully registered.\n");
+
+ return 0;
+
+err_out_exit:
+ mntput(kevent_mnt);
+ unregister_filesystem(&kevent_fs_type);
+
+ return err;
+}
+
+static void __devexit kevent_user_fini(void)
+{
+ misc_deregister(&kevent_miscdev);
+ mntput(kevent_mnt);
+ unregister_filesystem(&kevent_fs_type);
+}
+
+module_init(kevent_user_init);
+module_exit(kevent_user_fini);
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 6991bec..8d3769b 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -122,6 +122,9 @@ cond_syscall(ppc_rtas);
cond_syscall(sys_spu_run);
cond_syscall(sys_spu_create);

+cond_syscall(sys_kevent_get_events);
+cond_syscall(sys_kevent_ctl);
+
/* mmu depending weak syscall entries */
cond_syscall(sys_mprotect);
cond_syscall(sys_msync);

--
Evgeniy Polyakov

2006-08-10 12:23:06

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take7 1/1] kevent: core files and timer/poll notifications.

On Thu, Aug 10, 2006 at 04:16:38PM +0400, Evgeniy Polyakov ([email protected]) wrote:
> With this patchset rate of requests per second has achieved 2500 req/sec
> while with epoll/kqueue and similar techniques it is about 1600-1800
> requests per second on my test hardware and trivial web server.

Nope, it is old record from archives... Current one is 2600+

--
Evgeniy Polyakov

2006-08-11 00:57:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

On Thu, 10 Aug 2006 12:22:35 +0400
Evgeniy Polyakov <[email protected]> wrote:

> On Thu, Aug 10, 2006 at 01:02:54AM -0700, Andrew Morton ([email protected]) wrote:
> > > > Afaict this mmap function gives a user a free way of getting pinned memory.
> > > > What is the upper bound on the amount of memory which a user can thus
> > > > obtain?
> > >
> > > it is limited by maximum queue length which is 4k entries right now, so
> > > maximum number of paged here is 4k*40/page_size, i.e. about 40 pages on
> > > x86.
> >
> > Is that per user or per fd? If the latter that is, with the usual
> > RLIMIT_NOFILE, 160MBytes. 2GB with 64k pagesize. Problem ;)
>
> Per kevent fd.
> I have some ideas about better mmap ring implementation, which would
> dinamically grow it's buffer when events are added and reuse the same
> place for next events, but there are some nitpics unresolved yet.
> Let's not see there in next releases (no merge of course), until better
> solution is ready. I will change that area when other things are ready.

This is not a problem with the mmap interface per-se. If the proposed
event code permits each user to pin 160MB of kernel memory then that would
be a serious problem.


2006-08-11 06:16:18

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

On Thu, Aug 10, 2006 at 05:56:39PM -0700, Andrew Morton ([email protected]) wrote:
> > Per kevent fd.
> > I have some ideas about better mmap ring implementation, which would
> > dinamically grow it's buffer when events are added and reuse the same
> > place for next events, but there are some nitpics unresolved yet.
> > Let's not see there in next releases (no merge of course), until better
> > solution is ready. I will change that area when other things are ready.
>
> This is not a problem with the mmap interface per-se. If the proposed
> event code permits each user to pin 160MB of kernel memory then that would
> be a serious problem.

The main disadvantage is that all memory is allocated on the start even
if it will not be used later. I think dynamic grow is appropriate
solution, since user will have that memory used anyway, since kevents
are allocated, just part of them will be allocated from possibly
mmaped memory.

--
Evgeniy Polyakov

2006-08-11 06:24:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

On Fri, 11 Aug 2006 10:15:35 +0400
Evgeniy Polyakov <[email protected]> wrote:

> On Thu, Aug 10, 2006 at 05:56:39PM -0700, Andrew Morton ([email protected]) wrote:
> > > Per kevent fd.
> > > I have some ideas about better mmap ring implementation, which would
> > > dinamically grow it's buffer when events are added and reuse the same
> > > place for next events, but there are some nitpics unresolved yet.
> > > Let's not see there in next releases (no merge of course), until better
> > > solution is ready. I will change that area when other things are ready.
> >
> > This is not a problem with the mmap interface per-se. If the proposed
> > event code permits each user to pin 160MB of kernel memory then that would
> > be a serious problem.
>
> The main disadvantage is that all memory is allocated on the start even
> if it will not be used later. I think dynamic grow is appropriate
> solution, since user will have that memory used anyway, since kevents
> are allocated, just part of them will be allocated from possibly
> mmaped memory.

But the worst-case remains the same, doesn't it? 160MB of pinned kernel
memory per user?

2006-08-11 06:25:36

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

Evgeniy Polyakov wrote:
> The main disadvantage is that all memory is allocated on the start even
> if it will not be used later. I think dynamic grow is appropriate
> solution, since user will have that memory used anyway, since kevents
> are allocated,

If you _allocate_ memory at startup you're doing something wrong. All
you should do is allocate address space. Memory should be allocated
when it is needed.

Growing a memory region is always hard because it means you cannot keep
any addresses around and always have to reload a base pointer. That's
not ideal.

Especially on 64-bit machines address space really is no limitation
anymore. So, allocate as much as needed, allocate memory when it's
needed, and don't resize.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖


Attachments:
signature.asc (251.00 B)
OpenPGP digital signature

2006-08-11 06:31:08

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

On Thu, Aug 10, 2006 at 11:23:40PM -0700, Andrew Morton ([email protected]) wrote:
> On Fri, 11 Aug 2006 10:15:35 +0400
> Evgeniy Polyakov <[email protected]> wrote:
>
> > On Thu, Aug 10, 2006 at 05:56:39PM -0700, Andrew Morton ([email protected]) wrote:
> > > > Per kevent fd.
> > > > I have some ideas about better mmap ring implementation, which would
> > > > dinamically grow it's buffer when events are added and reuse the same
> > > > place for next events, but there are some nitpics unresolved yet.
> > > > Let's not see there in next releases (no merge of course), until better
> > > > solution is ready. I will change that area when other things are ready.
> > >
> > > This is not a problem with the mmap interface per-se. If the proposed
> > > event code permits each user to pin 160MB of kernel memory then that would
> > > be a serious problem.
> >
> > The main disadvantage is that all memory is allocated on the start even
> > if it will not be used later. I think dynamic grow is appropriate
> > solution, since user will have that memory used anyway, since kevents
> > are allocated, just part of them will be allocated from possibly
> > mmaped memory.
>
> But the worst-case remains the same, doesn't it? 160MB of pinned kernel
> memory per user?

Yes. And now I think dynamic growing is not a good solution, since user
can not know when he must call mmap() again to get additional pages
(although I have some hacks to "dynamically" replace previously mmapped
pages with new ones).

This area can be decreased down to 70mb by reducing amount of
information placed into the buffer (only user's data and flags) without
additional hints.

--
Evgeniy Polyakov

2006-08-11 06:34:41

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

On Thu, Aug 10, 2006 at 11:25:05PM -0700, Ulrich Drepper ([email protected]) wrote:
> Evgeniy Polyakov wrote:
> > The main disadvantage is that all memory is allocated on the start even
> > if it will not be used later. I think dynamic grow is appropriate
> > solution, since user will have that memory used anyway, since kevents
> > are allocated,
>
> If you _allocate_ memory at startup you're doing something wrong. All
> you should do is allocate address space. Memory should be allocated
> when it is needed.
>
> Growing a memory region is always hard because it means you cannot keep
> any addresses around and always have to reload a base pointer. That's
> not ideal.
>
> Especially on 64-bit machines address space really is no limitation
> anymore. So, allocate as much as needed, allocate memory when it's
> needed, and don't resize.

That requires mmap hacks to substitute pages in run-time without user
notifications. I do not expect it is a good solution, since on x86 it
requires full TLB flush (at least when I did it there were no exported
methods to flush separate addresses).

> --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
>



--
Evgeniy Polyakov

2006-08-11 06:38:18

by David Miller

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

From: Evgeniy Polyakov <[email protected]>
Date: Fri, 11 Aug 2006 10:33:53 +0400

> That requires mmap hacks to substitute pages in run-time without user
> notifications. I do not expect it is a good solution, since on x86 it
> requires full TLB flush (at least when I did it there were no exported
> methods to flush separate addresses).

You just need to provide a do_no_page method, the VM layer will
take care of the page level flushing or whatever else might be
needed.

2006-08-11 06:55:41

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

On Thu, Aug 10, 2006 at 11:38:26PM -0700, David Miller ([email protected]) wrote:
> From: Evgeniy Polyakov <[email protected]>
> Date: Fri, 11 Aug 2006 10:33:53 +0400
>
> > That requires mmap hacks to substitute pages in run-time without user
> > notifications. I do not expect it is a good solution, since on x86 it
> > requires full TLB flush (at least when I did it there were no exported
> > methods to flush separate addresses).
>
> You just need to provide a do_no_page method, the VM layer will
> take care of the page level flushing or whatever else might be
> needed.

Yes, it is the simplest way to extend mapping but not to replace pages
which are successfully mapped, but such hacks are not needed for kevent
which only expects to extend mapping when number of ready kevents
increases.

So I will create such implementation and will place a reduced amount of
info into that pages.

--
Evgeniy Polyakov

2006-08-11 07:05:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

On Fri, 11 Aug 2006 10:30:21 +0400
Evgeniy Polyakov <[email protected]> wrote:

> On Thu, Aug 10, 2006 at 11:23:40PM -0700, Andrew Morton ([email protected]) wrote:
> > On Fri, 11 Aug 2006 10:15:35 +0400
> > Evgeniy Polyakov <[email protected]> wrote:
> >
> > > On Thu, Aug 10, 2006 at 05:56:39PM -0700, Andrew Morton ([email protected]) wrote:
> > > > > Per kevent fd.
> > > > > I have some ideas about better mmap ring implementation, which would
> > > > > dinamically grow it's buffer when events are added and reuse the same
> > > > > place for next events, but there are some nitpics unresolved yet.
> > > > > Let's not see there in next releases (no merge of course), until better
> > > > > solution is ready. I will change that area when other things are ready.
> > > >
> > > > This is not a problem with the mmap interface per-se. If the proposed
> > > > event code permits each user to pin 160MB of kernel memory then that would
> > > > be a serious problem.
> > >
> > > The main disadvantage is that all memory is allocated on the start even
> > > if it will not be used later. I think dynamic grow is appropriate
> > > solution, since user will have that memory used anyway, since kevents
> > > are allocated, just part of them will be allocated from possibly
> > > mmaped memory.
> >
> > But the worst-case remains the same, doesn't it? 160MB of pinned kernel
> > memory per user?
>
> Yes. And now I think dynamic growing is not a good solution, since user
> can not know when he must call mmap() again to get additional pages
> (although I have some hacks to "dynamically" replace previously mmapped
> pages with new ones).
>
> This area can be decreased down to 70mb by reducing amount of
> information placed into the buffer (only user's data and flags) without
> additional hints.
>

70MB is still very bad, naturally.

There are other ways in which users can do this sort of thing - passing
fd's across sockets, allocating zillions of pagetables come to mind. But
we don't want to add more.

Possible options:

- Add a new rlimit for the number of kevent fd's

- Add a new rlimit for the amount of kevent memory

- Add a new rlimit for the total amount of pinned kernel memory. First
user is kevent.

- Account a kevent fd as being worth 100 regular fds, so the naughty user
hits EMFILE early (ug).

A new rlimit is attractive, and they're easy to add. Problem is, userspace
support is hard (I think). afaik a standard Linux system doesn't have
global and per-user rlimit config files which are parsed and acted upon at
login. That would make rlimits more useful.

2006-08-11 07:28:26

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [take6 1/3] kevent: Core files.

On Fri, Aug 11, 2006 at 12:04:54AM -0700, Andrew Morton ([email protected]) wrote:
> > This area can be decreased down to 70mb by reducing amount of
> > information placed into the buffer (only user's data and flags) without
> > additional hints.
> >
>
> 70MB is still very bad, naturally.

Actually I do not think that 4k events is a good choice - I expect people
will scale it to tens of thousands at least, so we definitely want not to
allow user to create way too many kevent fds.

> There are other ways in which users can do this sort of thing - passing
> fd's across sockets, allocating zillions of pagetables come to mind. But
> we don't want to add more.
>
> Possible options:
>
> - Add a new rlimit for the number of kevent fd's
>
> - Add a new rlimit for the amount of kevent memory
>
> - Add a new rlimit for the total amount of pinned kernel memory. First
> user is kevent.

I think this rlimit and first one are the best choises.

> - Account a kevent fd as being worth 100 regular fds, so the naughty user
> hits EMFILE early (ug).
>
> A new rlimit is attractive, and they're easy to add. Problem is, userspace
> support is hard (I think). afaik a standard Linux system doesn't have
> global and per-user rlimit config files which are parsed and acted upon at
> login. That would make rlimits more useful.

As for now it is possible to use stack size rlimit for example.

--
Evgeniy Polyakov