2002-10-28 19:14:12

by Hanna Linder

[permalink] [raw]
Subject: [PATCH] epoll more scalable than poll

--On Friday, October 18, 2002 15:05:50 -0700 Linus Torvalds <[email protected]> wrote:

> I like it noticeably better as a system call, so it's maybe worth
> discussing. It's not going to happen before I leave (very early tomorrow
> morning), but if people involved agree on this and clean patches to
> actiually add the code (not just system call stubs) can be made..
>
> Linus

Linus,

The results of our testing show not only does the system call
interface to epoll perform as well as the /dev interface but also that epoll
is many times better than standard poll. No other implementations of poll
have performed as well as epoll in any measure. Testing details and results
are published here, please take a minute to check it out: http://lse.sourceforge.net/epoll/index.html
Davide Libenzi finished the system call interface to epoll including
the changes Andrew Morton requested early last week. See that thread here: http://marc.theaimsgroup.com/?t=103516170500003&r=1&w=2
Please consider sys_epoll for inclusion in the next 2.5 kernel release.
The results clearly show epoll is the most scalable of all the existing poll implementations and the impact on existing code is minimal.

Thank you.

Hanna

ps- Did I mention there is a web site? http://lse.sf.net/epoll/index.html

-----
diff -Nru linux-2.5.44.vanilla/arch/i386/kernel/entry.S linux-2.5.44.epoll/arch/i386/kernel/entry.S
--- linux-2.5.44.vanilla/arch/i386/kernel/entry.S Fri Oct 18 21:01:19 2002
+++ linux-2.5.44.epoll/arch/i386/kernel/entry.S Sat Oct 19 21:16:19 2002
@@ -737,6 +737,10 @@
.long sys_free_hugepages
.long sys_exit_group
.long sys_lookup_dcookie
+ .long sys_epoll_create
+ .long sys_epoll_ctl /* 255 */
+ .long sys_epoll_wait
+

.rept NR_syscalls-(.-sys_call_table)/4
.long sys_ni_syscall
diff -Nru linux-2.5.44.vanilla/drivers/char/Makefile linux-2.5.44.epoll/drivers/char/Makefile
--- linux-2.5.44.vanilla/drivers/char/Makefile Fri Oct 18 21:02:32 2002
+++ linux-2.5.44.epoll/drivers/char/Makefile Tue Oct 22 10:08:40 2002
@@ -7,14 +7,14 @@
#
FONTMAPFILE = cp437.uni

-obj-y += mem.o tty_io.o n_tty.o tty_ioctl.o pty.o misc.o random.o
+obj-y += mem.o tty_io.o n_tty.o tty_ioctl.o pty.o misc.o random.o eventpoll.o

# All of the (potential) objects that export symbols.
# This list comes from 'grep -l EXPORT_SYMBOL *.[hc]'.

export-objs := busmouse.o vt.o generic_serial.o ip2main.o \
ite_gpio.o keyboard.o misc.o nvram.o random.o rtc.o \
- selection.o sonypi.o sysrq.o tty_io.o tty_ioctl.o
+ selection.o sonypi.o sysrq.o tty_io.o tty_ioctl.o eventpoll.o

obj-$(CONFIG_VT) += vt_ioctl.o vc_screen.o consolemap.o consolemap_deftbl.o selection.o keyboard.o
obj-$(CONFIG_HW_CONSOLE) += vt.o defkeymap.o
diff -Nru linux-2.5.44.vanilla/drivers/char/eventpoll.c linux-2.5.44.epoll/drivers/char/eventpoll.c
--- linux-2.5.44.vanilla/drivers/char/eventpoll.c Wed Dec 31 16:00:00 1969
+++ linux-2.5.44.epoll/drivers/char/eventpoll.c Sun Oct 27 15:23:47 2002
@@ -0,0 +1,1136 @@
+/*
+ * drivers/char/eventpoll.c ( Efficent event polling implementation )
+ * Copyright (C) 2001,...,2002 Davide Libenzi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * Davide Libenzi <[email protected]>
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/signal.h>
+#include <linux/errno.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/slab.h>
+#include <linux/poll.h>
+#include <linux/miscdevice.h>
+#include <linux/random.h>
+#include <linux/smp_lock.h>
+#include <linux/wrapper.h>
+#include <linux/string.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/wait.h>
+#include <linux/fcblist.h>
+#include <asm/bitops.h>
+#include <asm/uaccess.h>
+#include <asm/system.h>
+#include <asm/io.h>
+#include <asm/mman.h>
+#include <asm/atomic.h>
+#include <linux/eventpoll.h>
+
+
+
+#define EVENTPOLLFS_MAGIC 0x03111965 /* My birthday should work for this :) */
+
+#define DEBUG_EPOLL 0
+
+#if DEBUG_EPOLL > 0
+#define DPRINTK(x) printk x
+#define DNPRINTK(n, x) do { if ((n) <= DEBUG_EPOLL) printk x; } while (0)
+#else /* #if DEBUG_EPOLL > 0 */
+#define DPRINTK(x) (void) 0
+#define DNPRINTK(n, x) (void) 0
+#endif /* #if DEBUG_EPOLL > 0 */
+
+#define DEBUG_DPI 0
+
+#if DEBUG_DPI != 0
+#define DPI_SLAB_DEBUG (SLAB_DEBUG_FREE | SLAB_RED_ZONE /* | SLAB_POISON */)
+#else /* #if DEBUG_DPI != 0 */
+#define DPI_SLAB_DEBUG 0
+#endif /* #if DEBUG_DPI != 0 */
+
+#define INITIAL_HASH_BITS 7
+#define MAX_HASH_BITS 18
+#define RESIZE_LENGTH 2
+
+#define DPI_MEM_ALLOC() (struct epitem *) kmem_cache_alloc(dpi_cache, SLAB_KERNEL)
+#define DPI_MEM_FREE(p) kmem_cache_free(dpi_cache, p)
+#define IS_FILE_EPOLL(f) ((f)->f_op == &eventpoll_fops)
+
+
+
+typedef unsigned long long event_version_t;
+
+struct eventpoll {
+ rwlock_t lock;
+ wait_queue_head_t wq;
+ wait_queue_head_t poll_wait;
+ struct list_head *hash;
+ unsigned int hbits;
+ unsigned int hmask;
+ atomic_t hents;
+ atomic_t resize;
+ int numpages;
+ char **pages;
+ char *pages0[MAX_EVENTPOLL_PAGES];
+ char *pages1[MAX_EVENTPOLL_PAGES];
+ unsigned long vmabase;
+ atomic_t mmapped;
+ int eventcnt;
+ event_version_t ver;
+};
+
+struct epitem {
+ struct list_head llink;
+ struct eventpoll *ep;
+ struct file *file;
+ struct pollfd pfd;
+ int index;
+ event_version_t ver;
+};
+
+
+
+
+static int ep_getfd(int *efd, struct inode **einode, struct file **efile);
+static int ep_alloc_pages(char **pages, int numpages);
+static int ep_free_pages(char **pages, int numpages);
+static int ep_init(struct eventpoll *ep);
+static void ep_free(struct eventpoll *ep);
+static struct epitem *ep_find_nl(struct eventpoll *ep, int fd);
+static struct epitem *ep_find(struct eventpoll *ep, int fd);
+static int ep_hashresize(struct eventpoll *ep, unsigned long *kflags);
+static int ep_insert(struct eventpoll *ep, struct pollfd *pfd);
+static int ep_remove(struct eventpoll *ep, struct epitem *dpi);
+static void notify_proc(struct file *file, void *data, unsigned long *local, long *event);
+static int open_eventpoll(struct inode *inode, struct file *file);
+static int close_eventpoll(struct inode *inode, struct file *file);
+static unsigned int poll_eventpoll(struct file *file, poll_table *wait);
+static int write_eventpoll(struct file *file, const char *buffer, size_t count,
+ loff_t *ppos);
+static int ep_poll(struct eventpoll *ep, struct evpoll *dvp);
+static int ep_do_alloc_pages(struct eventpoll *ep, int numpages);
+static int ioctl_eventpoll(struct inode *inode, struct file *file,
+ unsigned int cmd, unsigned long arg);
+static void eventpoll_mm_open(struct vm_area_struct * vma);
+static void eventpoll_mm_close(struct vm_area_struct * vma);
+static int mmap_eventpoll(struct file *file, struct vm_area_struct *vma);
+static int eventpollfs_delete_dentry(struct dentry *dentry);
+static struct inode *get_eventpoll_inode(void);
+static struct super_block *eventpollfs_get_sb(struct file_system_type *fs_type,
+ int flags, char *dev_name, void *data);
+
+
+
+
+static kmem_cache_t *dpi_cache;
+static struct vfsmount *eventpoll_mnt;
+
+static struct file_operations eventpoll_fops = {
+ .write = write_eventpoll,
+ .ioctl = ioctl_eventpoll,
+ .mmap = mmap_eventpoll,
+ .open = open_eventpoll,
+ .release = close_eventpoll,
+ .poll = poll_eventpoll
+};
+
+static struct vm_operations_struct eventpoll_mmap_ops = {
+ .open = eventpoll_mm_open,
+ .close = eventpoll_mm_close,
+};
+
+static struct miscdevice eventpoll_miscdev = {
+ EVENTPOLL_MINOR, "eventpoll", &eventpoll_fops
+};
+
+static struct file_system_type eventpoll_fs_type = {
+ .name = "eventpollfs",
+ .get_sb = eventpollfs_get_sb,
+ .kill_sb = kill_anon_super,
+};
+
+static struct dentry_operations eventpollfs_dentry_operations = {
+ .d_delete = eventpollfs_delete_dentry,
+};
+
+
+
+asmlinkage int sys_epoll_create(int maxfds)
+{
+ int error = -EINVAL, fd;
+ unsigned long addr;
+ struct inode *inode;
+ struct file *file;
+ struct eventpoll *ep;
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_create(%d)\n",
+ current, maxfds));
+
+ if (maxfds > MAX_FDS_IN_EVENTPOLL)
+ goto eexit_1;
+ error = ep_getfd(&fd, &inode, &file);
+ if (error)
+ goto eexit_1;
+ error = open_eventpoll(inode, file);
+ if (error)
+ goto eexit_2;
+ ep = file->private_data;
+ error = ep_do_alloc_pages(ep, EP_FDS_PAGES(maxfds + 1));
+ if (error)
+ goto eexit_2;
+ down_write(&current->mm->mmap_sem);
+ addr = do_mmap_pgoff(file, 0, EP_MAP_SIZE(maxfds + 1), PROT_READ,
+ MAP_PRIVATE, 0);
+ up_write(&current->mm->mmap_sem);
+ error = PTR_ERR((void *) addr);
+ if (IS_ERR((void *) addr))
+ goto eexit_2;
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_create(%d) = %d\n",
+ current, maxfds, fd));
+
+ return fd;
+
+eexit_2:
+ sys_close(fd);
+eexit_1:
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_create(%d) = %d\n",
+ current, maxfds, error));
+ return error;
+}
+EXPORT_SYMBOL(sys_epoll_create);
+
+
+asmlinkage int sys_epoll_ctl(int epfd, int op, int fd, unsigned int events)
+{
+ int error = -EBADF;
+ struct file *file;
+ struct eventpoll *ep;
+ struct epitem *dpi;
+ struct pollfd pfd;
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_ctl(%d, %d, %d, %u)\n",
+ current, epfd, op, fd, events));
+
+ file = fget(epfd);
+ if (!file)
+ goto eexit_1;
+ error = -EINVAL;
+ if (!IS_FILE_EPOLL(file))
+ goto eexit_2;
+
+ ep = file->private_data;
+
+ pfd.fd = fd;
+ pfd.events = events;
+ pfd.revents = 0;
+
+ dpi = ep_find(ep, fd);
+
+ error = -EINVAL;
+ switch (op) {
+ case EP_CTL_ADD:
+ if (!dpi)
+ error = ep_insert(ep, &pfd);
+ else
+ error = -EEXIST;
+ break;
+ case EP_CTL_DEL:
+ if (dpi)
+ error = ep_remove(ep, dpi);
+ else
+ error = -ENOENT;
+ break;
+ case EP_CTL_MOD:
+ if (dpi) {
+ dpi->pfd.events = events;
+ error = 0;
+ } else
+ error = -ENOENT;
+ break;
+ }
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_ctl(%d, %d, %d, %u) = %d\n",
+ current, epfd, op, fd, events, error));
+
+eexit_2:
+ fput(file);
+eexit_1:
+ return error;
+}
+EXPORT_SYMBOL(sys_epoll_ctl);
+
+
+asmlinkage int sys_epoll_wait(int epfd, struct pollfd const **events, int timeout)
+{
+ int error = -EBADF;
+ void *eaddr;
+ struct file *file;
+ struct eventpoll *ep;
+ struct evpoll dvp;
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_wait(%d, %p, %d)\n",
+ current, epfd, events, timeout));
+
+ file = fget(epfd);
+ if (!file)
+ goto eexit_1;
+ error = -EINVAL;
+ if (!IS_FILE_EPOLL(file))
+ goto eexit_2;
+
+ ep = file->private_data;
+
+ error = -EINVAL;
+ if (!atomic_read(&ep->mmapped))
+ goto eexit_2;
+
+ dvp.ep_timeout = timeout;
+ error = ep_poll(ep, &dvp);
+ if (error > 0) {
+ eaddr = (void *) (ep->vmabase + dvp.ep_resoff);
+ if (copy_to_user(events, &eaddr, sizeof(struct pollfd *)))
+ error = -EFAULT;
+ }
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_wait(%d, %p, %d) = %d\n",
+ current, epfd, events, timeout, error));
+
+eexit_2:
+ fput(file);
+eexit_1:
+ return error;
+}
+EXPORT_SYMBOL(sys_epoll_wait);
+
+
+static int ep_getfd(int *efd, struct inode **einode, struct file **efile)
+{
+ struct qstr this;
+ char name[32];
+ struct dentry *dentry;
+ struct inode *inode;
+ struct file *file;
+ int error, fd;
+
+ error = -ENFILE;
+ file = get_empty_filp();
+ if (!file)
+ goto eexit_1;
+
+ inode = get_eventpoll_inode();
+ error = PTR_ERR(inode);
+ if (IS_ERR(inode))
+ goto eexit_2;
+
+ error = get_unused_fd();
+ if (error < 0)
+ goto eexit_3;
+ fd = error;
+
+ error = -ENOMEM;
+ sprintf(name, "[%lu]", inode->i_ino);
+ this.name = name;
+ this.len = strlen(name);
+ this.hash = inode->i_ino;
+ dentry = d_alloc(eventpoll_mnt->mnt_sb->s_root, &this);
+ if (!dentry)
+ goto eexit_4;
+ dentry->d_op = &eventpollfs_dentry_operations;
+ d_add(dentry, inode);
+ file->f_vfsmnt = mntget(mntget(eventpoll_mnt));
+ file->f_dentry = dget(dentry);
+
+ file->f_pos = 0;
+ file->f_flags = O_RDWR;
+ file->f_op = &eventpoll_fops;
+ file->f_mode = FMODE_READ | FMODE_WRITE;
+ file->f_version = 0;
+ file->private_data = NULL;
+
+ fd_install(fd, file);
+
+ *efd = fd;
+ *einode = inode;
+ *efile = file;
+ return 0;
+
+eexit_4:
+ put_unused_fd(fd);
+eexit_3:
+ iput(inode);
+eexit_2:
+ put_filp(file);
+eexit_1:
+ return error;
+}
+
+
+static int ep_alloc_pages(char **pages, int numpages)
+{
+ int ii;
+
+ for (ii = 0; ii < numpages; ii++) {
+ pages[ii] = (char *) __get_free_pages(GFP_KERNEL, 0);
+ if (!pages[ii]) {
+ for (--ii; ii >= 0; ii--) {
+ ClearPageReserved(virt_to_page(pages[ii]));
+ free_pages((unsigned long) pages[ii], 0);
+ }
+ return -ENOMEM;
+ }
+ SetPageReserved(virt_to_page(pages[ii]));
+ }
+ return 0;
+}
+
+
+static int ep_free_pages(char **pages, int numpages)
+{
+ int ii;
+
+ for (ii = 0; ii < numpages; ii++) {
+ ClearPageReserved(virt_to_page(pages[ii]));
+ free_pages((unsigned long) pages[ii], 0);
+ }
+ return 0;
+}
+
+
+static int ep_init(struct eventpoll *ep)
+{
+ int ii, hentries;
+
+ rwlock_init(&ep->lock);
+ init_waitqueue_head(&ep->wq);
+ init_waitqueue_head(&ep->poll_wait);
+ ep->hbits = INITIAL_HASH_BITS;
+ ep->hmask = (1 << ep->hbits) - 1;
+ atomic_set(&ep->hents, 0);
+ atomic_set(&ep->resize, 0);
+ atomic_set(&ep->mmapped, 0);
+ ep->numpages = 0;
+ ep->vmabase = 0;
+ ep->pages = ep->pages0;
+ ep->eventcnt = 0;
+ ep->ver = 1;
+
+ hentries = ep->hmask + 1;
+ if (!(ep->hash = (struct list_head *) vmalloc(hentries * sizeof(struct list_head))))
+ return -ENOMEM;
+
+ for (ii = 0; ii < hentries; ii++)
+ INIT_LIST_HEAD(&ep->hash[ii]);
+
+ return 0;
+}
+
+
+static void ep_free(struct eventpoll *ep)
+{
+ int ii;
+ struct list_head *lnk;
+
+ for (ii = 0; ii <= ep->hmask; ii++) {
+ while ((lnk = list_first(&ep->hash[ii]))) {
+ struct epitem *dpi = list_entry(lnk, struct epitem, llink);
+
+ file_notify_delcb(dpi->file, notify_proc);
+ list_del(lnk);
+ DPI_MEM_FREE(dpi);
+ }
+ }
+ vfree(ep->hash);
+ if (ep->numpages > 0) {
+ ep_free_pages(ep->pages0, ep->numpages);
+ ep_free_pages(ep->pages1, ep->numpages);
+ }
+}
+
+
+static struct epitem *ep_find_nl(struct eventpoll *ep, int fd)
+{
+ struct epitem *dpi = NULL;
+ struct list_head *lsthead, *lnk;
+
+ lsthead = &ep->hash[fd & ep->hmask];
+ list_for_each(lnk, lsthead) {
+ dpi = list_entry(lnk, struct epitem, llink);
+
+ if (dpi->pfd.fd == fd) break;
+ dpi = NULL;
+ }
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: ep_find(%d) -> %p\n", current, fd, dpi));
+
+ return dpi;
+}
+
+
+static struct epitem *ep_find(struct eventpoll *ep, int fd)
+{
+ struct epitem *dpi;
+ unsigned long flags;
+
+ read_lock_irqsave(&ep->lock, flags);
+
+ dpi = ep_find_nl(ep, fd);
+
+ read_unlock_irqrestore(&ep->lock, flags);
+
+ return dpi;
+}
+
+
+static int ep_hashresize(struct eventpoll *ep, unsigned long *kflags)
+{
+ struct list_head *hash, *oldhash;
+ unsigned int hbits = ep->hbits + 1;
+ unsigned int hmask = (1 << hbits) - 1;
+ int ii, res, hentries = hmask + 1;
+ unsigned long flags = *kflags;
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: ep_hashresize(%p) bits=%u\n", current, ep, hbits));
+
+ write_unlock_irqrestore(&ep->lock, flags);
+
+ res = -ENOMEM;
+ if (!(hash = (struct list_head *) vmalloc(hentries * sizeof(struct list_head)))) {
+ write_lock_irqsave(&ep->lock, flags);
+ goto eexit_1;
+ }
+
+ for (ii = 0; ii < hentries; ii++)
+ INIT_LIST_HEAD(&hash[ii]);
+
+ write_lock_irqsave(&ep->lock, flags);
+
+ oldhash = ep->hash;
+ for (ii = 0; ii <= ep->hmask; ii++) {
+ struct list_head *oldhead = &oldhash[ii], *lnk;
+
+ while ((lnk = list_first(oldhead))) {
+ struct epitem *dpi = list_entry(lnk, struct epitem, llink);
+
+ list_del(lnk);
+ list_add(lnk, &hash[dpi->pfd.fd & hmask]);
+ }
+ }
+
+ ep->hash = hash;
+ ep->hbits = hbits;
+ ep->hmask = hmask;
+
+ write_unlock_irqrestore(&ep->lock, flags);
+ vfree(oldhash);
+ write_lock_irqsave(&ep->lock, flags);
+
+ res = 0;
+eexit_1:
+ *kflags = flags;
+ atomic_dec(&ep->resize);
+ return res;
+}
+
+
+static int ep_insert(struct eventpoll *ep, struct pollfd *pfd)
+{
+ int error;
+ struct epitem *dpi;
+ struct file *file;
+ unsigned long flags;
+
+ if (atomic_read(&ep->hents) >= (ep->numpages * POLLFD_X_PAGE))
+ return -E2BIG;
+
+ file = fget(pfd->fd);
+ if (!file)
+ return -EBADF;
+
+ error = -ENOMEM;
+ if (!(dpi = DPI_MEM_ALLOC()))
+ goto eexit_1;
+
+ INIT_LIST_HEAD(&dpi->llink);
+ dpi->ep = ep;
+ dpi->file = file;
+ dpi->pfd = *pfd;
+ dpi->index = -1;
+ dpi->ver = ep->ver - 1;
+
+ write_lock_irqsave(&ep->lock, flags);
+
+ list_add(&dpi->llink, &ep->hash[pfd->fd & ep->hmask]);
+ atomic_inc(&ep->hents);
+
+ if (!atomic_read(&ep->resize) &&
+ (atomic_read(&ep->hents) >> ep->hbits) > RESIZE_LENGTH &&
+ ep->hbits < MAX_HASH_BITS) {
+ atomic_inc(&ep->resize);
+ ep_hashresize(ep, &flags);
+ }
+
+ write_unlock_irqrestore(&ep->lock, flags);
+
+ file_notify_addcb(file, notify_proc, dpi);
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: ep_insert(%p, %d)\n", current, ep, pfd->fd));
+
+ error = 0;
+eexit_1:
+ fput(file);
+
+ return error;
+}
+
+
+static int ep_remove(struct eventpoll *ep, struct epitem *dpi)
+{
+ unsigned long flags;
+ struct pollfd *pfd, *lpfd;
+ struct epitem *ldpi;
+
+ file_notify_delcb(dpi->file, notify_proc);
+
+ write_lock_irqsave(&ep->lock, flags);
+
+ list_del(&dpi->llink);
+ atomic_dec(&ep->hents);
+
+ if (dpi->index >= 0 && dpi->ver == ep->ver && dpi->index < ep->eventcnt) {
+ pfd = (struct pollfd *) (ep->pages[EVENT_PAGE_INDEX(dpi->index)] +
+ EVENT_PAGE_OFFSET(dpi->index));
+ if (pfd->fd == dpi->pfd.fd && dpi->index < --ep->eventcnt) {
+ lpfd = (struct pollfd *) (ep->pages[EVENT_PAGE_INDEX(ep->eventcnt)] +
+ EVENT_PAGE_OFFSET(ep->eventcnt));
+ *pfd = *lpfd;
+
+ if ((ldpi = ep_find_nl(ep, pfd->fd))) ldpi->index = dpi->index;
+ }
+ }
+
+ write_unlock_irqrestore(&ep->lock, flags);
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: ep_remove(%p, %d)\n",
+ current, ep, dpi->pfd.fd));
+
+ DPI_MEM_FREE(dpi);
+
+ return 0;
+}
+
+
+static void notify_proc(struct file *file, void *data, unsigned long *local, long *event)
+{
+ struct epitem *dpi = data;
+ struct eventpoll *ep = dpi->ep;
+ struct pollfd *pfd;
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: notify(%p, %p, %ld, %ld) ep=%p\n",
+ current, file, data, event[0], event[1], ep));
+
+ write_lock(&ep->lock);
+ if (!(dpi->pfd.events & event[1]))
+ goto out;
+
+ if (dpi->index < 0 || dpi->ver != ep->ver) {
+ if (ep->eventcnt >= (ep->numpages * POLLFD_X_PAGE))
+ goto out;
+ dpi->index = ep->eventcnt++;
+ dpi->ver = ep->ver;
+ pfd = (struct pollfd *) (ep->pages[EVENT_PAGE_INDEX(dpi->index)] +
+ EVENT_PAGE_OFFSET(dpi->index));
+ *pfd = dpi->pfd;
+ } else {
+ pfd = (struct pollfd *) (ep->pages[EVENT_PAGE_INDEX(dpi->index)] +
+ EVENT_PAGE_OFFSET(dpi->index));
+ if (pfd->fd != dpi->pfd.fd) {
+ if (ep->eventcnt >= (ep->numpages * POLLFD_X_PAGE))
+ goto out;
+ dpi->index = ep->eventcnt++;
+ pfd = (struct pollfd *) (ep->pages[EVENT_PAGE_INDEX(dpi->index)] +
+ EVENT_PAGE_OFFSET(dpi->index));
+ *pfd = dpi->pfd;
+ }
+ }
+
+ pfd->revents |= (pfd->events & event[1]);
+
+ if (waitqueue_active(&ep->wq))
+ wake_up(&ep->wq);
+ if (waitqueue_active(&ep->poll_wait))
+ wake_up(&ep->poll_wait);
+out:
+ write_unlock(&ep->lock);
+}
+
+
+static int open_eventpoll(struct inode *inode, struct file *file)
+{
+ int res;
+ struct eventpoll *ep;
+
+ if (!(ep = kmalloc(sizeof(struct eventpoll), GFP_KERNEL)))
+ return -ENOMEM;
+
+ memset(ep, 0, sizeof(*ep));
+ if ((res = ep_init(ep))) {
+ kfree(ep);
+ return res;
+ }
+
+ file->private_data = ep;
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: open() ep=%p\n", current, ep));
+ return 0;
+}
+
+
+static int close_eventpoll(struct inode *inode, struct file *file)
+{
+ struct eventpoll *ep = file->private_data;
+
+ if (ep) {
+ ep_free(ep);
+ kfree(ep);
+ }
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: close() ep=%p\n", current, ep));
+ return 0;
+}
+
+
+static unsigned int poll_eventpoll(struct file *file, poll_table *wait)
+{
+ struct eventpoll *ep = file->private_data;
+
+ poll_wait(file, &ep->poll_wait, wait);
+ if (ep->eventcnt)
+ return POLLIN | POLLRDNORM;
+
+ return 0;
+}
+
+
+static int write_eventpoll(struct file *file, const char *buffer, size_t count,
+ loff_t *ppos)
+{
+ int rcount;
+ struct eventpoll *ep = file->private_data;
+ struct epitem *dpi;
+ struct pollfd pfd;
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: write(%p, %d)\n", current, ep, count));
+
+ rcount = -EINVAL;
+ if (count % sizeof(struct pollfd))
+ goto eexit_1;
+
+ if ((rcount = verify_area(VERIFY_READ, buffer, count)))
+ goto eexit_1;
+
+ rcount = 0;
+
+ while (count > 0) {
+ if (__copy_from_user(&pfd, buffer, sizeof(pfd))) {
+ rcount = -EFAULT;
+ goto eexit_1;
+ }
+
+ dpi = ep_find(ep, pfd.fd);
+
+ if (pfd.fd >= current->files->max_fds || !current->files->fd[pfd.fd])
+ pfd.events = POLLREMOVE;
+ if (pfd.events & POLLREMOVE) {
+ if (dpi) {
+ ep_remove(ep, dpi);
+ rcount += sizeof(pfd);
+ }
+ }
+ else if (dpi) {
+ dpi->pfd.events = pfd.events;
+ rcount += sizeof(pfd);
+ } else {
+ pfd.revents = 0;
+ if (!ep_insert(ep, &pfd))
+ rcount += sizeof(pfd);
+ }
+
+ buffer += sizeof(pfd);
+ count -= sizeof(pfd);
+ }
+
+eexit_1:
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: write(%p, %d) = %d\n",
+ current, ep, count, rcount));
+
+ return rcount;
+}
+
+
+static int ep_poll(struct eventpoll *ep, struct evpoll *dvp)
+{
+ int res = 0;
+ long timeout;
+ unsigned long flags;
+ wait_queue_t wait;
+
+ if (!atomic_read(&ep->mmapped))
+ return -EINVAL;
+
+ write_lock_irqsave(&ep->lock, flags);
+
+ res = 0;
+ if (!ep->eventcnt) {
+ init_waitqueue_entry(&wait, current);
+ add_wait_queue(&ep->wq, &wait);
+ timeout = dvp->ep_timeout == -1 || dvp->ep_timeout > MAX_SCHEDULE_TIMEOUT / HZ ?
+ MAX_SCHEDULE_TIMEOUT: (dvp->ep_timeout * HZ) / 1000;
+ for (;;) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ if (ep->eventcnt || !timeout)
+ break;
+ if (signal_pending(current)) {
+ res = -EINTR;
+ break;
+ }
+
+ write_unlock_irqrestore(&ep->lock, flags);
+ timeout = schedule_timeout(timeout);
+ write_lock_irqsave(&ep->lock, flags);
+ }
+ remove_wait_queue(&ep->wq, &wait);
+
+ set_current_state(TASK_RUNNING);
+ }
+
+ if (!res && ep->eventcnt) {
+ res = ep->eventcnt;
+ ep->eventcnt = 0;
+ ++ep->ver;
+ if (ep->pages == ep->pages0) {
+ ep->pages = ep->pages1;
+ dvp->ep_resoff = 0;
+ } else {
+ ep->pages = ep->pages0;
+ dvp->ep_resoff = ep->numpages * PAGE_SIZE;
+ }
+ }
+
+ write_unlock_irqrestore(&ep->lock, flags);
+
+ return res;
+}
+
+
+static int ep_do_alloc_pages(struct eventpoll *ep, int numpages)
+{
+ int res, pgalloc, pgcpy;
+ unsigned long flags;
+ char **pages, **pages0, **pages1;
+
+ if (atomic_read(&ep->mmapped))
+ return -EBUSY;
+ if (numpages > MAX_EVENTPOLL_PAGES)
+ return -EINVAL;
+
+ pgalloc = numpages - ep->numpages;
+ if ((pages = (char **) vmalloc(2 * (pgalloc + 1) * sizeof(char *))) == NULL)
+ return -ENOMEM;
+ pages0 = &pages[0];
+ pages1 = &pages[pgalloc + 1];
+
+ if ((res = ep_alloc_pages(pages0, pgalloc)))
+ goto eexit_1;
+
+ if ((res = ep_alloc_pages(pages1, pgalloc))) {
+ ep_free_pages(pages0, pgalloc);
+ goto eexit_1;
+ }
+
+ write_lock_irqsave(&ep->lock, flags);
+ pgcpy = (ep->numpages + pgalloc) > numpages ? numpages - ep->numpages: pgalloc;
+ if (pgcpy > 0) {
+ memcpy(&ep->pages0[ep->numpages], pages0, pgcpy * sizeof(char *));
+ memcpy(&ep->pages1[ep->numpages], pages1, pgcpy * sizeof(char *));
+ ep->numpages += pgcpy;
+ }
+ write_unlock_irqrestore(&ep->lock, flags);
+
+ if (pgcpy < pgalloc) {
+ if (pgcpy < 0)
+ pgcpy = 0;
+ ep_free_pages(&pages0[pgcpy], pgalloc - pgcpy);
+ ep_free_pages(&pages1[pgcpy], pgalloc - pgcpy);
+ }
+
+eexit_1:
+ vfree(pages);
+ return res;
+}
+
+
+static int ioctl_eventpoll(struct inode *inode, struct file *file,
+ unsigned int cmd, unsigned long arg)
+{
+ int res;
+ struct eventpoll *ep = file->private_data;
+ struct epitem *dpi;
+ unsigned long flags;
+ struct pollfd pfd;
+ struct evpoll dvp;
+
+ switch (cmd) {
+ case EP_ALLOC:
+ res = ep_do_alloc_pages(ep, EP_FDS_PAGES(arg));
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: ioctl(%p, EP_ALLOC, %lu) == %d\n",
+ current, ep, arg, res));
+ return res;
+
+ case EP_FREE:
+ if (atomic_read(&ep->mmapped))
+ return -EBUSY;
+
+ res = -EINVAL;
+ write_lock_irqsave(&ep->lock, flags);
+ if (ep->numpages > 0) {
+ ep_free_pages(ep->pages0, ep->numpages);
+ ep_free_pages(ep->pages1, ep->numpages);
+ ep->numpages = 0;
+ ep->pages = ep->pages0;
+ res = 0;
+ }
+ write_unlock_irqrestore(&ep->lock, flags);
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: ioctl(%p, EP_FREE) == %d\n",
+ current, ep, res));
+ return res;
+
+ case EP_POLL:
+ if (copy_from_user(&dvp, (void *) arg, sizeof(struct evpoll)))
+ return -EFAULT;
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: ioctl(%p, EP_POLL, %d)\n",
+ current, ep, dvp.ep_timeout));
+
+ res = ep_poll(ep, &dvp);
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: ioctl(%p, EP_POLL, %d) == %d\n",
+ current, ep, dvp.ep_timeout, res));
+
+ if (res > 0 && copy_to_user((void *) arg, &dvp, sizeof(struct evpoll)))
+ res = -EFAULT;
+
+ return res;
+
+ case EP_ISPOLLED:
+ if (copy_from_user(&pfd, (void *) arg, sizeof(struct pollfd)))
+ return 0;
+
+ read_lock_irqsave(&ep->lock, flags);
+
+ res = 0;
+ if (!(dpi = ep_find_nl(ep, pfd.fd)))
+ goto is_not_polled;
+
+ pfd = dpi->pfd;
+ res = 1;
+
+ is_not_polled:
+ read_unlock_irqrestore(&ep->lock, flags);
+
+ if (res)
+ copy_to_user((void *) arg, &pfd, sizeof(struct pollfd));
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: ioctl(%p, EP_ISPOLLED, %d) == %d\n",
+ current, ep, pfd.fd, res));
+ return res;
+ }
+
+ return -EINVAL;
+}
+
+
+static void eventpoll_mm_open(struct vm_area_struct * vma)
+{
+ struct file *file = vma->vm_file;
+ struct eventpoll *ep = file->private_data;
+
+ if (ep) atomic_inc(&ep->mmapped);
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: mm_open(%p)\n", current, ep));
+}
+
+
+static void eventpoll_mm_close(struct vm_area_struct * vma)
+{
+ struct file *file = vma->vm_file;
+ struct eventpoll *ep = file->private_data;
+
+ if (ep && atomic_dec_and_test(&ep->mmapped))
+ ep->vmabase = 0;
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: mm_close(%p)\n", current, ep));
+}
+
+
+static int mmap_eventpoll(struct file *file, struct vm_area_struct *vma)
+{
+ struct eventpoll *ep = file->private_data;
+ unsigned long start;
+ int ii, res, numpages;
+ size_t mapsize;
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: mmap(%p, %lx, %lx)\n",
+ current, ep, vma->vm_start, vma->vm_pgoff << PAGE_SHIFT));
+
+ if (vma->vm_flags & VM_WRITE)
+ return -EACCES;
+ if ((vma->vm_pgoff << PAGE_SHIFT) != 0)
+ return -EINVAL;
+
+ mapsize = PAGE_ALIGN(vma->vm_end - vma->vm_start);
+ numpages = mapsize >> PAGE_SHIFT;
+
+ res = -EINVAL;
+ if (numpages != (2 * ep->numpages))
+ goto eexit_1;
+
+ start = vma->vm_start;
+ for (ii = 0; ii < ep->numpages; ii++) {
+ if ((res = remap_page_range(vma, start, __pa(ep->pages0[ii]),
+ PAGE_SIZE, vma->vm_page_prot)))
+ goto eexit_1;
+ start += PAGE_SIZE;
+ }
+ for (ii = 0; ii < ep->numpages; ii++) {
+ if ((res = remap_page_range(vma, start, __pa(ep->pages1[ii]),
+ PAGE_SIZE, vma->vm_page_prot)))
+ goto eexit_1;
+ start += PAGE_SIZE;
+ }
+ vma->vm_ops = &eventpoll_mmap_ops;
+ ep->vmabase = vma->vm_start;
+ atomic_set(&ep->mmapped, 1);
+ res = 0;
+eexit_1:
+
+ DNPRINTK(3, (KERN_INFO "[%p] eventpoll: mmap(%p, %lx, %lx) == %d\n",
+ current, ep, vma->vm_start, vma->vm_pgoff << PAGE_SHIFT, res));
+ return res;
+}
+
+
+static int eventpollfs_delete_dentry(struct dentry *dentry)
+{
+
+ return 1;
+}
+
+
+static struct inode *get_eventpoll_inode(void)
+{
+ int error = -ENOMEM;
+ struct inode *inode = new_inode(eventpoll_mnt->mnt_sb);
+
+ if (!inode)
+ goto eexit_1;
+
+ inode->i_fop = &eventpoll_fops;
+
+ /*
+ * Mark the inode dirty from the very beginning,
+ * that way it will never be moved to the dirty
+ * list because "mark_inode_dirty()" will think
+ * that it already _is_ on the dirty list.
+ */
+ inode->i_state = I_DIRTY;
+ inode->i_mode = S_IRUSR | S_IWUSR;
+ inode->i_uid = current->fsuid;
+ inode->i_gid = current->fsgid;
+ inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+ inode->i_blksize = PAGE_SIZE;
+ return inode;
+
+eexit_1:
+ return ERR_PTR(error);
+}
+
+
+static struct super_block *eventpollfs_get_sb(struct file_system_type *fs_type,
+ int flags, char *dev_name, void *data)
+{
+
+ return get_sb_pseudo(fs_type, "eventpoll:", NULL, EVENTPOLLFS_MAGIC);
+}
+
+
+static int __init eventpoll_init(void)
+{
+ int error;
+
+ error = -ENOMEM;
+ dpi_cache = kmem_cache_create("eventpoll",
+ sizeof(struct epitem),
+ __alignof__(struct epitem),
+ DPI_SLAB_DEBUG, NULL, NULL);
+ if (!dpi_cache)
+ goto eexit_1;
+
+ error = register_filesystem(&eventpoll_fs_type);
+ if (error)
+ goto eexit_2;
+
+ eventpoll_mnt = kern_mount(&eventpoll_fs_type);
+ error = PTR_ERR(eventpoll_mnt);
+ if (IS_ERR(eventpoll_mnt))
+ goto eexit_3;
+
+ error = misc_register(&eventpoll_miscdev);
+ if (error)
+ goto eexit_4;
+
+ printk(KERN_INFO "[%p] eventpoll: driver installed.\n", current);
+
+ return error;
+
+eexit_4:
+ mntput(eventpoll_mnt);
+eexit_3:
+ unregister_filesystem(&eventpoll_fs_type);
+eexit_2:
+ kmem_cache_destroy(dpi_cache);
+eexit_1:
+
+ return error;
+}
+
+static void __exit eventpoll_exit(void)
+{
+ unregister_filesystem(&eventpoll_fs_type);
+ mntput(eventpoll_mnt);
+ misc_deregister(&eventpoll_miscdev);
+ kmem_cache_destroy(dpi_cache);
+}
+
+module_init(eventpoll_init);
+module_exit(eventpoll_exit);
+
+MODULE_LICENSE("GPL");
+
+
diff -Nru linux-2.5.44.vanilla/fs/Makefile linux-2.5.44.epoll/fs/Makefile
--- linux-2.5.44.vanilla/fs/Makefile Fri Oct 18 21:01:57 2002
+++ linux-2.5.44.epoll/fs/Makefile Sat Oct 19 12:05:48 2002
@@ -6,14 +6,14 @@
#

export-objs := open.o dcache.o buffer.o bio.o inode.o dquot.o mpage.o aio.o \
- fcntl.o read_write.o dcookies.o
+ fcntl.o read_write.o dcookies.o fcblist.o

obj-y := open.o read_write.o devices.o file_table.o buffer.o \
bio.o super.o block_dev.o char_dev.o stat.o exec.o pipe.o \
namei.o fcntl.o ioctl.o readdir.o select.o fifo.o locks.o \
dcache.o inode.o attr.o bad_inode.o file.o dnotify.o \
filesystems.o namespace.o seq_file.o xattr.o libfs.o \
- fs-writeback.o mpage.o direct-io.o aio.o
+ fs-writeback.o mpage.o direct-io.o aio.o fcblist.o

ifneq ($(CONFIG_NFSD),n)
ifneq ($(CONFIG_NFSD),)
diff -Nru linux-2.5.44.vanilla/fs/fcblist.c linux-2.5.44.epoll/fs/fcblist.c
--- linux-2.5.44.vanilla/fs/fcblist.c Wed Dec 31 16:00:00 1969
+++ linux-2.5.44.epoll/fs/fcblist.c Sun Oct 27 15:23:07 2002
@@ -0,0 +1,135 @@
+/*
+ * linux/fs/fcblist.c ( File event callbacks handling )
+ * Copyright (C) 2001,...,2002 Davide Libenzi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * Davide Libenzi <[email protected]>
+ *
+ */
+
+#include <linux/config.h>
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/poll.h>
+#include <asm/bitops.h>
+#include <linux/fcblist.h>
+
+
+long ion_band_table[NSIGPOLL] = {
+ ION_IN, /* POLL_IN */
+ ION_OUT, /* POLL_OUT */
+ ION_IN, /* POLL_MSG */
+ ION_ERR, /* POLL_ERR */
+ 0, /* POLL_PRI */
+ ION_HUP /* POLL_HUP */
+};
+EXPORT_SYMBOL(ion_band_table);
+
+long poll_band_table[NSIGPOLL] = {
+ POLLIN | POLLRDNORM, /* POLL_IN */
+ POLLOUT | POLLWRNORM | POLLWRBAND, /* POLL_OUT */
+ POLLIN | POLLRDNORM | POLLMSG, /* POLL_MSG */
+ POLLERR, /* POLL_ERR */
+ POLLPRI | POLLRDBAND, /* POLL_PRI */
+ POLLHUP | POLLERR /* POLL_HUP */
+};
+EXPORT_SYMBOL(poll_band_table);
+
+
+void file_notify_event(struct file *filep, long *event)
+{
+ unsigned long flags;
+ struct list_head *lnk, *lsthead;
+
+ fcblist_read_lock(filep, flags);
+
+ lsthead = &filep->f_cblist;
+ list_for_each(lnk, lsthead) {
+ struct fcb_struct *fcbp = list_entry(lnk, struct fcb_struct, llink);
+
+ fcbp->cbproc(filep, fcbp->data, fcbp->local, event);
+ }
+
+ fcblist_read_unlock(filep, flags);
+}
+EXPORT_SYMBOL(file_notify_event);
+
+
+int file_notify_addcb(struct file *filep,
+ void (*cbproc)(struct file *, void *, unsigned long *, long *), void *data)
+{
+ unsigned long flags;
+ struct fcb_struct *fcbp;
+
+ if (!(fcbp = (struct fcb_struct *) kmalloc(sizeof(struct fcb_struct), GFP_KERNEL)))
+ return -ENOMEM;
+
+ memset(fcbp, 0, sizeof(struct fcb_struct));
+ fcbp->cbproc = cbproc;
+ fcbp->data = data;
+
+ fcblist_write_lock(filep, flags);
+ list_add_tail(&fcbp->llink, &filep->f_cblist);
+ fcblist_write_unlock(filep, flags);
+
+ return 0;
+}
+EXPORT_SYMBOL(file_notify_addcb);
+
+
+int file_notify_delcb(struct file *filep,
+ void (*cbproc)(struct file *, void *, unsigned long *, long *))
+{
+ unsigned long flags;
+ struct list_head *lnk, *lsthead;
+
+ fcblist_write_lock(filep, flags);
+
+ lsthead = &filep->f_cblist;
+ list_for_each(lnk, lsthead) {
+ struct fcb_struct *fcbp = list_entry(lnk, struct fcb_struct, llink);
+
+ if (fcbp->cbproc == cbproc) {
+ list_del(lnk);
+ fcblist_write_unlock(filep, flags);
+ kfree(fcbp);
+ return 0;
+ }
+ }
+
+ fcblist_write_unlock(filep, flags);
+
+ return -ENOENT;
+}
+EXPORT_SYMBOL(file_notify_delcb);
+
+
+void file_notify_cleanup(struct file *filep)
+{
+ unsigned long flags;
+ struct list_head *lnk, *lsthead;
+
+ fcblist_write_lock(filep, flags);
+
+ lsthead = &filep->f_cblist;
+ while ((lnk = list_first(lsthead))) {
+ struct fcb_struct *fcbp = list_entry(lnk, struct fcb_struct, llink);
+
+ list_del(lnk);
+ fcblist_write_unlock(filep, flags);
+ kfree(fcbp);
+ fcblist_write_lock(filep, flags);
+ }
+
+ fcblist_write_unlock(filep, flags);
+}
+EXPORT_SYMBOL(file_notify_cleanup);
+
diff -Nru linux-2.5.44.vanilla/fs/file_table.c linux-2.5.44.epoll/fs/file_table.c
--- linux-2.5.44.vanilla/fs/file_table.c Fri Oct 18 21:01:08 2002
+++ linux-2.5.44.epoll/fs/file_table.c Sat Oct 19 12:01:33 2002
@@ -8,6 +8,7 @@
#include <linux/string.h>
#include <linux/slab.h>
#include <linux/file.h>
+#include <linux/fcblist.h>
#include <linux/init.h>
#include <linux/module.h>
#include <linux/smp_lock.h>
@@ -58,6 +59,7 @@
f->f_gid = current->fsgid;
f->f_owner.lock = RW_LOCK_UNLOCKED;
list_add(&f->f_list, &anon_list);
+ file_notify_init(f);
file_list_unlock();
return f;
}
@@ -102,6 +104,7 @@
filp->f_uid = current->fsuid;
filp->f_gid = current->fsgid;
filp->f_op = dentry->d_inode->i_fop;
+ file_notify_init(filp);
if (filp->f_op->open)
return filp->f_op->open(dentry->d_inode, filp);
else
@@ -123,6 +126,7 @@
struct vfsmount * mnt = file->f_vfsmnt;
struct inode * inode = dentry->d_inode;

+ file_notify_cleanup(file);
locks_remove_flock(file);

if (file->f_op && file->f_op->release)
diff -Nru linux-2.5.44.vanilla/fs/pipe.c linux-2.5.44.epoll/fs/pipe.c
--- linux-2.5.44.vanilla/fs/pipe.c Fri Oct 18 21:01:56 2002
+++ linux-2.5.44.epoll/fs/pipe.c Sat Oct 19 12:32:34 2002
@@ -11,6 +11,7 @@
#include <linux/module.h>
#include <linux/init.h>
#include <linux/fs.h>
+#include <linux/fcblist.h>

#include <asm/uaccess.h>
#include <asm/ioctls.h>
@@ -47,7 +48,7 @@
pipe_read(struct file *filp, char *buf, size_t count, loff_t *ppos)
{
struct inode *inode = filp->f_dentry->d_inode;
- int do_wakeup;
+ int do_wakeup, pfull;
ssize_t ret;

/* pread is not allowed on pipes. */
@@ -63,6 +64,7 @@
down(PIPE_SEM(*inode));
for (;;) {
int size = PIPE_LEN(*inode);
+ pfull = PIPE_FULL(*inode);
if (size) {
char *pipebuf = PIPE_BASE(*inode) + PIPE_START(*inode);
ssize_t chars = PIPE_MAX_RCHUNK(*inode);
@@ -108,12 +110,18 @@
if (!ret) ret = -ERESTARTSYS;
break;
}
+ /* Send notification message */
+ if (pfull && !PIPE_FULL(*inode) && PIPE_WRITEFILE(*inode))
+ file_send_notify(PIPE_WRITEFILE(*inode), ION_OUT, POLLOUT | POLLWRNORM | POLLWRBAND);
if (do_wakeup) {
wake_up_interruptible(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_WRITERS(*inode), SIGIO, POLL_OUT);
}
pipe_wait(inode);
}
+ /* Send notification message */
+ if (pfull && !PIPE_FULL(*inode) && PIPE_WRITEFILE(*inode))
+ file_send_notify(PIPE_WRITEFILE(*inode), ION_OUT, POLLOUT | POLLWRNORM | POLLWRBAND);
up(PIPE_SEM(*inode));
/* Signal writers asynchronously that there is more room. */
if (do_wakeup) {
@@ -131,7 +139,7 @@
struct inode *inode = filp->f_dentry->d_inode;
ssize_t ret;
size_t min;
- int do_wakeup;
+ int do_wakeup, pempty;

/* pwrite is not allowed on pipes. */
if (unlikely(ppos != &filp->f_pos))
@@ -149,6 +157,7 @@
down(PIPE_SEM(*inode));
for (;;) {
int free;
+ pempty = PIPE_EMPTY(*inode);
if (!PIPE_READERS(*inode)) {
send_sig(SIGPIPE, current, 0);
if (!ret) ret = -EPIPE;
@@ -194,6 +203,9 @@
if (!ret) ret = -ERESTARTSYS;
break;
}
+ /* Send notification message */
+ if (pempty && !PIPE_EMPTY(*inode) && PIPE_READFILE(*inode))
+ file_send_notify(PIPE_READFILE(*inode), ION_IN, POLLIN | POLLRDNORM);
if (do_wakeup) {
wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
@@ -203,6 +215,9 @@
pipe_wait(inode);
PIPE_WAITING_WRITERS(*inode)--;
}
+ /* Send notification message */
+ if (pempty && !PIPE_EMPTY(*inode) && PIPE_READFILE(*inode))
+ file_send_notify(PIPE_READFILE(*inode), ION_IN, POLLIN | POLLRDNORM);
up(PIPE_SEM(*inode));
if (do_wakeup) {
wake_up_interruptible(PIPE_WAIT(*inode));
@@ -266,9 +281,22 @@
static int
pipe_release(struct inode *inode, int decr, int decw)
{
+ struct file *rdfile, *wrfile;
down(PIPE_SEM(*inode));
PIPE_READERS(*inode) -= decr;
PIPE_WRITERS(*inode) -= decw;
+ rdfile = PIPE_READFILE(*inode);
+ wrfile = PIPE_WRITEFILE(*inode);
+ if (decr && !PIPE_READERS(*inode)) {
+ PIPE_READFILE(*inode) = NULL;
+ if (wrfile)
+ file_send_notify(wrfile, ION_HUP, POLLHUP);
+ }
+ if (decw && !PIPE_WRITERS(*inode)) {
+ PIPE_WRITEFILE(*inode) = NULL;
+ if (rdfile)
+ file_send_notify(rdfile, ION_HUP, POLLHUP);
+ }
if (!PIPE_READERS(*inode) && !PIPE_WRITERS(*inode)) {
struct pipe_inode_info *info = inode->i_pipe;
inode->i_pipe = NULL;
@@ -488,6 +516,7 @@
PIPE_READERS(*inode) = PIPE_WRITERS(*inode) = 0;
PIPE_WAITING_WRITERS(*inode) = 0;
PIPE_RCOUNTER(*inode) = PIPE_WCOUNTER(*inode) = 1;
+ PIPE_READFILE(*inode) = PIPE_WRITEFILE(*inode) = NULL;
*PIPE_FASYNC_READERS(*inode) = *PIPE_FASYNC_WRITERS(*inode) = NULL;

return inode;
@@ -595,6 +624,9 @@
f2->f_op = &write_pipe_fops;
f2->f_mode = 2;
f2->f_version = 0;
+
+ PIPE_READFILE(*inode) = f1;
+ PIPE_WRITEFILE(*inode) = f2;

fd_install(i, f1);
fd_install(j, f2);
diff -Nru linux-2.5.44.vanilla/include/asm-i386/poll.h linux-2.5.44.epoll/include/asm-i386/poll.h
--- linux-2.5.44.vanilla/include/asm-i386/poll.h Fri Oct 18 21:01:52 2002
+++ linux-2.5.44.epoll/include/asm-i386/poll.h Sat Oct 19 12:01:33 2002
@@ -15,6 +15,7 @@
#define POLLWRNORM 0x0100
#define POLLWRBAND 0x0200
#define POLLMSG 0x0400
+#define POLLREMOVE 0x1000

struct pollfd {
int fd;
diff -Nru linux-2.5.44.vanilla/include/asm-i386/unistd.h linux-2.5.44.epoll/include/asm-i386/unistd.h
--- linux-2.5.44.vanilla/include/asm-i386/unistd.h Fri Oct 18 21:02:00 2002
+++ linux-2.5.44.epoll/include/asm-i386/unistd.h Sat Oct 19 20:23:33 2002
@@ -258,6 +258,9 @@
#define __NR_free_hugepages 251
#define __NR_exit_group 252
#define __NR_lookup_dcookie 253
+#define __NR_sys_epoll_create 254
+#define __NR_sys_epoll_ctl 255
+#define __NR_sys_epoll_wait 256


/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
diff -Nru linux-2.5.44.vanilla/include/linux/eventpoll.h linux-2.5.44.epoll/include/linux/eventpoll.h
--- linux-2.5.44.vanilla/include/linux/eventpoll.h Wed Dec 31 16:00:00 1969
+++ linux-2.5.44.epoll/include/linux/eventpoll.h Sun Oct 27 15:23:37 2002
@@ -0,0 +1,51 @@
+/*
+ * include/linux/eventpoll.h ( Efficent event polling implementation )
+ * Copyright (C) 2001,...,2002 Davide Libenzi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * Davide Libenzi <[email protected]>
+ *
+ */
+
+#ifndef _LINUX_EVENTPOLL_H
+#define _LINUX_EVENTPOLL_H
+
+
+#define EVENTPOLL_MINOR 124
+#define POLLFD_X_PAGE (PAGE_SIZE / sizeof(struct pollfd))
+#define MAX_FDS_IN_EVENTPOLL (1024 * 128)
+#define MAX_EVENTPOLL_PAGES (MAX_FDS_IN_EVENTPOLL / POLLFD_X_PAGE)
+#define EVENT_PAGE_INDEX(n) ((n) / POLLFD_X_PAGE)
+#define EVENT_PAGE_REM(n) ((n) % POLLFD_X_PAGE)
+#define EVENT_PAGE_OFFSET(n) (((n) % POLLFD_X_PAGE) * sizeof(struct pollfd))
+#define EP_FDS_PAGES(n) (((n) + POLLFD_X_PAGE - 1) / POLLFD_X_PAGE)
+#define EP_MAP_SIZE(n) (EP_FDS_PAGES(n) * PAGE_SIZE * 2)
+
+
+struct evpoll {
+ int ep_timeout;
+ unsigned long ep_resoff;
+};
+
+#define EP_ALLOC _IOR('P', 1, int)
+#define EP_POLL _IOWR('P', 2, struct evpoll)
+#define EP_FREE _IO('P', 3)
+#define EP_ISPOLLED _IOWR('P', 4, struct pollfd)
+
+#define EP_CTL_ADD 1
+#define EP_CTL_DEL 2
+#define EP_CTL_MOD 3
+
+
+asmlinkage int sys_epoll_create(int maxfds);
+asmlinkage int sys_epoll_ctl(int epfd, int op, int fd, unsigned int events);
+asmlinkage int sys_epoll_wait(int epfd, struct pollfd const **events, int timeout);
+
+
+
+#endif
+
diff -Nru linux-2.5.44.vanilla/include/linux/fcblist.h linux-2.5.44.epoll/include/linux/fcblist.h
--- linux-2.5.44.vanilla/include/linux/fcblist.h Wed Dec 31 16:00:00 1969
+++ linux-2.5.44.epoll/include/linux/fcblist.h Sun Oct 27 15:23:21 2002
@@ -0,0 +1,73 @@
+/*
+ * include/linux/fcblist.h ( File event callbacks handling )
+ * Copyright (C) 2001,...,2002 Davide Libenzi
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * Davide Libenzi <[email protected]>
+ *
+ */
+
+#ifndef __LINUX_FCBLIST_H
+#define __LINUX_FCBLIST_H
+
+#include <linux/config.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+
+
+
+/* file callback notification events */
+#define ION_IN 1
+#define ION_OUT 2
+#define ION_HUP 3
+#define ION_ERR 4
+
+#define FCB_LOCAL_SIZE 4
+
+#define fcblist_read_lock(fp, fl) read_lock_irqsave(&(fp)->f_cblock, fl)
+#define fcblist_read_unlock(fp, fl) read_unlock_irqrestore(&(fp)->f_cblock, fl)
+#define fcblist_write_lock(fp, fl) write_lock_irqsave(&(fp)->f_cblock, fl)
+#define fcblist_write_unlock(fp, fl) write_unlock_irqrestore(&(fp)->f_cblock, fl)
+
+struct fcb_struct {
+ struct list_head llink;
+ void (*cbproc)(struct file *, void *, unsigned long *, long *);
+ void *data;
+ unsigned long local[FCB_LOCAL_SIZE];
+};
+
+
+extern long ion_band_table[];
+extern long poll_band_table[];
+
+
+void file_notify_event(struct file *filep, long *event);
+
+int file_notify_addcb(struct file *filep,
+ void (*cbproc)(struct file *, void *, unsigned long *, long *), void *data);
+
+int file_notify_delcb(struct file *filep,
+ void (*cbproc)(struct file *, void *, unsigned long *, long *));
+
+void file_notify_cleanup(struct file *filep);
+
+
+static inline void file_notify_init(struct file *filep)
+{
+ rwlock_init(&filep->f_cblock);
+ INIT_LIST_HEAD(&filep->f_cblist);
+}
+
+static inline void file_send_notify(struct file *filep, long ioevt, long plevt) {
+ long event[] = { ioevt, plevt, -1 };
+
+ file_notify_event(filep, event);
+}
+
+#endif
diff -Nru linux-2.5.44.vanilla/include/linux/fs.h linux-2.5.44.epoll/include/linux/fs.h
--- linux-2.5.44.vanilla/include/linux/fs.h Fri Oct 18 21:01:18 2002
+++ linux-2.5.44.epoll/include/linux/fs.h Sat Oct 19 12:01:33 2002
@@ -506,6 +506,10 @@

/* needed for tty driver, and maybe others */
void *private_data;
+
+ /* file callback list */
+ rwlock_t f_cblock;
+ struct list_head f_cblist;
};
extern spinlock_t files_lock;
#define file_list_lock() spin_lock(&files_lock);
diff -Nru linux-2.5.44.vanilla/include/linux/list.h linux-2.5.44.epoll/include/linux/list.h
--- linux-2.5.44.vanilla/include/linux/list.h Fri Oct 18 21:01:07 2002
+++ linux-2.5.44.epoll/include/linux/list.h Sat Oct 19 12:01:33 2002
@@ -319,6 +319,11 @@
for (pos = (head)->next, n = pos->next; pos != (head); \
pos = n, ({ read_barrier_depends(); 0;}), n = pos->next)

+#define list_first(head) (((head)->next != (head)) ? (head)->next: (struct list_head *) 0)
+#define list_last(head) (((head)->prev != (head)) ? (head)->prev: (struct list_head *) 0)
+#define list_next(pos, head) (((pos)->next != (head)) ? (pos)->next: (struct list_head *) 0)
+#define list_prev(pos, head) (((pos)->prev != (head)) ? (pos)->prev: (struct list_head *) 0)
+
#endif /* __KERNEL__ || _LVM_H_INCLUDE */

#endif
diff -Nru linux-2.5.44.vanilla/include/linux/pipe_fs_i.h linux-2.5.44.epoll/include/linux/pipe_fs_i.h
--- linux-2.5.44.vanilla/include/linux/pipe_fs_i.h Fri Oct 18 21:02:24 2002
+++ linux-2.5.44.epoll/include/linux/pipe_fs_i.h Sat Oct 19 12:01:33 2002
@@ -12,6 +12,8 @@
unsigned int waiting_writers;
unsigned int r_counter;
unsigned int w_counter;
+ struct file *rdfile;
+ struct file *wrfile;
struct fasync_struct *fasync_readers;
struct fasync_struct *fasync_writers;
};
@@ -30,6 +32,8 @@
#define PIPE_WAITING_WRITERS(inode) ((inode).i_pipe->waiting_writers)
#define PIPE_RCOUNTER(inode) ((inode).i_pipe->r_counter)
#define PIPE_WCOUNTER(inode) ((inode).i_pipe->w_counter)
+#define PIPE_READFILE(inode) ((inode).i_pipe->rdfile)
+#define PIPE_WRITEFILE(inode) ((inode).i_pipe->wrfile)
#define PIPE_FASYNC_READERS(inode) (&((inode).i_pipe->fasync_readers))
#define PIPE_FASYNC_WRITERS(inode) (&((inode).i_pipe->fasync_writers))

diff -Nru linux-2.5.44.vanilla/include/linux/sys.h linux-2.5.44.epoll/include/linux/sys.h
--- linux-2.5.44.vanilla/include/linux/sys.h Fri Oct 18 21:01:49 2002
+++ linux-2.5.44.epoll/include/linux/sys.h Sun Oct 20 15:13:06 2002
@@ -4,7 +4,7 @@
/*
* system call entry points ... but not all are defined
*/
-#define NR_syscalls 256
+#define NR_syscalls 260

/*
* These are system calls that will be removed at some time
diff -Nru linux-2.5.44.vanilla/include/net/sock.h linux-2.5.44.epoll/include/net/sock.h
--- linux-2.5.44.vanilla/include/net/sock.h Fri Oct 18 21:02:27 2002
+++ linux-2.5.44.epoll/include/net/sock.h Tue Oct 22 15:57:38 2002
@@ -52,6 +52,9 @@
#include <asm/atomic.h>
#include <net/dst.h>
#include <net/scm.h>
+#include <linux/fs.h>
+#include <linux/file.h>
+#include <linux/fcblist.h>

/*
* This structure really needs to be cleaned up.
@@ -766,8 +769,13 @@

static inline void sk_wake_async(struct sock *sk, int how, int band)
{
- if (sk->socket && sk->socket->fasync_list)
- sock_wake_async(sk->socket, how, band);
+ if (sk->socket) {
+ if (sk->socket->file)
+ file_send_notify(sk->socket->file, ion_band_table[band - POLL_IN],
+ poll_band_table[band - POLL_IN]);
+ if (sk->socket->fasync_list)
+ sock_wake_async(sk->socket, how, band);
+ }
}

#define SOCK_MIN_SNDBUF 2048
diff -Nru linux-2.5.44.vanilla/net/ipv4/tcp.c linux-2.5.44.epoll/net/ipv4/tcp.c
--- linux-2.5.44.vanilla/net/ipv4/tcp.c Fri Oct 18 21:01:19 2002
+++ linux-2.5.44.epoll/net/ipv4/tcp.c Sat Oct 19 12:01:33 2002
@@ -476,8 +476,8 @@
if (sk->sleep && waitqueue_active(sk->sleep))
wake_up_interruptible(sk->sleep);

- if (sock->fasync_list && !(sk->shutdown & SEND_SHUTDOWN))
- sock_wake_async(sock, 2, POLL_OUT);
+ if (!(sk->shutdown & SEND_SHUTDOWN))
+ sk_wake_async(sk, 2, POLL_OUT);
}
}






2002-10-28 20:11:47

by Hanna Linder

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll


--On Monday, October 28, 2002 11:14:19 -0800 Hanna Linder <[email protected]> wrote:

> ps- Did I mention there is a web site? http://lse.sf.net/epoll/index.html
>
> -----
> diff -Nru linux-2.5.44.vanilla/arch/i386/kernel/entry.S linux-2.5.44.epoll/arch/i386/kernel/entry.S

Forgot to include the diffstat for Davide's sys_epoll patch (v11):


arch/i386/kernel/entry.S | 4
drivers/char/Makefile | 4
drivers/char/eventpoll.c | 1136 ++++++++++++++++++++++++++++++++++++++++++++++ fs/Makefile | 4
fs/fcblist.c | 135 +++++
fs/file_table.c | 4
fs/pipe.c | 36 +
include/asm-i386/poll.h | 1
include/asm-i386/unistd.h | 3
include/linux/eventpoll.h | 51 ++
include/linux/fcblist.h | 73 ++
include/linux/fs.h | 4
include/linux/list.h | 5
include/linux/pipe_fs_i.h | 4
include/linux/sys.h | 2
include/net/sock.h | 12
net/ipv4/tcp.c | 4
17 files changed, 1471 insertions(+), 11 deletions(-)

2002-10-28 20:50:49

by Martin Waitz

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll

hi :)

On Mon, Oct 28, 2002 at 11:14:19AM -0800, Hanna Linder wrote:
> The results of our testing show not only does the system call
> interface to epoll perform as well as the /dev interface but also that epoll
> is many times better than standard poll. No other implementations of poll
> have performed as well as epoll in any measure. Testing details and results
> are published here, please take a minute to check it out: http://lse.sourceforge.net/epoll/index.html
how does this compare to the kqueue mechanism found in {Free,Net}BSD?
(see http://people.freebsd.org/~jlemon/papers/kqueue.pdf)

i especially like their combined event update/event wait,
needing only one syscall per poll while building a changelist in
userspace...

a replacement for poll/select is _really_ needed.
but i think we should use existing interfaces if possible,
to reduce the changes needed in userspace.


--
CU, / Friedrich-Alexander University Erlangen, Germany
Martin Waitz // [Tali on IRCnet] [tali.home.pages.de] _________
______________/// - - - - - - - - - - - - - - - - - - - - ///
dies ist eine manuell generierte mail, sie beinhaltet //
tippfehler und ist auch ohne grossbuchstaben gueltig. /
-
Wer bereit ist, grundlegende Freiheiten aufzugeben, um sich
kurzfristige Sicherheit zu verschaffen, der hat weder Freiheit
noch Sicherheit verdient.
Benjamin Franklin (1706 - 1790)


Attachments:
(No filename) (1.37 kB)
(No filename) (189.00 B)
Download all attachments

2002-10-28 21:56:29

by bert hubert

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll

On Mon, Oct 28, 2002 at 09:56:47PM +0100, Martin Waitz wrote:

> needing only one syscall per poll while building a changelist in
> userspace...

Which is so smashingly succesful for iptables. I would very much doubt the
utility of building tables in userspace and them blasting them across,
especially as they will tend to be large when people bother to use epoll.

Regards,

bert

--
http://www.PowerDNS.com Versatile DNS Software & Services
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2002-10-28 21:53:50

by Dan Kegel

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll

Martin Waitz <[email protected]> wrote:
> On Mon, Oct 28, 2002 at 11:14:19AM -0800, Hanna Linder wrote:
>> The results of our testing show not only does the system call
>> interface to epoll perform as well as the /dev interface but also that epoll
>> is many times better than standard poll. No other implementations of poll
>> have performed as well as epoll in any measure. Testing details and results
>> are published here, please take a minute to check it out:
>> http://lse.sourceforge.net/epoll/index.html
>
> how does this compare to the kqueue mechanism found in {Free,Net}BSD?
> (see http://people.freebsd.org/~jlemon/papers/kqueue.pdf)
>
> i especially like their combined event update/event wait,
> needing only one syscall per poll while building a changelist in
> userspace...
>
> a replacement for poll/select is _really_ needed.
> but i think we should use existing interfaces if possible,
> to reduce the changes needed in userspace.

I'd kinda like to see a unified event queue object
used uniformly for everything. You might instantiate
several of them in one process (so e.g. libraries could have
their own).

The idea of using the kqueue interface was discussed once before. See
http://marc.theaimsgroup.com/?l=linux-kernel&m=97236943118139&w=2
for Linus' opinion of kqueues (he doesn't like them much).

Another existing event queue for readiness notification to
be delivered via is Ben's AIO completion notification queue,
but I haven't heard a definitive story about whether
epoll events could be delivered that way. (The discussion
seems to always veer into a discussion of asynchronous
poll, which is something else.)
- Dan


2002-10-28 22:01:52

by bert hubert

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll

On Mon, Oct 28, 2002 at 11:14:19AM -0800, Hanna Linder wrote:

> The results of our testing show not only does the system call
> interface to epoll perform as well as the /dev interface but also that epoll
> is many times better than standard poll. No other implementations of poll

Hanna,

Sure that this works? The following trivial program doesn't work on stdinput
when I'd expect it to. It just waits until the timeout passes end then
returns 0. It also does not work on a file, which is to be expected,
although 'select' returns with an immediate availability of data on a file
according to SuS.

Furthermore, there is some const weirdness going on, the ephttpd server has
a different second argument to sys_epoll_wait.

I compile this with:
cc -Wall -I/mnt/linux-2.5.44/include/ epoll.c -o epoll


#include <stdio.h>
#include <errno.h>
#include <asm/page.h>
#include <asm/poll.h>
#include <linux/linkage.h>
#include <linux/eventpoll.h>
#include <linux/unistd.h>

#define __sys_epoll_create(maxfds) _syscall1(int, sys_epoll_create, int, maxfds)
#define __sys_epoll_ctl(epfd, op, fd, events) _syscall4(int, sys_epoll_ctl, \
int, epfd, int, op, int, fd, unsigned int, events)

#define __sys_epoll_wait(epfd, events, timeout) _syscall3(int, sys_epoll_wait, \
int, epfd, struct pollfd const **, events, int, timeout)

__sys_epoll_create(maxfds)
__sys_epoll_ctl(epfd, op, fd, events)
__sys_epoll_wait(epfd, events, timeout)

// asmlinkage int sys_epoll_wait(int epfd, struct pollfd const **events, int timeout);

int main()
{
int kdpfd;
struct pollfd const *pfds;
int nfds;
int timeout=2;

if ((kdpfd = sys_epoll_create(10)) < 0) {
perror("sys_epoll_create");
return -1;
}
if (sys_epoll_ctl(kdpfd, EP_CTL_ADD, 0, POLLIN ) < 0) {
fprintf(stderr, "sys_epoll set insertion error: fd=%d\n", 0);

return -1;
}

nfds = sys_epoll_wait(kdpfd, &pfds, timeout * 1000);
fprintf(stderr,"sys_epoll_wait returned: %d\n",nfds);
return 0;
}

--
http://www.PowerDNS.com Versatile DNS Software & Services
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2002-10-28 22:01:41

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll

On Mon, 28 Oct 2002, Martin Waitz wrote:

> hi :)
>
> On Mon, Oct 28, 2002 at 11:14:19AM -0800, Hanna Linder wrote:
> > The results of our testing show not only does the system call
> > interface to epoll perform as well as the /dev interface but also that epoll
> > is many times better than standard poll. No other implementations of poll
> > have performed as well as epoll in any measure. Testing details and results
> > are published here, please take a minute to check it out: http://lse.sourceforge.net/epoll/index.html
> how does this compare to the kqueue mechanism found in {Free,Net}BSD?
> (see http://people.freebsd.org/~jlemon/papers/kqueue.pdf)
>
> i especially like their combined event update/event wait,
> needing only one syscall per poll while building a changelist in
> userspace...
>
> a replacement for poll/select is _really_ needed.
> but i think we should use existing interfaces if possible,
> to reduce the changes needed in userspace.

KQueue has not been tested simply because it does not ( to my knowledge )
have patches to apply to lk. I'd expect kqueue to scale in a similar way
of sys_epoll though. Where for "similar" I mean to not suffer high
connection load drops. About the interface, it looks pretty simple to me :

http://www.xmailserver.org/linux-patches/epoll_create.txt
http://www.xmailserver.org/linux-patches/epoll_ctl.txt
http://www.xmailserver.org/linux-patches/epoll_wait.txt




- Davide


2002-10-28 22:30:33

by Dan Kegel

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll

Dan Kegel wrote:
> The idea of using the kqueue interface was discussed once before. See
> http://marc.theaimsgroup.com/?l=linux-kernel&m=97236943118139&w=2
> for Linus' opinion of kqueues (he doesn't like them much).

Hang on - reading again, I wonder if the main reason he didn't like
kqueue is because it allowed for multiple event queues
(so libraries don't need to be tightly integrated into
the main program, for instance). He preferred one queue
and callbacks.

However, I think Linus admitted later on that nobody liked his
callback idea, so maybe he'd be receptive to the multiple
event queue idea now.

Um, I assume Ben's aio stuff allows multiple completion queues, right?
- Dan

2002-10-28 22:35:51

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll

On Mon, 28 Oct 2002, Dan Kegel wrote:

> Another existing event queue for readiness notification to
> be delivered via is Ben's AIO completion notification queue,
> but I haven't heard a definitive story about whether
> epoll events could be delivered that way. (The discussion
> seems to always veer into a discussion of asynchronous
> poll, which is something else.)

Yep Dan, Ben proposed that approach that we did not have the time to test.
The way of returning events of sys_epoll is very efficent, like you can
see in the scalability page ( pipetest ) that Hanna and her team setup :

http://lse.sourceforge.net/epoll/index.html



- Davide


2002-10-28 22:16:30

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll

On Mon, 28 Oct 2002, bert hubert wrote:

> On Mon, Oct 28, 2002 at 11:14:19AM -0800, Hanna Linder wrote:
>
> > The results of our testing show not only does the system call
> > interface to epoll perform as well as the /dev interface but also that epoll
> > is many times better than standard poll. No other implementations of poll
>
> Hanna,
>
> Sure that this works? The following trivial program doesn't work on stdinput
> when I'd expect it to. It just waits until the timeout passes end then
> returns 0. It also does not work on a file, which is to be expected,
> although 'select' returns with an immediate availability of data on a file
> according to SuS.
>
> Furthermore, there is some const weirdness going on, the ephttpd server has
> a different second argument to sys_epoll_wait.

sys_epoll, by plugging directly in the existing kernel architecture,
supports sockets and pipes. It does not support and there're not even
plans to support other devices like tty, where poll() and select() works
flawlessy. Since the sys_epoll ( and /dev/epoll ) fd support standard polling, you
can mix sys_epoll handling with other methods like poll() and the AIO's
POLL function when it'll be ready. For example, for devices that sys_epoll
intentionally does not support, you can use a method like :

put_sys_epoll_fd_inside_XXX();
...
wait_for_XXX_events();
...
if (XXX_event_fd() == sys_epoll_fd) {
sys_epoll_wait();
for_each_sys_epoll_event {
handle_fd_event();
}
}




- Davide



2002-10-28 22:10:53

by bert hubert

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll

On Mon, Oct 28, 2002 at 11:02:49PM +0100, bert hubert wrote:
> On Mon, Oct 28, 2002 at 09:56:47PM +0100, Martin Waitz wrote:
>
> > needing only one syscall per poll while building a changelist in
> > userspace...
>
> Which is so smashingly succesful for iptables. I would very much doubt the
> utility of building tables in userspace and them blasting them across,
> especially as they will tend to be large when people bother to use epoll.

Never mind, I should have read what you wrote. Building changesets in
userspace may have some utility.

--
http://www.PowerDNS.com Versatile DNS Software & Services
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2002-10-28 22:52:04

by bert hubert

[permalink] [raw]
Subject: and nicer too - Re: [PATCH] epoll more scalable than poll

On Mon, Oct 28, 2002 at 02:29:37PM -0800, Davide Libenzi wrote:

> sys_epoll, by plugging directly in the existing kernel architecture,
> supports sockets and pipes. It does not support and there're not even
> plans to support other devices like tty, where poll() and select() works
> flawlessy. Since the sys_epoll ( and /dev/epoll ) fd support standard polling, you

Ok. I suggest the manpage mention this prominently.

I tried a somewhat more involved example and it indeed works expected. As an
application developer, this suits my needs just fine. I really like the
'edge' nature of it all.

The interface is also lovely:

for(;;) {
nfds = sys_epoll_wait(kdpfd, &pfds, -1);
fprintf(stderr,"sys_epoll_wait returned: %d\n",nfds);

for(n=0;n<nfds;++n) {
if(pfds[n].fd==s) {
client=accept(s, (struct sockaddr*)&local, &addrlen);

if(client<0){
perror("accept");
continue;
}
if (sys_epoll_ctl(kdpfd, EP_CTL_ADD, client, POLLIN ) < 0) {
fprintf(stderr, "sys_epoll set insertion error: fd=%d\n", client);
return -1;
}
}
else
printf("something happened on fd %d\n", pfds[n].fd);
}
}

Each time a packet comes in, sys_wait returns just once so I can immediately
call it again without having to wait for another thread to have actually
*done* something with that socket.

Righteous stuff, I'll be using this, thanks.

Regards,

bert

--
http://www.PowerDNS.com Versatile DNS Software & Services
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2002-10-28 22:57:48

by Dan Kegel

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll

Davide Libenzi wrote:
> On Mon, 28 Oct 2002, Dan Kegel wrote:
>>Another existing event queue for readiness notification to
>>be delivered via is Ben's AIO completion notification queue,
>>but I haven't heard a definitive story about whether
>>epoll events could be delivered that way. (The discussion
>>seems to always veer into a discussion of asynchronous
>>poll, which is something else.)
>
> Yep Dan, Ben proposed that approach that we did not have the time to test.
> The way of returning events of sys_epoll is very efficent, like you can
> see in the scalability page ( pipetest ) that Hanna and her team setup :
>
> http://lse.sourceforge.net/epoll/index.html

I do like those results. If, however, the unified approach performs
as well, it might be good to go with it to reduce the number of
interfaces, as Martin suggested. (Though he was suggesting kqueue as
the preferred interface, not Ben's aio...)
- Dan

2002-10-28 23:07:28

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Mon, 28 Oct 2002, bert hubert wrote:

> The interface is also lovely:
>
> for(;;) {
> nfds = sys_epoll_wait(kdpfd, &pfds, -1);
> fprintf(stderr,"sys_epoll_wait returned: %d\n",nfds);
>
> for(n=0;n<nfds;++n) {
> if(pfds[n].fd==s) {
> client=accept(s, (struct sockaddr*)&local, &addrlen);
>
> if(client<0){
> perror("accept");
> continue;
> }
> if (sys_epoll_ctl(kdpfd, EP_CTL_ADD, client, POLLIN ) < 0) {
> fprintf(stderr, "sys_epoll set insertion error: fd=%d\n", client);
> return -1;
> }
> }
> else
> printf("something happened on fd %d\n", pfds[n].fd);
> }
> }

This is how you probably want to use it :

for(;;) {
nfds = sys_epoll_wait(kdpfd, &pfds, -1);

for(n = 0; n < nfds; ++n) {
if(fd = pfds[n].fd) == s) {
client = accept(s, (struct sockaddr*)&local, &addrlen);
if(client < 0){
perror("accept");
continue;
}
if (sys_epoll_ctl(kdpfd, EP_CTL_ADD, client, POLLIN ) < 0) {
fprintf(stderr, "sys_epoll set insertion error: fd=%d\n", client);
return -1;
}
fd = client;
}

do_use_fd(fd);
}
}




- Davide


2002-10-28 23:38:26

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll

Davide Libenzi wrote:
> sys_epoll, by plugging directly in the existing kernel architecture,
> supports sockets and pipes. It does not support and there're not even
> plans to support other devices like tty, where poll() and select() works
> flawlessy. Since the sys_epoll ( and /dev/epoll ) fd support standard polling, you
> can mix sys_epoll handling with other methods like poll() and the AIO's
> POLL function when it'll be ready. For example, for devices that sys_epoll
> intentionally does not support, you can use a method like :

:( I was hoping sys_epoll would be scalable without increasing the
number of system calls per event.

Is it too much work to support all kinds of fd? It would be rather a
good thing IMHO.

I'm thinking that a typical generic event handling library (like in a
typical home grown server) takes a set of fds and event handling
callbacks. sys_epoll is obviously not so trivial to use in place of a
poll() loop, because the library needs to fstat() each fd that is
registered to decide if epoll will return events for that fd.

For that to work, it's important that you can determine, through
fstat(), whether sys_epoll will actually return events for the fd, or
whether a sigqueue event is needed to trigger the epoll read.

So, is it exactly _all_ sockets and pipes, and nothing else?

Btw, is the set of fd types supported by epoll the same as the set of
fd types supported by SIGIO? That would be convenient - and logical.

thanks,
-- Jamie (who thinks a lot about fast web servers)

2002-10-28 23:38:53

by John Myers

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

bert hubert wrote:

>The interface is also lovely:
>
>
The code you wrote has the standard epoll race condition. If the file
descriptor 's' becomes readable before the call to sys_epoll_ctl,
sys_epoll_wait() will never return the socket. The connection will hang
and the file descriptor will effectively leak.

As you have amply demonstrated, the current epoll API is error prone.
The API should be fixed to test the poll condition and, if necessary,
drop an event upon insertion to the set.


2002-10-28 23:47:10

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll

On Mon, 28 Oct 2002, Jamie Lokier wrote:

> Davide Libenzi wrote:
> > sys_epoll, by plugging directly in the existing kernel architecture,
> > supports sockets and pipes. It does not support and there're not even
> > plans to support other devices like tty, where poll() and select() works
> > flawlessy. Since the sys_epoll ( and /dev/epoll ) fd support standard polling, you
> > can mix sys_epoll handling with other methods like poll() and the AIO's
> > POLL function when it'll be ready. For example, for devices that sys_epoll
> > intentionally does not support, you can use a method like :
>
> :( I was hoping sys_epoll would be scalable without increasing the
> number of system calls per event.
>
> Is it too much work to support all kinds of fd? It would be rather a
> good thing IMHO.
>
> I'm thinking that a typical generic event handling library (like in a
> typical home grown server) takes a set of fds and event handling
> callbacks. sys_epoll is obviously not so trivial to use in place of a
> poll() loop, because the library needs to fstat() each fd that is
> registered to decide if epoll will return events for that fd.
>
> For that to work, it's important that you can determine, through
> fstat(), whether sys_epoll will actually return events for the fd, or
> whether a sigqueue event is needed to trigger the epoll read.
>
> So, is it exactly _all_ sockets and pipes, and nothing else?
>
> Btw, is the set of fd types supported by epoll the same as the set of
> fd types supported by SIGIO? That would be convenient - and logical.

Jamie, doing that is not a real problem. The fact is that sys_epoll aimed
to solve issues where scalability on huge number of fds is involved. By
covering sockets ( network connections ) and pipes ( cgi execution ) you
have a pretty wide scalability addressing. Usually you know from where and
fd born, so you're typically able to handle it correctly. Those reasons,
togheter with the fact that I did not want to introduce revolutions in the
kernel, drove me to limit the sys_epoll coverage to sockets and pipes.



- Davide



2002-10-28 23:52:20

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Mon, 28 Oct 2002, John Gardiner Myers wrote:

> bert hubert wrote:
>
> >The interface is also lovely:
> >
> >
> The code you wrote has the standard epoll race condition. If the file
> descriptor 's' becomes readable before the call to sys_epoll_ctl,
> sys_epoll_wait() will never return the socket. The connection will hang
> and the file descriptor will effectively leak.
>
> As you have amply demonstrated, the current epoll API is error prone.
> The API should be fixed to test the poll condition and, if necessary,
> drop an event upon insertion to the set.

So, please don't use :

free((void *) rand());

free() is flawed !! Be warned !!



- Davide


2002-10-28 23:57:19

by bert hubert

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll

On Mon, Oct 28, 2002 at 11:44:34PM +0000, Jamie Lokier wrote:

> :( I was hoping sys_epoll would be scalable without increasing the
> number of system calls per event.

I see only one call per event? sys_epoll_wait. Perhaps sys_epoll_ctl to
register a new interest.

Regards,

bert
--
http://www.PowerDNS.com Versatile DNS Software & Services
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2002-10-29 00:04:59

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll

On Tue, 29 Oct 2002, bert hubert wrote:

> On Mon, Oct 28, 2002 at 11:44:34PM +0000, Jamie Lokier wrote:
>
> > :( I was hoping sys_epoll would be scalable without increasing the
> > number of system calls per event.
>
> I see only one call per event? sys_epoll_wait. Perhaps sys_epoll_ctl to
> register a new interest.

In theory you can register the fd at creation time with the full interest
set and you can leave it in there for its whole life without having to
call epoll_ctl() every switch between read/write. It's true that you could
receive false events, but by studying the frequency of those false events
on a "very high loaded" HTTP server, it resulted to be both very little
and uneffective on the server performance.



- Davide


2002-10-29 00:12:26

by bert hubert

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Mon, Oct 28, 2002 at 03:45:06PM -0800, John Gardiner Myers wrote:

> As you have amply demonstrated, the current epoll API is error prone.
> The API should be fixed to test the poll condition and, if necessary,
> drop an event upon insertion to the set.

That is a semantics change and not an API/ABI change. To reiterate, you
mention the following scenario:

for(;;) {
nfds = sys_epoll_wait(kdpfd, &pfds, -1);
for(n = 0; n < nfds; ++n) {
if((fd = pfds[n].fd) == s) {
/* 1: accept client (SYN/SYN|ACK/ACK completed) */
client = accept(s, (struct sockaddr*)&local, &addrlen);
if(client < 0){
perror("accept");
continue;
}

/* 2: packet comes in, client becomes readable */
/* 3: registering interest */
if (sys_epoll_ctl(kdpfd, EP_CTL_ADD, client, POLLIN ) < 0) {
fprintf(stderr, "sys_epoll set insertion error: fd=%d\n", client);
return -1;
}

/* 4: interest only now registered, no edge will be
reported, our fd is lost */

fd = client;
}
do_use_fd(fd);
}
}

There are lots of ways to solve this, I bet Davide knows best. Perhaps it is
solved already, you can't tell from only studying the API, the problem isn't
intrinsic to it.

An easy solution is to have sys_epoll_ctl check if there is there is data
ready and make sure there is an edge to report in that case to the next call
of sys_epoll_ctl().

Regards,

bert

--
http://www.PowerDNS.com Versatile DNS Software & Services
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2002-10-29 00:17:26

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Tue, 29 Oct 2002, bert hubert wrote:

> On Mon, Oct 28, 2002 at 03:45:06PM -0800, John Gardiner Myers wrote:
>
> > As you have amply demonstrated, the current epoll API is error prone.
> > The API should be fixed to test the poll condition and, if necessary,
> > drop an event upon insertion to the set.
>
> That is a semantics change and not an API/ABI change. To reiterate, you
> mention the following scenario:
>
> for(;;) {
> nfds = sys_epoll_wait(kdpfd, &pfds, -1);
> for(n = 0; n < nfds; ++n) {
> if((fd = pfds[n].fd) == s) {
> /* 1: accept client (SYN/SYN|ACK/ACK completed) */
> client = accept(s, (struct sockaddr*)&local, &addrlen);
> if(client < 0){
> perror("accept");
> continue;
> }
>
> /* 2: packet comes in, client becomes readable */
> /* 3: registering interest */
> if (sys_epoll_ctl(kdpfd, EP_CTL_ADD, client, POLLIN ) < 0) {
> fprintf(stderr, "sys_epoll set insertion error: fd=%d\n", client);
> return -1;
> }
>
> /* 4: interest only now registered, no edge will be
> reported, our fd is lost */
>
> fd = client;
> }
> do_use_fd(fd);
> }
> }
>
> There are lots of ways to solve this, I bet Davide knows best. Perhaps it is
> solved already, you can't tell from only studying the API, the problem isn't
> intrinsic to it.

The code above it's just fine. The "fd" is not lost because the falling
through :

do_use_fd(fd);

will make good use of it.



- Davide


2002-10-29 00:34:18

by bert hubert

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Mon, Oct 28, 2002 at 04:32:50PM -0800, Davide Libenzi wrote:

> The code above it's just fine. The "fd" is not lost because the falling
> through :
>
> do_use_fd(fd);
>
> will make good use of it.

If that code dares to read immediatly from the fd without having an explicit
POLLIN event, which also means that epoll can only be used in this fashion
with nonblocking sockets.

The listening socket needs to do that anyhow as the connection may have
vanished anyhow - select has that problem too.

To trigger the problem, see the following very nasty 'exploit'. Telnet to
port 10000 and immediately enter a line of text and press enter. This line
might very well be 'GET / HTTP/1.0'.

Because of the inserted sleep(2); below, this line will never be reported
by sys_epoll_wait() in its current form. To see it appear, enter an
additional line after a while.

So either fix up sys_epoll_ctl() to insert an edge if there is data at
insertion time (which needs some locking probably to get it right), OR
document the interface that it should only be used with non-blocking sockets
and that the caller should immediately try to read after the initial
sys_epoll_ctl insert call. This last solution sounds very bad and confusing.

Code:

#include <stdio.h>
#include <errno.h>
#include <asm/page.h>
#include <asm/poll.h>
#include <linux/linkage.h>
#include <linux/eventpoll.h>
#include <linux/unistd.h>
#include <netinet/in.h>
#include <string.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <stdlib.h>
#include <unistd.h>

#define __sys_epoll_create(maxfds) _syscall1(int, sys_epoll_create, int, maxfds)
#define __sys_epoll_ctl(epfd, op, fd, events) _syscall4(int, sys_epoll_ctl, \
int, epfd, int, op, int, fd, unsigned int, events)

#define __sys_epoll_wait(epfd, events, timeout) _syscall3(int, sys_epoll_wait, \
int, epfd, struct pollfd const **, events, int, timeout)

__sys_epoll_create(maxfds)
__sys_epoll_ctl(epfd, op, fd, events)
__sys_epoll_wait(epfd, events, timeout)

// asmlinkage int sys_epoll_wait(int epfd, struct pollfd const **events, int timeout);

int main()
{
int kdpfd;
struct pollfd const *pfds;
int nfds;
int timeout=2;
struct sockaddr_in local;
socklen_t addrlen;
int s, client;
int n;
char line[80];
int ret;

if ((kdpfd = sys_epoll_create(10)) < 0) {
perror("sys_epoll_create");
return -1;
}

s=socket(AF_INET,SOCK_STREAM,0);

memset(&local,0,sizeof(local));
local.sin_family=AF_INET;
local.sin_addr.s_addr = INADDR_ANY;

int tmp=1;
if(setsockopt(s,SOL_SOCKET,SO_REUSEADDR,(char*)&tmp,sizeof tmp)<0) {
perror("setsockopt");
return 0;
}

local.sin_port=htons(10000);

if(bind(s, (struct sockaddr*)&local,sizeof(local))<0) {
perror("bind");
return 0;
}

if(listen(s,128)<0) {
perror("listen");
return 0;
}

if (sys_epoll_ctl(kdpfd, EP_CTL_ADD, s, POLLIN ) < 0) {
fprintf(stderr, "sys_epoll set insertion error: fd=%d\n", s);

return -1;
}
addrlen=sizeof(local);


for(;;) {
nfds = sys_epoll_wait(kdpfd, &pfds, -1);
fprintf(stderr,"sys_epoll_wait returned: %d\n",nfds);

for(n=0;n<nfds;++n) {
if(pfds[n].fd==s) {
client=accept(s, (struct sockaddr*)&local, &addrlen);
sleep(2);
if(client<0){
perror("accept");
continue;
}
if (sys_epoll_ctl(kdpfd, EP_CTL_ADD, client, POLLIN ) < 0) {
fprintf(stderr, "sys_epoll set insertion error: fd=%d\n", client);
return -1;
}
}
else {
if((ret=read(pfds[n].fd,line,79))>0)
printf("read from fd %d: '%.*s'\n", pfds[n].fd, ret>2 ? ret-2 : 0,line);
}
}



}
return 0;
}



>
>
>
> - Davide
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
http://www.PowerDNS.com Versatile DNS Software & Services
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2002-10-29 00:41:28

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Tue, 29 Oct 2002, bert hubert wrote:

> If that code dares to read immediatly from the fd without having an explicit
> POLLIN event, which also means that epoll can only be used in this fashion
> with nonblocking sockets.

The epoll interface has to be used with non-blocking fds. The EAGAIN
return code from read/write tells you that you can go safely to wait for
events for that fd because you making the read/write to return EAGAIN, you
consumed the whole I/O space for that fd. Consuming the whole I/O space
meant that you brought the signal to zero ( talking in ee terms ), and a
followinf 0->1 transaction will trigger the event. Where 1 means I/O space
available ...



- Davide


2002-10-29 00:41:55

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll

bert hubert wrote:
> > :( I was hoping sys_epoll would be scalable without increasing the
> > number of system calls per event.
>
> I see only one call per event? sys_epoll_wait. Perhaps sys_epoll_ctl to
> register a new interest.

As David pointed out, you need a second call before the sys_epoll_wait
if you're waiting for fds that epoll doesn't support (such as a tty).

-- Jamie

2002-10-29 00:47:13

by bert hubert

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Mon, Oct 28, 2002 at 04:57:12PM -0800, Davide Libenzi wrote:

> > If that code dares to read immediatly from the fd without having an explicit
> > POLLIN event, which also means that epoll can only be used in this fashion
> > with nonblocking sockets.
>
> The epoll interface has to be used with non-blocking fds. The EAGAIN

Ok, so that is two things that need to be in the manpage and probably in
bold:

1) epoll only works on pipes and sockets
(not on tty, not on files)

2) epoll must be used on non-blocking sockets only
(and describe the race that happens otherwise)

If you send me the source of your manpages I'll work it in if you want.

I still vote for checking at insert time however unless that is too
expensive. Yes, we want people to read the manpages and heed them, but we
also not want to create interfaces that are needlessly errorprone.

Regards,

bert

--
http://www.PowerDNS.com Versatile DNS Software & Services
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2002-10-29 00:58:08

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Tue, 29 Oct 2002, bert hubert wrote:

> Ok, so that is two things that need to be in the manpage and probably in
> bold:
>
> 1) epoll only works on pipes and sockets
> (not on tty, not on files)
>
> 2) epoll must be used on non-blocking sockets only
> (and describe the race that happens otherwise)
>
> If you send me the source of your manpages I'll work it in if you want.

No problem ...


> I still vote for checking at insert time however unless that is too
> expensive. Yes, we want people to read the manpages and heed them, but we
> also not want to create interfaces that are needlessly errorprone.

This will make the API to be really error prone by giving the user the
impression that he can go sleeping for a given fd each time he wants. The
rule "you can wait for events only after you received EAGAIN" looks like a
very short and precise rule. On the contrary of "you can go wait for
events after you received EAGAIN or after accept/connect". And the fact on
using the fd immediately after an accept/connect is enforced by two very
likely facts :

1) on accept() it's very likely that the first packet brough you something
more than SYN

2) on connect() you have the full I/O write space available



- Davide


2002-10-29 01:08:20

by Hanna Linder

[permalink] [raw]
Subject: Re: [Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

--On Monday, October 28, 2002 17:13:53 -0800 Davide Libenzi <[email protected]> wrote:

> On Tue, 29 Oct 2002, bert hubert wrote:
>
>> Ok, so that is two things that need to be in the manpage and probably in
>> bold:
>>
>> 1) epoll only works on pipes and sockets
>> (not on tty, not on files)
>>
>> 2) epoll must be used on non-blocking sockets only
>> (and describe the race that happens otherwise)
>>
>> If you send me the source of your manpages I'll work it in if you want.
>
> No problem ...
>

If you need any help with the Man pages I will be glad to
help too. It looks like providing examples of how to use it would be
very useful since this is something application writers are supposed
to use...

> events after you received EAGAIN or after accept/connect". And the fact on
> using the fd immediately after an accept/connect is enforced by two very
> likely facts :
>
> 1) on accept() it's very likely that the first packet brough you something
> more than SYN
>
> 2) on connect() you have the full I/O write space available

Do you think these should be mentioned explicitly?

Thanks a lot.

Hanna

2002-10-29 01:23:42

by Davide Libenzi

[permalink] [raw]
Subject: Re: [Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Mon, 28 Oct 2002, Hanna Linder wrote:

> If you need any help with the Man pages I will be glad to
> help too. It looks like providing examples of how to use it would be
> very useful since this is something application writers are supposed
> to use...

IMHO I find the ephttpd.c, once stripped of the tons of #ifdef to handle
the other interfaces, to be a very clear example about how to use the API
togheter with coroutines. Using it with an event driven state machine
would require a little bit more code, but nothing brutal ...
Anyway yes, the NOTES section of the man pages should be "enriched" with a
few notes and maybe code samples.



- Davide


2002-10-29 01:41:09

by John Myers

[permalink] [raw]
Subject: Security critical race condition in epoll code

There appears to be a race condition in the epoll patch which permits
user space to scribble in the kernel's free memory.

First a user space program creates an epoll fd and adds a socket to it
using sys_epoll_ctl(...EP_CTL_ADD...)

Then the program creates two threads, A and B. Simultaneously, A calls
sys_epoll_ctl(...EP_CTL_MOD...) and B calls
sys_epoll_ctl(...EP_CTL_DEL...) on the socket that was previously added.

Thread A runs up through the point where ep_find() returns the (struct
epitem *) for the socket.

Thread B then runs and ep_remove() frees the (struct epitem *).

Thread A then runs some more and stores the value of events into the now
freed block of memory pointed to by dpi.


2002-10-29 01:45:33

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll

Davide Libenzi wrote:
> Jamie, doing that is not a real problem. The fact is that sys_epoll aimed
> to solve issues where scalability on huge number of fds is involved. By
> covering sockets ( network connections ) and pipes ( cgi execution ) you
> have a pretty wide scalability addressing. Usually you know from where and
> fd born, so you're typically able to handle it correctly. Those reasons,
> togheter with the fact that I did not want to introduce revolutions in the
> kernel, drove me to limit the sys_epoll coverage to sockets and pipes.

Oh I agree this is an acceptable limitation. Just wondering whether I
can safely depend on an fd being a socket/pipe being sufficient?
I.e. does it work on a non-IP socket, a packet socket, an IPX socket
etc?

It would be good if epoll would at least refuse to register fds that
it can't handle, returning EINVAL for them. If it's as simple as
socket+pipe, that's a trivial test in ep_insert.

I've just read the /dev/epoll patch. I think it makes sense, in the
long run, to share infrastructure with that other event notification
subsystem - sigio. The two should really be interchangable interfaces
to the same underlying event notification system - not one interface
handling some fds and the other handling different fds.

(Ideally, though, with the new waitqueue wakeup callback functions
that were needed for aio the old fd poll mechanism can be made to
generate events - which epoll and sigio and aio and poll() could all
use - full circle back to a beautiful and harmonious unix world once
more.)

-- Jamie

2002-10-29 01:48:31

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll

Jamie Lokier wrote:
> bert hubert wrote:
> > > :( I was hoping sys_epoll would be scalable without increasing the
> > > number of system calls per event.
> >
> > I see only one call per event? sys_epoll_wait. Perhaps sys_epoll_ctl to
> > register a new interest.
>
> As David pointed out, you need a second call before the sys_epoll_wait
> if you're waiting for fds that epoll doesn't support (such as a tty).
^
insert "also"

-- Jamie

2002-10-29 01:59:37

by Jamie Lokier

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

bert hubert wrote:
> 1) epoll only works on pipes and sockets
> (not on tty, not on files)

fifos come to mind as the other things which are a bit like pipes/sockets.

>
> 2) epoll must be used on non-blocking sockets only
> (and describe the race that happens otherwise)

That much is implied simply by epoll being a queue of not_ready->ready
edge transitions. It would be good for the manual to explain this,
but it is clearly implied by the API if you think about it.

-- Jamie

2002-10-29 01:57:24

by Davide Libenzi

[permalink] [raw]
Subject: Re: Security critical race condition in epoll code

On Mon, 28 Oct 2002, John Gardiner Myers wrote:

> First a user space program creates an epoll fd and adds a socket to it
> using sys_epoll_ctl(...EP_CTL_ADD...)
>
> Then the program creates two threads, A and B. Simultaneously, A calls
> sys_epoll_ctl(...EP_CTL_MOD...) and B calls
> sys_epoll_ctl(...EP_CTL_DEL...) on the socket that was previously added.
>
> Thread A runs up through the point where ep_find() returns the (struct
> epitem *) for the socket.
>
> Thread B then runs and ep_remove() frees the (struct epitem *).
>
> Thread A then runs some more and stores the value of events into the now
> freed block of memory pointed to by dpi.

Ugh ... I forgot that you're the one that is handling an fd with 25000
threads :) This is true and it'll be fixed before you can read this
message ...



- Davide


2002-10-29 02:29:05

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Tue, 29 Oct 2002, Jamie Lokier wrote:

> >
> > 2) epoll must be used on non-blocking sockets only
> > (and describe the race that happens otherwise)
>
> That much is implied simply by epoll being a queue of not_ready->ready
> edge transitions. It would be good for the manual to explain this,
> but it is clearly implied by the API if you think about it.

Bert and Hanna are working on man pages right now with the aim of adding
those notes ...



- Davide


2002-10-29 03:22:26

by Davide Libenzi

[permalink] [raw]
Subject: Re: Security critical race condition in epoll code

On Mon, 28 Oct 2002, John Gardiner Myers wrote:

> Thread A then runs some more and stores the value of events into the now
> freed block of memory pointed to by dpi.

Thank you very much for your input John, I really appreciate :

http://www.xmailserver.org/linux-patches/sys_epoll-2.5.44-last.diff

If you have any other complain, pls let me know ...



- Davide


2002-10-29 04:01:15

by Hanna Linder

[permalink] [raw]
Subject: [PATCH] Updated sys_epoll now with man pages

--On Monday, October 28, 2002 18:44:50 -0800 Davide Libenzi wrote:
> Bert and Hanna are working on man pages right now with the aim of adding
> those notes ...

The source for the man pages is included and the viewable versions
are on Davide's web site here:

http://www.xmailserver.org/linux-patches/epoll.txt
http://www.xmailserver.org/linux-patches/epoll_create.txt
http://www.xmailserver.org/linux-patches/epoll_ctl.txt
http://www.xmailserver.org/linux-patches/epoll_wait.txt

They describe how to use epoll, under what circumstances it will be
best used or not to use it and other man page tidbits. Also note
there is an updated version of sys_epoll with the latest input received
from the mailing lists. That is attached as well (for easy merging) ;)

Bert and Davide did all the work. I just did a little cleanup.
Please include the man pages if^H^Hwhen you take the patches.

Thanks.

Hanna


Attachments:
(No filename) (902.00 B)
epoll.2 (3.52 kB)
epoll_create.2 (1.92 kB)
epoll_ctl.2 (2.69 kB)
epoll_wait.2 (2.44 kB)
sys_epoll-2.5.44-last.diff (46.28 kB)
Download all attachments

2002-10-29 04:50:15

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll

On Tue, 29 Oct 2002, Jamie Lokier wrote:

> Davide Libenzi wrote:
> Oh I agree this is an acceptable limitation. Just wondering whether I
> can safely depend on an fd being a socket/pipe being sufficient?
> I.e. does it work on a non-IP socket, a packet socket, an IPX socket
> etc?

Yes, by plugging the sk_wake_async() that is called from std ->data_ready
and ->write_space of generic socket support, all sockets types are
supported. Well, I should say "should" instead of "are" because I never
tested it with sockets different from TCP/IP :)


> It would be good if epoll would at least refuse to register fds that
> it can't handle, returning EINVAL for them. If it's as simple as
> socket+pipe, that's a trivial test in ep_insert.

This can be certainly implemented if many of you feel that it could be
usefull. The clean way to understand if a file* is of a given type would
be to make the "struct file_operations" of the compatible files ( sockets
and pipes ) to be non-static and to use something like :

if (f->f_op == ...)

to test the target file type. I'm already doing this to verify the epoll
file descriptor coherence.



> I've just read the /dev/epoll patch. I think it makes sense, in the
> long run, to share infrastructure with that other event notification
> subsystem - sigio. The two should really be interchangable interfaces
> to the same underlying event notification system - not one interface
> handling some fds and the other handling different fds.

IMHO sys_epoll is going to be a replacement for rt-signals, because it
scales better, it collapses events and does not have the overflowing queue
problem.



> (Ideally, though, with the new waitqueue wakeup callback functions
> that were needed for aio the old fd poll mechanism can be made to
> generate events - which epoll and sigio and aio and poll() could all
> use - full circle back to a beautiful and harmonious unix world once
> more.)

The sys_epoll interface was coded to use the existing infrastructure w/out
adding any legacy code added to suite the implementation. Basically,
besides the few lines added to fs/pipe.c to support pipes ( rt-signal did
not support them ), the hook lays inside sk_wake_async().



- Davide




2002-10-29 05:03:44

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Updated sys_epoll now with man pages

Hanna Linder wrote:
>
> sys_epoll-2.5.44-last.diff

Folks,

when I took a 15-minute look at this code last week I found several
bugs, some of which were grave. It's a terrible thing to say, but
a sensible person would expect that a closer inspection would turn
up more problems.

Now, adding bugs to existing code is fine and traditional - people
find them quickly and they get fixed up.

But for *new* code, problems will take months to discover. The only
practical way to get this code vetted for inclusion is a close review.

And that is a sizeable task. The core implementation file is
1,600 lines. And I wonder how many people have counted the
number of comments in there?

Well, I'll make it easy: zero. Nil. Nada.

(Well, OK, a copyright header, and something which got cut-n-pasted
from inode.c)

In my wildly unconventional opinion this alone makes epoll just a hack,
of insufficient quality for inclusion in Linux. We *have* to stop doing
this to ourselves!


epoll seems to be a good and desirable thing. To move forward I
believe we need to get this code reviewed, and documented.

I can do that if you like; it will take me several weeks to get onto
it. But until that is completed I would oppose inclusion of this
code.

2002-10-29 05:15:41

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] Updated sys_epoll now with man pages

On Mon, 28 Oct 2002, Andrew Morton wrote:

> Folks,
>
> when I took a 15-minute look at this code last week I found several
> bugs, some of which were grave. It's a terrible thing to say, but
> a sensible person would expect that a closer inspection would turn
> up more problems.
>
> Now, adding bugs to existing code is fine and traditional - people
> find them quickly and they get fixed up.
>
> But for *new* code, problems will take months to discover. The only
> practical way to get this code vetted for inclusion is a close review.
>
> And that is a sizeable task. The core implementation file is
> 1,600 lines. And I wonder how many people have counted the
> number of comments in there?
>
> Well, I'll make it easy: zero. Nil. Nada.
>
> (Well, OK, a copyright header, and something which got cut-n-pasted
> from inode.c)
>
> In my wildly unconventional opinion this alone makes epoll just a hack,
> of insufficient quality for inclusion in Linux. We *have* to stop doing
> this to ourselves!
>
>
> epoll seems to be a good and desirable thing. To move forward I
> believe we need to get this code reviewed, and documented.
>
> I can do that if you like; it will take me several weeks to get onto
> it. But until that is completed I would oppose inclusion of this
> code.

Andrew, last time you wrote me you told me that it was a matter of a
couple of hour to spend togheter to look at the code. And believe that a
couple of hours are the just about the right time. Most of the code is
_trivial_ and the "core code" can be condensed in no more than 100 lines.
I agree with you that comment are missing, but I don't think it's a
problem to add them in the "sensible" places. It'd would ahve helped if
you would have given me this feedback one week ago though ...



- Davide


2002-10-29 05:25:30

by Randy.Dunlap

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] Updated sys_epoll now with man pages

On Mon, 28 Oct 2002, Andrew Morton wrote:

| Hanna Linder wrote:
| >
| > sys_epoll-2.5.44-last.diff
|
| Folks,
|
| when I took a 15-minute look at this code last week I found several
| bugs, some of which were grave. It's a terrible thing to say, but
| a sensible person would expect that a closer inspection would turn
| up more problems.
|
| Now, adding bugs to existing code is fine and traditional - people
| find them quickly and they get fixed up.
|
| But for *new* code, problems will take months to discover. The only
| practical way to get this code vetted for inclusion is a close review.
|
| And that is a sizeable task. The core implementation file is
| 1,600 lines. And I wonder how many people have counted the
| number of comments in there?
|
| Well, I'll make it easy: zero. Nil. Nada.
|
| (Well, OK, a copyright header, and something which got cut-n-pasted
| from inode.c)
|
| In my wildly unconventional opinion this alone makes epoll just a hack,
| of insufficient quality for inclusion in Linux. We *have* to stop doing
| this to ourselves!

I expect this to be unpopular, but I've been saying lately that
when new kernel APIs or syscalls or whatsoever are added to
Linux, there should be sufficient docs along with the patch(es)
explaining the patch and some intended uses of it, perhaps even
with examples. Ingo does this sometimes. Some people don't
bother to even come close.

Leading by example would be a nice thing to see here.

And "use the source Luke" is cool and is definitive.
But it's not the only way to get things done.
Well, maybe it is, but it shouldn't and needn't be.

| epoll seems to be a good and desirable thing. To move forward I
| believe we need to get this code reviewed, and documented.
|
| I can do that if you like; it will take me several weeks to get onto
| it. But until that is completed I would oppose inclusion of this
| code.
| -------------------------------------------------------

--
~Randy

2002-10-29 05:31:57

by Davide Libenzi

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] Updated sys_epoll now with man pages

On Mon, 28 Oct 2002, Randy.Dunlap wrote:

> I expect this to be unpopular, but I've been saying lately that
> when new kernel APIs or syscalls or whatsoever are added to
> Linux, there should be sufficient docs along with the patch(es)
> explaining the patch and some intended uses of it, perhaps even
> with examples. Ingo does this sometimes. Some people don't
> bother to even come close.
>
> Leading by example would be a nice thing to see here.

Hi Randy, no it's not unpopular :) Those were posted togheter with the
patch :

http://www.xmailserver.org/linux-patches/epoll.txt
http://www.xmailserver.org/linux-patches/epoll_create.txt
http://www.xmailserver.org/linux-patches/epoll_ctl.txt
http://www.xmailserver.org/linux-patches/epoll_wait.txt

to try to explain the new API from user side. From kernel side epoll
completely uses the sk_wake_async() already available by many versions,
and also it is not expected to offer kernel services to other kernel
modules.




- Davide


2002-10-29 05:38:20

by Randy.Dunlap

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] Updated sys_epoll now with man pages

On Mon, 28 Oct 2002, Davide Libenzi wrote:

| On Mon, 28 Oct 2002, Randy.Dunlap wrote:
|
| > I expect this to be unpopular, but I've been saying lately that
| > when new kernel APIs or syscalls or whatsoever are added to
| > Linux, there should be sufficient docs along with the patch(es)
| > explaining the patch and some intended uses of it, perhaps even
| > with examples. Ingo does this sometimes. Some people don't
| > bother to even come close.
| >
| > Leading by example would be a nice thing to see here.
|
| Hi Randy, no it's not unpopular :) Those were posted togheter with the
| patch :
|
| http://www.xmailserver.org/linux-patches/epoll.txt
| http://www.xmailserver.org/linux-patches/epoll_create.txt
| http://www.xmailserver.org/linux-patches/epoll_ctl.txt
| http://www.xmailserver.org/linux-patches/epoll_wait.txt
|
| to try to explain the new API from user side. From kernel side epoll
| completely uses the sk_wake_async() already available by many versions,
| and also it is not expected to offer kernel services to other kernel
| modules.

Yes, I knew that and I thought about it while typing, but
my dynamic RAM was too dynamic and not being refreshed often
enough. Thanks for doing it for me.

BTW, I didn't mean unpopular for the epoll patch, I meant
unpopular in general, especially for development kernel patches:
if every new feature required docs along with it, it might slow
down Linux development by one day, but help out everyone in
the long run (tm?).

--
~Randy

2002-10-29 06:00:12

by Randy.Dunlap

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] Updated sys_epoll now with man pages

On Mon, 28 Oct 2002, Davide Libenzi wrote:

| On Mon, 28 Oct 2002, Randy.Dunlap wrote:
|
| > Yes, I knew that and I thought about it while typing, but
| > my dynamic RAM was too dynamic and not being refreshed often
| > enough. Thanks for doing it for me.
|
| I knew it, I already sent you the links before :)
Yep. 8:)

| > BTW, I didn't mean unpopular for the epoll patch, I meant
| > unpopular in general, especially for development kernel patches:
| > if every new feature required docs along with it, it might slow
| > down Linux development by one day, but help out everyone in
| > the long run (tm?).
|
| I do agree Randy about comments, don't get me wrong. But you know what my
| job condition is :) Looking at the kernel source though, you find
| something like :
|
| /* add the fd to the interest set */
| do_add_fd_to_the_interest_set();
|
| and then you have the code that really would need comments completely
| naked. While, again, I do agree that comments are completely missing in
| the patch, I'm not that kind of guy that would like a function like :
|
| static struct epitem *ep_find_nl(struct eventpoll *ep, int fd)
| {
...
| }
|
| commented with "search an fd inside the hash". What a comment like that
| adds to this code ?

nada.

Just to be clear (and reiterate), my comments were not about
the epoll patch, but about adding new features/kernel APIs/etc
in general.

Later,
--
~Randy

2002-10-29 05:56:34

by Davide Libenzi

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] Updated sys_epoll now with man pages

On Mon, 28 Oct 2002, Randy.Dunlap wrote:

> Yes, I knew that and I thought about it while typing, but
> my dynamic RAM was too dynamic and not being refreshed often
> enough. Thanks for doing it for me.

I knew it, I already sent you the links before :)


> BTW, I didn't mean unpopular for the epoll patch, I meant
> unpopular in general, especially for development kernel patches:
> if every new feature required docs along with it, it might slow
> down Linux development by one day, but help out everyone in
> the long run (tm?).

I do agree Randy about comments, don't get me wrong. But you know what my
job condition is :) Looking at the kernel source though, you find
something like :

/* add the fd to the interest set */
do_add_fd_to_the_interest_set();

and then you have the code that really would need comments completely
naked. While, again, I do agree that comments are completely missing in
the patch, I'm not that kind of guy that would like a function like :

static struct epitem *ep_find_nl(struct eventpoll *ep, int fd)
{
struct epitem *dpi = NULL;
struct list_head *lsthead, *lnk;

lsthead = &ep->hash[fd & ep->hmask];
list_for_each(lnk, lsthead) {
dpi = list_entry(lnk, struct epitem, llink);

if (dpi->pfd.fd == fd) break;
dpi = NULL;
}

DNPRINTK(3, (KERN_INFO "[%p] eventpoll: ep_find(%d) -> %p\n", current, fd, dpi));

return dpi;
}

commented with "search an fd inside the hash". What a comment like that
adds to this code ?




- Davide




2002-10-29 06:07:34

by Davide Libenzi

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] Updated sys_epoll now with man pages

On Mon, 28 Oct 2002, Randy.Dunlap wrote:

> Just to be clear (and reiterate), my comments were not about
> the epoll patch, but about adding new features/kernel APIs/etc
> in general.

I know, I know, I'm just trying to recover from Andrew punches :-)



- Davide


2002-10-29 07:18:44

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] Updated sys_epoll now with man pages

On Mon, 28 Oct 2002, Andrew Morton wrote:

> epoll seems to be a good and desirable thing. To move forward I
> believe we need to get this code reviewed, and documented.

Ok, this is the long story ...
The epoll interface completely hooks in the existing sk_wake_async()
function that is already called to issue async events for rt-sig. The
patch adds :

struct file {
...
/* file callback list */
rwlock_t f_cblock;
struct list_head f_cblist;
};

a list of callback to be called upon event dispatch. This list is
maintained by fs/fcblist.c and each entry of such list is an item :

struct fcb_struct {
struct list_head llink;
void (*cbproc)(struct file *, void *, unsigned long *, long *);
void *data;
unsigned long local[FCB_LOCAL_SIZE];
};

This list is initiated in fs/file_table.c by calling :

static inline void file_notify_init(struct file *filep)
{
rwlock_init(&filep->f_cblock);
INIT_LIST_HEAD(&filep->f_cblist);
}

( defined in include/linux/fcblist.h ) in functions :

get_empty_filp()
init_private_file()

and is cleaned up calling :

void file_notify_cleanup(struct file *filep)
{
unsigned long flags;
struct list_head *lnk, *lsthead;

fcblist_write_lock(filep, flags);

lsthead = &filep->f_cblist;
while ((lnk = list_first(lsthead))) {
struct fcb_struct *fcbp = list_entry(lnk, struct fcb_struct, llink);

list_del(lnk);
fcblist_write_unlock(filep, flags);
kfree(fcbp);
fcblist_write_lock(filep, flags);
}

fcblist_write_unlock(filep, flags);
}

from inside :

__fput()

When an fd is inserted in the interest set the drivers/char/eventpoll.c
code adds a callback ( notify_proc() ) to the callback list of the
inserted file by monitoring all the async dispatches done on such file.
This dispatches hooks :

static inline void sk_wake_async(struct sock *sk, int how, int band)
{
if (sk->socket) {
if (sk->socket->file)
file_send_notify(sk->socket->file, ion_band_table[band - POLL_IN],
poll_band_table[band - POLL_IN]);
if (sk->socket->fasync_list)
sock_wake_async(sk->socket, how, band);
}
}

where the simple insertion of :

static inline void file_send_notify(struct file *filep, long ioevt, long plevt) {
long event[] = { ioevt, plevt, -1 };

file_notify_event(filep, event);
}

void file_notify_event(struct file *filep, long *event)
{
unsigned long flags;
struct list_head *lnk, *lsthead;

fcblist_read_lock(filep, flags);

lsthead = &filep->f_cblist;
list_for_each(lnk, lsthead) {
struct fcb_struct *fcbp = list_entry(lnk, struct fcb_struct, llink);

fcbp->cbproc(filep, fcbp->data, fcbp->local, event);
}

fcblist_read_unlock(filep, flags);
}

will garantie that all the associated callbacks are properly invoked.
So, now we know how event reaches the core of the implementation, that is :

static void notify_proc(struct file *file, void *data, unsigned long *local, long *event)
{
struct epitem *dpi = data;
struct eventpoll *ep = dpi->ep;
struct pollfd *pfd;

DNPRINTK(3, (KERN_INFO "[%p] eventpoll: notify(%p, %p, %ld, %ld) ep=%p\n",
current, file, data, event[0], event[1], ep));

write_lock(&ep->lock);
if (!(dpi->pfd.events & event[1]))
goto out;

if (dpi->index < 0 || dpi->ver != ep->ver) {
if (ep->eventcnt >= (ep->numpages * POLLFD_X_PAGE))
goto out;
dpi->index = ep->eventcnt++;
dpi->ver = ep->ver;
pfd = (struct pollfd *) (ep->pages[EVENT_PAGE_INDEX(dpi->index)] +
EVENT_PAGE_OFFSET(dpi->index));
*pfd = dpi->pfd;
} else {
pfd = (struct pollfd *) (ep->pages[EVENT_PAGE_INDEX(dpi->index)] +
EVENT_PAGE_OFFSET(dpi->index));
if (pfd->fd != dpi->pfd.fd) {
if (ep->eventcnt >= (ep->numpages * POLLFD_X_PAGE))
goto out;
dpi->index = ep->eventcnt++;
pfd = (struct pollfd *) (ep->pages[EVENT_PAGE_INDEX(dpi->index)] +
EVENT_PAGE_OFFSET(dpi->index));
*pfd = dpi->pfd;
}
}

pfd->revents |= (pfd->events & event[1]);

if (waitqueue_active(&ep->wq))
wake_up(&ep->wq);
if (waitqueue_active(&ep->poll_wait))
wake_up(&ep->poll_wait);
out:
write_unlock(&ep->lock);
}

The eventpoll interface uses a double-buffer to store events. While the
user space processes the old returned ones, the kernel fills in the other
buffer. Those buffers are switched on sys_epoll_wait() or the old
ioctl(EP_POLL). The buffer is mmap()ed in user space and in this way
there's no kernel to user space memory movement, besides the 4/8 bytes of
the returned pointer. The notify_proc() basically stores the new incoming
event into the existing slot ( if any ) or moves to the next slot if the
source "fd" did not have any event slot allocated. Determining if the fd
had a previous slot allocated is done by checking the ->index, ->ver and
the stored ->pfd.fd. If the event trigger a new slot allocation the number
of stored event slot ( ep->eventcnt ) is increased and, on exit, the wake
up list is invoked. Each epoll file descriptor has its own f->private_data
structure :

struct eventpoll {
/* semaphore used to serialize event set changes */
struct rw_semaphore acsem;
/* rw lock used to serialize hash and double buffer access */
rwlock_t lock;
/* wait queue for sysy_epoll_wait() and ioctl(EP_POLL) */
wait_queue_head_t wq;
/* wait queue for f->poll() */
wait_queue_head_t poll_wait;
/* fd hash and associated variables */
struct list_head *hash;
unsigned int hbits;
unsigned int hmask;
atomic_t hents;
atomic_t resize;
/* double buffer variables */
int numpages;
char **pages;
char *pages0[MAX_EVENTPOLL_PAGES];
char *pages1[MAX_EVENTPOLL_PAGES];
/* vma base address where the mmap() took place */
unsigned long vmabase;
/* mmap()ed flag */
atomic_t mmapped;
/* number of available events */
int eventcnt;
/* versioning used to validate double buffer entries */
event_version_t ver;
};

And each entry in the hash table is described by :

struct epitem {
struct list_head llink;
struct eventpoll *ep;
struct file *file;
struct pollfd pfd;
int index;
event_version_t ver;
};

The rest of the code is simply "infrastructure" needed to be able to
support a file* ( f_op ) and also the inode source file system.
About the stability of the code, I can say that it has been heavily stress
tested ( the old epoll code is used in production appliance ) on both UP
and SMP and guys that actually tested it could report how many crashes
they had. Does it have bugs ? Obviously it does ... but spotting the
eventually new ones will be more productive than ranting about the fact
that you found some in the previous revision. Andrew, if we don't want to
merge this, it's fine for me. Just don't shoot "several weeks" to review
something that "at the bone" are no more than 200 lines of code. And we're
talking about a code that, per Linus statement, will have about 8 months
to see the light ...




- Davide








2002-10-29 10:58:31

by bert hubert

[permalink] [raw]
Subject: Re: [PATCH] Updated sys_epoll now with man pages

On Mon, Oct 28, 2002 at 09:09:56PM -0800, Andrew Morton wrote:

> when I took a 15-minute look at this code last week I found several
> bugs, some of which were grave. It's a terrible thing to say, but

Are there still unresolved bugs?

> epoll seems to be a good and desirable thing. To move forward I
> believe we need to get this code reviewed, and documented.

Davide's code explanation looks nice - do you think it sufficient?

> I can do that if you like; it will take me several weeks to get onto
> it. But until that is completed I would oppose inclusion of this
> code.

This sounds real harsh as this means that it does not, in fact, go in. I'm
no guru but the code in question looks sane and reasonable. The state of the
kernel right now is such that I would really advocate this going in, I'm
sure we can get commitment to vet this code - which is all the easier
because it is going to see more use.

I'll be releasing a version of mtasker (http://ds9a.nl/mtasker) later today
which uses epoll which I'm going to take into production.

Regards,

bert

--
http://www.PowerDNS.com Versatile DNS Software & Services
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2002-10-29 11:21:16

by Jamie Lokier

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll

Davide Libenzi wrote:
> IMHO sys_epoll is going to be a replacement for rt-signals, because it
> scales better, it collapses events and does not have the overflowing queue
> problem.

Scalability is also solved by the signal-per-fd patch, as you know.
The main advantage of epoll is that it's lighter weight than rt-signals.

(IMHO signal-per-fd really ought to be included in 2.6 _anyway_, regardless
of any better mechanism for reading events.)

> The sys_epoll interface was coded to use the existing infrastructure w/out
> adding any legacy code added to suite the implementation. Basically,
> besides the few lines added to fs/pipe.c to support pipes ( rt-signal did
> not support them ), the hook lays inside sk_wake_async().

I agree that was the way to do it for 2.4 and earlier - you have to
work with a range of kernels, and minimum impact.

But now in 2.5 it's appropriate to implement whatever's _right_.

Time for me to take the big plunge and try a 2.5 kernel on my IDE
laptop, I guess :-)

-- Jamie

2002-10-29 12:53:00

by Martin Waitz

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

hi :)

On Mon, Oct 28, 2002 at 04:08:04PM -0800, Davide Libenzi wrote:
> > The code you wrote has the standard epoll race condition. If the file
> > descriptor 's' becomes readable before the call to sys_epoll_ctl,
> > sys_epoll_wait() will never return the socket. The connection will hang
> > and the file descriptor will effectively leak.
>
> So, please don't use :
>
> free((void *) rand());
>
> free() is flawed !! Be warned !!

this is not a good argument.
the situation above is real, and a new poll mechanism should handle it.
(the kevent paper has a nice section about this problem)
why put the burden on the application, if the kernel already has
all information needed?

every api should be build to cause the least astonishment to its users.
epoll is much more scalable than standard poll, yet i don't think
it's a nice api.

you should at least build it in a way that it can be extended later.
it would be sad if one had to add another syscall to build an
better notification mechanism later.

the unified event mechanism introduced in bsd is a good example imho.
we should build something that is similar useful for applications.

don't take me wrong, i greatly appreciate the work put in epoll,
i just think it's far from being perfect at the moment.

--
CU, / Friedrich-Alexander University Erlangen, Germany
Martin Waitz // [Tali on IRCnet] [tali.home.pages.de] _________
______________/// - - - - - - - - - - - - - - - - - - - - ///
dies ist eine manuell generierte mail, sie beinhaltet //
tippfehler und ist auch ohne grossbuchstaben gueltig. /
-
Wer bereit ist, grundlegende Freiheiten aufzugeben, um sich
kurzfristige Sicherheit zu verschaffen, der hat weder Freiheit
noch Sicherheit verdient.
Benjamin Franklin (1706 - 1790)


Attachments:
(No filename) (1.74 kB)
(No filename) (189.00 B)
Download all attachments

2002-10-29 13:03:23

by bert hubert

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Mon, Oct 28, 2002 at 04:57:12PM -0800, Davide Libenzi wrote:

> The epoll interface has to be used with non-blocking fds. The EAGAIN
> return code from read/write tells you that you can go safely to wait for
> events for that fd because you making the read/write to return EAGAIN, you
> consumed the whole I/O space for that fd. Consuming the whole I/O space
> meant that you brought the signal to zero ( talking in ee terms ), and a
> followinf 0->1 transaction will trigger the event. Where 1 means I/O space
> available ...

I tried to modify the mtasker webserver
(http://ds9a.nl/mtasker/releases/mtasker-0.2.tar.gz) to work with epoll
instead of select and, well, this is awkward.

The right way to use epoll is (schematically);

int clientfd=accept();
setnonblocking(clientfd);

epfd=epoll_create();
epoll_ctl(epfd,EP_CTL_ADD, clientfd, POLLIN);

for(;;) {
if(read(clientfd)<0 && errno==EAGAIN) {
epoll_wait(epfd);
continue;
}
epoll_ctl(epfd,EP_CTL_DEL, clientfd); // remove again
break;
}

This requires having the fd in epoll before trying to read, which means a
weird interface where you have to register your interest, try to read, and
unregister your interface in case it succeeded.

This means two epoll syscalls per read.

This is all solved if epoll_ctl() creates an edge if it finds that the poll
condition is met at insert time.

The way it works now is way more awkward then using regular poll(), the way
it works now is very easy to do wrong because of this awkwardness. Even
semi-correct which zeroes the pollstate before calling epoll is wrong:

for(;;) {
if(read(clientfd)<0 && errno==EAGAIN) {
waitOn(clientfd); /* function which registers our
interest with a central
dispatcher and waits */
continue;
}
break;
}

Code like this would appear in many userspace threading abstractions, like
GNU Pth or mtasker. Instead, we need:

for(;;) {
registerReadInterest(clientfd);
if(read(clientfd)<0 && errno==EAGAIN) {
waitOn(clientfd); /* function which waits for our
interest be satisfied */
continue;
}
deleteReadInterest(clientfd);
break;
}

If epoll_ctl make epoll_wait report an edge in case it finds that there is
already data, all this is way way simpler, allowing:

waitOn(clientfd); /* function which registers interest and waits for
an edge to appear */
read(clientfd);

Regards,

bert

--
http://www.PowerDNS.com Versatile DNS Software & Services
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2002-10-29 15:04:22

by Paul Larson

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] Updated sys_epoll now with man pages

On Mon, 2002-10-28 at 23:28, Randy.Dunlap wrote:
> I expect this to be unpopular, but I've been saying lately that
> when new kernel APIs or syscalls or whatsoever are added to
> Linux, there should be sufficient docs along with the patch(es)
> explaining the patch and some intended uses of it, perhaps even
> with examples. Ingo does this sometimes. Some people don't
> bother to even come close.
It would be great to see more people doing that, also releasing docs
about intentions beforehand (rfc's and such) would be nice to see too.
As far as example programs go, cc'ing the Linux Test Project at
[email protected] would be a really nice thing too and make
you very popular! :)

Thanks,
Paul Larson

2002-10-29 15:13:03

by bert hubert

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Tue, Oct 29, 2002 at 01:59:04PM +0100, Martin Waitz wrote:

> every api should be build to cause the least astonishment to its users.
> epoll is much more scalable than standard poll, yet i don't think
> it's a nice api.

epoll is not a 'faster poll()'. It is edge based instead of level based -
some astonishment is in order here. Anyhow, all problems with the API
mentioned go away if epoll_ctl() inserts an egde if it finds that its poll
condition is met.

> the unified event mechanism introduced in bsd is a good example imho.
> we should build something that is similar useful for applications.

Sounds 2.7-ish.

Regards,

bert

--
http://www.PowerDNS.com Versatile DNS Software & Services
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2002-10-29 19:44:25

by John Myers

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

bert hubert wrote:

>That is a semantics change and not an API/ABI change.
>
An API/ABI encompasses the semantics of the calls. A change to the
semantics is a change to the API/ABI.

>I bet Davide knows best.
>
Nope, he doesn't.

>An easy solution is to have sys_epoll_ctl check if there is there is data
>ready and make sure there is an edge to report in that case to the next call
>of sys_epoll_ctl().
>
>
This is the very solution I am proposing.


2002-10-29 20:47:44

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Tue, 29 Oct 2002, John Gardiner Myers wrote:

> >I bet Davide knows best.
> >
> Nope, he doesn't.

It is very easy for me to remain calm here. You're a funny guy. You're in
the computer science by many many years and still you're not able to
understand how edge triggered events works. And look, this apply to every
field, form ee to cs. Book suggestions would be requested here, but since
I believe grasping inside a technical library to be pretty fun, I'll leave
you this pleasure.



> >An easy solution is to have sys_epoll_ctl check if there is there is data
> >ready and make sure there is an edge to report in that case to the next call
> >of sys_epoll_ctl().
> >
> >
> This is the very solution I am proposing.

This is an example snippet code that can be used with the current API :

for(;;) {
nfds = sys_epoll_wait(kdpfd, &pfds, -1);

for(n = 0; n < nfds; ++n) {
if(fd = pfds[n].fd) == s) {
client = accept(s, (struct sockaddr*)&local, &addrlen);
if(client < 0){
perror("accept");
continue;
}
if (sys_epoll_ctl(kdpfd, EP_CTL_ADD, client, POLLIN | POLLOUT) < 0) {
fprintf(stderr, "sys_epoll set insertion error: fd=%d\n", client);
return -1;
}
fd = client;
}
do_use_fd(fd);
}
}

This is what will be used in case of your
failing-to-understand-edge-triggered-api method :

for(;;) {
nfds = sys_epoll_wait(kdpfd, &pfds, -1);

for(n = 0; n < nfds; ++n) {
if(fd = pfds[n].fd) == s) {
client = accept(s, (struct sockaddr*)&local, &addrlen);
if(client < 0){
perror("accept");
continue;
}
if (sys_epoll_ctl(kdpfd, EP_CTL_ADD, client, POLLIN | POLLOUT) < 0) {
fprintf(stderr, "sys_epoll set insertion error: fd=%d\n", client);
return -1;
}
} else
do_use_fd(fd);
}
}

Why the heck ( and this for the 100th time ) do you want to go to wait for
an event on the newly born fd if :

1) On connect() you have the _full_ write I/O space available
2) On accept() it's very likely the you'll find something more than a SYN
in the first packet

Besides, the first code is even more cleaner and simmetric, while adopting
your half *ss solution might suggest the user that he can go waiting for
events any time he wants. Like going to sleep the the wait queue of IDE
disk w/out having issued any command. Now to bring this 101, consider :

1) "issuing a command to an IDE disk" == "using read/write until EAGAIN"

2) "adding yourself on the IDE disk wait queue" == "calling sys_epoll_wait()"


PS: since my time is not infinite, and since I'm working on the changes we
agreed with Andrew I would suggest you either to take another look at the
code suggesting us new changes ( like you did yesterday ) or to go
shopping for books.



- Davide



2002-10-29 21:09:36

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Tue, 29 Oct 2002, bert hubert wrote:

> On Mon, Oct 28, 2002 at 04:57:12PM -0800, Davide Libenzi wrote:
>
> > The epoll interface has to be used with non-blocking fds. The EAGAIN
> > return code from read/write tells you that you can go safely to wait for
> > events for that fd because you making the read/write to return EAGAIN, you
> > consumed the whole I/O space for that fd. Consuming the whole I/O space
> > meant that you brought the signal to zero ( talking in ee terms ), and a
> > followinf 0->1 transaction will trigger the event. Where 1 means I/O space
> > available ...
>
> I tried to modify the mtasker webserver
> (http://ds9a.nl/mtasker/releases/mtasker-0.2.tar.gz) to work with epoll
> instead of select and, well, this is awkward.
>
> The right way to use epoll is (schematically);
>
> int clientfd=accept();
> setnonblocking(clientfd);
>
> epfd=epoll_create();
> epoll_ctl(epfd,EP_CTL_ADD, clientfd, POLLIN);
>
> for(;;) {
> if(read(clientfd)<0 && errno==EAGAIN) {
> epoll_wait(epfd);
> continue;
> }
> epoll_ctl(epfd,EP_CTL_DEL, clientfd); // remove again
> break;
> }
>
> This requires having the fd in epoll before trying to read, which means a
> weird interface where you have to register your interest, try to read, and
> unregister your interface in case it succeeded.
>
> This means two epoll syscalls per read.

Bert, you don't need to do that. Like I wrote in the latest man pages, you
can easily add the fd inside the interest set with POLLIN | POLLOUT when
the fd born ( accept/connect ) and leave it there for its whole life. The
fact that the API is edge triggered garanties you that you won't receive
any events when the fd is not used. I personallt tried to measure the
number of stale events and is basically ( better definitely ) uninfluent.


> This is all solved if epoll_ctl() creates an edge if it finds that the poll
> condition is met at insert time.
>
> The way it works now is way more awkward then using regular poll(), the way
> it works now is very easy to do wrong because of this awkwardness. Even
> semi-correct which zeroes the pollstate before calling epoll is wrong:
>
> for(;;) {
> if(read(clientfd)<0 && errno==EAGAIN) {
> waitOn(clientfd); /* function which registers our
> interest with a central
> dispatcher and waits */
> continue;
> }
> break;
> }
>
> Code like this would appear in many userspace threading abstractions, like
> GNU Pth or mtasker. Instead, we need:
>
> for(;;) {
> registerReadInterest(clientfd);
> if(read(clientfd)<0 && errno==EAGAIN) {
> waitOn(clientfd); /* function which waits for our
> interest be satisfied */
> continue;
> }
> deleteReadInterest(clientfd);
> break;
> }
>
> If epoll_ctl make epoll_wait report an edge in case it finds that there is
> already data, all this is way way simpler, allowing:
>
> waitOn(clientfd); /* function which registers interest and waits for
> an edge to appear */
> read(clientfd);

You can do all this easily, expecially with API that has a core dispatch
loop like GNU PTH. The example http server used to test it is an example
on how to use it ( with coroutines ).


Hanna, is it possible for you guys to cleanup ephttpd and make it an
example program for sys_epoll ?



- Davide



2002-10-29 21:24:35

by Hanna Linder

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

--On Tuesday, October 29, 2002 13:25:24 -0800 Davide Libenzi <[email protected]> wrote:

> Hanna, is it possible for you guys to cleanup ephttpd and make it an
> example program for sys_epoll ?

You mean for the man page? Sure, I will look into it.

Hanna

2002-10-29 21:25:43

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Tue, 29 Oct 2002, Hanna Linder wrote:

> --On Tuesday, October 29, 2002 13:25:24 -0800 Davide Libenzi <[email protected]> wrote:
>
> > Hanna, is it possible for you guys to cleanup ephttpd and make it an
> > example program for sys_epoll ?
>
> You mean for the man page? Sure, I will look into it.

Not only for the man page, like a kind-of-reference possible usage ...



- Davide


2002-10-29 22:48:39

by Martin Waitz

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

hi :)

On Tue, Oct 29, 2002 at 04:19:23PM +0100, bert hubert wrote:
> It is edge based instead of level based -
seems i have to look closer at your papers/code

what is the motivation to make it edge based?

> > the unified event mechanism introduced in bsd is a good example imho.
> > we should build something that is similar useful for applications.
> Sounds 2.7-ish.
difinitely.

--
CU, / Friedrich-Alexander University Erlangen, Germany
Martin Waitz // [Tali on IRCnet] [tali.home.pages.de] _________
______________/// - - - - - - - - - - - - - - - - - - - - ///
dies ist eine manuell generierte mail, sie beinhaltet //
tippfehler und ist auch ohne grossbuchstaben gueltig. /
-
Wer bereit ist, grundlegende Freiheiten aufzugeben, um sich
kurzfristige Sicherheit zu verschaffen, der hat weder Freiheit
noch Sicherheit verdient.
Benjamin Franklin (1706 - 1790)


Attachments:
(No filename) (889.00 B)
(No filename) (189.00 B)
Download all attachments

2002-10-29 23:07:18

by Hanna Linder

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

--On Tuesday, October 29, 2002 13:41:34 -0800 Davide Libenzi <[email protected]> wrote:
>>
>> > Hanna, is it possible for you guys to cleanup ephttpd and make it an
>> > example program for sys_epoll ?
>>
>> You mean for the man page? Sure, I will look into it.
>
> Not only for the man page, like a kind-of-reference possible usage ...


Would it make sense to put sys_epoll examples in the Documentation
directory?

Thanks.

Hanna

2002-10-29 23:12:15

by Randy.Dunlap

[permalink] [raw]
Subject: Re: [Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Tue, 29 Oct 2002, Hanna Linder wrote:

| --On Tuesday, October 29, 2002 13:41:34 -0800 Davide Libenzi <[email protected]> wrote:
| >>
| >> > Hanna, is it possible for you guys to cleanup ephttpd and make it an
| >> > example program for sys_epoll ?
| >>
| >> You mean for the man page? Sure, I will look into it.
| >
| > Not only for the man page, like a kind-of-reference possible usage ...
|
|
| Would it make sense to put sys_epoll examples in the Documentation
| directory?

You mean in the linux/Documentation/ directory?
No, please don't litter it up like that.

--
~Randy

2002-10-29 23:09:51

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Tue, 29 Oct 2002, Hanna Linder wrote:

> --On Tuesday, October 29, 2002 13:41:34 -0800 Davide Libenzi <[email protected]> wrote:
> >>
> >> > Hanna, is it possible for you guys to cleanup ephttpd and make it an
> >> > example program for sys_epoll ?
> >>
> >> You mean for the man page? Sure, I will look into it.
> >
> > Not only for the man page, like a kind-of-reference possible usage ...
>
>
> Would it make sense to put sys_epoll examples in the Documentation
> directory?

I don't really know ... epoll does not have kernel services. IMHO this is
more user space documentation.



- Davide


2002-10-30 00:00:46

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH] epoll more scalable than poll

On Tue, 29 Oct 2002, Jamie Lokier wrote:

> Davide Libenzi wrote:
> > IMHO sys_epoll is going to be a replacement for rt-signals, because it
> > scales better, it collapses events and does not have the overflowing queue
> > problem.
>
> Scalability is also solved by the signal-per-fd patch, as you know.
> The main advantage of epoll is that it's lighter weight than rt-signals.
>
> (IMHO signal-per-fd really ought to be included in 2.6 _anyway_, regardless
> of any better mechanism for reading events.)

It scales pretty good, yes. You have to be carefull to build your kernel
with a huge queue to avoid SIGIO, that makes you pay somthing. Also does
not support pipes.


> > The sys_epoll interface was coded to use the existing infrastructure w/out
> > adding any legacy code added to suite the implementation. Basically,
> > besides the few lines added to fs/pipe.c to support pipes ( rt-signal did
> > not support them ), the hook lays inside sk_wake_async().
>
> I agree that was the way to do it for 2.4 and earlier - you have to
> work with a range of kernels, and minimum impact.
>
> But now in 2.5 it's appropriate to implement whatever's _right_.

Yes Jamie, you can add sys_epoll support for other devices but IMHO the
only devices where you're going to have scalability problems due huge
number of handled fds are 1) sockets 2) pipes. I feel that devices that do
not go over 100-500 can be easily handled with the fully supportive
poll(). The fact that you can drop a sys_epoll fd inside a poll() set,
garanties you 1) scalability due the stocking of sockets+pipes inside the
sys_epoll fd 2) compatibility with the full set of devices. This w/out
screwing up much the kernel code. What do you think ?



- Davide


2002-10-30 00:20:50

by Jamie Lokier

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

> 1) "issuing a command to an IDE disk" == "using read/write until EAGAIN"
> 2) "adding yourself on the IDE disk wait queue" == "calling sys_epoll_wait()"

That is quite a good analogy. epoll is like a waitqueue - which is
also like a futex. To use a waitqueue properly you have do these
things in the order shown:

1. Set the task state to stopped.
2. Register yourself on the waitqueue.
3. Check the condition.
4. If condition is not met, schedule.

With epoll it is very similar. To wait for a condition on a file
descriptor, such as readability, you must do these things in the order
shown:

1. Register your interest using epoll_ctl.
2. Check the condition by actually calling read().
3. If the condition is not met (i.e. read() returned EAGAIN),
call epoll_wait (i.e. equivalent to schedule).

With epoll, you can optimise by registering interest just once. In
other words, steps 2 and 3 may be repeated without repeating step 1.

And if you are concerned about starvation -- that is, one of your file
descriptors always has new data so others don't get a chance to be
serviced -- don't be. You don't have to completely read one fd until
you see EGAIN. All that matters is that until you see the EAGAIN,
your user space data structure should have a flag that says the fd is
still readable, so another epoll event is not expected or required for
that fd.

-- Jamie

2002-10-30 01:54:17

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Wed, 30 Oct 2002, Jamie Lokier wrote:

> > 1) "issuing a command to an IDE disk" == "using read/write until EAGAIN"
> > 2) "adding yourself on the IDE disk wait queue" == "calling sys_epoll_wait()"
>
> That is quite a good analogy. epoll is like a waitqueue - which is
> also like a futex. To use a waitqueue properly you have do these
> things in the order shown:
>
> 1. Set the task state to stopped.
> 2. Register yourself on the waitqueue.
> 3. Check the condition.
> 4. If condition is not met, schedule.
>
> With epoll it is very similar. To wait for a condition on a file
> descriptor, such as readability, you must do these things in the order
> shown:
>
> 1. Register your interest using epoll_ctl.
> 2. Check the condition by actually calling read().
> 3. If the condition is not met (i.e. read() returned EAGAIN),
> call epoll_wait (i.e. equivalent to schedule).
>
> With epoll, you can optimise by registering interest just once. In
> other words, steps 2 and 3 may be repeated without repeating step 1.
>
> And if you are concerned about starvation -- that is, one of your file
> descriptors always has new data so others don't get a chance to be
> serviced -- don't be. You don't have to completely read one fd until
> you see EGAIN. All that matters is that until you see the EAGAIN,
> your user space data structure should have a flag that says the fd is
> still readable, so another epoll event is not expected or required for
> that fd.

Jamie,

can I pay you a beer ? Your comment describe perfectly the API. You can
replace read() with write() in your description, and the whole thing is
still true.



- Davide


2002-10-30 02:08:35

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Tue, 29 Oct 2002, Martin Waitz wrote:

> hi :)
>
> On Tue, Oct 29, 2002 at 04:19:23PM +0100, bert hubert wrote:
> > It is edge based instead of level based -
> seems i have to look closer at your papers/code
>
> what is the motivation to make it edge based?

You have two ways to know if "something" changed. You call everyone each
time and you ask him if his changed, or you call everyone one time by
saying "call me when you're changed". It's behind the "call me when you're
changed" phrase that lays the concept of edge triggered APIs. So, pretty
evidently, the answer to your question is : efficency.




- Davide


2002-10-30 02:16:16

by John Myers

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

Davide Libenzi wrote:

>You're in the computer science by many many years and still you're not able to
>understand how edge triggered events works.
>
Failure to agree does not imply failure to understand. I understand the
model you want to apply to this problem, I do not agree that it is the
best model to apply to this problem.

>Why the heck ( and this for the 100th time ) do you want to go to wait for
>an event on the newly born fd if :
>
>1) On connect() you have the _full_ write I/O space available
>2) On accept() it's very likely the you'll find something more than a SYN
> in the first packet
>
>Besides, the first code is even more cleaner and simmetric, while adopting
>your half *ss solution might suggest the user that he can go waiting for
>events any time he wants.
>
The first code is hardly cleaner and is definitely not symmetric--the
way the accept code has to set up to fall through the do_use_fd() code
is subtle. In the first code, the accept segment cannot be cleanly
pulled into a callback:

for (;;) {
nfds = sys_epoll_wait(kdpfd, &pfds, -1);
for(n = 0; n < nfds; ++n) {
(cb[pfds[n].fd])(pfds[n].fd);
}
}


Also, your first code does not fit your "edge triggered" model--the code
for handling 's' does not drain its input. By the time you call
accept(), there could be multiple connections ready to be accepted.

Your connect() argument is not applicable to "server sends first"
protocols. I suspect you are being overly optimistic about the
likelihood of getting data with SYN, but whatever. The argument is
basically that not delivering an event upon registration (and thus
having the event be implicit) improves performance because the socket is
going to be ready with sufficiently high probability. I would counter
that the cost of explicitly delivering such an event is miniscule
compared to the rest of the cost of connection setup and teardown--the
optimization is not worthwhile.

>Like going to sleep the the wait queue of IDE
>disk w/out having issued any command.
>
The key difference between this interface and wait queues is that with
wait queues it is not technically feasible to both register interest and
test the condition in a single, atomic operation. epoll does not have
this technical limitation, so it can provide a better interface.

>PS: since my time is not infinite, and since I'm working on the changes we
>agreed with Andrew I would suggest you either to take another look at the
>code suggesting us new changes ( like you did yesterday ) or to go
>shopping for books.
>
I am uncomfortable with the way the epoll code adds its own set of
notification hooks into the socket and pipe code. Much better would be
to extend the existing set of notification hooks, like the aio poll code
does. That would reduce the risk of kernel bugs where some subsystem
fails to deliver an event to one but not all types of poll notification
hooks and it would minimize the cost of the epoll patch when epoll is
not being used.

>
>


2002-10-30 03:37:06

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Tue, 29 Oct 2002, John Gardiner Myers wrote:

> Failure to agree does not imply failure to understand. I understand the
> model you want to apply to this problem, I do not agree that it is the
> best model to apply to this problem.

John, your first post about epoll was "the interface has a bug, please
do not merge it". Now either you have a strange way to communicate non
agreement or it is something more than that. Or maybe you wanted just
blindly kill the interface with your comments because you're totally
committed to another one currently. And the existance of an interface that
might work as well, or maybe better in some cases, could create you some
problem that I'm unaware about.



> >Why the heck ( and this for the 100th time ) do you want to go to wait for
> >an event on the newly born fd if :
> >
> >1) On connect() you have the _full_ write I/O space available
> >2) On accept() it's very likely the you'll find something more than a SYN
> > in the first packet
> >
> >Besides, the first code is even more cleaner and simmetric, while adopting
> >your half *ss solution might suggest the user that he can go waiting for
> >events any time he wants.
> >
> The first code is hardly cleaner and is definitely not symmetric--the
> way the accept code has to set up to fall through the do_use_fd() code
> is subtle. In the first code, the accept segment cannot be cleanly
> pulled into a callback:
>
> for (;;) {
> nfds = sys_epoll_wait(kdpfd, &pfds, -1);
> for(n = 0; n < nfds; ++n) {
> (cb[pfds[n].fd])(pfds[n].fd);
> }
> }

Sorry, what prevents you in coding that ? If you, instead of ranting
because epoll does not fit your personal idea of event notification, took
a look to the example http server used for the test ( coroutine based )
you'll see that does exactly that. Ok, it's a mess because it supports 5
interfaces, all #ifdef'ed, but the concept is there.



> Also, your first code does not fit your "edge triggered" model--the code
> for handling 's' does not drain its input. By the time you call
> accept(), there could be multiple connections ready to be accepted.

I really don't believe this. Are you just trolling or what ? It is clear
that your acceptor routine has to do a little more work than that in a
real program. Again, looking at the example http server might help you.
This is what the acceptor coroutine does in such _trivial_ http server :

static void *dph_acceptor(void *data)
{
struct dph_conn *conn = (struct dph_conn *) data;
struct sockaddr_in addr;
int sfd, addrlen = sizeof(addr);

while ((sfd = dph_accept(conn, (struct sockaddr *) &addr, &addrlen)) != -1) {
if (dph_new_conn(sfd, dph_httpd) < 0) {
dph_close(sfd);

}
}
return data;
}

and this is dph_accept :

int dph_accept(struct dph_conn *conn, struct sockaddr *addr, int *addrlen)
{
int sfd, flags = 1;

while ((sfd = accept(conn->sfd, addr, (socklen_t *) addrlen)) < 0) {
if (errno == EINTR)
continue;
if (errno != EAGAIN && errno != EWOULDBLOCK)
return -1;
conn->events = POLLIN;
co_resume(conn);
}
if (ioctl(sfd, FIONBIO, &flags) &&
((flags = fcntl(sfd, F_GETFL, 0)) < 0 ||
fcntl(sfd, F_SETFL, flags | O_NONBLOCK) < 0)) {
close(sfd);
return -1;
}
return sfd;
}

and this is dph_new_conn :

static int dph_new_conn(int sfd, void *func)
{
struct dph_conn *conn = (struct dph_conn *) malloc(sizeof(struct dph_conn));
struct pollfd pfd;

if (!conn)
return -1;

DBL_INIT_LIST_HEAD(&conn->lnk);
conn->sfd = sfd;
conn->events = POLLIN | POLLOUT;
conn->revents = 0;
if (!(conn->co = co_create(func, NULL, stksize))) {
free(conn);
return -1;
}

DBL_LIST_ADDT(&conn->lnk, &chash[sfd % chash_size]);

if (epoll_ctl(kdpfd, EP_CTL_ADD, sfd, POLLIN | POLLOUT) < 0) {
DBL_LIST_DEL(&conn->lnk);
co_delete(conn->co);
free(conn);
return -1;

}

co_call(conn->co, conn);

return 0;
}

Oh ... I forgot the scheduler :

static int dph_scheduler(int loop, unsigned int timeout)
{
int ii, nfds;
struct dph_conn *conn;
struct pollfd const *pfds;

do {
nfds = sys_epoll_wait(kdpfd, &pfds, timeout * 1000);

for (ii = 0; ii < nfds; ii++, pfds++) {
if ((conn = dph_find(pfds->fd))) {
conn->revents = pfds->revents;

if (conn->revents & conn->events)
co_call(conn->co, conn);
}
}
} while (loop);
return 0;
}

And just to make it complete, those are read/write :

int dph_read(struct dph_conn *conn, char *buf, int nbyte)
{
int n;

while ((n = read(conn->sfd, buf, nbyte)) < 0) {
if (errno == EINTR)
continue;
if (errno != EAGAIN && errno != EWOULDBLOCK)
return -1;
conn->events = POLLIN;
co_resume(conn);
}
return n;
}

int dph_write(struct dph_conn *conn, char const *buf, int nbyte)
{
int n;

while ((n = write(conn->sfd, buf, nbyte)) < 0) {
if (errno == EINTR)
continue;
if (errno != EAGAIN && errno != EWOULDBLOCK)
return -1;
conn->events = POLLOUT;
co_resume(conn);
}
return n;
}


The functions co_resume() and co_call() are the coroutine suspend and
call. The one I'm using is this :

http://www.goron.de/~froese/coro/

but coroutine implementation is trivial. You can change the same
implementation to use an I/O driven state machine and the result does not
change.



> I am uncomfortable with the way the epoll code adds its own set of
> notification hooks into the socket and pipe code. Much better would be
> to extend the existing set of notification hooks, like the aio poll code
> does. That would reduce the risk of kernel bugs where some subsystem
> fails to deliver an event to one but not all types of poll notification
> hooks and it would minimize the cost of the epoll patch when epoll is
> not being used.

Doh ! John, did you actually read the code ? Could you compare AIO level
of intrusion inside the kernel code with the epoll one ? It fits _exactly_
the rt-signal hooks. One of the design goals for me was to add almost
nothing on the main path. You can lookup here for a quick compare between
aio poll and epoll for a test where events delivery efficency does matter
( pipetest ) :

http://lse.sourceforge.net/epoll/index.html

Now, I don't believe that a real world app will exchange 300000 tokens per
second through a pipe, but this help you to understand the efficency of
the epoll event notification subsystem.




- Davide




2002-10-30 05:35:15

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Wed, 30 Oct 2002, Jamie Lokier wrote:

> > 1) "issuing a command to an IDE disk" == "using read/write until EAGAIN"
> > 2) "adding yourself on the IDE disk wait queue" == "calling sys_epoll_wait()"
>
> That is quite a good analogy. epoll is like a waitqueue - which is
> also like a futex. To use a waitqueue properly you have do these
> things in the order shown:
>
> 1. Set the task state to stopped.
> 2. Register yourself on the waitqueue.
> 3. Check the condition.
> 4. If condition is not met, schedule.
>
> With epoll it is very similar. To wait for a condition on a file
> descriptor, such as readability, you must do these things in the order
> shown:
>
> 1. Register your interest using epoll_ctl.
> 2. Check the condition by actually calling read().
> 3. If the condition is not met (i.e. read() returned EAGAIN),
> call epoll_wait (i.e. equivalent to schedule).
>
> With epoll, you can optimise by registering interest just once. In
> other words, steps 2 and 3 may be repeated without repeating step 1.
>
> And if you are concerned about starvation -- that is, one of your file
> descriptors always has new data so others don't get a chance to be
> serviced -- don't be. You don't have to completely read one fd until
> you see EGAIN. All that matters is that until you see the EAGAIN,
> your user space data structure should have a flag that says the fd is
> still readable, so another epoll event is not expected or required for
> that fd.

I would say more about this. I did not understand the interface until I
read this one :-)
Hanna, is it possible for you to make Jamie description to fit the epoll.2
man page ?
If you can do that I'll send you the source of the man page ...



- Davide



2002-10-30 18:52:57

by Zach Brown

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

> It is very easy for me to remain calm here. You're a funny guy. You're in
> the computer science by many many years and still you're not able to
> understand how edge triggered events works. And look, this apply to every
> field, form ee to cs. Book suggestions would be requested here, but since
> I believe grasping inside a technical library to be pretty fun, I'll leave
> you this pleasure.

http://www.infidels.org/news/atheism/logic.html#hominem

I know its hard, but can we try and avoid the most pathetic pitfalls of
arguing over email?

- z

2002-10-30 19:09:30

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Wed, 30 Oct 2002, Zach Brown wrote:

> > It is very easy for me to remain calm here. You're a funny guy. You're in
> > the computer science by many many years and still you're not able to
> > understand how edge triggered events works. And look, this apply to every
> > field, form ee to cs. Book suggestions would be requested here, but since
> > I believe grasping inside a technical library to be pretty fun, I'll leave
> > you this pleasure.
>
> http://www.infidels.org/news/atheism/logic.html#hominem
>
> I know its hard, but can we try and avoid the most pathetic pitfalls of
> arguing over email?

Zach, on one side it's very easy for me. I just won't reply. This should
cut this very short. Looking at the whole thread you'll find that he
wanted to pass his non agreement with the interface, that is a pretty
normal and legitimate thing, for a bug of the interface. Now, while non
agreement imply a very own subjective way to see a thing, a bug means a
very objective thing. That is, "it does not work". Now, when someone state
something that is proven to be false ( it's not even an RTQA, it's a
NOT-A-BUG ), and when this someone tried in every way to kill the
interface ( for reason that I'm not aware about ), and also implied that
"I do not understand", well I've been educated to respond. Look, I'm a
very simple guy. You don't touch me and I'll be transparent like a ghost
for you. You touch me personally, and I retaliate.



- Davide



2002-10-30 19:32:03

by Martin Waitz

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

hi :)

> > what is the motivation to make it edge based?

On Tue, Oct 29, 2002 at 06:24:27PM -0800, Davide Libenzi wrote:
> You have two ways to know if "something" changed. You call everyone each
> time and you ask him if his changed, or you call everyone one time by
> saying "call me when you're changed".
well, you don't say 'call me when you're changed' but
'i'm interested in your status, please be prepared to report if you
have changed' when calling epoll_ctl.

> It's behind the "call me when you're changed" phrase that lays the
> concept of edge triggered APIs.
in most situations, you are not really interested in 'has it changed?'
but in 'what has it changed to?'.
this is the difference between edge- or level-triggered notification.

e.g. the application wants to know
'from which fds can i read without blocking' and not which fds
happend to change their block-state
(perhaps there is still data available after the last read, in
which case the application would like to be notified about this
situation)

> So, pretty evidently, the answer to your question is : efficency.
but the api should be useful, too...


is it possible that http://www.xmailserver.org is down atm?
i couldn't get as much docu about epoll as i wanted to,
so please correct me if my above view about epoll is incorrect

and yes, i haven't looked at any code yet,
i just like the kevent docu better then the epoll docu... ;)
and yes, i would like to port kevent to linux,
but i don't have any time to do this in the next months... :(

--
CU, / Friedrich-Alexander University Erlangen, Germany
Martin Waitz // [Tali on IRCnet] [tali.home.pages.de] _________
______________/// - - - - - - - - - - - - - - - - - - - - ///
dies ist eine manuell generierte mail, sie beinhaltet //
tippfehler und ist auch ohne grossbuchstaben gueltig. /
-
Wer bereit ist, grundlegende Freiheiten aufzugeben, um sich
kurzfristige Sicherheit zu verschaffen, der hat weder Freiheit
noch Sicherheit verdient.
Benjamin Franklin (1706 - 1790)


Attachments:
(No filename) (1.97 kB)
(No filename) (189.00 B)
Download all attachments

2002-10-30 22:56:47

by Jamie Lokier

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

John Gardiner Myers wrote:
> I am uncomfortable with the way the epoll code adds its own set of
> notification hooks into the socket and pipe code. Much better would be
> to extend the existing set of notification hooks, like the aio poll code
> does.

Fwiw, I agree with the above (I'm having a think about it).

I also agree with criticisms that epoll should test and send an event
on registration, but only _if_ the test is cheap. Nothing to do with
correctness (I like the edge semantics as they are), but because
delivering one event is so infinitesimally low impact with epoll that
it's preferable to doing a single speculative read/write/whatever.

Regarding the effectiveness of the optimisation, I'd guess that quite
a lot of incoming connections do not come with initial data in the
short scheduling time after a SYN (unless it's on a LAN). I don't
know this for sure though.

-- Jamie

2002-10-30 23:37:49

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Wed, 30 Oct 2002, Jamie Lokier wrote:

> John Gardiner Myers wrote:
> > I am uncomfortable with the way the epoll code adds its own set of
> > notification hooks into the socket and pipe code. Much better would be
> > to extend the existing set of notification hooks, like the aio poll code
> > does.
>
> Fwiw, I agree with the above (I'm having a think about it).
>
> I also agree with criticisms that epoll should test and send an event
> on registration, but only _if_ the test is cheap. Nothing to do with
> correctness (I like the edge semantics as they are), but because
> delivering one event is so infinitesimally low impact with epoll that
> it's preferable to doing a single speculative read/write/whatever.
>
> Regarding the effectiveness of the optimisation, I'd guess that quite
> a lot of incoming connections do not come with initial data in the
> short scheduling time after a SYN (unless it's on a LAN). I don't
> know this for sure though.

Ok Jamie, try to explain me which kind of improvement this first drop will
bring. And also, how such first drop would not bring a "confusion" for the
user, letting him think that he can go sleeping event w/out having first
received EAGAIN. Isn't it better to say "you wait for events after EAGAIN",
instead of "you wait for events after EAGAIN but after accept/connect".
The cost of the test will be basically the cost of a ->poll(), that is
exactly the same cost of the very first read()/write() that you would do
by following the current API rule.



- Davide


2002-10-31 00:46:53

by Jamie Lokier

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

Davide Libenzi wrote:
> The cost of the test will be basically the cost of a ->poll(), that is
> exactly the same cost of the very first read()/write() that you would do
> by following the current API rule.

No, the cost of ->poll() is somewhat less than read()/write(), because
the latter requires a system call and the former does not. System
calls are still nowhere near as cheap as function calls.

> > I also agree with criticisms that epoll should test and send an event
> > on registration, but only _if_ the test is cheap. Nothing to do with
> > correctness (I like the edge semantics as they are), but because
> > delivering one event is so infinitesimally low impact with epoll that
> > it's preferable to doing a single speculative read/write/whatever.
> >
> > Regarding the effectiveness of the optimisation, I'd guess that quite
> > a lot of incoming connections do not come with initial data in the
> > short scheduling time after a SYN (unless it's on a LAN). I don't
> > know this for sure though.
>
> Ok Jamie, try to explain me which kind of improvement this first drop will
> bring.

I have thought about an optimal server state machine. (I presume from
your carefully thought out implementation that you have too).

In a state machine, each fd has some user-space state. I've already
hinted at how this is used to prevent starvation/livelock on a busy
server, and make service fairer.

I would take that further and _defer_ the epoll_ctl() to register an
fd until the first time I have seen EAGAIN from that fd. This is
because in some cases, epoll_ctl() would not be needed at all - so we
can remove that overhead, and the system call overhead.

Now you would force me to call read() or write() after the
epoll_ctl(), even though I _know_ the result is always going to be
EAGAIN. You're forcing me to make an always redundant system call.
But I can't omit it, because that's a race condition.

So, I've thought about the _optimal_ state machine and it's clear that
epoll should test the condition on fd registration - for best
performance. (Nothing to do with scalability, just raw performance).

> And also, how such first drop would not bring a "confusion" for the
> user, letting him think that he can go sleeping event w/out having first
> received EAGAIN. Isn't it better to say "you wait for events after EAGAIN",
> instead of "you wait for events after EAGAIN but after accept/connect".

Be careful with your rules. epoll should work with blocking fds too,
if you understand the rules well enough, and fd registration doesn't
have to be done at the same time as accept/connect/pipe.

Your current rule in practice is:

an event is generated on every "would-block" -> "ready" transition.
after fd registration, you must treat the fd as "ready".

The proposed rule is this:

an event is generated on every "would-block" -> "ready" transition.
after fd registration, you may treat the fd as in any state you like.

The proposed rule is better because it permits better optimisations in
user space, as explained earlier. (If you _really_ want to avoid the
call to ->poll() when user space doesn't care, make that a flag
argument to epoll_ctl()).

enjoy :)
-- Jamie

2002-10-31 02:00:50

by John Myers

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

Davide Libenzi wrote:

>John, your first post about epoll was "the interface has a bug, please
>do not merge it".
>
My first post about epoll pointed out how it was designed for single
threaded callers and concluded:

I certainly hope /dev/epoll itself doesn't get accepted into the kernel,
the interface is error prone. Registering interest in a condition when
the condition is already true should immediately generate an event, the
epoll interface did not do that last time I saw it discussed. This
deficiency in the interface requires callers to include more complex
workaround code and is likely to result in subtle, hard to diagnose bugs.

I did not say "the interface has a bug", I said that the interface is
error prone. This is a deficiency that should be fixed before the
interface is added to the kernel.

>Sorry, what prevents you in coding that ? If you, instead of ranting
>because epoll does not fit your personal idea of event notification, took
>a look to the example http server used for the test ( coroutine based )
>you'll see that does exactly that.
>
You posted code which you claimed was "even more cleaner and simmetric"
(sic) because it fell through to the do_use_fd() code instead of putting
the do_use_fd() code in an else clause. A callback scheme is akin to
the if/else structure. To adapt the first code to a callback scheme,
the accept callback has to somehow arrange to call the do_use_fd()
callback before returning to the event loop. This requirement is subtle
and asymmetric.

>I really don't believe this. Are you just trolling or what ? It is clear
>that your acceptor routine has to do a little more work than that in a
>real program.
>
Basically, you spawn off another coroutine. That complicates the "fall
through to do_use_fd()" logic in the first code by requiring an external
facility not required by the second code. The second code could simply
have the accept code loop until EAGAIN.

>Doh ! John, did you actually read the code ?
>
Yes, indeed.

>Could you compare AIO level
>of intrusion inside the kernel code with the epoll one ?
>
Aio poll extends the existing set of poll notification hooks with a
callback mechanism. It then plugs into this callback mechanism in order
to deliver events. The end result is that the same notification hooks
are used for classic poll and aio poll. When aio poll is not being
used, there is no additional performance penalty other than a slightly
larger poll_table_entry and poll_table_page.

Epoll creates a new callback mechanism and plugs into this new callback
mechansim. It adds a new set of notification hooks which feed into this
new callback mechansim. The end result is that there is one set of
notification hooks for classic poll and another set for epoll. When
epoll is not being used, the poll and socket code makes an additional
set of checks to see that nobody has registered interest through the new
callback mechanism.

> It fits _exactly_
>the rt-signal hooks. One of the design goals for me was to add almost
>nothing on the main path. You can lookup here for a quick compare between
>aio poll and epoll for a test where events delivery efficency does matter
>( pipetest ) :
>
This is a comparison of the cost of using epoll to the cost of using aio
in one particular situation. It is irrelevant to the point I was making.

>Now, I don't believe that a real world app will exchange 300000 tokens per
>second through a pipe, but this help you to understand the efficency of
>the epoll event notification subsystem.
>
>
My understanding of the efficiency of the epoll event notification
subsystem is:

1) Unlike the current aio poll, it amortizes the cost of interest
registration/deregistration across multiple events for a given connection.

2) It declares multithreaded use out of scope, making optimizations that
are only appropriate for use by single threaded callers.


2002-10-31 03:05:33

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Wed, 30 Oct 2002, John Gardiner Myers wrote:

> You posted code which you claimed was "even more cleaner and simmetric"
> (sic) because it fell through to the do_use_fd() code instead of putting
> the do_use_fd() code in an else clause. A callback scheme is akin to
> the if/else structure. To adapt the first code to a callback scheme,
> the accept callback has to somehow arrange to call the do_use_fd()
> callback before returning to the event loop. This requirement is subtle
> and asymmetric.

A callback scheme can be _trivially_ implemented use the current epoll.
I'm sure you know exactly how to do it, so I'm not spending more time
explaining it to you.



> Basically, you spawn off another coroutine. That complicates the "fall
> through to do_use_fd()" logic in the first code by requiring an external
> facility not required by the second code. The second code could simply
> have the accept code loop until EAGAIN.

No it does not, you always fall through do_use_fd(). It's that simple.



> Epoll creates a new callback mechanism and plugs into this new callback
> mechansim. It adds a new set of notification hooks which feed into this
> new callback mechansim. The end result is that there is one set of
> notification hooks for classic poll and another set for epoll. When
> epoll is not being used, the poll and socket code makes an additional
> set of checks to see that nobody has registered interest through the new
> callback mechanism.

Where epoll hooks has nothing to do with ->f_po->poll()



> > It fits _exactly_
> >the rt-signal hooks. One of the design goals for me was to add almost
> >nothing on the main path. You can lookup here for a quick compare between
> >aio poll and epoll for a test where events delivery efficency does matter
> >( pipetest ) :
> >
> This is a comparison of the cost of using epoll to the cost of using aio
> in one particular situation. It is irrelevant to the point I was making.

See, I believe numbers talks. And it does make a pretty clear point
indeed.



> My understanding of the efficiency of the epoll event notification
> subsystem is:
>
> 1) Unlike the current aio poll, it amortizes the cost of interest
> registration/deregistration across multiple events for a given connection.

Yep


> 2) It declares multithreaded use out of scope, making optimizations that
> are only appropriate for use by single threaded callers.

It's not single threaded. It can be used in multithreaded environment if
the one that code the app has a minimal idea of what he's doing. Like
everything else. You cannot use a FILE* wildly sharing it randomly inside
a multithreaded app, and expecting to receive coherent results. Like 95%
of the APIs. Can those APIs be used in a multithreaded environment ? You bet,
with care, like everything that uses freakin' threads.




- Davide






2002-10-31 03:59:16

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Thu, 31 Oct 2002, Jamie Lokier wrote:

> No, the cost of ->poll() is somewhat less than read()/write(), because
> the latter requires a system call and the former does not. System
> calls are still nowhere near as cheap as function calls.

Jamie, it's not for the cost, it's that IMHO is useless. And might
generate confusion on the API usage.



> I have thought about an optimal server state machine. (I presume from
> your carefully thought out implementation that you have too).
>
> In a state machine, each fd has some user-space state. I've already
> hinted at how this is used to prevent starvation/livelock on a busy
> server, and make service fairer.
>
> I would take that further and _defer_ the epoll_ctl() to register an
> fd until the first time I have seen EAGAIN from that fd. This is
> because in some cases, epoll_ctl() would not be needed at all - so we
> can remove that overhead, and the system call overhead.
>
> Now you would force me to call read() or write() after the
> epoll_ctl(), even though I _know_ the result is always going to be
> EAGAIN. You're forcing me to make an always redundant system call.
> But I can't omit it, because that's a race condition.
>
> So, I've thought about the _optimal_ state machine and it's clear that
> epoll should test the condition on fd registration - for best
> performance. (Nothing to do with scalability, just raw performance).

Jamie I don't force you to call read/write soon. Your state machine will
have a state 0, from where everything starts. Let's say that this server
is an SMTP server and that supports PIPELINING. When a client connect (
you accept ) you will basically have your acceptor routine that puts the
fd for the new connection inside your list of ready-fds. Such list will
contain connection status, state machine state and a callback at the bare
bone. The whenever you feel it appropriate you pop the fd from the ready
list and you call the associated callback. That for state 0 will have
encoded "send SMTP welcome message" to the client. The socket write buffer
will be empty and you write() will return != EAGAIN. So you keep your fd
inside your ready list. Having a ready list enables you to handle
priorities, fairness, etc... Having successfully sent the welcome string
will move you to the next state, state 1. Whenever you'll find it
appropriate, you'll call again the callback associated with the file
descriptor, that for state 1 will have encoded "read SMTP command". Now
suppose that the SMTP client is lazy and you have nothing in the input
buffer ( or you partially read the SMTP command ). The read() will return
EAGAIN, you remain in state 1 and you remove the fd from your ready list.
This guy is _ready_ to generate an event. One of thenext time you'll call
epoll_wait(2) you'll find our famous fd ready to be used. You push it in
the ready list, and it's up to you, based on your fairness policies, to
use it soon or not. <b> The important thing is that you keep it in your
ready list and you do not go wait for it </b> Now the PIPELINING stuff
makes it worth to have your ready-fds list to apply fairness rules among
your clients. The above pattern repeats by moving your state machine among
your states, until finally, you reach the final state where you drop the
connection. Now, this one, that is a typical state machine implementation
can be _trivially_ implemented with epoll, and I don't see how adding an
initial event might help in this design. The other even more trivial
implementation using coroutines shows its semplicitly in a pretty clear
way.



> Be careful with your rules. epoll should work with blocking fds too,
> if you understand the rules well enough, and fd registration doesn't
> have to be done at the same time as accept/connect/pipe.

Obviously you can register the fd whenever you want. I would take _a_lot_
of care using it with blocking files. Not because it will crash or
something like this but because you might stall you app on a reat/write
operation. Suppose you received your event, and you have 2000 bytes in
your input buffer for example. You start reading the data with a blocking
file and when the data is over you'll be waiting on tha system call, that
is definitely what you want to do in a 1:N ( one task, N files )
application architecture. You don't really want to use blocking files with
an edge triggered event API.



> Your current rule in practice is:
>
> an event is generated on every "would-block" -> "ready" transition.
> after fd registration, you must treat the fd as "ready".
>
> The proposed rule is this:
>
> an event is generated on every "would-block" -> "ready" transition.
> after fd registration, you may treat the fd as in any state you like.
>
> The proposed rule is better because it permits better optimisations in
> user space, as explained earlier. (If you _really_ want to avoid the
> call to ->poll() when user space doesn't care, make that a flag
> argument to epoll_ctl()).

I still prefer 1) Jamie, besides the system call cost ( that is not always
a cost, see soon ready ops ), there's the fact of making the user to
follow a behavior pattern. That point 2) leaves uncertain. Now, I guess
that we will spend a lot of time arguing and talking about nothing. Let's
go to the code. Show me with real code ( possibly not 25000 lines :) )
where you get stuck w/out having the initial event and if it makes sense
and there's no clean way to solve it in user space, I'll seriously
consider your ( and John ) proposal.




- Davide


2002-10-31 04:48:40

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Wed, 30 Oct 2002, Martin Waitz wrote:

> On Tue, Oct 29, 2002 at 06:24:27PM -0800, Davide Libenzi wrote:
> > You have two ways to know if "something" changed. You call everyone each
> > time and you ask him if his changed, or you call everyone one time by
> > saying "call me when you're changed".
> well, you don't say 'call me when you're changed' but
> 'i'm interested in your status, please be prepared to report if you
> have changed' when calling epoll_ctl.

Yes, I just don't like to write much :)



> > It's behind the "call me when you're changed" phrase that lays the
> > concept of edge triggered APIs.
> in most situations, you are not really interested in 'has it changed?'
> but in 'what has it changed to?'.
> this is the difference between edge- or level-triggered notification.
>
> e.g. the application wants to know
> 'from which fds can i read without blocking' and not which fds
> happend to change their block-state
> (perhaps there is still data available after the last read, in
> which case the application would like to be notified about this
> situation)

The state of I/O can change only in two way. From I/O space available ( 1 )
to I/O space empty ( 0 ) and reverse. You generate 1->0 transactions and I
guess that ones are not very interesting. You're very much interested in
0->1 transaction indeed, that the kernel generates.



> is it possible that http://www.xmailserver.org is down atm?
> i couldn't get as much docu about epoll as i wanted to,
> so please correct me if my above view about epoll is incorrect
>
> and yes, i haven't looked at any code yet,
> i just like the kevent docu better then the epoll docu... ;)
> and yes, i would like to port kevent to linux,
> but i don't have any time to do this in the next months... :(

It has always been up for what it may concern. It's a T1 but during these
days it had quite a few hits because of epoll.



- Davide



2002-10-31 11:03:17

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Wed, Oct 30, 2002 at 07:21:24PM -0800, Davide Libenzi wrote:
> On Wed, 30 Oct 2002, John Gardiner Myers wrote:
>
> > Epoll creates a new callback mechanism and plugs into this new callback
> > mechansim. It adds a new set of notification hooks which feed into this
> > new callback mechansim. The end result is that there is one set of
> > notification hooks for classic poll and another set for epoll. When
> > epoll is not being used, the poll and socket code makes an additional
> > set of checks to see that nobody has registered interest through the new
> > callback mechanism.
>
> Where epoll hooks has nothing to do with ->f_po->poll()
>

I think what John means, and what Jamie has also brought up in a
separate note is that now when an event happens on an fd, in some cases
there are tests for 3 kinds of callbacks that get triggered -- the wait
queue for poll type registrations, the fasync list for sigio, and the
new epoll file send notify type callbacks. There is a little overhead
(not sure if significant) for each kind of test ...

>
>
> > > It fits _exactly_
> > >the rt-signal hooks. One of the design goals for me was to add almost
> > >nothing on the main path. You can lookup here for a quick compare between
> > >aio poll and epoll for a test where events delivery efficency does matter
> > >( pipetest ) :
> > >
> > This is a comparison of the cost of using epoll to the cost of using aio
> > in one particular situation. It is irrelevant to the point I was making.
>
> See, I believe numbers talks. And it does make a pretty clear point
> indeed.
>
>
>
> > My understanding of the efficiency of the epoll event notification
> > subsystem is:
> >
> > 1) Unlike the current aio poll, it amortizes the cost of interest
> > registration/deregistration across multiple events for a given connection.
>
> Yep
>

Adding persistent iocb support to aio doesn't appear too hard, and
to be fair to aio, it does seem to help it come much closer to epoll,
in fact very much closer at least for pipetest with a quickly hacked
version that I tried. There still appears to be a gap remaining to
be covered i.e epoll continuing to lead :) albeit by a smaller margin.

A little more magic is going on than just interest registration
amortization (and I suspect its not just the threading argument),
worth analysing if not for any other reason but to gain a better
understanding of these 2 event delivery mechanisms the core for both
of which are now in the mainline kernel.

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India

2002-10-31 15:01:19

by Jamie Lokier

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

Davide Libenzi wrote:
> [long description of ready lists]

Davide, you have exactly explained ready lists, which I assume we'd
already agreed on (cf. beer), and completely missed the point about
deferring the call to epoll_ctl(). You haven't mentioned that at all
your description.

Consider cacheing http proxy server. Let's say 25% of the requests to
a proxy server are _not_ using pipelining. Then you can save 25% of
the calls to epoll_ctl() if network conditions are favourable.

> Jamie I don't force you to call read/write soon.

You do if I try to optimise by deferring the call to epoll_ctl().
Let's see how my user space optimisation is affected in your description.

> Your state machine will
> have a state 0, from where everything starts. Let's say that this server
> is an SMTP server and that supports PIPELINING. When a client connect (
> you accept ) you will basically have your acceptor routine that puts the
> fd for the new connection inside your list of ready-fds. Such list will
> contain connection status, state machine state and a callback at the bare
> bone. The whenever you feel it appropriate you pop the fd from the ready
> list and you call the associated callback. That for state 0 will have
> encoded "send SMTP welcome message" to the client. The socket write buffer
> will be empty and you write() will return != EAGAIN. So you keep your fd
> inside your ready list. Having a ready list enables you to handle
> priorities, fairness, etc... Having successfully sent the welcome string
> will move you to the next state, state 1. Whenever you'll find it
> appropriate, you'll call again the callback associated with the file
> descriptor, that for state 1 will have encoded "read SMTP command". Now
> suppose that the SMTP client is lazy and you have nothing in the input
> buffer ( or you partially read the SMTP command ). The read() will return
> EAGAIN, you remain in state 1 and you remove the fd from your ready list.

At this point, I would call epoll_ctl(). Note, I do _not_ call
epoll_ctl() after accept(), because that is a waste of time. It is
better to defer it because sometimes it is not needed at all.

> This guy is _ready_ to generate an event. One of thenext time you'll call
> epoll_wait(2) you'll find our famous fd ready to be used.

Most of the time this works, but there's a race condition: after I saw
EAGAIN and before I called epoll_ctl(), the state might have changed.
So I must call read() after epoll_ctl(), even though it is 99.99%
likely to return EAGAIN.

Here is the system call sequence that I end up executing:

- read() = nbytes
- read() = EGAIN
- epoll_ctl() // Called first time we see EAGAIN,
// if we will want to read more.
- read() = EGAIN

The final read is 99.99% likely to return EGAIN, and could be
eliminated if epoll_ctl() had an option to test the condition.

> Now, this one, that is a typical state machine implementation
> can be _trivially_ implemented with epoll, and I don't see how adding an
> initial event might help in this design. The other even more trivial
> implementation using coroutines shows its semplicitly in a pretty clear
> way.

That's because it doesn't help in that design. It helps in a
different (faster in some scenarios ;) design.

> You don't really want to use blocking files with
> an edge triggered event API.

Agreed, it is rarely useful, but I felt your description ("event after
AGAIN") was technically incorrect. Of course it would be ok to
_specify_ the API as only giving defined behaviour for non-blocking I/O.

> >
> > an event is generated on every "would-block" -> "ready" transition.
> > after fd registration, you must treat the fd as "ready".
> >
> > The proposed rule is this:
> >
> > an event is generated on every "would-block" -> "ready" transition.
> > after fd registration, you may treat the fd as in any state you like.
> >
> > The proposed rule is better because it permits better optimisations in
> > user space, as explained earlier. (If you _really_ want to avoid the
> > call to ->poll() when user space doesn't care, make that a flag
> > argument to epoll_ctl()).
>
> I still prefer 1) Jamie, besides the system call cost ( that is not always
> a cost, see soon ready ops ),

Do you avoid the cost of epoll_ctl() per new fd?

> there's the fact of making the user to follow a behavior
> pattern. That point 2) leaves uncertain.

That's the point :-) Flexibility is _good_. It means that somebody
can implement a technique that you haven't thought of.

With 2) the _programmer_ (let's assume some level of understanding)
can use the exact application code that you offered. It will work fine.

However if they're feeling clever, like me, they can optimise further.

I would suggest, though, to simply provide both options: EP_CTL_ADD
and EP_CTL_ADD_AND_TEST. That's so explicit that nobody can be
confused!

> Show me with real code ( possibly not 25000 lines :) )
> where you get stuck w/out having the initial event and if it makes sense
> and there's no clean way to solve it in user space, I'll seriously
> consider your ( and John ) proposal.

Davide, I don't write buggy code deliberately. My code would not get
stuck using present epoll. Neither APIs 1) nor 2) have a bug, but
version 1) is slower with some kinds of state machine.

(Don't confuse me with the person who said your API is buggy -- it is
_not_ buggy, it's just not as flexible as it should be).

I can write code that shows the optimisation if that would make it
clearer.

-- Jamie

2002-10-31 15:37:19

by Jamie Lokier

[permalink] [raw]
Subject: Unifying epoll,aio,futexes etc. (What I really want from epoll)

ps. I thought I should explain what bothers me most about epoll at the
moment. It's good at what it does, but it's so very limited in what
it supports.

I have a high performance server application in mind, that epoll is
_almost_ perfect for but not quite.

Davide, you like coroutines, so perhaps you will appreciate a web
server that serves a mixture of dynamic and static content, using
coroutines and user+kernel threading in a carefully balanced way.
Dynamic content is cached, accurately (taking advantage of nanosecond
mtimes if possible), yet served as fast as static pages (using a
clever cache validation method), and is built from files (read using
aio to improve throughput) and subrequests to other servers just like
a proxy. Data is served zero-copy using sendfile and /dev/shm.

A top quality server like that, optimised for performance, has to
respond to these events:

- network accept()
- read/write/exception on sockets and pipes
- timers
- aio
- futexes
- dnotify events

See how epoll only helps with the first two? And this is the very
application space that epoll could _almost_ be perfect for.

Btw, it doesn't _have_ to be a web server. Enterprise scale Java
runtimes, database servers, spider clients, network load generators,
proxies, even humble X servers - also have very similar requirements.

There are several scalable and fast event queuing mechanisms in the
kernel now: rt-signals, aio and epoll, yet each of them is limited by
only keeping track of a few kinds of possible event.

Technically, it's possible to use them all together. If you want to
react to all the kinds of events I listed above, you have to. But
it's mighty ugly code to use them all at once, and it's certainly not
the "lean and mean" event loop that everyone aspires to.

By adding yet another mechanism without solving the general problem,
epoll just makes the mighty ugly userspace more ugly. (But it's
probably worth using - socket notifcation through rt-signals has its
own problems).

I would very much like to see a general solution to the problem of all
different kinds of events being queued to userspace efficiently,
through one mechanism ("to bind them all..."). Every piece of this puzzle
has been written already, they're just not joined up very well.

I'm giving this serious thought now, if anyone wants to offer input.

-- Jamie

ps. Alan, you mentioned something about futexes being suitable.
Was that a vague notion, or do you have a clear idea in mind?

(A nice way to collect events from a _set_ of futexes might be just the thing.)

2002-10-31 15:41:43

by bert hubert

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

On Thu, Oct 31, 2002 at 03:41:12PM +0000, Jamie Lokier wrote:

> There are several scalable and fast event queuing mechanisms in the
> kernel now: rt-signals, aio and epoll, yet each of them is limited by
> only keeping track of a few kinds of possible event.

Probably you could even abuse netlink for this purpose. netlink for kqueue?

Regards,

bert

--
http://www.PowerDNS.com Versatile DNS Software & Services
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2002-10-31 16:21:22

by Alan

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

On Thu, 2002-10-31 at 15:41, Jamie Lokier wrote:
> - network accept()
> - read/write/exception on sockets and pipes
> - timers
> - aio
> - futexes
> - dnotify events
>
> See how epoll only helps with the first two? And this is the very
> application space that epoll could _almost_ be perfect for.
>
> ps. Alan, you mentioned something about futexes being suitable.
> Was that a vague notion, or do you have a clear idea in mind?
>
> (A nice way to collect events from a _set_ of futexes might be just the thing.)

The futexes do all the high performance stuff you actually need. One way
to do it is to do user space signal delivery setting futexes off but
that means user space switches and is just wrong. Setting a list of
futexes instead of signal delivery in kernel space is fast. Letting the
user pick what futexes get set allows you to do neat stuff like trees of
wakeup without having to handle t kernel side.

What is hard is multiple futex waits and livelock for that. I think it
can be done properly but I've not sat down and designed it all out - I
wonder what Rusty thinks.

Alan

2002-10-31 16:39:29

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Wed, 30 Oct 2002, Zach Brown wrote:

> > It is very easy for me to remain calm here. You're a funny guy. You're in
> > the computer science by many many years and still you're not able to
> > understand how edge triggered events works. And look, this apply to every
> > field, form ee to cs. Book suggestions would be requested here, but since
> > I believe grasping inside a technical library to be pretty fun, I'll leave
> > you this pleasure.
>
> http://www.infidels.org/news/atheism/logic.html#hominem
>
> I know its hard, but can we try and avoid the most pathetic pitfalls of
> arguing over email?

Zach, it's just me or I received this one twice :)


- Davide


2002-10-31 18:28:10

by Davide Libenzi

[permalink] [raw]
Subject: Re: [Lse-tech] Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Thu, 31 Oct 2002, Suparna Bhattacharya wrote:

> I think what John means, and what Jamie has also brought up in a
> separate note is that now when an event happens on an fd, in some cases
> there are tests for 3 kinds of callbacks that get triggered -- the wait
> queue for poll type registrations, the fasync list for sigio, and the
> new epoll file send notify type callbacks. There is a little overhead
> (not sure if significant) for each kind of test ...

The poll hooks is not where an edge triggered event notification API wants
to hook. For the way notification are sent and for the registration
method, that is not the most efficent thing. Hooking inside the fasync
list is worth to be investigated and I'll look into it as soon as I
finished the patch for 2.5.45 for Linus. It does have certain limits IMHO,
like the single lock protection. I'll look into it, even if the famous
cost for the extra callback check cannot even be measured IMHO.



- Davide


2002-10-31 18:54:11

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Thu, 31 Oct 2002, Jamie Lokier wrote:

> Davide, you have exactly explained ready lists, which I assume we'd
> already agreed on (cf. beer), and completely missed the point about
> deferring the call to epoll_ctl(). You haven't mentioned that at all
> your description.
>
> Consider cacheing http proxy server. Let's say 25% of the requests to
> a proxy server are _not_ using pipelining. Then you can save 25% of
> the calls to epoll_ctl() if network conditions are favourable.
>
> You do if I try to optimise by deferring the call to epoll_ctl().
> Let's see how my user space optimisation is affected in your description.
>
> At this point, I would call epoll_ctl(). Note, I do _not_ call
> epoll_ctl() after accept(), because that is a waste of time. It is
> better to defer it because sometimes it is not needed at all.
>
> > This guy is _ready_ to generate an event. One of thenext time you'll call
> > epoll_wait(2) you'll find our famous fd ready to be used.
>
> Most of the time this works, but there's a race condition: after I saw
> EAGAIN and before I called epoll_ctl(), the state might have changed.
> So I must call read() after epoll_ctl(), even though it is 99.99%
> likely to return EAGAIN.
>
> Here is the system call sequence that I end up executing:
>
> - read() = nbytes
> - read() = EGAIN
> - epoll_ctl() // Called first time we see EAGAIN,
> // if we will want to read more.
> - read() = EGAIN
>
> The final read is 99.99% likely to return EGAIN, and could be
> eliminated if epoll_ctl() had an option to test the condition.
>
> Do you avoid the cost of epoll_ctl() per new fd?

Jamie, the cost of epoll_ctl(2) is minimal/zero compared with the average
life of a connection. Also it might be done once at fd "creation" and one
at fd "removal". It's not inside an high frequency loop like epoll_wait(2).
Believe me, or ... do not believe me and show me a little performance data
that shows up this performance degradation due a soonish epoll_ctl(2). And
Jamie, if you really really want to use such pattern ( delaying the fd
registration, that IMHO does not help you in getting any performance boost )
you can still do it in user space ( poll() timeout 0 after epoll_ctl(2) ).
Jamie, I'm _really_ willing to be contradicted with performance data here.


> I would suggest, though, to simply provide both options: EP_CTL_ADD
> and EP_CTL_ADD_AND_TEST. That's so explicit that nobody can be
> confused!

The EP_CTL_ADD_AND_TEST would do the poll() timeout 0 trick in kernel
space. Is it fast done in kernel ? Sure, you can measure something at
rates of 500000 of registration per second. Even here Jamie, I'm willing
to be contradicted by performance data. You're trying to optimize
something that is not inside an high frequency loop, and it's going to
give you any measurable improvement IMHO.




- Davide


2002-10-31 20:13:09

by Davide Libenzi

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

On Thu, 31 Oct 2002, Jamie Lokier wrote:

> ps. I thought I should explain what bothers me most about epoll at the
> moment. It's good at what it does, but it's so very limited in what
> it supports.
>
> I have a high performance server application in mind, that epoll is
> _almost_ perfect for but not quite.
>
> Davide, you like coroutines, so perhaps you will appreciate a web
> server that serves a mixture of dynamic and static content, using
> coroutines and user+kernel threading in a carefully balanced way.
> Dynamic content is cached, accurately (taking advantage of nanosecond
> mtimes if possible), yet served as fast as static pages (using a
> clever cache validation method), and is built from files (read using
> aio to improve throughput) and subrequests to other servers just like
> a proxy. Data is served zero-copy using sendfile and /dev/shm.
>
> A top quality server like that, optimised for performance, has to
> respond to these events:
>
> - network accept()
> - read/write/exception on sockets and pipes
> - timers
> - aio
> - futexes
> - dnotify events
>
> See how epoll only helps with the first two? And this is the very
> application space that epoll could _almost_ be perfect for.
>
> Btw, it doesn't _have_ to be a web server. Enterprise scale Java
> runtimes, database servers, spider clients, network load generators,
> proxies, even humble X servers - also have very similar requirements.
>
> There are several scalable and fast event queuing mechanisms in the
> kernel now: rt-signals, aio and epoll, yet each of them is limited by
> only keeping track of a few kinds of possible event.
>
> Technically, it's possible to use them all together. If you want to
> react to all the kinds of events I listed above, you have to. But
> it's mighty ugly code to use them all at once, and it's certainly not
> the "lean and mean" event loop that everyone aspires to.
>
> By adding yet another mechanism without solving the general problem,
> epoll just makes the mighty ugly userspace more ugly. (But it's
> probably worth using - socket notifcation through rt-signals has its
> own problems).
>
> I would very much like to see a general solution to the problem of all
> different kinds of events being queued to userspace efficiently,
> through one mechanism ("to bind them all..."). Every piece of this puzzle
> has been written already, they're just not joined up very well.
>
> I'm giving this serious thought now, if anyone wants to offer input.

Jamie, the fact that epoll supports a limited number of "objects" was an
as-designed at that time. I see it quite easy to extend it to support
other objects. Futexes are a matter of one line of code int :

/* Waiter either waiting in FUTEX_WAIT or poll(), or expecting signal */
static inline void tell_waiter(struct futex_q *q)
{
wake_up_all(&q->waiters);
if (q->filp) {
send_sigio(&q->filp->f_owner, q->fd, POLL_IN);
+ file_notify_send(q->filp, ION_IN, POLLIN | POLLRDNORM);
}
}

Timer, as long as you access them through a file* interface ( like futexes )
will become trivial too. Another line should be sufficent for dnotify :

void __inode_dir_notify(struct inode *inode, unsigned long event)
{
struct dnotify_struct * dn;
struct dnotify_struct **prev;
struct fown_struct * fown;
int changed = 0;

write_lock(&dn_lock);
prev = &inode->i_dnotify;
while ((dn = *prev) != NULL) {
if ((dn->dn_mask & event) == 0) {
prev = &dn->dn_next;
continue;
}
fown = &dn->dn_filp->f_owner;
send_sigio(fown, dn->dn_fd, POLL_MSG);
+ file_notify_send(dn->dn_filp, ION_IN, POLLIN | POLLRDNORM | POLLMSG);
if (dn->dn_mask & DN_MULTISHOT)
prev = &dn->dn_next;
else {
*prev = dn->dn_next;
changed = 1;
kmem_cache_free(dn_cache, dn);
}
}
if (changed)
redo_inode_mask(inode);
write_unlock(&dn_lock);
}

This is the result of a quite quick analysis, but I do not expect it to be
much more difficult than that.




- Davide


2002-10-31 22:56:10

by Jamie Lokier

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

Davide, I think you are right. That's why I said epoll was _nearly_ perfect :)

Davide Libenzi wrote:
> Jamie, the fact that epoll supports a limited number of "objects" was an
> as-designed at that time. I see it quite easy to extend it to support
> other objects. Futexes are a matter of one line of code int :

Agreed - though I'd prefer if the overhead of creating a temporary fd
for each futex were eliminated, as well as the potentially large fd
table. (In a threaded app, it's reasonable to have many more futexes
than sockets, and they are created and destroyed rapidly too. No data
on how many of those futexes need to be registered, though).

In other words, add another op to sys_futex() called FUTEX_EPOLL which
directly registers the futex on an epoll interest list, and let epoll
report those events as futex events.

(I suspect that is quite easy too).

> Timer, as long as you access them through a file* interface ( like futexes )
> will become trivial too. Another line should be sufficent for dnotify :

Sorry (<humble/>), ignore timers. Somehow I picked up the idea that
epoll_wait() didn't have a timeout from some example or other, which
was very silly of me. I've read the patch properly now! Of course
epoll supports timers - a timeout is quite enough for user space.

> void __inode_dir_notify(struct inode *inode, unsigned long event)

Agreed. This is looking good :)

It's lucky that polling for readability on a directory is not useful
in any other way, though :)

The semantics for this are a bit confusing and inconsistent with
poll(). User gets POLL_RDNORM event which means something in the
directory has changed, not that the directory is now readable or that
poll() would return POLL_RDNORM. It really should be a different
flag, made for the purpose.

> This is the result of a quite quick analysis, but I do not expect it to be
> much more difficult than that.

Someone suggested hooking into ->poll() as a less obtrusive way to
implement epoll. You're probably right that it's quicker to hook
directly as you have done, although it would be great if epoll could
somehow "fall back" to using ->poll() in the cases where epoll isn't
directly supported by a file object.

I wrote quite a lot about futexes above. That's because good futex
support, and fallback to ->poll() would pretty much make epoll
universal. What do you think of these ideas?:

1. Add FUTEX_EPOLL operation to futex.c, which registers a futex
with an epoll interest list. This would cause FUTEX_WAKE
calls on that address to generate epoll events. Some care is
needed here to keep track of the exact number of events generated,
because some rather subtle usages of futex depend on the
return value from futex_wake being the _exact_ number of waiters
that are woken. It would have to correspond to the exact number
of events counted by userspace.

2. Add a check to EP_CTL_ADD which checks whether a file supports
epoll notifications natively. Perhaps a file_operations hook
is in order here. If it does, great. If not, fall back to
a generic mechanism that uses the file's ->poll() method. (I
haven't thought through for sure how plausible this is).
Magically, every kind of fd works, including special devices,
and the things that are most performance critical (sockets,
pipes, futexes) are tuned. Yum!

3. Eliminate send_sigio() calls - pass all events to epoll, and let
epoll dispatch signals where they have been registered. In
combination with (2), this magically provides SIGIO support for
all fd types as well.

4. Merge the aio and epoll event reporting functions: io_getevents
and epoll_wait are remarkably similar, and should really be one
function. It would introduce binary incompatibility somewhere though.

There are a few cherries to go on top but I don't want to make this
email any longer. Those are the essentials :)

-- Jamie

ps. Falling back to ->poll also means you can make trees of
notifications, like Alan suggested :)

pps. I much prefer epoll's use of an fd for the interest list than
aio's aio_context_t.

2002-10-31 23:45:50

by Rusty Russell

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

On 31 Oct 2002 16:45:58 +0000
Alan Cox <[email protected]> wrote:

> What is hard is multiple futex waits and livelock for that. I think it
> can be done properly but I've not sat down and designed it all out - I
> wonder what Rusty thinks.

Hmm... Never thought about it. You mean an API like:

struct futex_set *futex_set_init();
struct futex_set *futex_set_add(struct futex_set *, struct futex *);
/* Returns futex obtained. */
struct futex *futex_set_wait(struct futex_set *);

I think a naive implementation of futex_set_wait would look like:

set = futex_set
try:
for each futex in set {
if (grab in userspace) {
close fds;
return with futex;
}
close old fd for futex if any
call FUTEX_FD to get fd notification of futex;
}

select on fds
set = fds which are ready
goto try

You could, of course, loop through the fast path once before making any
syscalls. Another optimization is to have FUTEX_FD reuse an existing fd
rather than requiring the close.

Not sure I get the point about livelock though: deadlock is possible if
apps seek multiple locks at once without care, of course.

Rusty.
--
there are those who do and those who hang on and you don't see too
many doers quoting their contemporaries. -- Larry McVoy

2002-11-01 00:26:36

by Jamie Lokier

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

Rusty Russell wrote:
> I think a naive implementation of futex_set_wait would look like:

Vaguely. We are looking for something with the queue-like semantics
of epoll and rt-signals: persistent (as opposed to one-shot)
listening, ordered delivery of events, scalable listening to thousands
at once (without the poll/select O(n) problem).

> Not sure I get the point about livelock though: deadlock is possible if
> apps seek multiple locks at once without care, of course.

I'm not sure what Alan meant either.

-- Jamie

2002-11-01 00:49:31

by Davide Libenzi

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

On Thu, 31 Oct 2002, Jamie Lokier wrote:

> Davide, I think you are right. That's why I said epoll was _nearly_ perfect :)
>
> Davide Libenzi wrote:
> > Jamie, the fact that epoll supports a limited number of "objects" was an
> > as-designed at that time. I see it quite easy to extend it to support
> > other objects. Futexes are a matter of one line of code int :
>
> Agreed - though I'd prefer if the overhead of creating a temporary fd
> for each futex were eliminated, as well as the potentially large fd
> table. (In a threaded app, it's reasonable to have many more futexes
> than sockets, and they are created and destroyed rapidly too. No data
> on how many of those futexes need to be registered, though).
>
> In other words, add another op to sys_futex() called FUTEX_EPOLL which
> directly registers the futex on an epoll interest list, and let epoll
> report those events as futex events.

Jamie, the futex support can be easily done with one line of code patch. I
still prefer the one-to-one mapping between futexes and files. It makes
everything easier. I don't really see futex creation/destroy as an high
frequency event that might be suitable for optimization. Usually you have
your own set of resources to be "protected" and in 95% of cases you know
those resources from the beginning.



> > Timer, as long as you access them through a file* interface ( like futexes )
> > will become trivial too. Another line should be sufficent for dnotify :
>
> Sorry (<humble/>), ignore timers. Somehow I picked up the idea that
> epoll_wait() didn't have a timeout from some example or other, which
> was very silly of me. I've read the patch properly now! Of course
> epoll supports timers - a timeout is quite enough for user space.

If you want to timeout I/O operations you can easily put a timer routine
in your main event scheduler loop. But I still like the idea of timers
easily accessible through a file* interface.



> > void __inode_dir_notify(struct inode *inode, unsigned long event)
>
> Agreed. This is looking good :)

I asked Linus what he thinks about this one-line patch.



> Someone suggested hooking into ->poll() as a less obtrusive way to
> implement epoll. You're probably right that it's quicker to hook
> directly as you have done, although it would be great if epoll could
> somehow "fall back" to using ->poll() in the cases where epoll isn't
> directly supported by a file object.

I'm currently investigating this ... looks like an easy way to support
"alien" files :)



> I wrote quite a lot about futexes above. That's because good futex
> support, and fallback to ->poll() would pretty much make epoll
> universal. What do you think of these ideas?:
>
> 1. Add FUTEX_EPOLL operation to futex.c, which registers a futex
> with an epoll interest list. This would cause FUTEX_WAKE
> calls on that address to generate epoll events. Some care is
> needed here to keep track of the exact number of events generated,
> because some rather subtle usages of futex depend on the
> return value from futex_wake being the _exact_ number of waiters
> that are woken. It would have to correspond to the exact number
> of events counted by userspace.

I still believe that the 1:1 mapping is sufficent and with that in place (
and the one line patch to kernel/futex.c ) futex support comes nicely.



> 2. Add a check to EP_CTL_ADD which checks whether a file supports
> epoll notifications natively. Perhaps a file_operations hook
> is in order here. If it does, great. If not, fall back to
> a generic mechanism that uses the file's ->poll() method. (I
> haven't thought through for sure how plausible this is).
> Magically, every kind of fd works, including special devices,
> and the things that are most performance critical (sockets,
> pipes, futexes) are tuned. Yum!

Yes, kind of. The hook for an efficent edge triggered event notification
should be something like the socket one where you have a ->data_ready()
and ->write_space(), where the caller of these callbacks know that signals
has to be delivered on 0->1 transactions. With the poll hook you have the
drawback that the wakeup list is invoked each time data arrives and this
might generate a little bit too many events. This is no a problem since
epoll collapse them, but still collapsing do cost CPU cycles.



> 3. Eliminate send_sigio() calls - pass all events to epoll, and let
> epoll dispatch signals where they have been registered. In
> combination with (2), this magically provides SIGIO support for
> all fd types as well.

I would leave as a next cleanup operation, eventually.




- Davide





2002-11-01 01:49:27

by Matthew D. Hall

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

If I may respectfully weigh in...

If a new API and/or a significant change in semantics is to be applied
to the kernel for a unified event notification system, this is obviously
an issue for 2.7 or 2.9. As such, we have plenty of time to focus upon
simplicity and correctness rather than plain old inertia. We need to
bring a truly unified, and therefore new, event API to the kernel, and
it must be done right. kevent attempts to achieve this for FreeBSD, and
generally speaking, it succeeds. But linux can do much better.

The API should present the notion of a general edge-triggered event
(e.g. I/O upon sockets, pipes, files, timers, etc.), and it should do so
simply. Linus made some suggestions on lkml back in 2000
(http://marc.theaimsgroup.com/?l=linux-kernel&m=97236943118139&w=2) that
pretty much hit the nail on the head -- with some exceptions.

* Unless every conceivable event is to be represented as a file (rather
unintuitive IMHO), his proposed interface fails to accomodate non-I/O
events (e.g. timers, signals, directory updates, etc.). As much as I
appreciate the UNIX Way, making everything a file is a massive
oversimplification. We can often stretch the definition far enough to
make this work, but I'd be impressed to see how one intends to call,
e.g., a timer a type of file.

* There is a seemingly significant overhead in performing exactly one
callback per event. Doesn't this prevent any kind of event coalescence?
It seems like we could be doing an awful lot of cache thrashing, among
other things. In some cases, this might happen anyway, but why should
the interface enforce such behavior? In most other cases, it's
perfectly acceptable to inline an event handler (either via compile-time
inlining or literally). I do realize that Linus only suggested that the
C library do the mass callbacks, BTW, so strictly speaking, it's the
userland API that would "enforce such behavior." Nonetheless, this is
of concern.

Enough of what we shouldn't do. Here's what we should:

* The interface should allow the implementation to be highly extensible
without horrible code contortions within the kernel. It is important to
be able to add new types of events as they become necessary. The
interface should be general and simple enough to accomodate these
extensions. Linux (really, UNIX) has failed to exercise this foresight
in the past, and that's why we have the current mess of varied
polling/triggering methods.

* I might be getting a bit utopian here, but IMHO the kernel should
move toward a completely edge-triggered event system. The old
level-triggered interfaces should only wrap this paradigm.

* Might as well reiterate: simplicity. FreeBSD's kevent solves nearly
all of the traditional problems, but it is gross. And complicated.
There were clearly too many computer scientists and not enough
engineers on that team.

* Only one queue per process or kernel thread. Multiple queues per
flow of execution is just ugly and ultimately pointless. That is not to
say that you can't multithread, but each thread simply uses the same
queue. In cases when you want one thread to only wait on a certain
type(s) of event, you can do this as well; you just can't have one
thread juggling more than one queue. Since the event notification is
edge-triggered, I cannot see any reason for a significant performance
degradation resulting from this policy. I am not altogether convinced
that this must be a strict policy, however; the issue of different
userspace threads having different event queues inside the same task is
interesting.

* No re-arming events. They must be manually killed.

* I'm sure everyone would agree that passing an opaque "user context"
pointer is necessary with each event.

* Zero-copy event delivery (of course).

Some question marks:
- Should the kernel attempt to prune the queue of "cancelled" events
(hints later deemed irrelevant, untrue, or obsolete by newer events)?
- Is one-queue-per-task really the way to go? This stops many bad
practices but could prevent some decent ones (see user threads comment).

Matthew D. Hall


2002-11-01 01:55:33

by Jamie Lokier

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

Davide Libenzi wrote:
> Jamie, the futex support can be easily done with one line of code patch. I
> still prefer the one-to-one mapping between futexes and files. It makes
> everything easier.

I do agree it is very simple and hence good.

> I don't really see futex creation/destroy as an high frequency event
> that might be suitable for optimization. Usually you have your own
> set of resources to be "protected" and in 95% of cases you know
> those resources from the beginning.

Well, I'll disagree but stay mostly quiet. I think it is reasonable
to have a futex per _object_ in certain language run-times.
Allocation rate: 10,000,000 per second in some examples (f.e. certain
kinds of threaded simulator).

Hardly any of those will need associated fds, and I have no figures on
how many or how often, but you can see that futexes are sometimes used
in a very dynamic way because they are so cheap until contention.

That's the cool thing about futexes: there's absolutely zero kernel
overhead until contention, and only one "long" of overhead in user
space.

At contention, two syscalls resolves it synchronously: futex_wait,
futex_wake. The async method using an fd with epoll takes five:
futex_fd, epoll_ctl, poll, futex_wake, futex_close. That works, but
lacks the _cool_ factor that futexes have IMHO. It should be:
futex_wait_async, futex_wake.

I realise my argument is a weak one though :)

> > > Timer, as long as you access them through a file* interface ( like futexes )
> > > will become trivial too. Another line should be sufficent for dnotify :
> >
> > Sorry (<humble/>), ignore timers. Somehow I picked up the idea that
> > epoll_wait() didn't have a timeout from some example or other, which
> > was very silly of me. I've read the patch properly now! Of course
> > epoll supports timers - a timeout is quite enough for user space.
>
> If you want to timeout I/O operations you can easily put a timer routine
> in your main event scheduler loop. But I still like the idea of timers
> easily accessible through a file* interface.

Sure, but using file * interface implies entering the kernel - that
can sometimes be skipped* if your timer queue is in user space.

* - it happens under heavy load, conveniently.

> > > void __inode_dir_notify(struct inode *inode, unsigned long event)
> >
> > Agreed. This is looking good :)
>
> I asked Linus what he thinks about this one-line patch.

I have no objections to it. Generally, I'd like epoll to be able to
report _what_ the event was (not just POLL_RDNORM, but what kind of
dnotify event), but as I don't get to run on an ideal kernel [;)] I'll
be happy with POLL_RDNORM.

> I still believe that the 1:1 mapping is sufficent and with that in place (
> and the one line patch to kernel/futex.c ) futex support comes nicely.

It does work - actually, with ->poll() you don't need any lines in futex.c :)

Even if a specialised futex hook is added someday, the fd support will
continue to be useful.

> > 2. Add a check to EP_CTL_ADD which checks whether a file supports
> > epoll notifications natively. Perhaps a file_operations hook
> > is in order here. If it does, great. If not, fall back to
> > a generic mechanism that uses the file's ->poll() method. (I
> > haven't thought through for sure how plausible this is).
> > Magically, every kind of fd works, including special devices,
> > and the things that are most performance critical (sockets,
> > pipes, futexes) are tuned. Yum!
>
> Yes, kind of. The hook for an efficent edge triggered event notification
> should be something like the socket one where you have a ->data_ready()
> and ->write_space(), where the caller of these callbacks know that signals
> has to be delivered on 0->1 transactions. With the poll hook you have the
> drawback that the wakeup list is invoked each time data arrives and this
> might generate a little bit too many events. This is no a problem since
> epoll collapse them, but still collapsing do cost CPU cycles.

You avoid the extra CPU cycles like this:

1. EP_CTL_ADD adds the listener to the file's wait queue using
->poll(), and gets a free test of the object readiness [;)]

2. When the transition happens, the wakeup will call your function,
epoll_wakeup_function. That removes the listener from the file's
wait queue. Note, you won't see any more wakeups from that file.

3. When you report the event user space, _then_ you automatically
add the listener back to the file's wait queue by calling ->poll().

This way, there are no spurious wakeups, and nothing to collapse. I
would not be surprised if this is quite fast - perhaps as fast as the
special epoll hooks.

The nice feature that makes this possible is that waitqueues don't
wake up tasks any more: they simply call your choice of callback
function. It was changed for aio, and it's a good change.

-- Jamie

2002-11-01 02:42:01

by Davide Libenzi

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

On Thu, 31 Oct 2002, Matthew D. Hall wrote:

> * Unless every conceivable event is to be represented as a file (rather
> unintuitive IMHO), his proposed interface fails to accomodate non-I/O
> events (e.g. timers, signals, directory updates, etc.). As much as I
> appreciate the UNIX Way, making everything a file is a massive
> oversimplification. We can often stretch the definition far enough to
> make this work, but I'd be impressed to see how one intends to call,
> e.g., a timer a type of file.

The fact that a timer is a file garanties you the usage of an existing
infrastructure and existing APIs to use it. For example epoll_create(2)
returns you a file descriptor, and this enable you ( for example ) to drop
this file descriptor inside a poll set. Also you get the cleanup
infrastructure that otherwise you would have to code every time, for each
new object that you create, by yourself. Something like :

int timer_create(void);
int timer_set(struct timespec *ts);

and you can use epoll or poll to get the timer event, and close(2) to
destroy it. You get automatic compatibility with lots of nice stuff having
an object that is actually a file and I usually like it as idea.



> * I'm sure everyone would agree that passing an opaque "user context"
> pointer is necessary with each event.

I asked this about a week ago. It's _trivial_ to do in epoll. I did not
receive any feedback, so I didn't implement it. Feedback will be very much
appreciated here ...




- Davide




2002-11-01 02:50:07

by Jamie Lokier

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

Matthew D. Hall wrote:
> The API should present the notion of a general edge-triggered event
> (e.g. I/O upon sockets, pipes, files, timers, etc.), and it should do so
> simply.

Agreed. Btw, earlier today I had misgivings about epoll, but since
I've had such positive response from Davide I think epoll has
potential to become that ideal interface...

> * Unless every conceivable event is to be represented as a file (rather
> unintuitive IMHO), his proposed interface fails to accomodate non-I/O
> events (e.g. timers, signals, directory updates, etc.).#

...apart from this one point!

> As much as I appreciate the UNIX Way, making everything a file is a
> massive oversimplification. We can often stretch the definition far
> enough to make this work, but I'd be impressed to see how one
> intends to call, e.g., a timer a type of file.

If it has an fd, that is, if it has an index into file_table, then
it's a "file". No other semantics are required for event purposes.

This seems quite weird and pointless at first, but actually fds are
quite useful: you can dup them and pass them between processes, and
they have a security model (you can't use someone else's fd unless
they've passed it to you). Think of an fd as a handle to an arbitrary
object.

OTOH look at rt-signals: a complete mess, no kernel allocation
mechanism, libraries fight it out and don't always work. Look at aio:
it has an aio_context_t - IMHO that should be an fd, not an opaque
number that cannot be transferred to another process or polled on.

However, despite all the goodness of fds, you're right. Event queues
really need to deliver more info than which event and
read/write/hangup. dnotify events should include what happened and
maybe the inode number. futex events should include the address.
(rt-signals get close to this but fail due to pseudo-compatibility
with a braindead POSIX standard).

> * There is a seemingly significant overhead in performing exactly one
> callback per event. Doesn't this prevent any kind of event coalescence?

As you go on to say, this should be a matter for userspace. My
concern is that kernel space should provide a flexible mechanism for a
variety of possible userspaces.

> * The interface should allow the implementation to be highly extensible
> without horrible code contortions within the kernel. It is important to
> be able to add new types of events as they become necessary. The
> interface should be general and simple enough to accomodate these
> extensions. Linux (really, UNIX) has failed to exercise this foresight
> in the past, and that's why we have the current mess of varied
> polling/triggering methods.

Agreed, agreed, agreed, agreed. Fwiw, I now think these can all be
satisfied with some evolution of epoll, if that is permitted.

> * I might be getting a bit utopian here, but IMHO the kernel should
> move toward a completely edge-triggered event system. The old
> level-triggered interfaces should only wrap this paradigm.

Take a close look at how wait queues and notifier chains are used.
Some of the kernel is edge triggered already. Admittedly, it's about
as clear as clay at times - some wait queues are used in an
edge-triggered way, others level-triggered.

> * Might as well reiterate: simplicity. FreeBSD's kevent solves nearly
> all of the traditional problems, but it is gross. And complicated.

Could you explain what you find complicated and/or gross about kevent?
We should learn from their mistakes.

> There were clearly too many computer scientists and not enough
> engineers on that team.

I am both ;)

> * Only one queue per process or kernel thread. Multiple queues per
> flow of execution is just ugly and ultimately pointless.

Disagree. You're right that it's technically not necessary to have
multiple queues, but in practice you can't always force an entire
program to funnel all its I/O through one library - that just doesn't
happen in reality. And there is basically no cost to having multiple
queues. Keyed off fds :)

That was a mistake made by rt-signals: assuming all the different bits
of code that use rt-signals will cooperate. Theoretically solvable in
user space. In reality, they don't know about each other. Although
my code at least searches for a free rt-signal, that's not guaranteed
to work if another library assumes a fixed value.

The same problem occurs with the LDT. Glibc wants to use it and so do I.
Conflict.

> Since the event notification is edge-triggered, I cannot see any
> reason for a significant performance degradation resulting from this
> policy. I am not altogether convinced that this must be a strict
> policy, however; the issue of different userspace threads having
> different event queues inside the same task is interesting.

User space threads are often but not always built on top of a simple
scheduler which converts blocking system calls to queued non-blocking
system calls. If done well, this is a form of virtualisation which
may even be nestable.

You'd expect the event queue mechanism to be included in the set of
blocking system calls which are converted, so multiple userspace
threads would "see" individual queues even though they are multiplexed
by the userspace scheduler.

This works great, until those threads expect mmap() to provide them
with their own separate zero-copy event queues :) So another reason to
need multiple queues from the kernel.

> * No re-arming events. They must be manually killed.

I would provide both options, like dnotify does: one-shot and
continuous. There are occasions when one-shot is better for resource
usage - it depends what the event is monitoring.

> * I'm sure everyone would agree that passing an opaque "user context"
> pointer is necessary with each event.

It is not the end of the world to use an fd number and a table, but it
may improve thread scalability to use a pointer instead.

> * Zero-copy event delivery (of course).

I think this is not critical for performance, but desirable anyway. I
would take this further:

1. zero-copy delivery
2. zero system calls as long as the queue is non-empty
(like the packet socket mmap interface)
3. no fixed limit on the size of the queue at creation time

Neither epoll nor aio satisfy (3). Luckily I have a nice design which
satisfies all these and is extensible in the ways we agree on.

> Some question marks:
> - Should the kernel attempt to prune the queue of "cancelled" events
> (hints later deemed irrelevant, untrue, or obsolete by newer events)?

Something is needed in the case of aio cancellations, but I think
that's different to what you mean. Backtracking hints is quite
difficult to synchronise with userspace if done through mmap and no
system calls. It's best not to bother. Coalescing events, which can
have the effect of superceding events in some cases, is a possibility.
It's tricky but more worthwhile than backtracking.

For some kinds of event, such as round robin futex wakeups, it's
critically important that the _number_ of events seen by userspace is
the same as the number sent from the kernel. In these cases, they are
not just hints, they are synchronisation tokens. That adds some
excitement to coalescing in a shared memory buffer.

-- Jamie

2002-11-01 04:21:24

by Mark Mielke

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

On Thu, Oct 31, 2002 at 11:02:15PM +0000, Jamie Lokier wrote:
> The semantics for this are a bit confusing and inconsistent with
> poll(). User gets POLL_RDNORM event which means something in the
> directory has changed, not that the directory is now readable or that
> poll() would return POLL_RDNORM. It really should be a different
> flag, made for the purpose.

Don't be encouraging any of us to expect the ability to poll() for changes
to regular files (log file parsers that sit on EOF periodically polling for
further data...). Just get *something* decent out so that we can play with
it in a production environment. I would put off extensions such as this until
the API is well established.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-11-01 04:53:18

by Jamie Lokier

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

Mark Mielke wrote:
> On Thu, Oct 31, 2002 at 11:02:15PM +0000, Jamie Lokier wrote:
> > The semantics for this are a bit confusing and inconsistent with
> > poll(). User gets POLL_RDNORM event which means something in the
> > directory has changed, not that the directory is now readable or that
> > poll() would return POLL_RDNORM. It really should be a different
> > flag, made for the purpose.
>
> Don't be encouraging any of us to expect the ability to poll() for changes
> to regular files (log file parsers that sit on EOF periodically polling for
> further data...).

Actually you can already do something similar, if a little coarse
grained, in 2.4 kernels using dnotify on the parent directory.

> Just get *something* decent out so that we can play with it in a
> production environment. I would put off extensions such as this
> until the API is well established.

"something decent" is already out - epoll is quite useful in its
present form. (Take that with a grain of salt - I haven't tried it,
and it only just went into 2.4.45, and I have the impression Davide is
cleaning up the code for 2.4.46 - but it looks basically ok).

-- Jamie

2002-11-01 12:57:08

by Alan

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

On Thu, 2002-10-31 at 22:00, Rusty Russell wrote:
> try:
> for each futex in set {
> if (grab in userspace) {
> close fds;
> return with futex;
> }
> close old fd for futex if any
> call FUTEX_FD to get fd notification of futex;
> }
>
> select on fds
> set = fds which are ready
> goto try
>
> You could, of course, loop through the fast path once before making any
> syscalls. Another optimization is to have FUTEX_FD reuse an existing fd
> rather than requiring the close.
>
> Not sure I get the point about livelock though: deadlock is possible if
> apps seek multiple locks at once without care, of course.

Think about 1000 futexes where one task wants to grab them all and other
tasks want any one of them - done wrongly at that point busy traffic
will starve the 1000 futex waiter.


2002-11-01 17:19:47

by Dan Kegel

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

Davide Libenzi wrote:
>>Do you avoid the cost of epoll_ctl() per new fd?
>
> Jamie, the cost of epoll_ctl(2) is minimal/zero compared with the average
> life of a connection.

Depends on the workload. Where I work, the http client I'm writing
has to perform extremely well even on 1 byte files with HTTP 1.0.
Minimizing system calls is suprisingly important - even
a gettimeofday hurts.

- Dan

2002-11-01 17:20:47

by Davide Libenzi

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

On Fri, 1 Nov 2002, Jamie Lokier wrote:

> I have no objections to it. Generally, I'd like epoll to be able to
> report _what_ the event was (not just POLL_RDNORM, but what kind of
> dnotify event), but as I don't get to run on an ideal kernel [;)] I'll
> be happy with POLL_RDNORM.

See below ...


> > I still believe that the 1:1 mapping is sufficent and with that in place (
> > and the one line patch to kernel/futex.c ) futex support comes nicely.
>
> It does work - actually, with ->poll() you don't need any lines in futex.c :)

The global poll hook might work, but you can deliver anything with your
callback. And you have to actually do another poll to understand which
poll flags you really got. Adding the famous one-lines patches to single
devices anytime there's the need, being file_notify_send() completely
expandible, enable you to have a more detailed report to send back to the
caller.



> You avoid the extra CPU cycles like this:
>
> 1. EP_CTL_ADD adds the listener to the file's wait queue using
> ->poll(), and gets a free test of the object readiness [;)]
>
> 2. When the transition happens, the wakeup will call your function,
> epoll_wakeup_function. That removes the listener from the file's
> wait queue. Note, you won't see any more wakeups from that file.
>
> 3. When you report the event user space, _then_ you automatically
> add the listener back to the file's wait queue by calling ->poll().
>
> This way, there are no spurious wakeups, and nothing to collapse. I
> would not be surprised if this is quite fast - perhaps as fast as the
> special epoll hooks.

Jamie, I'm afraid it won't. This is the cost of reporting events to the
user with the current epoll :

ep->eventcnt = 0;
++ep->ver;
if (ep->pages == ep->pages0) {
ep->pages = ep->pages1;
dvp->ep_resoff = 0;
} else {
ep->pages = ep->pages0;
dvp->ep_resoff = ep->numpages * PAGE_SIZE;
}

Using the global poll hook you have several problem. First the poll table
is not suitable to single insert/removal, all you need is a poll_wait()
that, recognizing a special type of table, do insert in the wait queue
differently. For example you'll have :


typedef struct poll_table_struct {
+ int queue;
+ wait_queue_t *q;
int error;
struct poll_table_page * table;
} poll_table;

This togheter with :

static inline void poll_initwait_ex(poll_table* pt, wait_queue_t *q, int queue)
{
+ pt->queue = queue;
+ pt->q = q;
pt->error = 0;
pt->table = NULL;
}

And the :

void __pollwait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
{
struct poll_table_page *table = p->table;

+ if (!p->queue)
+ return;
+ if (p->q) {
+ add_wait_queue(wait_address, p->q);
+ return;
+ }
if (!table || POLL_TABLE_FULL(table)) {
struct poll_table_page *new_table;

new_table = (struct poll_table_page *) __get_free_page(GFP_KERNEL);
if (!new_table) {
p->error = -ENOMEM;
__set_current_state(TASK_RUNNING);
return;
}
new_table->entry = new_table->entries;
new_table->next = table;
p->table = new_table;
table = new_table;
}

/* Add a new entry */
{
struct poll_table_entry * entry = table->entry;
table->entry = entry+1;
get_file(filp);
entry->filp = filp;
entry->wait_address = wait_address;
init_waitqueue_entry(&entry->wait, current);
add_wait_queue(wait_address,&entry->wait);
}
}


This enable you to do two things :

1) During the EP_CTL_ADD you do :

poll_table pt;

poll_initwait_ex(&pt, &dpi->q, 1);
file->f_op->poll(file, &pt);

and this adds _your_own_ wait queue object in the file poll queue.
No full blow poll_table. You need your own wait queue because when
the callback ( wakeup ) is called you need to call container_of()
to get the dpi*

2) Before reporting events you need to fetch _real_ poll flags for each
file you received the callback from. You do :

poll_table pt;

poll_initwait_ex(&pt, NULL, 0);
flags = file->f_op->poll(file, &pt);

You really don't want to remove and add to file's poll queue _every_ time
you receive an event. You're going to pay a lot for that. I'm currently
coding this one to give it a try to see what kind of performances I get.
With the global poll hook you won't be able to do more detailed event
report that file_notify_event() enable you to do.



- Davide


2002-11-01 17:29:34

by Davide Libenzi

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Fri, 1 Nov 2002, Dan Kegel wrote:

> Davide Libenzi wrote:
> >>Do you avoid the cost of epoll_ctl() per new fd?
> >
> > Jamie, the cost of epoll_ctl(2) is minimal/zero compared with the average
> > life of a connection.
>
> Depends on the workload. Where I work, the http client I'm writing
> has to perform extremely well even on 1 byte files with HTTP 1.0.
> Minimizing system calls is suprisingly important - even
> a gettimeofday hurts.

Dan, is it _one_ gettimeofday() or a gettimeofday() inside a loop ?
gettimeofday() is of the order of few microseconds ... and If your clients
works with anything alse than a loopback, few microseconds shouldn't weigh
in much compared to connect/send/recv/close on a network connection. It is
not much the fact that you transfer one byte, it's the whole TCP handshake
cost that weighs in.



- Davide


2002-11-01 17:56:21

by Dan Kegel

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

Davide Libenzi wrote:
>>* I'm sure everyone would agree that passing an opaque "user context"
>>pointer is necessary with each event.
>
> I asked this about a week ago. It's _trivial_ to do in epoll. I did not
> receive any feedback, so I didn't implement it. Feedback will be very much
> appreciated here ...

If it's cheap, do it! It relieves the programmer of having
to manage a fd to object lookup table.
- Dan


2002-11-01 18:19:03

by Dan Kegel

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

Davide Libenzi wrote:
> On Fri, 1 Nov 2002, Dan Kegel wrote:
>
>>Davide Libenzi wrote:
>>
>>>>Do you avoid the cost of epoll_ctl() per new fd?
>>>
>>>Jamie, the cost of epoll_ctl(2) is minimal/zero compared with the average
>>>life of a connection.
>>
>>Depends on the workload. Where I work, the http client I'm writing
>>has to perform extremely well even on 1 byte files with HTTP 1.0.
>>Minimizing system calls is suprisingly important - even
>>a gettimeofday hurts.
>
> Dan, is it _one_ gettimeofday() or a gettimeofday() inside a loop ?
> gettimeofday() is of the order of few microseconds ... and If your clients
> works with anything alse than a loopback, few microseconds shouldn't weigh
> in much compared to connect/send/recv/close on a network connection. It is
> not much the fact that you transfer one byte, it's the whole TCP handshake
> cost that weighs in.

The scenario is: we're doing load testing of http products,
and for various reasons, we want line-rate traffic with
the smallest possible message size. i.e. we want the
maximum number of HTTP requests/responses per second.
Hence the 1 byte payloads. A single system call on the
slowish embedded processor I'm using has a suprisingly large
impact on the number of http gets per second I can do.
A 1% increase in speed is worth it for me!

So please do try to reduce the number of syscalls needed
to handle very short TCP sessions, if possible.

- Dan


2002-11-01 19:10:38

by Jamie Lokier

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

Dan Kegel wrote:
> Davide Libenzi wrote:
> >>Do you avoid the cost of epoll_ctl() per new fd?
> >
> >Jamie, the cost of epoll_ctl(2) is minimal/zero compared with the average
> >life of a connection.
>
> Depends on the workload. Where I work, the http client I'm writing
> has to perform extremely well even on 1 byte files with HTTP 1.0.
> Minimizing system calls is suprisingly important - even
> a gettimeofday hurts.

For this sort of thing, I would like to see an option to automatically
set the non-blocking flag on accept(). To really squeeze the system
calls, you could also automatically epoll-register on accept(), and
for super bonus automatically do the accept() at event delivery time.

But it's getting very silly at that point.

-- Jamie

2002-11-01 19:59:58

by Charlie Krasic

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll


Jamie Lokier <[email protected]> writes:

> For this sort of thing, I would like to see an option to automatically
> set the non-blocking flag on accept(). To really squeeze the system
> calls, you could also automatically epoll-register on accept(), and
> for super bonus automatically do the accept() at event delivery time.

> But it's getting very silly at that point.

> -- Jamie

I would like to see a new kind of nonblocking flag that implies the
use of epoll. So instead of giving O_NONBLOCK to fctnl(F_SETFL), you
give O_NONBLOCK_EPOLL. In addition to becoming non-blocking, the
socket is added to epoll interest set. Furthermore, if the socket is
a "listener" socket, all connections accepted on the socket inherit
the non-blocking status and are added automatically to the same epoll
interest set. It's true that this can get silly though. I'd like to
do the same with other flags, like TCP_CORK.

-- Buck

> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/

2002-11-01 20:14:10

by Mark Mielke

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

On Fri, Nov 01, 2002 at 07:16:43PM +0000, Jamie Lokier wrote:
> > Depends on the workload. Where I work, the http client I'm writing
> > has to perform extremely well even on 1 byte files with HTTP 1.0.
> > Minimizing system calls is suprisingly important - even
> > a gettimeofday hurts.
> For this sort of thing, I would like to see an option to automatically
> set the non-blocking flag on accept(). To really squeeze the system
> calls, you could also automatically epoll-register on accept(), and
> for super bonus automatically do the accept() at event delivery time.
> But it's getting very silly at that point.

Not really... isn't accept() automatically performed ahead of time anyways,
as long as the listen queue isn't full?

Another issue for the 'unified event notification model':

How does epoll interact with signals, specifically the race
condition between determining the timeout that should be passed
to epoll_wait(), and epoll_wait() itself? (see pselect() for info)
For example: it is very regular for priority to be given to a fd
callback before a signal callback, meaning that epoll_wait() would
be called with timeout=0 if a received signal did not have its
callback executed yet, or something greater, otherwise.

I would like to see at least of the following (suggestions made by
other people) in the final version:

1) Userspace data pointer to allow more efficient userspace dispatching
when epoll_wait() returns. (Something about scanning array structures
for matching fd arguments rubs me the wrong way -- it shouldn't be
necessary)

2) Reduced requirements to issue system calls such as read() when EAGAIN
is the expected return value. The whole 'do a quick poll() or similar
at registration time upon request' issue - for obscure cases that would
require complex code, or code that cannot yet be agreed upon, this
could temporarily mark events ready at registration without checking,
with a goal of eliminating this behaviour one type of file at a time.

Although the ability to wait on futex or timeout objects seems clever,
I'm not sure that we are at a point that we know how they would be
commonly used yet. Right now people need a poll() replacement for file
descriptors. Timeouts can be handled by manipulating the argument
to epoll_wait() and performing userspace analysis (same as poll()).

Futex objects have not (to my knowledge) yet been used in great
numbers at the same time (i.e. wait for 100 futexes to be obtained)
probably because the routines necessary to perform this operation
do not yet exist. It might be nice to fit this into epoll later, but
it doesn't need to yet.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-11-01 20:08:01

by Jamie Lokier

[permalink] [raw]
Subject: Re: and nicer too - Re: [PATCH] epoll more scalable than poll

Charlie Krasic wrote:
> I would like to see a new kind of nonblocking flag that implies the
> use of epoll. So instead of giving O_NONBLOCK to fctnl(F_SETFL), you
> give O_NONBLOCK_EPOLL. In addition to becoming non-blocking, the
> socket is added to epoll interest set. Furthermore, if the socket is
> a "listener" socket, all connections accepted on the socket inherit
> the non-blocking status and are added automatically to the same epoll
> interest set. It's true that this can get silly though. I'd like to
> do the same with other flags, like TCP_CORK.

... and close-on-exec.

-- Jamie

2002-11-01 20:39:42

by Jamie Lokier

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

Davide Libenzi wrote:
> > In other words, add another op to sys_futex() called FUTEX_EPOLL which
> > directly registers the futex on an epoll interest list, and let epoll
> > report those events as futex events.
>
> Jamie, the futex support can be easily done with one line of code patch. I
> still prefer the one-to-one mapping between futexes and files.

I forgot something important: futex notifcations must be _exactly
counted_ for some uses of futexes. It's all very subtle, but there's
an example in Rusty's futex library where a token is passed to one of
the waiters, and waiters are queued up behind each other in the order
they started waiting. (See futex_up_fair() in usersem.h). You need
this to prevent starvation, with Alan's example of waiting for
multiple futexes being a particularly nasty case.

Because of this, and the way your one-liner works, I think* that a
multi-threaded program will need to allocate one fd per waiter to
guarantee the counter - not one fd per waited-upon futex. So when
1000 threads are waiting on some global mutex (as happens), they'll
need an fd each - they can't share one.

* - If I'm wrong about this, please someone correct me.

Consequently, fds will need to be allocated when a thread wants to
wait, instead of lazily once per contended futex - hence a higher rate
of allocations and deallocations.

The fixes for this are twofold:

1. You must change file_send_notify() so that it takes a count which
limits the number of notifications (like FUTEX_WAKE), and returns
the number of notifications sent.

2. The futex's queue of waiters must contain the epoll waiters _and_
waitqueue waiters, in the order that they started waiting. It's
not enough to wake the epoll waiters first, and if any
notifications are left, wake the others, nor vice versa.

Futex epolls are a bit fiddly.

-- Jamie

2002-11-01 23:09:52

by John Myers

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

Matthew D. Hall wrote:

> * There is a seemingly significant overhead in performing exactly one
> callback per event.

The "exactly one callback per event" semantics of aio are important for
cancellation in thread pool environments. When you're shutting down a
connection, you need to be able to get to a point where you know no
other thread is processing or will process an event for the connection,
so it is safe to free the connection state.

> * Only one queue per process or kernel thread.

Having a single thread process multiple queues is not particularly
interesting (unless you have user-space threads or coroutines). Being
able to have different threads in the same process process different
queues is interesting--it permits a library to set up its own queue,
using its own threads to process it.

> * No re-arming events. They must be manually killed.

Rearming events is a useful way to get the correct cancellation
semantics in thread pool environments.

> - Should the kernel attempt to prune the queue of "cancelled" events
> (hints later deemed irrelevant, untrue, or obsolete by newer events)?

This makes the cancellation semantics much easier to deal with in single
threaded event loops. Single threaded cancellation is difficult in the
current aio interface because in the case where the canceled operation
already has an undelivered event in the queue, the canceling code has to
defer freeing the context until it receives that event.

An additional point: In a thread pool environment, you want event wakeup
to be in LIFO order and use wake-one semantics. You also want
concurrency control: don't deliver an event to a waiting thread if that
pool does not have fewer threads in runnable state than CPUs.


2002-11-01 23:21:19

by John Myers

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

Jamie Lokier wrote:

>You avoid the extra CPU cycles like this:
>
> 1. EP_CTL_ADD adds the listener to the file's wait queue using
> ->poll(), and gets a free test of the object readiness [;)]
>
> 2. When the transition happens, the wakeup will call your function,
> epoll_wakeup_function. That removes the listener from the file's
> wait queue. Note, you won't see any more wakeups from that file.
>
> 3. When you report the event user space, _then_ you automatically
> add the listener back to the file's wait queue by calling ->poll().
>
>
The cost of removing and readding the listener to the file's wait queue
is part of what epoll is amortizing.

There's also the oddity that I noticed this week: pipes don't report
POLLOUT readiness through the classic poll interface until the pipe's
buffer is completely empty. Changing this to report POLLOUT readiness
when the pipe's buffer is not full apparently causes NIS to break.


2002-11-02 04:47:52

by Mark Mielke

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

On Fri, Nov 01, 2002 at 03:27:41PM -0800, John Gardiner Myers wrote:
> There's also the oddity that I noticed this week: pipes don't report
> POLLOUT readiness through the classic poll interface until the pipe's
> buffer is completely empty. Changing this to report POLLOUT readiness
> when the pipe's buffer is not full apparently causes NIS to break.

These seems deficient. Does this mean that pipes managed via poll() are
not able to maximum throughput?

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-11-02 15:35:21

by Jamie Lokier

[permalink] [raw]
Subject: Re: Unifying epoll,aio,futexes etc. (What I really want from epoll)

John Gardiner Myers wrote:
> The cost of removing and readding the listener to the file's wait queue
> is part of what epoll is amortizing.

Not really. The main point of epoll is to ensure O(1) processing time
per event - one list add and removal doesn't affect that. It has a
constant time overhead, which I expect is rather small - but Davide
says he's measuring that so we'll see.

> There's also the oddity that I noticed this week: pipes don't report
> POLLOUT readiness through the classic poll interface until the pipe's
> buffer is completely empty. Changing this to report POLLOUT readiness
> when the pipe's buffer is not full apparently causes NIS to break.

There's a section in the Glibc manual which talks about pipe
atomicity. A pipe must guarantee that a write of PIPE_BUF bytes or
less either blocks or is accepted whole. So you can't report POLLOUT
just because there is room in the pipe - there must be PIPE_BUF room.

Furthermore, the manual says that after writing PIPE_BUF bytes,
further writes will block until some bytes are read. This latter does
not seem a useful requirement to me - I think that a pipe could be
larger than the PIPE_BUF atomicity value, but perhaps it is defined in
POSIX or SUS to be like this. (Someone care to check?)

Together these would seem to imply the behaviour noted by John.

-- Jamie

2002-11-05 18:09:24

by John Myers

[permalink] [raw]
Subject: pipe POLLOUT oddity



Mark Mielke wrote:

>On Fri, Nov 01, 2002 at 03:27:41PM -0800, John Gardiner Myers wrote:
>
>
>>There's also the oddity that I noticed this week: pipes don't report
>>POLLOUT readiness through the classic poll interface until the pipe's
>>buffer is completely empty. Changing this to report POLLOUT readiness
>>when the pipe's buffer is not full apparently causes NIS to break.
>>
>>
>
>These seems deficient. Does this mean that pipes managed via poll() are
>not able to maximum throughput?
>
I could see this going either way, depending on the application.
Holding off the POLLOUT readiness could improve performance by making
sure that whenever a process is scheduled to write to a pipe the pipe
has enough buffer to take all of the data.


Attachments:
smime.p7s (3.62 kB)
S/MIME Cryptographic Signature

2002-11-05 18:13:24

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: pipe POLLOUT oddity

On Tue, Nov 05, 2002 at 10:15:33AM -0800, John Gardiner Myers wrote:
> I could see this going either way, depending on the application.
> Holding off the POLLOUT readiness could improve performance by making
> sure that whenever a process is scheduled to write to a pipe the pipe
> has enough buffer to take all of the data.

Aio write to pipes has a distinct advantage here as the pipe code can
provide the write atomicity guarantees while preserving the non-blocking
aspect of the io submission interface.

-ben
--
"Do you seek knowledge in time travel?"