this is the v3 release of the syslet/threadlet subsystem:
http://redhat.com/~mingo/syslet-patches/
This release came a few days later than i originally wanted, because
i've implemented many fundamental changes to the code. The biggest
highlights of v3 are:
- "Threadlets": the introduction of the 'threadlet' execution concept.
- syslets: multiple rings support with no kernel-side footprint, the
elimination of mlock() pinning, no async_register/unregister() calls
needed anymore and more.
"Threadlets" are basically the user-space equivalent of syslets: small
functions of execution that the kernel attempts to execute without
scheduling. If the threadlet blocks, the kernel creates a real thread
from it, and execution continues in that thread. The 'head' context (the
context that never blocks) returns to the original function that called
the threadlet. Threadlets are very easy to use:
long my_threadlet_fn(void *data)
{
char *name = data;
int fd;
fd = open(name, O_RDONLY);
if (fd < 0)
goto out;
fstat(fd, &stat);
read(fd, buf, count)
...
out:
return threadlet_complete();
}
main()
{
done = threadlet_exec(threadlet_fn, new_stack, &user_head);
if (!done)
reqs_queued++;
}
There is no limitation whatsoever about how a threadlet function can
look like: it can use arbitrary system-calls and all execution will be
procedural. There is no 'registration' needed when running threadlets
either: the kernel will take care of all the details, user-space just
runs a threadlet without any preparation and that's it.
Completion of async threadlets can be done from user-space via any of
the existing APIs: in threadlet-test.c (see the async-test-v3.tar.gz
user-space examples at the URL above) i've for example used a futex
between the head and the async threads to do threadlet notification. But
select(), poll() or signals can be used too - whichever is most
convenient to the application writer.
Threadlets can also be thought of as 'optional threads': they execute in
the original context as long as they do not block, but once they block,
they are moved off into their separate thread context - and the original
context can continue execution.
Threadlets can also be thought of as 'on-demand parallelism': user-space
does not have to worry about setting up, sizing and feeding a thread
pool - the kernel will execute the workload in a single-threaded manner
as long as it makes sense, but once the context blocks, a parallel
context is created. So parallelism inside applications is utilized in a
natural way. (The best place to do this is in the kernel - user-space
has no idea about what level of parallelism is best for any given
moment.)
I believe this threadlet concept is what user-space will want to use for
programmable parallelism.
[ Note that right now there's a pair of system-calls: sys_threadlet_on()
and sys_threadlet_off() that demarks the beginning and the end of a
syslet function, which enter the kernel even in the 'cached' case -
but my plan is to do these two system calls via a vsyscall, without
having to enter the kernel at all. That will reduce cached threadlet
execution NULL-overhead to around 10 nsecs - making it essentially
zero. ]
Threadlets share much of the scheduling infrastructure with syslets.
Syslets (small, kernel-side, scripted "syscall plugins") are still
supported - they are (much...) harder to program than threadlets but
they allow the highest performance. Core infrastructure libraries like
glibc/libaio are expected to use syslets. Jens Axboe's FIO tool already
includes support for v2 syslets, and the following patch updates FIO to
the v3 API:
http://redhat.com/~mingo/syslet-patches/fio-syslet-v3.patch
Furthermore, the syslet code and API has been significantly enhanced as
well:
- support for multiple completion rings has been added
- there is no more mlock()ing of the completion ring(s)
- sys_async_register()/unregister() has been removed as it is not
needed anymore. sys_async_exec() can be called straight away.
- there is no kernel-side resource used up by async completion rings at
all (all the state is in user-space), so an arbitrary number of
completion rings are supported.
plus lots of bugs were fixed and a good number of cleanups were done as
well. The v3 code is ABI-incompatible with v2, due to these fundamental
changes.
As always, comments, suggestions, reports are welcome.
Ingo
From: Ingo Molnar <[email protected]>
add include/linux/async.h which contains the kernel-side API
declarations.
it also provides NOP stubs for the !CONFIG_ASYNC_SUPPORT case.
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
include/linux/async.h | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 92 insertions(+)
Index: linux/include/linux/async.h
===================================================================
--- /dev/null
+++ linux/include/linux/async.h
@@ -0,0 +1,92 @@
+#ifndef _LINUX_ASYNC_H
+#define _LINUX_ASYNC_H
+
+#include <linux/completion.h>
+#include <linux/compiler.h>
+
+/*
+ * The syslet subsystem - asynchronous syscall execution support.
+ *
+ * Generic kernel API definitions:
+ */
+
+struct syslet_uatom;
+struct async_thread;
+struct async_head;
+
+/*
+ * The syslet subsystem - asynchronous syscall execution support.
+ *
+ * Syslet-subsystem internal definitions:
+ */
+
+/*
+ * The kernel-side copy of a syslet atom - with arguments expanded:
+ */
+struct syslet_atom {
+ unsigned long flags;
+ unsigned long nr;
+ long __user *ret_ptr;
+ struct syslet_uatom __user *next;
+ unsigned long args[6];
+};
+
+/*
+ * The 'async head' is the thread which has user-space context (ptregs)
+ * 'below it' - this is the one that can return to user-space:
+ */
+struct async_head {
+ spinlock_t lock;
+ struct task_struct *user_task;
+
+ struct list_head ready_async_threads;
+ struct list_head busy_async_threads;
+
+ struct mutex completion_lock;
+ long events_left;
+ wait_queue_head_t wait;
+
+ struct async_head_user __user *ahu;
+
+ unsigned long __user *new_stackp;
+ unsigned long new_eip;
+ unsigned long restore_stack;
+ unsigned long restore_eip;
+ struct completion start_done;
+ struct completion exit_done;
+};
+
+/*
+ * The 'async thread' is either a newly created async thread or it is
+ * an 'ex-head' - it cannot return to user-space and only has kernel
+ * context.
+ */
+struct async_thread {
+ struct task_struct *task;
+ struct syslet_uatom __user *work;
+ unsigned long user_stack;
+ unsigned long user_eip;
+ struct async_head *ah;
+
+ struct list_head entry;
+
+ unsigned int exit;
+};
+
+#ifdef CONFIG_ASYNC_SUPPORT
+extern void async_init(struct task_struct *t);
+extern void async_exit(struct task_struct *t);
+extern void __async_schedule(struct task_struct *t);
+#else /* !CONFIG_ASYNC_SUPPORT */
+static inline void async_init(struct task_struct *t)
+{
+}
+static inline void async_exit(struct task_struct *t)
+{
+}
+static inline void __async_schedule(struct task_struct *t)
+{
+}
+#endif /* !CONFIG_ASYNC_SUPPORT */
+
+#endif
From: Ingo Molnar <[email protected]>
add include/linux/syslet.h which contains the user-space API/ABI
declarations. Add the new header to include/linux/Kbuild as well.
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
include/linux/Kbuild | 1
include/linux/syslet.h | 155 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 156 insertions(+)
Index: linux/include/linux/Kbuild
===================================================================
--- linux.orig/include/linux/Kbuild
+++ linux/include/linux/Kbuild
@@ -140,6 +140,7 @@ header-y += sockios.h
header-y += som.h
header-y += sound.h
header-y += synclink.h
+header-y += syslet.h
header-y += telephony.h
header-y += termios.h
header-y += ticable.h
Index: linux/include/linux/syslet.h
===================================================================
--- /dev/null
+++ linux/include/linux/syslet.h
@@ -0,0 +1,155 @@
+#ifndef _LINUX_SYSLET_H
+#define _LINUX_SYSLET_H
+/*
+ * The syslet subsystem - asynchronous syscall execution support.
+ *
+ * Started by Ingo Molnar:
+ *
+ * Copyright (C) 2007 Red Hat, Inc., Ingo Molnar <[email protected]>
+ *
+ * User-space API/ABI definitions:
+ */
+
+#ifndef __user
+# define __user
+#endif
+
+/*
+ * This is the 'Syslet Atom' - the basic unit of execution
+ * within the syslet framework. A syslet always represents
+ * a single system-call plus its arguments, plus has conditions
+ * attached to it that allows the construction of larger
+ * programs from these atoms. User-space variables can be used
+ * (for example a loop index) via the special sys_umem*() syscalls.
+ *
+ * Arguments are implemented via pointers to arguments. This not
+ * only increases the flexibility of syslet atoms (multiple syslets
+ * can share the same variable for example), but is also an
+ * optimization: copy_uatom() will only fetch syscall parameters
+ * up until the point it meets the first NULL pointer. 50% of all
+ * syscalls have 2 or less parameters (and 90% of all syscalls have
+ * 4 or less parameters).
+ *
+ * [ Note: since the argument array is at the end of the atom, and the
+ * kernel will not touch any argument beyond the final NULL one, atoms
+ * might be packed more tightly. (the only special case exception to
+ * this rule would be SKIP_TO_NEXT_ON_STOP atoms, where the kernel will
+ * jump a full syslet_uatom number of bytes.) ]
+ */
+struct syslet_uatom {
+ unsigned long flags;
+ unsigned long nr;
+ long __user *ret_ptr;
+ struct syslet_uatom __user *next;
+ unsigned long __user *arg_ptr[6];
+ /*
+ * User-space can put anything in here, kernel will not
+ * touch it:
+ */
+ void __user *private;
+};
+
+/*
+ * Flags to modify/control syslet atom behavior:
+ */
+
+/*
+ * Immediately queue this syslet asynchronously - do not even
+ * attempt to execute it synchronously in the user context:
+ */
+#define SYSLET_ASYNC 0x00000001
+
+/*
+ * Never queue this syslet asynchronously - even if synchronous
+ * execution causes a context-switching:
+ */
+#define SYSLET_SYNC 0x00000002
+
+/*
+ * Do not queue the syslet in the completion ring when done.
+ *
+ * ( the default is that the final atom of a syslet is queued
+ * in the completion ring. )
+ *
+ * Some syscalls generate implicit completion events of their
+ * own.
+ */
+#define SYSLET_NO_COMPLETE 0x00000004
+
+/*
+ * Execution control: conditions upon the return code
+ * of the just executed syslet atom. 'Stop' means syslet
+ * execution is stopped and the atom is put into the
+ * completion ring:
+ */
+#define SYSLET_STOP_ON_NONZERO 0x00000008
+#define SYSLET_STOP_ON_ZERO 0x00000010
+#define SYSLET_STOP_ON_NEGATIVE 0x00000020
+#define SYSLET_STOP_ON_NON_POSITIVE 0x00000040
+
+#define SYSLET_STOP_MASK \
+ ( SYSLET_STOP_ON_NONZERO | \
+ SYSLET_STOP_ON_ZERO | \
+ SYSLET_STOP_ON_NEGATIVE | \
+ SYSLET_STOP_ON_NON_POSITIVE )
+
+/*
+ * Special modifier to 'stop' handling: instead of stopping the
+ * execution of the syslet, the linearly next syslet is executed.
+ * (Normal execution flows along atom->next, and execution stops
+ * if atom->next is NULL or a stop condition becomes true.)
+ *
+ * This is what allows true branches of execution within syslets.
+ */
+#define SYSLET_SKIP_TO_NEXT_ON_STOP 0x00000080
+
+/*
+ * This is the (per-user-context) descriptor of the async completion
+ * ring. This gets passed in to sys_async_exec():
+ */
+struct async_head_user {
+ /*
+ * Current completion ring index - managed by the kernel:
+ */
+ unsigned long kernel_ring_idx;
+ /*
+ * User-side ring index:
+ */
+ unsigned long user_ring_idx;
+
+ /*
+ * Ring of pointers to completed async syslets (i.e. syslets that
+ * generated a cachemiss and went async, returning -EASYNCSYSLET
+ * to the user context by sys_async_exec()) are queued here.
+ * Syslets that were executed synchronously (cached) are not
+ * queued here.
+ *
+ * Note: the final atom that generated the exit condition is
+ * queued here. Normally this would be the last atom of a syslet.
+ */
+ struct syslet_uatom __user **completion_ring;
+
+ /*
+ * Ring size in bytes:
+ */
+ unsigned long ring_size_bytes;
+
+ /*
+ * The head task can become a cachemiss thread later on
+ * too, if it blocks - so it needs its separate thread
+ * stack and start address too:
+ */
+ unsigned long head_stack;
+ unsigned long head_eip;
+
+ /*
+ * Newly started async kernel threads will take their
+ * user stack and user start address from here. User-space
+ * code has to check for new_thread_stack going to NULL
+ * and has to refill it with a new stack if that happens.
+ */
+ unsigned long new_thread_stack;
+ unsigned long new_thread_eip;
+};
+
+#endif
From: Ingo Molnar <[email protected]>
add the kernel generic bits - these are present even if !CONFIG_ASYNC_SUPPORT.
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
fs/exec.c | 4 ++++
include/linux/sched.h | 23 ++++++++++++++++++++++-
kernel/capability.c | 3 +++
kernel/exit.c | 7 +++++++
kernel/fork.c | 5 +++++
kernel/sched.c | 9 +++++++++
kernel/sys.c | 36 ++++++++++++++++++++++++++++++++++++
7 files changed, 86 insertions(+), 1 deletion(-)
Index: linux/fs/exec.c
===================================================================
--- linux.orig/fs/exec.c
+++ linux/fs/exec.c
@@ -1446,6 +1446,10 @@ static int coredump_wait(int exit_code)
tsk->vfork_done = NULL;
complete(vfork_done);
}
+ /*
+ * Make sure we exit our async context before waiting:
+ */
+ async_exit(tsk);
if (core_waiters)
wait_for_completion(&startup_done);
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -83,12 +83,12 @@ struct sched_param {
#include <linux/timer.h>
#include <linux/hrtimer.h>
#include <linux/task_io_accounting.h>
+#include <linux/async.h>
#include <asm/processor.h>
struct exec_domain;
struct futex_pi_state;
-
/*
* List of flags we want to share for kernel threads,
* if only because they are not used by them anyway.
@@ -997,6 +997,12 @@ struct task_struct {
/* journalling filesystem info */
void *journal_info;
+/* async syscall support: */
+ struct async_thread *at, *async_ready;
+ struct async_head *ah;
+ struct async_thread __at;
+ struct async_head __ah;
+
/* VM state */
struct reclaim_state *reclaim_state;
@@ -1053,6 +1059,21 @@ struct task_struct {
#endif
};
+/*
+ * Is an async syscall being executed currently?
+ */
+#ifdef CONFIG_ASYNC_SUPPORT
+static inline int async_syscall(struct task_struct *t)
+{
+ return t->async_ready != NULL;
+}
+#else /* !CONFIG_ASYNC_SUPPORT */
+static inline int async_syscall(struct task_struct *t)
+{
+ return 0;
+}
+#endif /* !CONFIG_ASYNC_SUPPORT */
+
static inline pid_t process_group(struct task_struct *tsk)
{
return tsk->signal->pgrp;
Index: linux/kernel/capability.c
===================================================================
--- linux.orig/kernel/capability.c
+++ linux/kernel/capability.c
@@ -176,6 +176,9 @@ asmlinkage long sys_capset(cap_user_head
int ret;
pid_t pid;
+ if (async_syscall(current))
+ return -ENOSYS;
+
if (get_user(version, &header->version))
return -EFAULT;
Index: linux/kernel/exit.c
===================================================================
--- linux.orig/kernel/exit.c
+++ linux/kernel/exit.c
@@ -26,6 +26,7 @@
#include <linux/ptrace.h>
#include <linux/profile.h>
#include <linux/mount.h>
+#include <linux/async.h>
#include <linux/proc_fs.h>
#include <linux/mempolicy.h>
#include <linux/taskstats_kern.h>
@@ -889,6 +890,12 @@ fastcall NORET_TYPE void do_exit(long co
schedule();
}
+ /*
+ * Note: async threads have to exit their context before the MM
+ * exit (due to the coredumping wait):
+ */
+ async_exit(tsk);
+
tsk->flags |= PF_EXITING;
if (unlikely(in_atomic()))
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -22,6 +22,7 @@
#include <linux/personality.h>
#include <linux/mempolicy.h>
#include <linux/sem.h>
+#include <linux/async.h>
#include <linux/file.h>
#include <linux/key.h>
#include <linux/binfmts.h>
@@ -1054,6 +1055,7 @@ static struct task_struct *copy_process(
p->lock_depth = -1; /* -1 = no lock */
do_posix_clock_monotonic_gettime(&p->start_time);
+ async_init(p);
p->security = NULL;
p->io_context = NULL;
p->io_wait = NULL;
@@ -1621,6 +1623,9 @@ asmlinkage long sys_unshare(unsigned lon
struct uts_namespace *uts, *new_uts = NULL;
struct ipc_namespace *ipc, *new_ipc = NULL;
+ if (async_syscall(current))
+ return -ENOSYS;
+
check_unshare_flags(&unshare_flags);
/* Return -EINVAL for all unsupported flags */
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -38,6 +38,7 @@
#include <linux/vmalloc.h>
#include <linux/blkdev.h>
#include <linux/delay.h>
+#include <linux/async.h>
#include <linux/smp.h>
#include <linux/threads.h>
#include <linux/timer.h>
@@ -3436,6 +3437,14 @@ asmlinkage void __sched schedule(void)
}
profile_hit(SCHED_PROFILING, __builtin_return_address(0));
+ prev = current;
+ if (unlikely(prev->async_ready)) {
+ if (prev->state && !(preempt_count() & PREEMPT_ACTIVE) &&
+ (!(prev->state & TASK_INTERRUPTIBLE) ||
+ !signal_pending(prev)))
+ __async_schedule(prev);
+ }
+
need_resched:
preempt_disable();
prev = current;
Index: linux/kernel/sys.c
===================================================================
--- linux.orig/kernel/sys.c
+++ linux/kernel/sys.c
@@ -933,6 +933,9 @@ asmlinkage long sys_setregid(gid_t rgid,
int new_egid = old_egid;
int retval;
+ if (async_syscall(current))
+ return -ENOSYS;
+
retval = security_task_setgid(rgid, egid, (gid_t)-1, LSM_SETID_RE);
if (retval)
return retval;
@@ -979,6 +982,9 @@ asmlinkage long sys_setgid(gid_t gid)
int old_egid = current->egid;
int retval;
+ if (async_syscall(current))
+ return -ENOSYS;
+
retval = security_task_setgid(gid, (gid_t)-1, (gid_t)-1, LSM_SETID_ID);
if (retval)
return retval;
@@ -1049,6 +1055,9 @@ asmlinkage long sys_setreuid(uid_t ruid,
int old_ruid, old_euid, old_suid, new_ruid, new_euid;
int retval;
+ if (async_syscall(current))
+ return -ENOSYS;
+
retval = security_task_setuid(ruid, euid, (uid_t)-1, LSM_SETID_RE);
if (retval)
return retval;
@@ -1112,6 +1121,9 @@ asmlinkage long sys_setuid(uid_t uid)
int old_ruid, old_suid, new_suid;
int retval;
+ if (async_syscall(current))
+ return -ENOSYS;
+
retval = security_task_setuid(uid, (uid_t)-1, (uid_t)-1, LSM_SETID_ID);
if (retval)
return retval;
@@ -1152,6 +1164,9 @@ asmlinkage long sys_setresuid(uid_t ruid
int old_suid = current->suid;
int retval;
+ if (async_syscall(current))
+ return -ENOSYS;
+
retval = security_task_setuid(ruid, euid, suid, LSM_SETID_RES);
if (retval)
return retval;
@@ -1206,6 +1221,9 @@ asmlinkage long sys_setresgid(gid_t rgid
{
int retval;
+ if (async_syscall(current))
+ return -ENOSYS;
+
retval = security_task_setgid(rgid, egid, sgid, LSM_SETID_RES);
if (retval)
return retval;
@@ -1261,6 +1279,9 @@ asmlinkage long sys_setfsuid(uid_t uid)
{
int old_fsuid;
+ if (async_syscall(current))
+ return -ENOSYS;
+
old_fsuid = current->fsuid;
if (security_task_setuid(uid, (uid_t)-1, (uid_t)-1, LSM_SETID_FS))
return old_fsuid;
@@ -1290,6 +1311,9 @@ asmlinkage long sys_setfsgid(gid_t gid)
{
int old_fsgid;
+ if (async_syscall(current))
+ return -ENOSYS;
+
old_fsgid = current->fsgid;
if (security_task_setgid(gid, (gid_t)-1, (gid_t)-1, LSM_SETID_FS))
return old_fsgid;
@@ -1365,6 +1389,9 @@ asmlinkage long sys_setpgid(pid_t pid, p
struct task_struct *group_leader = current->group_leader;
int err = -EINVAL;
+ if (async_syscall(current))
+ return -ENOSYS;
+
if (!pid)
pid = group_leader->pid;
if (!pgid)
@@ -1488,6 +1515,9 @@ asmlinkage long sys_setsid(void)
pid_t session;
int err = -EPERM;
+ if (async_syscall(current))
+ return -ENOSYS;
+
write_lock_irq(&tasklist_lock);
/* Fail if I am already a session leader */
@@ -1732,6 +1762,9 @@ asmlinkage long sys_setgroups(int gidset
struct group_info *group_info;
int retval;
+ if (async_syscall(current))
+ return -ENOSYS;
+
if (!capable(CAP_SETGID))
return -EPERM;
if ((unsigned)gidsetsize > NGROUPS_MAX)
@@ -2073,6 +2106,9 @@ asmlinkage long sys_prctl(int option, un
{
long error;
+ if (async_syscall(current))
+ return -ENOSYS;
+
error = security_task_prctl(option, arg2, arg3, arg4, arg5);
if (error)
return error;
From: Ingo Molnar <[email protected]>
add the create_async_thread() way of creating kernel threads:
these threads first execute a kernel function and when they
return from it they execute user-space.
An architecture must implement this interface before it can turn
CONFIG_ASYNC_SUPPORT on.
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/i386/kernel/entry.S | 25 +++++++++++++++++++++++++
arch/i386/kernel/process.c | 31 +++++++++++++++++++++++++++++++
include/asm-i386/processor.h | 5 +++++
3 files changed, 61 insertions(+)
Index: linux/arch/i386/kernel/entry.S
===================================================================
--- linux.orig/arch/i386/kernel/entry.S
+++ linux/arch/i386/kernel/entry.S
@@ -996,6 +996,31 @@ ENTRY(kernel_thread_helper)
CFI_ENDPROC
ENDPROC(kernel_thread_helper)
+ENTRY(async_thread_helper)
+ CFI_STARTPROC
+ /*
+ * Allocate space on the stack for pt-regs.
+ * sizeof(struct pt_regs) == 64, and we've got 8 bytes on the
+ * kernel stack already:
+ */
+ subl $64-8, %esp
+ CFI_ADJUST_CFA_OFFSET 64
+ movl %edx,%eax
+ push %edx
+ CFI_ADJUST_CFA_OFFSET 4
+ call *%ebx
+ addl $4, %esp
+ CFI_ADJUST_CFA_OFFSET -4
+
+ movl %eax, PT_EAX(%esp)
+
+ GET_THREAD_INFO(%ebp)
+
+ jmp syscall_exit
+ CFI_ENDPROC
+ENDPROC(async_thread_helper)
+
+
.section .rodata,"a"
#include "syscall_table.S"
Index: linux/arch/i386/kernel/process.c
===================================================================
--- linux.orig/arch/i386/kernel/process.c
+++ linux/arch/i386/kernel/process.c
@@ -352,6 +352,37 @@ int kernel_thread(int (*fn)(void *), voi
EXPORT_SYMBOL(kernel_thread);
/*
+ * This gets run with %ebx containing the
+ * function to call, and %edx containing
+ * the "args".
+ */
+extern void async_thread_helper(void);
+
+/*
+ * Create an async thread
+ */
+int create_async_thread(int (*fn)(void *), void * arg, unsigned long flags)
+{
+ struct pt_regs regs;
+
+ memset(®s, 0, sizeof(regs));
+
+ regs.ebx = (unsigned long) fn;
+ regs.edx = (unsigned long) arg;
+
+ regs.xds = __USER_DS;
+ regs.xes = __USER_DS;
+ regs.xgs = __KERNEL_PDA;
+ regs.orig_eax = -1;
+ regs.eip = (unsigned long) async_thread_helper;
+ regs.xcs = __KERNEL_CS | get_kernel_rpl();
+ regs.eflags = X86_EFLAGS_IF | X86_EFLAGS_SF | X86_EFLAGS_PF | 0x2;
+
+ /* Ok, create the new task.. */
+ return do_fork(flags, 0, ®s, 0, NULL, NULL);
+}
+
+/*
* Free current thread data structures etc..
*/
void exit_thread(void)
Index: linux/include/asm-i386/processor.h
===================================================================
--- linux.orig/include/asm-i386/processor.h
+++ linux/include/asm-i386/processor.h
@@ -472,6 +472,11 @@ extern void prepare_to_copy(struct task_
*/
extern int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags);
+/*
+ * create an async thread:
+ */
+extern int create_async_thread(int (*fn)(void *), void * arg, unsigned long flags);
+
extern unsigned long thread_saved_pc(struct task_struct *tsk);
void show_trace(struct task_struct *task, struct pt_regs *regs, unsigned long *stack);
From: Ingo Molnar <[email protected]>
enable CONFIG_ASYNC_SUPPORT on x86.
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/i386/Kconfig | 4 ++++
1 file changed, 4 insertions(+)
Index: linux/arch/i386/Kconfig
===================================================================
--- linux.orig/arch/i386/Kconfig
+++ linux/arch/i386/Kconfig
@@ -38,6 +38,10 @@ config MMU
bool
default y
+config ASYNC_SUPPORT
+ bool
+ default y
+
config SBUS
bool
From: Ingo Molnar <[email protected]>
the core syslet / async system calls infrastructure code.
Is built only if CONFIG_ASYNC_SUPPORT is enabled.
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
kernel/Makefile | 1
kernel/async.c | 958 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 959 insertions(+)
Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile
+++ linux/kernel/Makefile
@@ -10,6 +10,7 @@ obj-y = sched.o fork.o exec_domain.o
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
hrtimer.o rwsem.o latency.o nsproxy.o srcu.o
+obj-$(CONFIG_ASYNC_SUPPORT) += async.o
obj-$(CONFIG_STACKTRACE) += stacktrace.o
obj-y += time/
obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o
Index: linux/kernel/async.c
===================================================================
--- /dev/null
+++ linux/kernel/async.c
@@ -0,0 +1,958 @@
+/*
+ * kernel/async.c
+ *
+ * The syslet and threadlet subsystem - asynchronous syscall and user-space
+ * code execution support.
+ *
+ * Started by Ingo Molnar:
+ *
+ * Copyright (C) 2007 Red Hat, Inc., Ingo Molnar <[email protected]>
+ *
+ * This file is released under the GPLv2.
+ *
+ * This code implements asynchronous syscalls via 'syslets'.
+ *
+ * Syslets consist of a set of 'syslet atoms' which are residing
+ * purely in user-space memory and have no kernel-space resource
+ * attached to them. These atoms can be linked to each other via
+ * pointers. Besides the fundamental ability to execute system
+ * calls, syslet atoms can also implement branches, loops and
+ * arithmetics.
+ *
+ * Thus syslets can be used to build small autonomous programs that
+ * the kernel can execute purely from kernel-space, without having
+ * to return to any user-space context. Syslets can be run by any
+ * unprivileged user-space application - they are executed safely
+ * by the kernel.
+ *
+ * "Threadlets" are the user-space equivalent of syslets: small
+ * functions of execution that the kernel attempts to execute
+ * without scheduling. If the threadlet blocks, the kernel creates
+ * a real thread from it, and execution continues in that thread.
+ * The 'head' context (the context that never blocks) returns to
+ * the original function that called the threadlet.
+ */
+#include <linux/syscalls.h>
+#include <linux/syslet.h>
+#include <linux/delay.h>
+#include <linux/async.h>
+#include <linux/sched.h>
+#include <linux/init.h>
+#include <linux/err.h>
+
+#include <asm/uaccess.h>
+#include <asm/unistd.h>
+
+typedef asmlinkage long (*syscall_fn_t)(long, long, long, long, long, long);
+
+extern syscall_fn_t sys_call_table[NR_syscalls];
+
+/*
+ * An async 'cachemiss context' is either busy, or it is ready.
+ * If it is ready, the 'head' might switch its user-space context
+ * to that ready thread anytime - so that if the ex-head blocks,
+ * one ready thread can become the next head and can continue to
+ * execute user-space code.
+ */
+static void
+__mark_async_thread_ready(struct async_thread *at, struct async_head *ah)
+{
+ list_del(&at->entry);
+ list_add_tail(&at->entry, &ah->ready_async_threads);
+ if (list_empty(&ah->busy_async_threads))
+ wake_up(&ah->wait);
+}
+
+static void
+mark_async_thread_ready(struct async_thread *at, struct async_head *ah)
+{
+ spin_lock(&ah->lock);
+ __mark_async_thread_ready(at, ah);
+ spin_unlock(&ah->lock);
+}
+
+static void
+__mark_async_thread_busy(struct async_thread *at, struct async_head *ah)
+{
+ list_del(&at->entry);
+ list_add_tail(&at->entry, &ah->busy_async_threads);
+}
+
+static void
+mark_async_thread_busy(struct async_thread *at, struct async_head *ah)
+{
+ spin_lock(&ah->lock);
+ __mark_async_thread_busy(at, ah);
+ spin_unlock(&ah->lock);
+}
+
+static void
+__async_thread_init(struct task_struct *t, struct async_thread *at,
+ struct async_head *ah)
+{
+ INIT_LIST_HEAD(&at->entry);
+ at->exit = 0;
+ at->task = t;
+ at->ah = ah;
+ at->work = NULL;
+
+ t->at = at;
+}
+
+static void
+async_thread_init(struct task_struct *t, struct async_thread *at,
+ struct async_head *ah)
+{
+ spin_lock(&ah->lock);
+ __async_thread_init(t, at, ah);
+ __mark_async_thread_ready(at, ah);
+ spin_unlock(&ah->lock);
+}
+
+static void
+async_thread_exit(struct async_thread *at, struct task_struct *t)
+{
+ struct async_head *ah = at->ah;
+
+ spin_lock(&ah->lock);
+ list_del_init(&at->entry);
+ if (at->exit)
+ complete(&ah->exit_done);
+ t->at = NULL;
+ at->task = NULL;
+ spin_unlock(&ah->lock);
+}
+
+static struct async_thread *
+pick_ready_cachemiss_thread(struct async_head *ah)
+{
+ struct list_head *head = &ah->ready_async_threads;
+
+ if (list_empty(head))
+ return NULL;
+
+ return list_entry(head->next, struct async_thread, entry);
+}
+
+void __async_schedule(struct task_struct *t)
+{
+ struct async_thread *new_async_thread;
+ struct async_thread *async_ready;
+ struct async_head *ah = t->ah;
+ struct task_struct *new_task;
+
+ WARN_ON(!ah);
+ spin_lock(&ah->lock);
+
+ new_async_thread = pick_ready_cachemiss_thread(ah);
+ if (!new_async_thread)
+ goto out_unlock;
+
+ async_ready = t->async_ready;
+ WARN_ON(!async_ready);
+ t->async_ready = NULL;
+
+ new_task = new_async_thread->task;
+
+ move_user_context(new_task, t);
+ if (ah->restore_stack) {
+ task_pt_regs(new_task)->esp = ah->restore_stack;
+ WARN_ON(!ah->restore_eip);
+ task_pt_regs(new_task)->eip = ah->restore_eip;
+ /*
+ * The return code 0 is needed to tell the
+ * head user-context that the threadlet went async:
+ */
+ task_pt_regs(new_task)->eax = 0;
+ }
+
+ new_task->at = NULL;
+ t->ah = NULL;
+ new_task->ah = ah;
+ ah->user_task = new_task;
+
+ wake_up_process(new_task);
+
+ __async_thread_init(t, async_ready, ah);
+ __mark_async_thread_busy(t->at, ah);
+
+ out_unlock:
+ spin_unlock(&ah->lock);
+}
+
+static void async_schedule(struct task_struct *t)
+{
+ if (t->async_ready)
+ __async_schedule(t);
+}
+
+static long __exec_atom(struct task_struct *t, struct syslet_atom *atom)
+{
+ struct async_thread *async_ready_save;
+ long ret;
+
+ /*
+ * If user-space expects the syscall to schedule then
+ * (try to) switch user-space to another thread straight
+ * away and execute the syscall asynchronously:
+ */
+ if (unlikely(atom->flags & SYSLET_ASYNC))
+ async_schedule(t);
+ /*
+ * Does user-space want synchronous execution for this atom?:
+ */
+ async_ready_save = t->async_ready;
+ if (unlikely(atom->flags & SYSLET_SYNC))
+ t->async_ready = NULL;
+
+ if (unlikely(atom->nr >= NR_syscalls))
+ return -ENOSYS;
+
+ ret = sys_call_table[atom->nr](atom->args[0], atom->args[1],
+ atom->args[2], atom->args[3],
+ atom->args[4], atom->args[5]);
+
+ if (atom->ret_ptr && put_user(ret, atom->ret_ptr))
+ return -EFAULT;
+
+ if (t->ah)
+ t->async_ready = async_ready_save;
+
+ return ret;
+}
+
+/*
+ * Arithmetics syscall, add a value to a user-space memory location.
+ *
+ * Generic C version - in case the architecture has not implemented it
+ * in assembly.
+ */
+asmlinkage __attribute__((weak)) long
+sys_umem_add(unsigned long __user *uptr, unsigned long inc)
+{
+ unsigned long val, new_val;
+
+ if (get_user(val, uptr))
+ return -EFAULT;
+ /*
+ * inc == 0 means 'read memory value':
+ */
+ if (!inc)
+ return val;
+
+ new_val = val + inc;
+ if (__put_user(new_val, uptr))
+ return -EFAULT;
+
+ return new_val;
+}
+
+/*
+ * Open-coded because this is a very hot codepath during syslet
+ * execution and every cycle counts ...
+ *
+ * [ NOTE: it's an explicit fastcall because optimized assembly code
+ * might depend on this. There are some kernels that disable regparm,
+ * so lets not break those if possible. ]
+ */
+fastcall __attribute__((weak)) long
+copy_uatom(struct syslet_atom *atom, struct syslet_uatom __user *uatom)
+{
+ unsigned long __user *arg_ptr;
+ long ret = 0;
+
+ if (!access_ok(VERIFY_READ, uatom, sizeof(*uatom)))
+ return -EFAULT;
+
+ ret = __get_user(atom->nr, &uatom->nr);
+ ret |= __get_user(atom->ret_ptr, &uatom->ret_ptr);
+ ret |= __get_user(atom->flags, &uatom->flags);
+ ret |= __get_user(atom->next, &uatom->next);
+
+ memset(atom->args, 0, sizeof(atom->args));
+
+ ret |= __get_user(arg_ptr, &uatom->arg_ptr[0]);
+ if (!arg_ptr)
+ return ret;
+ if (!access_ok(VERIFY_READ, arg_ptr, sizeof(*arg_ptr)))
+ return -EFAULT;
+ ret |= __get_user(atom->args[0], arg_ptr);
+
+ ret |= __get_user(arg_ptr, &uatom->arg_ptr[1]);
+ if (!arg_ptr)
+ return ret;
+ if (!access_ok(VERIFY_READ, arg_ptr, sizeof(*arg_ptr)))
+ return -EFAULT;
+ ret |= __get_user(atom->args[1], arg_ptr);
+
+ ret |= __get_user(arg_ptr, &uatom->arg_ptr[2]);
+ if (!arg_ptr)
+ return ret;
+ if (!access_ok(VERIFY_READ, arg_ptr, sizeof(*arg_ptr)))
+ return -EFAULT;
+ ret |= __get_user(atom->args[2], arg_ptr);
+
+ ret |= __get_user(arg_ptr, &uatom->arg_ptr[3]);
+ if (!arg_ptr)
+ return ret;
+ if (!access_ok(VERIFY_READ, arg_ptr, sizeof(*arg_ptr)))
+ return -EFAULT;
+ ret |= __get_user(atom->args[3], arg_ptr);
+
+ ret |= __get_user(arg_ptr, &uatom->arg_ptr[4]);
+ if (!arg_ptr)
+ return ret;
+ if (!access_ok(VERIFY_READ, arg_ptr, sizeof(*arg_ptr)))
+ return -EFAULT;
+ ret |= __get_user(atom->args[4], arg_ptr);
+
+ ret |= __get_user(arg_ptr, &uatom->arg_ptr[5]);
+ if (!arg_ptr)
+ return ret;
+ if (!access_ok(VERIFY_READ, arg_ptr, sizeof(*arg_ptr)))
+ return -EFAULT;
+ ret |= __get_user(atom->args[5], arg_ptr);
+
+ return ret;
+}
+
+/*
+ * Should the next atom run, depending on the return value of
+ * the current atom - or should we stop execution?
+ */
+static int run_next_atom(struct syslet_atom *atom, long ret)
+{
+ switch (atom->flags & SYSLET_STOP_MASK) {
+ case SYSLET_STOP_ON_NONZERO:
+ if (!ret)
+ return 1;
+ return 0;
+ case SYSLET_STOP_ON_ZERO:
+ if (ret)
+ return 1;
+ return 0;
+ case SYSLET_STOP_ON_NEGATIVE:
+ if (ret >= 0)
+ return 1;
+ return 0;
+ case SYSLET_STOP_ON_NON_POSITIVE:
+ if (ret > 0)
+ return 1;
+ return 0;
+ }
+ return 1;
+}
+
+static struct syslet_uatom __user *
+next_uatom(struct syslet_atom *atom, struct syslet_uatom *uatom, long ret)
+{
+ /*
+ * If the stop condition is false then continue
+ * to atom->next:
+ */
+ if (run_next_atom(atom, ret))
+ return atom->next;
+ /*
+ * Special-case: if the stop condition is true and the atom
+ * has SKIP_TO_NEXT_ON_STOP set, then instead of
+ * stopping we skip to the atom directly after this atom
+ * (in linear address-space).
+ *
+ * This, combined with the atom->next pointer and the
+ * stop condition flags is what allows true branches and
+ * loops in syslets:
+ */
+ if (atom->flags & SYSLET_SKIP_TO_NEXT_ON_STOP)
+ return uatom + 1;
+
+ return NULL;
+}
+
+/*
+ * If user-space requested a completion event then put the last
+ * executed uatom into the completion ring:
+ */
+static long
+complete_uatom(struct async_head *ah, struct task_struct *t,
+ struct syslet_atom *atom, struct syslet_uatom __user *uatom,
+ struct async_head_user __user *ahu)
+{
+ unsigned long ring_size_bytes, max_ring_idx, kernel_ring_idx;
+ struct syslet_uatom __user **ring_slot, *slot_val = NULL;
+ struct syslet_uatom __user **completion_ring;
+
+ WARN_ON(!t->at);
+ WARN_ON(t->ah);
+
+ if (atom->flags & SYSLET_NO_COMPLETE)
+ return 0;
+
+ if (!access_ok(VERIFY_WRITE, ahu, sizeof(*ahu)))
+ return -EFAULT;
+
+ if (__get_user(completion_ring, &ahu->completion_ring))
+ return -EFAULT;
+ if (__get_user(ring_size_bytes, &ahu->ring_size_bytes))
+ return -EFAULT;
+ if (!ring_size_bytes)
+ return -EINVAL;
+
+ max_ring_idx = ring_size_bytes / sizeof(void *);
+ if (ring_size_bytes != max_ring_idx * sizeof(void *))
+ return -EINVAL;
+ /*
+ * We pre-check the ring pointer, so that in the fastpath
+ * we can use __get_user():
+ */
+ if (!access_ok(VERIFY_WRITE, completion_ring, ring_size_bytes))
+ return -EFAULT;
+
+ mutex_lock(&ah->completion_lock);
+ /*
+ * Asynchron threads can complete in parallel so use the
+ * head-lock to serialize:
+ */
+ if (__get_user(kernel_ring_idx, &ahu->kernel_ring_idx))
+ goto fault_unlock;
+ if (kernel_ring_idx >= max_ring_idx)
+ goto err_unlock;
+
+ ring_slot = completion_ring + kernel_ring_idx;
+ if (__get_user(slot_val, ring_slot))
+ goto fault_unlock;
+ /*
+ * User-space submitted more work than what fits into the
+ * completion ring - do not stomp over it silently and signal
+ * the error condition:
+ */
+ if (slot_val)
+ goto err_unlock;
+
+ slot_val = uatom;
+ if (__put_user(slot_val, ring_slot))
+ goto fault_unlock;
+ /*
+ * Update the ring index:
+ */
+ kernel_ring_idx++;
+ if (kernel_ring_idx == max_ring_idx)
+ kernel_ring_idx = 0;
+
+ if (__put_user(kernel_ring_idx, &ahu->kernel_ring_idx))
+ goto fault_unlock;
+
+ /*
+ * See whether the async-head is waiting and needs a wakeup:
+ */
+ if (ah->events_left) {
+ if (!--ah->events_left) {
+ /*
+ * We first unlock the mutex - to reduce the size
+ * of the critical section. We have a safe
+ * reference to 'ah':
+ */
+ mutex_unlock(&ah->completion_lock);
+ wake_up(&ah->wait);
+ goto out;
+ }
+ }
+
+ mutex_unlock(&ah->completion_lock);
+ out:
+ return 0;
+
+ fault_unlock:
+ mutex_unlock(&ah->completion_lock);
+
+ return -EFAULT;
+
+ err_unlock:
+ mutex_unlock(&ah->completion_lock);
+
+ return -EINVAL;
+}
+
+/*
+ * This is the main syslet atom execution loop. This fetches atoms
+ * and executes them until it runs out of atoms or until the
+ * exit condition becomes false:
+ */
+static struct syslet_uatom __user *
+exec_atom(struct async_head *ah, struct task_struct *t,
+ struct syslet_uatom __user *uatom,
+ struct async_head_user __user *ahu)
+{
+ struct syslet_uatom __user *last_uatom;
+ struct syslet_atom atom;
+ long ret;
+
+ run_next:
+ if (unlikely(copy_uatom(&atom, uatom)))
+ return ERR_PTR(-EFAULT);
+
+ last_uatom = uatom;
+ ret = __exec_atom(t, &atom);
+ if (unlikely(signal_pending(t) || need_resched()))
+ goto stop;
+
+ uatom = next_uatom(&atom, uatom, ret);
+ if (uatom)
+ goto run_next;
+ stop:
+ /*
+ * We do completion only in async context:
+ */
+ if (t->at && complete_uatom(ah, t, &atom, last_uatom, ahu))
+ return ERR_PTR(-EFAULT);
+
+ return last_uatom;
+}
+
+static void cachemiss_execute(struct async_thread *at, struct async_head *ah,
+ struct task_struct *t)
+{
+ struct syslet_uatom __user *uatom;
+
+ uatom = at->work;
+ WARN_ON(!uatom);
+ at->work = NULL;
+ WARN_ON(1); /* need to pass the ahu too */
+
+ exec_atom(ah, t, uatom, NULL);
+}
+
+static struct syslet_uatom __user *
+cachemiss_loop(struct async_thread *at, struct async_head *ah,
+ struct task_struct *t)
+{
+ for (;;) {
+ mark_async_thread_busy(at, ah);
+ set_task_state(t, TASK_INTERRUPTIBLE);
+ if (at->work)
+ cachemiss_execute(at, ah, t);
+ if (unlikely(t->ah || at->exit || signal_pending(t)))
+ break;
+ mark_async_thread_ready(at, ah);
+ schedule();
+ }
+ t->state = TASK_RUNNING;
+
+ async_thread_exit(at, t);
+
+ if (at->exit)
+ do_exit(0);
+
+ if (!t->ah) {
+ /*
+ * Cachemiss threads return to one given
+ * user-space instruction address and stack
+ * pointer:
+ */
+ task_pt_regs(t)->esp = at->user_stack;
+ task_pt_regs(t)->eip = at->user_eip;
+
+ return (void *)-1;
+ }
+ /*
+ * Head context: return to user-space with NULL:
+ */
+ return NULL;
+}
+
+/*
+ * This is what a newly created cachemiss thread executes for the
+ * first time: initialize, pick up the user stack/IP addresses from
+ * the head and then execute the cachemiss loop. If the cachemiss
+ * loop returns then we return back to user-space:
+ */
+static int cachemiss_thread(void *data)
+{
+ struct pt_regs *head_regs, *regs;
+ struct task_struct *t = current;
+ struct async_head *ah = data;
+ struct async_thread *at;
+ int ret;
+
+ at = &t->__at;
+ async_thread_init(t, at, ah);
+
+ /*
+ * Clone the head thread's user-space ptregs over,
+ * now that we are in kernel-space:
+ */
+ head_regs = task_pt_regs(ah->user_task);
+ regs = task_pt_regs(t);
+
+ *regs = *head_regs;
+ ret = get_user(at->user_stack, ah->new_stackp);
+ WARN_ON(ret);
+ /*
+ * Clear the stack pointer, signalling to user-space that
+ * this thread stack has been used up:
+ */
+ ret = put_user(0, ah->new_stackp);
+ WARN_ON(ret);
+
+ complete(&ah->start_done);
+
+ /*
+ * Fixme: 64-bit kernel threads should return long
+ */
+ return (int)cachemiss_loop(at, ah, t);
+}
+
+/**
+ * sys_async_thread - do work as an async cachemiss thread again
+ *
+ * If an async thread has returned back to user-space (due to say
+ * a signal) then it is a 'busy' thread during that period. It
+ * can again offer itself into the cachemiss pool by calling this
+ * syscall:
+ */
+asmlinkage long sys_async_thread(void)
+{
+ struct task_struct *t = current;
+ struct async_thread *at = t->at;
+ struct async_head *ah = t->__at.ah;
+
+ /*
+ * Only async threads are allowed to do this:
+ */
+ if (!ah || t->ah)
+ return -EINVAL;
+
+ /*
+ * If a cachemiss threadlet calls sys_async_thread()
+ * then we first have to mark it ready:
+ */
+ if (at) {
+ mark_async_thread_ready(at, ah);
+ } else {
+ at = &t->__at;
+ WARN_ON(!at->ah);
+
+ async_thread_init(t, at, ah);
+ }
+
+ return (long)cachemiss_loop(at, at->ah, t);
+}
+
+/*
+ * Initialize the in-kernel async head, based on the user-space async
+ * head:
+ */
+static long
+async_head_init(struct task_struct *t, struct async_head_user __user *ahu)
+{
+ struct async_head *ah;
+
+ ah = &t->__ah;
+
+ spin_lock_init(&ah->lock);
+ INIT_LIST_HEAD(&ah->ready_async_threads);
+ INIT_LIST_HEAD(&ah->busy_async_threads);
+ init_waitqueue_head(&ah->wait);
+ mutex_init(&ah->completion_lock);
+ ah->events_left = 0;
+ ah->ahu = NULL;
+ ah->new_stackp = NULL;
+ ah->new_eip = 0;
+ ah->restore_stack = 0;
+ ah->restore_eip = 0;
+ ah->user_task = t;
+ t->ah = ah;
+
+ return 0;
+}
+
+/*
+ * If the head cache-misses then it will become a cachemiss
+ * thread after having finished its current syslet. If it
+ * returns to user-space after that point (to handle a signal
+ * for example) then it will need a thread stack of its own:
+ */
+static long init_head(struct async_head *ah, struct task_struct *t,
+ struct async_head_user __user *ahu)
+{
+ unsigned long head_stack, head_eip;
+
+ if (get_user(head_stack, &ahu->head_stack))
+ return -EFAULT;
+ if (get_user(head_eip, &ahu->head_eip))
+ return -EFAULT;
+ t->__at.user_stack = head_stack;
+ t->__at.user_eip = head_eip;
+
+ return async_head_init(t, ahu);
+}
+
+/*
+ * Simple limit and pool management mechanism for now:
+ */
+static long
+refill_cachemiss_pool(struct async_head *ah, struct task_struct *t,
+ struct async_head_user __user *ahu)
+{
+ unsigned long new_eip;
+ long pid, ret;
+
+ init_completion(&ah->start_done);
+ ah->new_stackp = &ahu->new_thread_stack;
+ ret = get_user(new_eip, &ahu->new_thread_eip);
+ WARN_ON(ret);
+ ah->new_eip = new_eip;
+
+ pid = create_async_thread(cachemiss_thread, (void *)ah,
+ CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND |
+ CLONE_THREAD | CLONE_SYSVSEM);
+ if (pid < 0)
+ return pid;
+
+ wait_for_completion(&ah->start_done);
+ ah->new_stackp = NULL;
+ ah->new_eip = 0;
+
+ return 0;
+}
+
+/**
+ * sys_async_exec - execute a syslet.
+ *
+ * returns the uatom that was last executed, if the kernel was able to
+ * execute the syslet synchronously, or NULL if the syslet became
+ * asynchronous. (in the latter case syslet completion will be notified
+ * via the completion ring)
+ *
+ * (Various errors might also be returned via the usual negative numbers.)
+ */
+asmlinkage struct syslet_uatom __user *
+sys_async_exec(struct syslet_uatom __user *uatom,
+ struct async_head_user __user *ahu)
+{
+ struct syslet_uatom __user *ret;
+ struct task_struct *t = current;
+ struct async_head *ah = t->ah;
+ struct async_thread *at = &t->__at;
+
+ /*
+ * Do not allow recursive calls of sys_async_exec():
+ */
+ if (async_syscall(t))
+ return ERR_PTR(-ENOSYS);
+
+ if (unlikely(!ah)) {
+ ret = (void *)init_head(ah, t, ahu);
+ if (ret)
+ return ret;
+ ah = t->ah;
+ }
+
+ if (unlikely(list_empty(&ah->ready_async_threads))) {
+ ret = (void *)refill_cachemiss_pool(ah, t, ahu);
+ if (ret)
+ return ret;
+ }
+
+ t->async_ready = at;
+ ah->ahu = ahu;
+
+ ret = exec_atom(ah, t, uatom, ahu);
+
+ /*
+ * Are we still executing as head?
+ */
+ if (t->ah) {
+ t->async_ready = NULL;
+
+ return ret;
+ }
+
+ /*
+ * We got turned into a cachemiss thread,
+ * enter the cachemiss loop:
+ */
+ set_task_state(t, TASK_INTERRUPTIBLE);
+ mark_async_thread_ready(at, ah);
+
+ return cachemiss_loop(at, ah, t);
+}
+
+/**
+ * sys_async_wait - wait for async completion events
+ *
+ * This syscall waits for @min_wait_events syslet completion events
+ * to finish or for all async processing to finish (whichever
+ * comes first).
+ */
+asmlinkage long
+sys_async_wait(unsigned long min_wait_events, unsigned long user_ring_idx,
+ struct async_head_user __user *ahu)
+{
+ struct task_struct *t = current;
+ struct async_head *ah = t->ah;
+ unsigned long kernel_ring_idx;
+
+ /*
+ * Do not allow async waiting:
+ */
+ if (async_syscall(t))
+ return -ENOSYS;
+ if (!ah)
+ return -EINVAL;
+
+ mutex_lock(&ah->completion_lock);
+ if (get_user(kernel_ring_idx, &ahu->kernel_ring_idx))
+ goto err_unlock;
+ /*
+ * Account any completions that happened since user-space
+ * checked the ring:
+ */
+ ah->events_left = min_wait_events - (kernel_ring_idx - user_ring_idx);
+ mutex_unlock(&ah->completion_lock);
+
+ return wait_event_interruptible(ah->wait,
+ list_empty(&ah->busy_async_threads) || ah->events_left <= 0);
+
+ err_unlock:
+ mutex_unlock(&ah->completion_lock);
+ return -EFAULT;
+}
+
+asmlinkage long
+sys_threadlet_on(unsigned long restore_stack,
+ unsigned long restore_eip,
+ struct async_head_user __user *ahu)
+{
+ struct task_struct *t = current;
+ struct async_head *ah = t->ah;
+ struct async_thread *at = &t->__at;
+ long ret;
+
+ /*
+ * Do not allow recursive calls of sys_threadlet_on():
+ */
+ if (t->async_ready || t->at)
+ return -EINVAL;
+
+ if (unlikely(!ah)) {
+ ret = init_head(ah, t, ahu);
+ if (ret)
+ return ret;
+ ah = t->ah;
+ }
+
+ if (unlikely(list_empty(&ah->ready_async_threads))) {
+ ret = refill_cachemiss_pool(ah, t, ahu);
+ if (ret)
+ return ret;
+ }
+
+ t->async_ready = at;
+ ah->restore_stack = restore_stack;
+ ah->restore_eip = restore_eip;
+
+ ah->ahu = ahu;
+
+ return 0;
+}
+
+asmlinkage long sys_threadlet_off(void)
+{
+ struct task_struct *t = current;
+ struct async_head *ah = t->ah;
+
+ /*
+ * Are we still executing as head?
+ */
+ if (ah) {
+ t->async_ready = NULL;
+
+ return 1;
+ }
+
+ /*
+ * We got turned into a cachemiss thread,
+ * return to user-space, which can do
+ * the notification, etc:
+ */
+ return 0;
+}
+
+static void __notify_async_thread_exit(struct async_thread *at,
+ struct async_head *ah)
+{
+ list_del_init(&at->entry);
+ at->exit = 1;
+ init_completion(&ah->exit_done);
+ wake_up_process(at->task);
+}
+
+static void stop_cachemiss_threads(struct async_head *ah)
+{
+ struct async_thread *at;
+
+repeat:
+ spin_lock(&ah->lock);
+ list_for_each_entry(at, &ah->ready_async_threads, entry) {
+
+ __notify_async_thread_exit(at, ah);
+ spin_unlock(&ah->lock);
+
+ wait_for_completion(&ah->exit_done);
+
+ goto repeat;
+ }
+
+ list_for_each_entry(at, &ah->busy_async_threads, entry) {
+
+ __notify_async_thread_exit(at, ah);
+ spin_unlock(&ah->lock);
+
+ wait_for_completion(&ah->exit_done);
+
+ goto repeat;
+ }
+ spin_unlock(&ah->lock);
+}
+
+static void async_head_exit(struct async_head *ah, struct task_struct *t)
+{
+ stop_cachemiss_threads(ah);
+ WARN_ON(!list_empty(&ah->ready_async_threads));
+ WARN_ON(!list_empty(&ah->busy_async_threads));
+ WARN_ON(spin_is_locked(&ah->lock));
+
+ t->ah = NULL;
+}
+
+/*
+ * fork()-time initialization:
+ */
+void async_init(struct task_struct *t)
+{
+ t->at = NULL;
+ t->async_ready = NULL;
+ t->ah = NULL;
+ t->__at.ah = NULL;
+}
+
+/*
+ * do_exit()-time cleanup:
+ */
+void async_exit(struct task_struct *t)
+{
+ struct async_thread *at = t->at;
+ struct async_head *ah = t->ah;
+
+ /*
+ * If head does a sys_exit() then the final schedule() must
+ * not be passed on to another cachemiss thread:
+ */
+ t->async_ready = NULL;
+
+ if (unlikely(at))
+ async_thread_exit(at, t);
+
+ if (unlikely(ah))
+ async_head_exit(ah, t);
+}
From: Ingo Molnar <[email protected]>
Add Documentation/syslet-design.txt with a high-level description
of the syslet concepts.
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
Documentation/syslet-design.txt | 137 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 137 insertions(+)
Index: linux/Documentation/syslet-design.txt
===================================================================
--- /dev/null
+++ linux/Documentation/syslet-design.txt
@@ -0,0 +1,137 @@
+Syslets / asynchronous system calls
+===================================
+
+started by Ingo Molnar <[email protected]>
+
+Goal:
+-----
+
+The goal of the syslet subsystem is to allow user-space to execute
+arbitrary system calls asynchronously. It does so by allowing user-space
+to execute "syslets" which are small scriptlets that the kernel can execute
+both securely and asynchronously without having to exit to user-space.
+
+the core syslet concepts are:
+
+The Syslet Atom:
+----------------
+
+The syslet atom is a small, fixed-size (44 bytes on 32-bit) piece of
+user-space memory, which is the basic unit of execution within the syslet
+framework. A syslet represents a single system-call and its arguments.
+In addition it also has condition flags attached to it that allows the
+construction of larger programs (syslets) from these atoms.
+
+Arguments to the system call are implemented via pointers to arguments.
+This not only increases the flexibility of syslet atoms (multiple syslets
+can share the same variable for example), but is also an optimization:
+copy_uatom() will only fetch syscall parameters up until the point it
+meets the first NULL pointer. 50% of all syscalls have 2 or less
+parameters (and 90% of all syscalls have 4 or less parameters).
+
+ [ Note: since the argument array is at the end of the atom, and the
+ kernel will not touch any argument beyond the final NULL one, atoms
+ might be packed more tightly. (the only special case exception to
+ this rule would be SKIP_TO_NEXT_ON_STOP atoms, where the kernel will
+ jump a full syslet_uatom number of bytes.) ]
+
+The Syslet:
+-----------
+
+A syslet is a program, represented by a graph of syslet atoms. The
+syslet atoms are chained to each other either via the atom->next pointer,
+or via the SYSLET_SKIP_TO_NEXT_ON_STOP flag.
+
+Running Syslets:
+----------------
+
+Syslets can be run via the sys_async_exec() system call, which takes
+the first atom of the syslet as an argument. The kernel does not need
+to be told about the other atoms - it will fetch them on the fly as
+execution goes forward.
+
+A syslet might either be executed 'cached', or it might generate a
+'cachemiss'.
+
+'Cached' syslet execution means that the whole syslet was executed
+without blocking. The system-call returns the submitted atom's address
+in this case.
+
+If a syslet blocks while the kernel executes a system-call embedded in
+one of its atoms, the kernel will keep working on that syscall in
+parallel, but it immediately returns to user-space with a NULL pointer,
+so the submitting task can submit other syslets.
+
+Completion of asynchronous syslets:
+-----------------------------------
+
+Completion of asynchronous syslets is done via the 'completion ring',
+which is a ringbuffer of syslet atom pointers user user-space memory,
+provided by user-space as an argument to the sys_async_exec() syscall.
+The kernel fills in the ringbuffer starting at index 0, and user-space
+must clear out these pointers. Once the kernel reaches the end of
+the ring it wraps back to index 0. The kernel will not overwrite
+non-NULL pointers (but will return an error), user-space has to
+make sure it completes all events it asked for.
+
+Waiting for completions:
+------------------------
+
+Syslet completions can be waited for via the sys_async_wait()
+system call - which takes the number of events it should wait for as
+a parameter. This system call will also return if the number of
+pending events goes down to zero.
+
+Sample Hello World syslet code:
+
+--------------------------->
+/*
+ * Set up a syslet atom:
+ */
+static void
+init_atom(struct syslet_uatom *atom, int nr,
+ void *arg_ptr0, void *arg_ptr1, void *arg_ptr2,
+ void *arg_ptr3, void *arg_ptr4, void *arg_ptr5,
+ void *ret_ptr, unsigned long flags, struct syslet_uatom *next)
+{
+ atom->nr = nr;
+ atom->arg_ptr[0] = arg_ptr0;
+ atom->arg_ptr[1] = arg_ptr1;
+ atom->arg_ptr[2] = arg_ptr2;
+ atom->arg_ptr[3] = arg_ptr3;
+ atom->arg_ptr[4] = arg_ptr4;
+ atom->arg_ptr[5] = arg_ptr5;
+ atom->ret_ptr = ret_ptr;
+ atom->flags = flags;
+ atom->next = next;
+}
+
+int main(int argc, char *argv[])
+{
+ unsigned long int fd_out = 1; /* standard output */
+ char *buf = "Hello Syslet World!\n";
+ unsigned long size = strlen(buf);
+ struct syslet_uatom atom, *done;
+
+ async_head_init();
+
+ /*
+ * Simple syslet consisting of a single atom:
+ */
+ init_atom(&atom, __NR_sys_write, &fd_out, &buf, &size,
+ NULL, NULL, NULL, NULL, SYSLET_ASYNC, NULL);
+ done = sys_async_exec(&atom);
+ if (!done) {
+ sys_async_wait(1);
+ if (completion_ring[curr_ring_idx] == &atom) {
+ completion_ring[curr_ring_idx] = NULL;
+ printf("completed an async syslet atom!\n");
+ }
+ } else {
+ printf("completed an cached syslet atom!\n");
+ }
+
+ async_head_exit();
+
+ return 0;
+}
From: Ingo Molnar <[email protected]>
mark clone() and fork() as not available for async execution.
Both need an intact user context beneath them to work.
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/i386/kernel/ioport.c | 6 ++++++
arch/i386/kernel/ldt.c | 3 +++
arch/i386/kernel/process.c | 6 ++++++
arch/i386/kernel/vm86.c | 6 ++++++
4 files changed, 21 insertions(+)
Index: linux/arch/i386/kernel/ioport.c
===================================================================
--- linux.orig/arch/i386/kernel/ioport.c
+++ linux/arch/i386/kernel/ioport.c
@@ -62,6 +62,9 @@ asmlinkage long sys_ioperm(unsigned long
struct tss_struct * tss;
unsigned long *bitmap;
+ if (async_syscall(current))
+ return -ENOSYS;
+
if ((from + num <= from) || (from + num > IO_BITMAP_BITS))
return -EINVAL;
if (turn_on && !capable(CAP_SYS_RAWIO))
@@ -139,6 +142,9 @@ asmlinkage long sys_iopl(unsigned long u
unsigned int old = (regs->eflags >> 12) & 3;
struct thread_struct *t = ¤t->thread;
+ if (async_syscall(current))
+ return -ENOSYS;
+
if (level > 3)
return -EINVAL;
/* Trying to gain more privileges? */
Index: linux/arch/i386/kernel/ldt.c
===================================================================
--- linux.orig/arch/i386/kernel/ldt.c
+++ linux/arch/i386/kernel/ldt.c
@@ -233,6 +233,9 @@ asmlinkage int sys_modify_ldt(int func,
{
int ret = -ENOSYS;
+ if (async_syscall(current))
+ return -ENOSYS;
+
switch (func) {
case 0:
ret = read_ldt(ptr, bytecount);
Index: linux/arch/i386/kernel/process.c
===================================================================
--- linux.orig/arch/i386/kernel/process.c
+++ linux/arch/i386/kernel/process.c
@@ -731,6 +731,9 @@ struct task_struct fastcall * __switch_t
asmlinkage int sys_fork(struct pt_regs regs)
{
+ if (async_syscall(current))
+ return -ENOSYS;
+
return do_fork(SIGCHLD, regs.esp, ®s, 0, NULL, NULL);
}
@@ -740,6 +743,9 @@ asmlinkage int sys_clone(struct pt_regs
unsigned long newsp;
int __user *parent_tidptr, *child_tidptr;
+ if (async_syscall(current))
+ return -ENOSYS;
+
clone_flags = regs.ebx;
newsp = regs.ecx;
parent_tidptr = (int __user *)regs.edx;
Index: linux/arch/i386/kernel/vm86.c
===================================================================
--- linux.orig/arch/i386/kernel/vm86.c
+++ linux/arch/i386/kernel/vm86.c
@@ -208,6 +208,9 @@ asmlinkage int sys_vm86old(struct pt_reg
struct task_struct *tsk;
int tmp, ret = -EPERM;
+ if (async_syscall(current))
+ return -ENOSYS;
+
tsk = current;
if (tsk->thread.saved_esp0)
goto out;
@@ -238,6 +241,9 @@ asmlinkage int sys_vm86(struct pt_regs r
int tmp, ret;
struct vm86plus_struct __user *v86;
+ if (async_syscall(current))
+ return -ENOSYS;
+
tsk = current;
switch (regs.ebx) {
case VM86_REQUEST_IRQ:
From: Ingo Molnar <[email protected]>
provide an optimized assembly version of the copy_uatom() method.
This is about 3 times faster than the C version.
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/i386/lib/getuser.S | 115 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 115 insertions(+)
Index: linux/arch/i386/lib/getuser.S
===================================================================
--- linux.orig/arch/i386/lib/getuser.S
+++ linux/arch/i386/lib/getuser.S
@@ -10,6 +10,121 @@
*/
#include <asm/thread_info.h>
+/*
+ * copy_uatom() - copy a syslet_atom from user-space into kernel-space:
+ */
+
+.text
+.align 4
+.globl copy_uatom
+copy_uatom:
+ #
+ # regparm(2) call, %eax: atom, %edx: uatom
+ #
+ movl %eax, %ecx # %ecx is atom, %edx remains uatom
+
+ cmpl $__PAGE_OFFSET-40, %edx # access_ok(uatom-sizeof(*uatom))
+ jae bad_copy_uatom
+
+
+1: movl (%edx), %eax # atom->flags = uatom->flags
+ movl %eax, (%ecx)
+
+2: movl 4(%edx), %eax # atom->nr = uatom->nr
+ movl %eax, 4(%ecx)
+
+3: movl 8(%edx), %eax # atom->ret_ptr = uatom->ret_ptr
+ movl %eax, 8(%ecx)
+
+4: movl 12(%edx), %eax # atom->next = uatom->next
+ movl %eax, 12(%ecx)
+
+
+10: movl 16(%edx), %eax # atom->arg_ptr[0] = uatom->arg_ptr[0]
+ testl %eax, %eax
+ jz 20f # NULL ptr - zero out remaining args
+ cmpl $__PAGE_OFFSET, %eax # access_ok(arg_ptr)
+ jae bad_copy_uatom
+100: movl (%eax), %eax
+ movl %eax, 16(%ecx)
+
+11: movl 20(%edx), %eax # atom->arg_ptr[1] = uatom->arg_ptr[1]
+ testl %eax, %eax
+ jz 21f # NULL ptr - zero out remaining args
+ cmpl $__PAGE_OFFSET, %eax # access_ok(arg_ptr)
+ jae bad_copy_uatom
+110: movl (%eax), %eax
+ movl %eax, 20(%ecx)
+
+12: movl 24(%edx), %eax # atom->arg_ptr[2] = uatom->arg_ptr[2]
+ testl %eax, %eax
+ jz 22f # NULL ptr - zero out remaining args
+ cmpl $__PAGE_OFFSET, %eax # access_ok(arg_ptr)
+ jae bad_copy_uatom
+120: movl (%eax), %eax
+ movl %eax, 24(%ecx)
+
+13: movl 28(%edx), %eax # atom->arg_ptr[3] = uatom->arg_ptr[3]
+ testl %eax, %eax
+ jz 23f # NULL ptr - zero out remaining args
+ cmpl $__PAGE_OFFSET, %eax # access_ok(arg_ptr)
+ jae bad_copy_uatom
+130: movl (%eax), %eax
+ movl %eax, 28(%ecx)
+
+14: movl 32(%edx), %eax # atom->arg_ptr[4] = uatom->arg_ptr[4]
+ testl %eax, %eax
+ jz 24f # NULL ptr - zero out remaining args
+ cmpl $__PAGE_OFFSET, %eax # access_ok(arg_ptr)
+ jae bad_copy_uatom
+140: movl (%eax), %eax
+ movl %eax, 32(%ecx)
+
+15: movl 36(%edx), %eax # atom->arg_ptr[5] = uatom->arg_ptr[5]
+ testl %eax, %eax
+ jz 25f # NULL ptr - zero out remaining args
+ cmpl $__PAGE_OFFSET, %eax # access_ok(arg_ptr)
+ jae bad_copy_uatom
+150: movl (%eax), %eax
+ movl %eax, 36(%ecx)
+
+ xorl %eax, %eax # return 0
+ ret
+
+ #
+ # Zero out arg values. Since ptr was NULL, %eax is zero:
+ #
+20: movl %eax, 16(%ecx) # atom->args[0] = 0
+21: movl %eax, 20(%ecx) # atom->args[1] = 0
+22: movl %eax, 24(%ecx) # atom->args[2] = 0
+23: movl %eax, 28(%ecx) # atom->args[3] = 0
+24: movl %eax, 32(%ecx) # atom->args[4] = 0
+25: movl %eax, 36(%ecx) # atom->args[5] = 0
+
+ ret # return 0
+
+bad_copy_uatom:
+ movl $-14, %eax
+ ret
+
+.section __ex_table,"a"
+ .long 1b, bad_copy_uatom
+ .long 2b, bad_copy_uatom
+ .long 3b, bad_copy_uatom
+ .long 4b, bad_copy_uatom
+ .long 10b, bad_copy_uatom
+ .long 100b, bad_copy_uatom
+ .long 11b, bad_copy_uatom
+ .long 110b, bad_copy_uatom
+ .long 12b, bad_copy_uatom
+ .long 120b, bad_copy_uatom
+ .long 13b, bad_copy_uatom
+ .long 130b, bad_copy_uatom
+ .long 14b, bad_copy_uatom
+ .long 140b, bad_copy_uatom
+ .long 15b, bad_copy_uatom
+ .long 150b, bad_copy_uatom
+.previous
/*
* __get_user_X
From: Ingo Molnar <[email protected]>
provide an optimized assembly version of sys_umem_add().
It is about 2 times faster than the C version.
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/i386/lib/getuser.S | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
Index: linux/arch/i386/lib/getuser.S
===================================================================
--- linux.orig/arch/i386/lib/getuser.S
+++ linux/arch/i386/lib/getuser.S
@@ -11,6 +11,33 @@
#include <asm/thread_info.h>
/*
+ * Optimized user-memory arithmetics system-call:
+ */
+
+.text
+.align 4
+.globl sys_umem_add
+sys_umem_add:
+ movl 0x4(%esp), %ecx # uptr
+ movl 0x8(%esp), %edx # inc
+
+ cmpl $__PAGE_OFFSET, %ecx # access_ok(uptr)
+ jae bad_sys_umem_add
+
+1: addl %edx, (%ecx) # *uptr += inc
+2: movl (%ecx), %eax # return *uptr
+ ret
+
+bad_sys_umem_add:
+ movl $-14, %eax
+ ret
+
+.section __ex_table,"a"
+ .long 1b, bad_sys_umem_add
+ .long 2b, bad_sys_umem_add
+.previous
+
+/*
* copy_uatom() - copy a syslet_atom from user-space into kernel-space:
*/
From: Arjan van de Ven <[email protected]>
Split the FPU save area from the task struct. This allows easy migration
of FPU context, and it's generally cleaner. It also allows the following
two (future) optimizations:
1) allocate the right size for the actual cpu rather than 512 bytes always
2) only allocate when the application actually uses FPU, so in the first
lazy FPU trap. This could save memory for non-fpu using apps.
Signed-off-by: Arjan van de Ven <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/i386/kernel/i387.c | 96 ++++++++++++++++++++---------------------
arch/i386/kernel/process.c | 56 +++++++++++++++++++++++
arch/i386/kernel/traps.c | 10 ----
include/asm-i386/i387.h | 6 +-
include/asm-i386/processor.h | 6 ++
include/asm-i386/thread_info.h | 6 ++
kernel/fork.c | 7 ++
7 files changed, 123 insertions(+), 64 deletions(-)
Index: linux/arch/i386/kernel/i387.c
===================================================================
--- linux.orig/arch/i386/kernel/i387.c
+++ linux/arch/i386/kernel/i387.c
@@ -31,9 +31,9 @@ void mxcsr_feature_mask_init(void)
unsigned long mask = 0;
clts();
if (cpu_has_fxsr) {
- memset(¤t->thread.i387.fxsave, 0, sizeof(struct i387_fxsave_struct));
- asm volatile("fxsave %0" : : "m" (current->thread.i387.fxsave));
- mask = current->thread.i387.fxsave.mxcsr_mask;
+ memset(¤t->thread.i387->fxsave, 0, sizeof(struct i387_fxsave_struct));
+ asm volatile("fxsave %0" : : "m" (current->thread.i387->fxsave));
+ mask = current->thread.i387->fxsave.mxcsr_mask;
if (mask == 0) mask = 0x0000ffbf;
}
mxcsr_feature_mask &= mask;
@@ -49,16 +49,16 @@ void mxcsr_feature_mask_init(void)
void init_fpu(struct task_struct *tsk)
{
if (cpu_has_fxsr) {
- memset(&tsk->thread.i387.fxsave, 0, sizeof(struct i387_fxsave_struct));
- tsk->thread.i387.fxsave.cwd = 0x37f;
+ memset(&tsk->thread.i387->fxsave, 0, sizeof(struct i387_fxsave_struct));
+ tsk->thread.i387->fxsave.cwd = 0x37f;
if (cpu_has_xmm)
- tsk->thread.i387.fxsave.mxcsr = 0x1f80;
+ tsk->thread.i387->fxsave.mxcsr = 0x1f80;
} else {
- memset(&tsk->thread.i387.fsave, 0, sizeof(struct i387_fsave_struct));
- tsk->thread.i387.fsave.cwd = 0xffff037fu;
- tsk->thread.i387.fsave.swd = 0xffff0000u;
- tsk->thread.i387.fsave.twd = 0xffffffffu;
- tsk->thread.i387.fsave.fos = 0xffff0000u;
+ memset(&tsk->thread.i387->fsave, 0, sizeof(struct i387_fsave_struct));
+ tsk->thread.i387->fsave.cwd = 0xffff037fu;
+ tsk->thread.i387->fsave.swd = 0xffff0000u;
+ tsk->thread.i387->fsave.twd = 0xffffffffu;
+ tsk->thread.i387->fsave.fos = 0xffff0000u;
}
/* only the device not available exception or ptrace can call init_fpu */
set_stopped_child_used_math(tsk);
@@ -152,18 +152,18 @@ static inline unsigned long twd_fxsr_to_
unsigned short get_fpu_cwd( struct task_struct *tsk )
{
if ( cpu_has_fxsr ) {
- return tsk->thread.i387.fxsave.cwd;
+ return tsk->thread.i387->fxsave.cwd;
} else {
- return (unsigned short)tsk->thread.i387.fsave.cwd;
+ return (unsigned short)tsk->thread.i387->fsave.cwd;
}
}
unsigned short get_fpu_swd( struct task_struct *tsk )
{
if ( cpu_has_fxsr ) {
- return tsk->thread.i387.fxsave.swd;
+ return tsk->thread.i387->fxsave.swd;
} else {
- return (unsigned short)tsk->thread.i387.fsave.swd;
+ return (unsigned short)tsk->thread.i387->fsave.swd;
}
}
@@ -171,9 +171,9 @@ unsigned short get_fpu_swd( struct task_
unsigned short get_fpu_twd( struct task_struct *tsk )
{
if ( cpu_has_fxsr ) {
- return tsk->thread.i387.fxsave.twd;
+ return tsk->thread.i387->fxsave.twd;
} else {
- return (unsigned short)tsk->thread.i387.fsave.twd;
+ return (unsigned short)tsk->thread.i387->fsave.twd;
}
}
#endif /* 0 */
@@ -181,7 +181,7 @@ unsigned short get_fpu_twd( struct task_
unsigned short get_fpu_mxcsr( struct task_struct *tsk )
{
if ( cpu_has_xmm ) {
- return tsk->thread.i387.fxsave.mxcsr;
+ return tsk->thread.i387->fxsave.mxcsr;
} else {
return 0x1f80;
}
@@ -192,27 +192,27 @@ unsigned short get_fpu_mxcsr( struct tas
void set_fpu_cwd( struct task_struct *tsk, unsigned short cwd )
{
if ( cpu_has_fxsr ) {
- tsk->thread.i387.fxsave.cwd = cwd;
+ tsk->thread.i387->fxsave.cwd = cwd;
} else {
- tsk->thread.i387.fsave.cwd = ((long)cwd | 0xffff0000u);
+ tsk->thread.i387->fsave.cwd = ((long)cwd | 0xffff0000u);
}
}
void set_fpu_swd( struct task_struct *tsk, unsigned short swd )
{
if ( cpu_has_fxsr ) {
- tsk->thread.i387.fxsave.swd = swd;
+ tsk->thread.i387->fxsave.swd = swd;
} else {
- tsk->thread.i387.fsave.swd = ((long)swd | 0xffff0000u);
+ tsk->thread.i387->fsave.swd = ((long)swd | 0xffff0000u);
}
}
void set_fpu_twd( struct task_struct *tsk, unsigned short twd )
{
if ( cpu_has_fxsr ) {
- tsk->thread.i387.fxsave.twd = twd_i387_to_fxsr(twd);
+ tsk->thread.i387->fxsave.twd = twd_i387_to_fxsr(twd);
} else {
- tsk->thread.i387.fsave.twd = ((long)twd | 0xffff0000u);
+ tsk->thread.i387->fsave.twd = ((long)twd | 0xffff0000u);
}
}
@@ -298,8 +298,8 @@ static inline int save_i387_fsave( struc
struct task_struct *tsk = current;
unlazy_fpu( tsk );
- tsk->thread.i387.fsave.status = tsk->thread.i387.fsave.swd;
- if ( __copy_to_user( buf, &tsk->thread.i387.fsave,
+ tsk->thread.i387->fsave.status = tsk->thread.i387->fsave.swd;
+ if ( __copy_to_user( buf, &tsk->thread.i387->fsave,
sizeof(struct i387_fsave_struct) ) )
return -1;
return 1;
@@ -312,15 +312,15 @@ static int save_i387_fxsave( struct _fps
unlazy_fpu( tsk );
- if ( convert_fxsr_to_user( buf, &tsk->thread.i387.fxsave ) )
+ if ( convert_fxsr_to_user( buf, &tsk->thread.i387->fxsave ) )
return -1;
- err |= __put_user( tsk->thread.i387.fxsave.swd, &buf->status );
+ err |= __put_user( tsk->thread.i387->fxsave.swd, &buf->status );
err |= __put_user( X86_FXSR_MAGIC, &buf->magic );
if ( err )
return -1;
- if ( __copy_to_user( &buf->_fxsr_env[0], &tsk->thread.i387.fxsave,
+ if ( __copy_to_user( &buf->_fxsr_env[0], &tsk->thread.i387->fxsave,
sizeof(struct i387_fxsave_struct) ) )
return -1;
return 1;
@@ -343,7 +343,7 @@ int save_i387( struct _fpstate __user *b
return save_i387_fsave( buf );
}
} else {
- return save_i387_soft( ¤t->thread.i387.soft, buf );
+ return save_i387_soft( ¤t->thread.i387->soft, buf );
}
}
@@ -351,7 +351,7 @@ static inline int restore_i387_fsave( st
{
struct task_struct *tsk = current;
clear_fpu( tsk );
- return __copy_from_user( &tsk->thread.i387.fsave, buf,
+ return __copy_from_user( &tsk->thread.i387->fsave, buf,
sizeof(struct i387_fsave_struct) );
}
@@ -360,11 +360,11 @@ static int restore_i387_fxsave( struct _
int err;
struct task_struct *tsk = current;
clear_fpu( tsk );
- err = __copy_from_user( &tsk->thread.i387.fxsave, &buf->_fxsr_env[0],
+ err = __copy_from_user( &tsk->thread.i387->fxsave, &buf->_fxsr_env[0],
sizeof(struct i387_fxsave_struct) );
/* mxcsr reserved bits must be masked to zero for security reasons */
- tsk->thread.i387.fxsave.mxcsr &= mxcsr_feature_mask;
- return err ? 1 : convert_fxsr_from_user( &tsk->thread.i387.fxsave, buf );
+ tsk->thread.i387->fxsave.mxcsr &= mxcsr_feature_mask;
+ return err ? 1 : convert_fxsr_from_user( &tsk->thread.i387->fxsave, buf );
}
int restore_i387( struct _fpstate __user *buf )
@@ -378,7 +378,7 @@ int restore_i387( struct _fpstate __user
err = restore_i387_fsave( buf );
}
} else {
- err = restore_i387_soft( ¤t->thread.i387.soft, buf );
+ err = restore_i387_soft( ¤t->thread.i387->soft, buf );
}
set_used_math();
return err;
@@ -391,7 +391,7 @@ int restore_i387( struct _fpstate __user
static inline int get_fpregs_fsave( struct user_i387_struct __user *buf,
struct task_struct *tsk )
{
- return __copy_to_user( buf, &tsk->thread.i387.fsave,
+ return __copy_to_user( buf, &tsk->thread.i387->fsave,
sizeof(struct user_i387_struct) );
}
@@ -399,7 +399,7 @@ static inline int get_fpregs_fxsave( str
struct task_struct *tsk )
{
return convert_fxsr_to_user( (struct _fpstate __user *)buf,
- &tsk->thread.i387.fxsave );
+ &tsk->thread.i387->fxsave );
}
int get_fpregs( struct user_i387_struct __user *buf, struct task_struct *tsk )
@@ -411,7 +411,7 @@ int get_fpregs( struct user_i387_struct
return get_fpregs_fsave( buf, tsk );
}
} else {
- return save_i387_soft( &tsk->thread.i387.soft,
+ return save_i387_soft( &tsk->thread.i387->soft,
(struct _fpstate __user *)buf );
}
}
@@ -419,14 +419,14 @@ int get_fpregs( struct user_i387_struct
static inline int set_fpregs_fsave( struct task_struct *tsk,
struct user_i387_struct __user *buf )
{
- return __copy_from_user( &tsk->thread.i387.fsave, buf,
+ return __copy_from_user( &tsk->thread.i387->fsave, buf,
sizeof(struct user_i387_struct) );
}
static inline int set_fpregs_fxsave( struct task_struct *tsk,
struct user_i387_struct __user *buf )
{
- return convert_fxsr_from_user( &tsk->thread.i387.fxsave,
+ return convert_fxsr_from_user( &tsk->thread.i387->fxsave,
(struct _fpstate __user *)buf );
}
@@ -439,7 +439,7 @@ int set_fpregs( struct task_struct *tsk,
return set_fpregs_fsave( tsk, buf );
}
} else {
- return restore_i387_soft( &tsk->thread.i387.soft,
+ return restore_i387_soft( &tsk->thread.i387->soft,
(struct _fpstate __user *)buf );
}
}
@@ -447,7 +447,7 @@ int set_fpregs( struct task_struct *tsk,
int get_fpxregs( struct user_fxsr_struct __user *buf, struct task_struct *tsk )
{
if ( cpu_has_fxsr ) {
- if (__copy_to_user( buf, &tsk->thread.i387.fxsave,
+ if (__copy_to_user( buf, &tsk->thread.i387->fxsave,
sizeof(struct user_fxsr_struct) ))
return -EFAULT;
return 0;
@@ -461,11 +461,11 @@ int set_fpxregs( struct task_struct *tsk
int ret = 0;
if ( cpu_has_fxsr ) {
- if (__copy_from_user( &tsk->thread.i387.fxsave, buf,
+ if (__copy_from_user( &tsk->thread.i387->fxsave, buf,
sizeof(struct user_fxsr_struct) ))
ret = -EFAULT;
/* mxcsr reserved bits must be masked to zero for security reasons */
- tsk->thread.i387.fxsave.mxcsr &= mxcsr_feature_mask;
+ tsk->thread.i387->fxsave.mxcsr &= mxcsr_feature_mask;
} else {
ret = -EIO;
}
@@ -479,7 +479,7 @@ int set_fpxregs( struct task_struct *tsk
static inline void copy_fpu_fsave( struct task_struct *tsk,
struct user_i387_struct *fpu )
{
- memcpy( fpu, &tsk->thread.i387.fsave,
+ memcpy( fpu, &tsk->thread.i387->fsave,
sizeof(struct user_i387_struct) );
}
@@ -490,10 +490,10 @@ static inline void copy_fpu_fxsave( stru
unsigned short *from;
int i;
- memcpy( fpu, &tsk->thread.i387.fxsave, 7 * sizeof(long) );
+ memcpy( fpu, &tsk->thread.i387->fxsave, 7 * sizeof(long) );
to = (unsigned short *)&fpu->st_space[0];
- from = (unsigned short *)&tsk->thread.i387.fxsave.st_space[0];
+ from = (unsigned short *)&tsk->thread.i387->fxsave.st_space[0];
for ( i = 0 ; i < 8 ; i++, to += 5, from += 8 ) {
memcpy( to, from, 5 * sizeof(unsigned short) );
}
@@ -540,7 +540,7 @@ int dump_task_extended_fpu(struct task_s
if (fpvalid) {
if (tsk == current)
unlazy_fpu(tsk);
- memcpy(fpu, &tsk->thread.i387.fxsave, sizeof(*fpu));
+ memcpy(fpu, &tsk->thread.i387->fxsave, sizeof(*fpu));
}
return fpvalid;
}
Index: linux/arch/i386/kernel/process.c
===================================================================
--- linux.orig/arch/i386/kernel/process.c
+++ linux/arch/i386/kernel/process.c
@@ -645,7 +645,7 @@ struct task_struct fastcall * __switch_t
/* we're going to use this soon, after a few expensive things */
if (next_p->fpu_counter > 5)
- prefetch(&next->i387.fxsave);
+ prefetch(&next->i387->fxsave);
/*
* Reload esp0.
@@ -908,3 +908,57 @@ unsigned long arch_align_stack(unsigned
sp -= get_random_int() % 8192;
return sp & ~0xf;
}
+
+
+
+struct kmem_cache *task_struct_cachep;
+struct kmem_cache *task_i387_cachep;
+
+struct task_struct * alloc_task_struct(void)
+{
+ struct task_struct *tsk;
+ tsk = kmem_cache_alloc(task_struct_cachep, GFP_KERNEL);
+ if (!tsk)
+ return NULL;
+ tsk->thread.i387 = kmem_cache_alloc(task_i387_cachep, GFP_KERNEL);
+ if (!tsk->thread.i387)
+ goto error;
+ WARN_ON((unsigned long)tsk->thread.i387 & 15);
+ return tsk;
+
+error:
+ kfree(tsk);
+ return NULL;
+}
+
+void memcpy_task_struct(struct task_struct *dst, struct task_struct *src)
+{
+ union i387_union *ptr;
+ ptr = dst->thread.i387;
+ *dst = *src;
+ dst->thread.i387 = ptr;
+ memcpy(dst->thread.i387, src->thread.i387, sizeof(union i387_union));
+}
+
+void free_task_struct(struct task_struct *tsk)
+{
+ kmem_cache_free(task_i387_cachep, tsk->thread.i387);
+ tsk->thread.i387=NULL;
+ kmem_cache_free(task_struct_cachep, tsk);
+}
+
+
+void task_struct_slab_init(void)
+{
+ /* create a slab on which task_structs can be allocated */
+ task_struct_cachep =
+ kmem_cache_create("task_struct", sizeof(struct task_struct),
+ ARCH_MIN_TASKALIGN, SLAB_PANIC, NULL, NULL);
+ task_i387_cachep =
+ kmem_cache_create("task_i387", sizeof(union i387_union), 32,
+ SLAB_PANIC | SLAB_MUST_HWCACHE_ALIGN, NULL, NULL);
+}
+
+
+/* the very init task needs a static allocated i387 area */
+union i387_union init_i387_context;
Index: linux/arch/i386/kernel/traps.c
===================================================================
--- linux.orig/arch/i386/kernel/traps.c
+++ linux/arch/i386/kernel/traps.c
@@ -1154,16 +1154,6 @@ void __init trap_init(void)
set_trap_gate(19,&simd_coprocessor_error);
if (cpu_has_fxsr) {
- /*
- * Verify that the FXSAVE/FXRSTOR data will be 16-byte aligned.
- * Generates a compile-time "error: zero width for bit-field" if
- * the alignment is wrong.
- */
- struct fxsrAlignAssert {
- int _:!(offsetof(struct task_struct,
- thread.i387.fxsave) & 15);
- };
-
printk(KERN_INFO "Enabling fast FPU save and restore... ");
set_in_cr4(X86_CR4_OSFXSR);
printk("done.\n");
Index: linux/include/asm-i386/i387.h
===================================================================
--- linux.orig/include/asm-i386/i387.h
+++ linux/include/asm-i386/i387.h
@@ -34,7 +34,7 @@ extern void init_fpu(struct task_struct
"nop ; frstor %1", \
"fxrstor %1", \
X86_FEATURE_FXSR, \
- "m" ((tsk)->thread.i387.fxsave))
+ "m" ((tsk)->thread.i387->fxsave))
extern void kernel_fpu_begin(void);
#define kernel_fpu_end() do { stts(); preempt_enable(); } while(0)
@@ -60,8 +60,8 @@ static inline void __save_init_fpu( stru
"fxsave %[fx]\n"
"bt $7,%[fsw] ; jnc 1f ; fnclex\n1:",
X86_FEATURE_FXSR,
- [fx] "m" (tsk->thread.i387.fxsave),
- [fsw] "m" (tsk->thread.i387.fxsave.swd) : "memory");
+ [fx] "m" (tsk->thread.i387->fxsave),
+ [fsw] "m" (tsk->thread.i387->fxsave.swd) : "memory");
/* AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception
is pending. Clear the x87 state here by setting it to fixed
values. safe_address is a random variable that should be in L1 */
Index: linux/include/asm-i386/processor.h
===================================================================
--- linux.orig/include/asm-i386/processor.h
+++ linux/include/asm-i386/processor.h
@@ -407,7 +407,7 @@ struct thread_struct {
/* fault info */
unsigned long cr2, trap_no, error_code;
/* floating point info */
- union i387_union i387;
+ union i387_union *i387;
/* virtual 86 mode info */
struct vm86_struct __user * vm86_info;
unsigned long screen_bitmap;
@@ -420,11 +420,15 @@ struct thread_struct {
unsigned long io_bitmap_max;
};
+
+extern union i387_union init_i387_context;
+
#define INIT_THREAD { \
.vm86_info = NULL, \
.sysenter_cs = __KERNEL_CS, \
.io_bitmap_ptr = NULL, \
.gs = __KERNEL_PDA, \
+ .i387 = &init_i387_context, \
}
/*
Index: linux/include/asm-i386/thread_info.h
===================================================================
--- linux.orig/include/asm-i386/thread_info.h
+++ linux/include/asm-i386/thread_info.h
@@ -102,6 +102,12 @@ static inline struct thread_info *curren
#define free_thread_info(info) kfree(info)
+#define __HAVE_ARCH_TASK_STRUCT_ALLOCATOR
+extern struct task_struct * alloc_task_struct(void);
+extern void free_task_struct(struct task_struct *tsk);
+extern void memcpy_task_struct(struct task_struct *dst, struct task_struct *src);
+extern void task_struct_slab_init(void);
+
#else /* !__ASSEMBLY__ */
/* how to get the thread information struct from ASM */
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -84,6 +84,8 @@ int nr_processes(void)
#ifndef __HAVE_ARCH_TASK_STRUCT_ALLOCATOR
# define alloc_task_struct() kmem_cache_alloc(task_struct_cachep, GFP_KERNEL)
# define free_task_struct(tsk) kmem_cache_free(task_struct_cachep, (tsk))
+# define memcpy_task_struct(dst, src) *dst = *src;
+
static struct kmem_cache *task_struct_cachep;
#endif
@@ -138,6 +140,8 @@ void __init fork_init(unsigned long memp
task_struct_cachep =
kmem_cache_create("task_struct", sizeof(struct task_struct),
ARCH_MIN_TASKALIGN, SLAB_PANIC, NULL, NULL);
+#else
+ task_struct_slab_init();
#endif
/*
@@ -176,7 +180,8 @@ static struct task_struct *dup_task_stru
return NULL;
}
- *tsk = *orig;
+ memcpy_task_struct(tsk, orig);
+
tsk->thread_info = ti;
setup_thread_stack(tsk, orig);
From: Ingo Molnar <[email protected]>
add the move_user_context() method to move the user-space
context of one kernel thread to another kernel thread.
User-space might notice the changed TID, but execution,
stack and register contents (general purpose and FPU) are
still the same.
An architecture must implement this interface before it can turn
CONFIG_ASYNC_SUPPORT on.
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/i386/kernel/process.c | 21 +++++++++++++++++++++
include/asm-i386/system.h | 7 +++++++
2 files changed, 28 insertions(+)
Index: linux/arch/i386/kernel/process.c
===================================================================
--- linux.orig/arch/i386/kernel/process.c
+++ linux/arch/i386/kernel/process.c
@@ -820,6 +820,27 @@ unsigned long get_wchan(struct task_stru
}
/*
+ * Move user-space context from one kernel thread to another.
+ * This includes registers and FPU state. Callers must make
+ * sure that neither task is running user context at the moment:
+ */
+void
+move_user_context(struct task_struct *new_task, struct task_struct *old_task)
+{
+ struct pt_regs *old_regs = task_pt_regs(old_task);
+ struct pt_regs *new_regs = task_pt_regs(new_task);
+ union i387_union *tmp;
+
+ *new_regs = *old_regs;
+ /*
+ * Flip around the FPU state too:
+ */
+ tmp = new_task->thread.i387;
+ new_task->thread.i387 = old_task->thread.i387;
+ old_task->thread.i387 = tmp;
+}
+
+/*
* sys_alloc_thread_area: get a yet unused TLS descriptor index.
*/
static int get_free_idx(void)
Index: linux/include/asm-i386/system.h
===================================================================
--- linux.orig/include/asm-i386/system.h
+++ linux/include/asm-i386/system.h
@@ -33,6 +33,13 @@ extern struct task_struct * FASTCALL(__s
"2" (prev), "d" (next)); \
} while (0)
+/*
+ * Move user-space context from one kernel thread to another.
+ * This includes registers and FPU state for now:
+ */
+extern void
+move_user_context(struct task_struct *new_task, struct task_struct *old_task);
+
#define _set_base(addr,base) do { unsigned long __pr; \
__asm__ __volatile__ ("movw %%dx,%1\n\t" \
"rorl $16,%%edx\n\t" \
From: Ingo Molnar <[email protected]>
wire up the new syslet / async system call syscalls and make it
thus available to user-space.
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/i386/kernel/syscall_table.S | 6 ++++++
include/asm-i386/unistd.h | 8 +++++++-
2 files changed, 13 insertions(+), 1 deletion(-)
Index: linux/arch/i386/kernel/syscall_table.S
===================================================================
--- linux.orig/arch/i386/kernel/syscall_table.S
+++ linux/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,9 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+ .long sys_async_exec /* 320 */
+ .long sys_async_wait
+ .long sys_umem_add
+ .long sys_async_thread
+ .long sys_threadlet_on
+ .long sys_threadlet_off /* 325 */
Index: linux/include/asm-i386/unistd.h
===================================================================
--- linux.orig/include/asm-i386/unistd.h
+++ linux/include/asm-i386/unistd.h
@@ -325,10 +325,16 @@
#define __NR_move_pages 317
#define __NR_getcpu 318
#define __NR_epoll_pwait 319
+#define __NR_async_exec 320
+#define __NR_async_wait 321
+#define __NR_umem_add 322
+#define __NR_async_thread 323
+#define __NR_threadlet_on 324
+#define __NR_threadlet_off 325
#ifdef __KERNEL__
-#define NR_syscalls 320
+#define NR_syscalls 326
#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
On 2/21/07, Ingo Molnar <[email protected]> wrote:
> I believe this threadlet concept is what user-space will want to use for
> programmable parallelism.
This is brilliant. Now it needs just four more things:
1) Documentation of what you can and can't do safely from a threadlet,
given that it runs in an unknown thread context;
2) Facilities for manipulating pools of threadlets, so you can
throttle their concurrency, reprioritize them, and cancel them in
bulk, disposing safely of any dynamically allocated memory,
synchronization primitives, and so forth that they may be holding;
3) Reworked threadlet scheduling to allow tens of thousands of blocked
threadlets to be dispatched efficiently in a controlled, throttled,
non-cache-and-MMU-thrashing manner, immediately following the softirq
that unblocks the I/O they're waiting on; and
4) AIO vsyscalls whose semantics resemble those of IEEE 754 floating
point operations, with a clear distinction between a) pipeline state
vs. operands, b) results vs. side effects, and c) coding errors vs.
not-a-number results vs. exceptions that cost you a pipeline flush and
nonlocal branch.
When these four problems are solved (and possibly one or two more that
I'm not thinking of), you will have caught up with the state of the
art in massively parallel event-driven cooperative multitasking
frameworks. This would be a really, really good thing for Linux and
its users.
Cheers,
- Michael
* Michael K. Edwards <[email protected]> wrote:
> 3) Reworked threadlet scheduling to allow tens of thousands of blocked
> threadlets to be dispatched efficiently in a controlled, throttled,
> non-cache-and-MMU-thrashing manner, immediately following the softirq
> that unblocks the I/O they're waiting on; and
threadlets, when they dont block, are just regular user-space function
calls - so no need to schedule or throttle them. [*]
threadlets, when they block, are regular kernel threads, so the regular
O(1) scheduler takes care of them. If MMU trashing is of any concern
then syslets should be used to implement the most performance-critical
events: under Linux a kernel thread that does not exit out to user-space
does not do any TLB switching at all. (even if there are multiple
processes active and their syslets intermix)
throttling of outstanding async contexts is most easily done by
user-space - you can see an example in threadlet-test.c, but there's
also fio/engines/syslet-rw.c. v2 had a kernel-space throttling mechanism
as well, i'll probably reintroduce that in later versions.
Ingo
[*] although certain more advanced scheduling tactics like the detection
of frequently executed threadlet functions and their pushing out to
separate contexts is possible too - but this is an optional add-on
and for later.
On Wed, 21 Feb 2007, Ingo Molnar wrote:
> From: Ingo Molnar <[email protected]>
>
> add the move_user_context() method to move the user-space
> context of one kernel thread to another kernel thread.
> User-space might notice the changed TID, but execution,
> stack and register contents (general purpose and FPU) are
> still the same.
Also signal handling should/must be maintained, on top of TID.
You don't want the user to be presented with a different signal handling
after an sys_async_exec call.
> Signed-off-by: Ingo Molnar <[email protected]>
> Signed-off-by: Arjan van de Ven <[email protected]>
> ---
> arch/i386/kernel/process.c | 21 +++++++++++++++++++++
> include/asm-i386/system.h | 7 +++++++
> 2 files changed, 28 insertions(+)
>
> Index: linux/arch/i386/kernel/process.c
> ===================================================================
> --- linux.orig/arch/i386/kernel/process.c
> +++ linux/arch/i386/kernel/process.c
> @@ -820,6 +820,27 @@ unsigned long get_wchan(struct task_stru
> }
>
> /*
> + * Move user-space context from one kernel thread to another.
> + * This includes registers and FPU state. Callers must make
> + * sure that neither task is running user context at the moment:
> + */
> +void
> +move_user_context(struct task_struct *new_task, struct task_struct *old_task)
> +{
> + struct pt_regs *old_regs = task_pt_regs(old_task);
> + struct pt_regs *new_regs = task_pt_regs(new_task);
> + union i387_union *tmp;
> +
> + *new_regs = *old_regs;
> + /*
> + * Flip around the FPU state too:
> + */
> + tmp = new_task->thread.i387;
> + new_task->thread.i387 = old_task->thread.i387;
> + old_task->thread.i387 = tmp;
> +}
This is not going to work in this case (already posted twice in other
emails):
---
Given TS_USEDFPU set (NTSK == new_task, OTSK == old_task), before
move_user_context():
CPU => FPUc
NTSK => FPUn
OTSK => FPUo
After move_user_context():
CPU => FPUc
NTSK => FPUo
OTSK => FPUn
After the incoming __unlazy_fpu() in __switch_to():
CPU => FPUc
NTSK => FPUo
OTSK => FPUc
After the first fault in NTSK:
CPU => FPUo
NTSK => FPUo
OTSK => FPUc
So NTSK loads a non up2date FPUo, instead of the FPUc that was the "dirty"
context to migrate (since TS_USEDFPU was set). I think you need an early
__unlazy_fpu() in that case, that would turn the above into:
Before move_user_context():
CPU => FPUc
NTSK => FPUn
OTSK => FPUo
After an early __unlazy_fpu() before FPU member swap:
CPU => FPUc
NTSK => FPUn
OTSK => FPUc
After move_user_context():
CPU => FPUc
NTSK => FPUc
OTSK => FPUn
After the first fault in NTSK:
CPU => FPUc
NTSK => FPUc
OTSK => FPUn
So, NTSK (the return-to-userspace task) will get the correct FPUc after a
fault. But the OTSK (now becoming service thread) will load FPUn after a
fault, that is not what expected. You may need a copy in that case.
I think correct FPU context handling is not going to be as easy as
swapping FPU pointers.
- Davide
* Michael K. Edwards <[email protected]> wrote:
> 1) Documentation of what you can and can't do safely from a threadlet,
> given that it runs in an unknown thread context;
you can do just about anything from a threadlet, using bog standard
procedural programming. (Certain system-calls are excluded at the moment
out of caution - but i'll probably lift restrictions like sys_clone()
use because sys_clone() can be done safely from a threadlet.)
The code must be thread-safe, because the kernel can move execution to a
new thread anytime and then it will execute in parallel with the main
thread. There's no other requirement.
Wrt. performance, one good model is to run request-alike functionality
from a threadlet, to maximize parallelism.
ingo
* Davide Libenzi <[email protected]> wrote:
> On Wed, 21 Feb 2007, Ingo Molnar wrote:
>
> > From: Ingo Molnar <[email protected]>
> >
> > add the move_user_context() method to move the user-space
> > context of one kernel thread to another kernel thread.
> > User-space might notice the changed TID, but execution,
> > stack and register contents (general purpose and FPU) are
> > still the same.
>
> Also signal handling should/must be maintained, on top of TID. You
> don't want the user to be presented with a different signal handling
> after an sys_async_exec call.
right now CLONE_SIGNAL and CLONE_SIGHAND is used for new async threads,
so they should inherit and share all the signal settings.
one area that definitely needs more work is that the ptrace parent (if
any) should probably follow the 'head' context. gdb at the moment copes
surprisingly well, but some artifacts are visible every now and then.
> > + *new_regs = *old_regs;
> > + /*
> > + * Flip around the FPU state too:
> > + */
> > + tmp = new_task->thread.i387;
> > + new_task->thread.i387 = old_task->thread.i387;
> > + old_task->thread.i387 = tmp;
> > +}
>
> This is not going to work in this case (already posted twice in other
> emails):
i'm really sorry - i still have a huge email backlog.
> So NTSK loads a non up2date FPUo, instead of the FPUc that was the
> "dirty" context to migrate (since TS_USEDFPU was set). I think you
> need an early __unlazy_fpu() in that case, that would turn the above
> into:
yes. My plan is to to avoid all these problems by having a
special-purpose sched_yield_to(old_task, new_task) function.
this, besides being even faster than the default scheduler (because the
runqueue balance does not change so no real scheduling decision has to
be done - the true scheduling decisions happen later on at async-wakeup
time), should also avoid all the FPU races: the FPU just gets flipped
between old_task and new_task (and TS_USEDFPU needs to be moved as well,
etc.). No intermediate task can come inbetween.
can you see a hole in this sched_yield_to() method as well?
Ingo
* Michael K. Edwards <[email protected]> wrote:
> 2) Facilities for manipulating pools of threadlets, so you can
> throttle their concurrency, reprioritize them, and cancel them in
> bulk, disposing safely of any dynamically allocated memory,
> synchronization primitives, and so forth that they may be holding;
pthread_cancel() [if/once threadlets are integrated into pthreads] ought
to do that. A threadlet, if it gets moved to an async context, is a
full-blown thread.
Ingo
* Michael K. Edwards <[email protected]> wrote:
> 4) AIO vsyscalls whose semantics resemble those of IEEE 754 floating
> point operations, with a clear distinction between a) pipeline state
> vs. operands, b) results vs. side effects, and c) coding errors vs.
> not-a-number results vs. exceptions that cost you a pipeline flush and
> nonlocal branch.
threadlets (and syslets) are parallel contexts and they behave so -
queuing and execution semantics are then ontop of that, implemented
either by glibc, or implemented by the application. There is no
'pipeline' of requests imposed - the structure of pending requests is
totally free-form. For example in threadlet-test.c i've in essence
implemented a 'set of requests' with the submission site only interested
in whether all requests are done or not - but any stricter (or even
looser) semantics and ordering can be used too.
in terms of AIO, the best queueing model is i think what the kernel uses
internally: freely ordered, with barrier support. (That is equivalent to
a "queue of sets", where the queue are the barriers, and the sets are
the requests within barriers. If there is no barrier pending then
there's just one large freely-ordered set of requests.)
Ingo
Ingo Molnar wrote:
> in terms of AIO, the best queueing model is i think what the kernel uses
> internally: freely ordered, with barrier support.
Speaking of AIO, how do you imagine lio_listio is implemented? If there
is no asynchronous syscall it would mean creating a threadlet for each
request but this means either waiting or creating several/many threads.
--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
On 2/21/07, Ingo Molnar <[email protected]> wrote:
> threadlets, when they dont block, are just regular user-space function
> calls - so no need to schedule or throttle them. [*]
Right. That's a great design feature.
> threadlets, when they block, are regular kernel threads, so the regular
> O(1) scheduler takes care of them. If MMU trashing is of any concern
> then syslets should be used to implement the most performance-critical
> events: under Linux a kernel thread that does not exit out to user-space
> does not do any TLB switching at all. (even if there are multiple
> processes active and their syslets intermix)
As far as I am concerned syslets by themselves are a dead letter,
because you can't do any of the things that potential application
coders need to do with them. As for threadlets, making them kernel
threads is not such a good design feature, O(1) scheduler or not. You
take the full hit of kernel task creation, on the spot, for every
threadlet that blocks. You don't fence off the threadlet from any of
the stuff that it ought to be fenced off from for thread-safety
reasons, so you don't have much choice but to create a new TLS arena
for it (which you need anyway for errno), so they have horrible MMU
and memory overhead. You can't dispatch them inexpensively, while the
data delivered by a softirq is still hot in cache, to traverse 1-3
lines of userspace code and make the next syscall. So they're just
not usable for any of the things that a real AIO application actually
does.
> throttling of outstanding async contexts is most easily done by
> user-space - you can see an example in threadlet-test.c, but there's
> also fio/engines/syslet-rw.c. v2 had a kernel-space throttling mechanism
> as well, i'll probably reintroduce that in later versions.
You're telling me that scheduling parallel I/O is the kernel's job but
throttling it is userspace's job?
> [*] although certain more advanced scheduling tactics like the detection
> of frequently executed threadlet functions and their pushing out to
> separate contexts is possible too - but this is an optional add-on
> and for later.
You won't be able to do it later if you don't design for it now.
Don't reinvent the square wheel -- there's a model to follow that was
so successful that it has killed all alternate models in its sphere.
Namely, IEEE 754. But please try not to make pipeline flushes suck as
much as they did on the i860.
Cheers,
- Michael
On 2/21/07, Ingo Molnar <[email protected]> wrote:
> pthread_cancel() [if/once threadlets are integrated into pthreads] ought
> to do that. A threadlet, if it gets moved to an async context, is a
> full-blown thread.
The fact that you are proposing pthread_cancel as a model for how to
abort an unfinished threadlet suggests -- to me, and I would think to
anyone who has ever written code that had no choice but to call
pthread_cancel -- that you have not thought about this part of the
problem.
Cheers,
- Michael
On 2/21/07, Ingo Molnar <[email protected]> wrote:
> threadlets (and syslets) are parallel contexts and they behave so -
> queuing and execution semantics are then ontop of that, implemented
> either by glibc, or implemented by the application. There is no
> 'pipeline' of requests imposed - the structure of pending requests is
> totally free-form. For example in threadlet-test.c i've in essence
> implemented a 'set of requests' with the submission site only interested
> in whether all requests are done or not - but any stricter (or even
> looser) semantics and ordering can be used too.
In short, you have a dataflow model with infinite parallelism,
implemented using threads of control mapped willy-nilly onto the
underlying hardware. This has not proven to be a successful model in
the past.
> in terms of AIO, the best queueing model is i think what the kernel uses
> internally: freely ordered, with barrier support. (That is equivalent to
> a "queue of sets", where the queue are the barriers, and the sets are
> the requests within barriers. If there is no barrier pending then
> there's just one large freely-ordered set of requests.)
That's a big part of why Linux scales poorly for workloads that
involve a large volume of in-flight I/O transactions. Unless you
essentially lock one application thread to each CPU core, with a
complete understanding of its cache sharing and latency relationships
to all the other cores, and do your own userspace I/O scheduling and
dispatching state machine -- which is what all industrial-strength
databases and other sorts of transaction engines currently do -- you
get the same old best-effort context-thrashing scheduler we've had
since Solaris 2.0.
Let's do something genuinely better this time, OK?
Cheers,
- Michael
On 2/21/07, Michael K. Edwards <[email protected]> wrote:
> You won't be able to do it later if you don't design for it now.
> Don't reinvent the square wheel -- there's a model to follow that was
> so successful that it has killed all alternate models in its sphere.
> Namely, IEEE 754. But please try not to make pipeline flushes suck as
> much as they did on the i860.
To understand why I harp on IEEE 754 as a sane model for pipelined
AIO, you might consider reading (at least parts of):
http://www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf
People who write industrial-strength floating point programs rely on
the IEEE floating point semantics to avoid having to check every
result of every arithmetic step to see whether it is a valid input to
the next step. NaNs and +/-0 and overflows and all that jazz are
essential to efficient coding of things like matrix inversion, because
the only alternative is simply to fail. But in-line indications of an
exceptional result aren't enough, because it may or may not have been
a coding error, and you may need fine-grained control over which
"failure conditions" are within the realm of the expected and which
are not.
Here's a quotable bit from Kahan and Darcy's polemic:
<quote>
To achieve Floating-Point Predictability:
Limit programmers' choices to what is reasonable and necessary as
well as parsimonious, and
Limit language implementors'choices so as always to honor the
programmer's choices.
To do so, language designers must understand floating-point well
enough to validate their determination of "what is reasonable and
necessary," or else must entrust that determination to someone else
with the necessary competency.
</quote>
Now I ain't that "someone else", when it comes to AIO pipelining. But
they're out there. Figure out how to create an AIO model that honors
the RDBMS programmer's choices efficiently on a NUMA box without
making him understand NUMA, and you really will have created something
for the ages.
Cheers,
- Michael
On Thu, 22 Feb 2007, Ingo Molnar wrote:
>
> * Davide Libenzi <[email protected]> wrote:
>
> > On Wed, 21 Feb 2007, Ingo Molnar wrote:
> >
> > > From: Ingo Molnar <[email protected]>
> > >
> > > add the move_user_context() method to move the user-space
> > > context of one kernel thread to another kernel thread.
> > > User-space might notice the changed TID, but execution,
> > > stack and register contents (general purpose and FPU) are
> > > still the same.
> >
> > Also signal handling should/must be maintained, on top of TID. You
> > don't want the user to be presented with a different signal handling
> > after an sys_async_exec call.
>
> right now CLONE_SIGNAL and CLONE_SIGHAND is used for new async threads,
> so they should inherit and share all the signal settings.
Right. Sorry I missed the signal cloning flags (still has to go thru the
whole code).
> > So NTSK loads a non up2date FPUo, instead of the FPUc that was the
> > "dirty" context to migrate (since TS_USEDFPU was set). I think you
> > need an early __unlazy_fpu() in that case, that would turn the above
> > into:
>
> yes. My plan is to to avoid all these problems by having a
> special-purpose sched_yield_to(old_task, new_task) function.
>
> this, besides being even faster than the default scheduler (because the
> runqueue balance does not change so no real scheduling decision has to
> be done - the true scheduling decisions happen later on at async-wakeup
> time), should also avoid all the FPU races: the FPU just gets flipped
> between old_task and new_task (and TS_USEDFPU needs to be moved as well,
> etc.). No intermediate task can come inbetween.
>
> can you see a hole in this sched_yield_to() method as well?
Not sure till I see the code, but in that case we really need a sync©.
This is really a fork-like for the FPU context IMO. The current "dirty"
FPU context should follow both OTSK and NTSK.
- Davide
* Michael K. Edwards <[email protected]> wrote:
> [...] As for threadlets, making them kernel threads is not such a good
> design feature, O(1) scheduler or not. You take the full hit of
> kernel task creation, on the spot, for every threadlet that blocks.
> [...]
this is very much not how they work. Threadlets share the same basic
infrastructure with syslets and they do /not/ take a 'full hit of kernel
thread creation' when they block. Please read the announcements, past
discussions on lkml and the code about how it works.
about your other point:
> > threadlets, when they block, are regular kernel threads, so the
> > regular O(1) scheduler takes care of them. If MMU trashing is of any
> > concern then syslets should be used to implement the most
> > performance-critical events: under Linux a kernel thread that does
> > not exit out to user-space does not do any TLB switching at all.
> > (even if there are multiple processes active and their syslets
> > intermix)
>
> As far as I am concerned syslets by themselves are a dead letter,
> because you can't do any of the things that potential application
> coders need to do with them. [...]
syslets are not meant to be directly exposed to application coders.
Syslets (like many of my previous mails stated) are meant as building
blocks to higher-level AIO interfaces such as in glibc or libaio. Then
user-space can build its state-machine based on syslet-implemented
glibc/libaio. In that specific role they are a very fast and scalable
mechanism.
Ingo
* Michael K. Edwards <[email protected]> wrote:
> [...] Unless you essentially lock one application thread to each CPU
> core, with a complete understanding of its cache sharing and latency
> relationships to all the other cores, and do your own userspace I/O
> scheduling and dispatching state machine -- which is what all
> industrial-strength databases and other sorts of transaction engines
> currently do -- you get the same old best-effort context-thrashing
> scheduler we've had since Solaris 2.0.
the syslet/threadlet framework has been derived from Tux, which one can
accuse of may things, but which i definitely can not accuse of being
slow. It has no relationship whatsoever to Solaris 2.0 or later.
Your other mail showed that you have very basic misunderstandings about
how threadlets work, on which you based a string of firm but incorrect
conclusions. In this discussion i'm mostly interested in specific
feedback about syslets/threadlets - thankfully we are past the years of
"unless Linux does generic technique X it will stay a hobby OS forever"
type of time-wasting discussions.
Ingo
* Ulrich Drepper <[email protected]> wrote:
> Ingo Molnar wrote:
> > in terms of AIO, the best queueing model is i think what the kernel uses
> > internally: freely ordered, with barrier support.
>
> Speaking of AIO, how do you imagine lio_listio is implemented? If
> there is no asynchronous syscall it would mean creating a threadlet
> for each request but this means either waiting or creating
> several/many threads.
my current thinking is that special-purpose (non-programmable, static)
APIs like aio_*() and lio_*(), where every last cycle of performance
matters, should be implemented using syslets - even if it is quite
tricky to write syslets (which they no doubt are - just compare the size
of syslet-test.c to threadlet-test.c). So i'd move syslets into the same
category as raw syscalls: pieces of the raw infrastructure between the
kernel and glibc, not an exposed API to apps. [and even if we keep them
in that category they still need quite a bit of API work, to clean up
the 32/64-bit issues, etc.]
The size of the async thread pool can be kept in check either from
user-space (by starting to queue up requests after a certain point of
saturation without submitting them) or from kernel-space which involves
waiting (the latter was present in v2 but i temporarily removed it from
v3). "You have to wait" is the eventual final answer in every
well-behaved queueing system anyway.
How things work out with a large number of outstanding threads in real
apps is still an open question (until someone tries it) but i'm
cautiously optimistic: in my own (FIO based) measurements syslets beat
the native KAIO interfaces both in the cached and in the non-cached [==
many threads] case. I did not expect the latter at all: the non-cached
syslet codepath is not optimized at all yet, so i expected it to have
(much) higher CPU overhead than KAIO.
This means that KAIO is in worse shape than i thought - there's just way
too much context KAIO has to build up to submit parallel IO contexts.
Many years of optimizations went into KAIO already, so it's probably at
its outer edge of performance capabilities. Furthermore, what KAIO has
to compete against in the syslet case are the synchronous syscalls
turned async, and more than a decade of optimizations went into all the
synchronous syscalls. Plus the 'threading overhead of syslets' really
boils down to 'scheduling overhead' in the end - and we can do over a
million context-switches a second, per CPU. What killed user-space
thread-based AIO performance many moons ago wasnt really the threading
concept itself or scheduling overhead, it was the (then) fragile
threading implementation of Linux, combined with the resulting
signal-based AIO code. Catching and handling a single signal is more
expensive than a context-switch - and signals have legacies attached to
them that make them hard to scale within the kernel. Plus with syslets
the 'threading overhead' is optional, it only happens when it has to.
Plus there's the fundamental killer that KAIO is a /lot/ harder to
implement (and to maintain) on the kernel side: it has to be implemented
for every IO discipline, and even for the IO disciplines it supports at
the moment, it is not truly asynchronous for things like metadata
blocking or VFS blocking. To handle things like metadata blocking it has
to resort to non-statemachine techniques like retries - which are bad
for performance.
Syslets/threadlets on the other hand, once the core is implemented, have
near zero ongoing maintainance cost (compared to KAIO pushed into every
IO subsystem) and cover all IO disciplines and API variants immediately,
and they are as perfectly asynchronous as it gets.
So all in one, i used to think that AIO state-machines have a long-term
place within the kernel, but with syslets i think i've proven myself
embarrasingly wrong =B-)
Ingo
On 2/21/07, Ingo Molnar <[email protected]> wrote:
> > [...] As for threadlets, making them kernel threads is not such a good
> > design feature, O(1) scheduler or not. You take the full hit of
> > kernel task creation, on the spot, for every threadlet that blocks.
> > [...]
>
> this is very much not how they work. Threadlets share the same basic
> infrastructure with syslets and they do /not/ take a 'full hit of kernel
> thread creation' when they block. Please read the announcements, past
> discussions on lkml and the code about how it works.
Sorry, you're right, I jumped to a conclusion here without reading the
implementation. I read too much into your statement that "threadlets,
when they block, are regular kernel threads". So tell me, at what
stage is NPTL going to need a new TLS context for errno and all that?
Immediately when the threadlet first blocks, right? At most you can
delay the bulk page copies with CoW MMU tricks (about which I cannot
begin to match your knowledge). But you can't let any code run that
might touch errno -- or FPU state or anything else you need to
preserve for when you do carry on with the threadlet -- until you've
at least told the hardware to fault on write.
Yes, I will go back and read the code for myself. This will take me
some time because I have only a hand-waving level of knowledge about
task_structs and pt_regs, and have largely avoided the dark corners of
the x86 architecture. But I think my point still stands: allowing
code inside threadlets to use the usual C library wrappers around the
usual synchronous syscalls is going to mean that the threadlet context
is fairly heavyweight, both in memory and CPU/MMU state. This means
that you can't pull it cheaply over to whatever CPU core happened to
process the device I/O that delivered the data it wanted.
If the cost of threadlet migration isn't negligible, then you can't
just write code that initiates a zillion threadlets in a loop -- not
if you want to be able to exploit NUMA or even SMP efficiently. You
have to distribute the threadlet initiation among parallel threads on
all the CPUs -- at which point you are back where you started, with
the application explicitly partitioned among CPU-locked dispatch
threads. Any programming team prepared to cope with that is probably
going to stick to the userspace state machine they probably already
have for managing delayed I/O responses.
> syslets are not meant to be directly exposed to application coders.
> Syslets (like many of my previous mails stated) are meant as building
> blocks to higher-level AIO interfaces such as in glibc or libaio. Then
> user-space can build its state-machine based on syslet-implemented
> glibc/libaio. In that specific role they are a very fast and scalable
> mechanism.
With all due respect -- and I know how successful you have been with
major new kernel features in the past -- I think that's a cop-out.
That's like saying that floating point is not meant to be directly
exposed to application coders. Sure, the details of the floating
point _pipeline_ are essentially opaque; but I don't have to package
up a string of floating point operations into a "floatlet" and send it
off to the FPU. From the point of view of the integer unit, I move
things into and out of floating point registers, and in between I
issue instructions to the FPU to do arithmetic on its registers. If
something goes mildly wrong (say, underflow), and I've told the FPU to
carry on under those circumstances, it does. If something goes bad
wrong, or if underflow happens and I've told the FPU that my code
won't do the right thing on underflow, I get an exception. That's it.
If the FPU decides to pipeline things, that's not my problem; when I
get to the operation that wants to pull a result back over to the
integer unit, I stall until it's ready. And in a cleverer, more
tightly interlocked processor that issues some things in parallel and
speculatively executes others, exceptions may not be deliverable until
long after the arithmetic op that produced them (you might not wind up
taking that branch). Hence other state may have to be rewound so that
the exception is delivered with everything else in the state it would
have been in if the processor weren't so damn clever.
You don't have to invent all that pipelining and migration and
speculative execution stuff up front for AIO. But if you don't stick
to what is "reasonable and necessary as well as parsimonious" when
defining what's allowed inside a threadlet, you won't leave
implementations any future flexibility. And "want speed? use syslets
instead" is no answer, especially if you tell me that syslets wrapped
in glibc are only really useful for short-circuiting chained states in
a state machine. I. Don't. Want. To. Write. State. Machines. Any.
More.
Cheers,
- Michael
Oh, and while I haven't written a kernel or an RDBMS, I have written
some fairly serious non-blocking I/O code (without resorting to
threads; one socket and thousands of independent userspace state
machines) and rewritten the plumbing of two fairly heavy-duty network
operations frameworks, one of them attached to a horrifically complex
GUI. And briefly, long ago, I made Transputers dance with Occam and
galaxies spin with PVM. This gives me exactly zero credentials here,
of course, but may suggest to you that when I say something that seems
naive, it's more likely that I got the Linux-specific lingo wrong.
That, or I'm _actively_ stupid. :-)
On 2/21/07, Ingo Molnar <[email protected]> wrote:
> the syslet/threadlet framework has been derived from Tux, which one can
> accuse of may things, but which i definitely can not accuse of being
> slow. It has no relationship whatsoever to Solaris 2.0 or later.
So how well does Tux fare on a NUMA box? The Solaris 2.0 reference
was not about the origins of the code, it was about the era when SMP
first became relevant to the UNIX application programmer. I remember
an emphasis on scheduler scalability then, too. It took them quite a
while to figure out that having an efficient scheduler is of little
use if you are scheduling the wrong things on the wrong CPUs in the
wrong order and thereby thrashing continuously. By that time we had
given up and gone back to message passing via the network stack, which
was the one kernel component that could figure out how to get state
from one CPU to another without taking all of its clothes off and
changing underwear in between. Sound familiar?
> Your other mail showed that you have very basic misunderstandings about
> how threadlets work, on which you based a string of firm but incorrect
> conclusions. In this discussion i'm mostly interested in specific
> feedback about syslets/threadlets - thankfully we are past the years of
> "unless Linux does generic technique X it will stay a hobby OS forever"
> type of time-wasting discussions.
Well, maybe I'll let someone else weigh in about whether I understood
threadlets well enough to provide feedback worth reading. As for the
"hobby OS forever" bit, that's an utter misrepresentation of my
comments and criticism. Linux is now good enough for Oracle to have
more or less abandoned Sun for Linux. That's as good as it needs to
be, as far as Oracle and IBM are concerned. The question is now
whether it will ever get substantially _better_, so that you can do
something constructive with a NUMA box or a 64-core MIPS without
having the resources of an Oracle or a Google to build an
OS-atop-the-OS.
Cheers,
- Michael
On Wed, Feb 21, 2007 at 10:13:55PM +0100, Ingo Molnar wrote:
> this is the v3 release of the syslet/threadlet subsystem:
>
> http://redhat.com/~mingo/syslet-patches/
>
> This release came a few days later than i originally wanted, because
> i've implemented many fundamental changes to the code. The biggest
> highlights of v3 are:
>
> - "Threadlets": the introduction of the 'threadlet' execution concept.
>
> - syslets: multiple rings support with no kernel-side footprint, the
> elimination of mlock() pinning, no async_register/unregister() calls
> needed anymore and more.
>
> "Threadlets" are basically the user-space equivalent of syslets: small
> functions of execution that the kernel attempts to execute without
> scheduling. If the threadlet blocks, the kernel creates a real thread
> from it, and execution continues in that thread. The 'head' context (the
> context that never blocks) returns to the original function that called
> the threadlet. Threadlets are very easy to use:
>
> long my_threadlet_fn(void *data)
> {
> char *name = data;
> int fd;
>
> fd = open(name, O_RDONLY);
> if (fd < 0)
> goto out;
>
> fstat(fd, &stat);
> read(fd, buf, count)
> ...
>
> out:
> return threadlet_complete();
> }
>
>
> main()
> {
> done = threadlet_exec(threadlet_fn, new_stack, &user_head);
> if (!done)
> reqs_queued++;
> }
>
> There is no limitation whatsoever about how a threadlet function can
> look like: it can use arbitrary system-calls and all execution will be
> procedural. There is no 'registration' needed when running threadlets
> either: the kernel will take care of all the details, user-space just
> runs a threadlet without any preparation and that's it.
>
> Completion of async threadlets can be done from user-space via any of
> the existing APIs: in threadlet-test.c (see the async-test-v3.tar.gz
> user-space examples at the URL above) i've for example used a futex
> between the head and the async threads to do threadlet notification. But
> select(), poll() or signals can be used too - whichever is most
> convenient to the application writer.
>
> Threadlets can also be thought of as 'optional threads': they execute in
> the original context as long as they do not block, but once they block,
> they are moved off into their separate thread context - and the original
> context can continue execution.
>
> Threadlets can also be thought of as 'on-demand parallelism': user-space
> does not have to worry about setting up, sizing and feeding a thread
> pool - the kernel will execute the workload in a single-threaded manner
> as long as it makes sense, but once the context blocks, a parallel
> context is created. So parallelism inside applications is utilized in a
> natural way. (The best place to do this is in the kernel - user-space
> has no idea about what level of parallelism is best for any given
> moment.)
>
> I believe this threadlet concept is what user-space will want to use for
> programmable parallelism.
>
> [ Note that right now there's a pair of system-calls: sys_threadlet_on()
> and sys_threadlet_off() that demarks the beginning and the end of a
> syslet function, which enter the kernel even in the 'cached' case -
> but my plan is to do these two system calls via a vsyscall, without
> having to enter the kernel at all. That will reduce cached threadlet
> execution NULL-overhead to around 10 nsecs - making it essentially
> zero. ]
>
> Threadlets share much of the scheduling infrastructure with syslets.
>
> Syslets (small, kernel-side, scripted "syscall plugins") are still
> supported - they are (much...) harder to program than threadlets but
> they allow the highest performance. Core infrastructure libraries like
> glibc/libaio are expected to use syslets. Jens Axboe's FIO tool already
> includes support for v2 syslets, and the following patch updates FIO to
Ah, glad to see that - I was wondering if it was worthwhile to try adding
syslet support to aio-stress to be able to perform some comparisons.
Hopefully FIO should be able to generate a similar workload, but I haven't
tried it yet so am not sure. Are you planning to upload some results
(so I can compare it with patterns I am familiar with) ?
Regards
Suparna
> the v3 API:
>
> http://redhat.com/~mingo/syslet-patches/fio-syslet-v3.patch
>
> Furthermore, the syslet code and API has been significantly enhanced as
> well:
>
> - support for multiple completion rings has been added
>
> - there is no more mlock()ing of the completion ring(s)
>
> - sys_async_register()/unregister() has been removed as it is not
> needed anymore. sys_async_exec() can be called straight away.
>
> - there is no kernel-side resource used up by async completion rings at
> all (all the state is in user-space), so an arbitrary number of
> completion rings are supported.
>
> plus lots of bugs were fixed and a good number of cleanups were done as
> well. The v3 code is ABI-incompatible with v2, due to these fundamental
> changes.
>
> As always, comments, suggestions, reports are welcome.
>
> Ingo
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
* Suparna Bhattacharya <[email protected]> wrote:
> > Threadlets share much of the scheduling infrastructure with syslets.
> >
> > Syslets (small, kernel-side, scripted "syscall plugins") are still
> > supported - they are (much...) harder to program than threadlets but
> > they allow the highest performance. Core infrastructure libraries
> > like glibc/libaio are expected to use syslets. Jens Axboe's FIO tool
> > already includes support for v2 syslets, and the following patch
> > updates FIO to
>
> Ah, glad to see that - I was wondering if it was worthwhile to try
> adding syslet support to aio-stress to be able to perform some
> comparisons. [...]
i think it would definitely be worth it.
> [...] Hopefully FIO should be able to generate a similar workload, but
> I haven't tried it yet so am not sure. Are you planning to upload some
> results (so I can compare it with patterns I am familiar with) ?
i had no time yet to do careful benchmarks. Right now my impression from
quick testing is that libaio performance can be exceeded via syslets. So
it would be very interesting if you could try this too, independently of
me.
Ingo
Hi Ingo, developers.
On Thu, Feb 22, 2007 at 08:40:44AM +0100, Ingo Molnar ([email protected]) wrote:
> Syslets/threadlets on the other hand, once the core is implemented, have
> near zero ongoing maintainance cost (compared to KAIO pushed into every
> IO subsystem) and cover all IO disciplines and API variants immediately,
> and they are as perfectly asynchronous as it gets.
>
> So all in one, i used to think that AIO state-machines have a long-term
> place within the kernel, but with syslets i think i've proven myself
> embarrasingly wrong =B-)
Hmm...
Try to have a network web server with huge load made on top of
syslets/threadlets.
It is not a TUX anymore - you had 1024 threads, and all of them will be
consumed by tcp_sendmsg() for slow clients - rescheduling will kill a
machine.
My tests show that with 4k connections per second (8k concurrency) more
than 20k connections of 80k total block in tcp_sendmsg() over gigabit
lan between quite fast machines.
Or threadlet/syslet AIO should not be used with networking too?
> Ingo
--
Evgeniy Polyakov
> It is not a TUX anymore - you had 1024 threads, and all of them will be
> consumed by tcp_sendmsg() for slow clients - rescheduling will kill a
> machine.
I think it's time to make a split in what "context switch" or
"reschedule" means...
there are two types of context switch:
1) To a different process. This means teardown of the TLB, going to a
new MMU state, saving FPU state etc etc etc. This is obviously quite
expensive
2) To a thread of the same process. No TLB flush no new MMU state,
effectively all it does is getting a new task struct on the kernel side,
and a new ESP/EIP pair on the userspace side. If there is FPU code
involved that gets saved as well.
Number 1 is very expensive and that is what is really worrying normally;
number 2 is a LOT lighter weight, and while Linux is a bit heavy there,
it can be made lighter... there's no fundamental reason for it to be
really expensive.
--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org
May I just say, that this is f***ing brilliant.
It completely separates the threadlet/fibril core from the (contentious)
completion notification debate, and allows you to use whatever mechanism
you like. (fd, signal, kevent, futex, ...)
You can also add a "macro syscall" like the original syslet idea,
and it can be independent of the threadlet mechanism but provide the
same effects.
If the macros can be designed to always exit when donew, a guarantee
never to return to user space, then you can always recycle the stack
after threadlet_exec() returns, whether it blocked in the syscall or not,
and you have your original design.
May I just suggest, however, that the interface be:
tid = threadlet_exec(...)
Where tid < 0 means error, tid == 0 means completed synchronously,
and tod > 0 identifies the child so it can be waited for?
Anyway, this is a really excellent user-space API. (You might add some
sort of "am I synchronous?" query, or maybe you could just use gettid()
for the purpose.)
The one interesting question is, can you nest threadlet_exec() calls?
I think it's implementable, and I can definitely see the attraction
of being able to call libraries that use it internally (to do async
read-ahead or whatever) from a threadlet function.
On Thu, Feb 22, 2007 at 12:52:39PM +0100, Arjan van de Ven ([email protected]) wrote:
>
> > It is not a TUX anymore - you had 1024 threads, and all of them will be
> > consumed by tcp_sendmsg() for slow clients - rescheduling will kill a
> > machine.
>
> I think it's time to make a split in what "context switch" or
> "reschedule" means...
>
> there are two types of context switch:
>
> 1) To a different process. This means teardown of the TLB, going to a
> new MMU state, saving FPU state etc etc etc. This is obviously quite
> expensive
>
> 2) To a thread of the same process. No TLB flush no new MMU state,
> effectively all it does is getting a new task struct on the kernel side,
> and a new ESP/EIP pair on the userspace side. If there is FPU code
> involved that gets saved as well.
>
> Number 1 is very expensive and that is what is really worrying normally;
> number 2 is a LOT lighter weight, and while Linux is a bit heavy there,
> it can be made lighter... there's no fundamental reason for it to be
> really expensive.
It does not matter - even with threads cost of having thousands of
threads is _too_ expensive. So, IMO, it is wrong to have to create
20k threads for the simple web server which only sends one index page to
80k connections with 4k connections per seconds rate.
Just have that example in mind - more than 20k blocks in 80k connections
over gigabit lan, and it is likely optimistic result, when designing new
type of AIO.
> --
> if you want to mail me at work (you don't), use arjan (at) linux.intel.com
> Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org
--
Evgeniy Polyakov
* Evgeniy Polyakov <[email protected]> wrote:
> It is not a TUX anymore - you had 1024 threads, and all of them will
> be consumed by tcp_sendmsg() for slow clients - rescheduling will kill
> a machine.
maybe it will, maybe it wont. Lets try? There is no true difference
between having a 'request structure' that represents the current state
of the HTTP connection plus a statemachine that moves that request
between various queues, and a 'kernel stack' that goes in and out of
runnable state and carries its processing state in its stack - other
than the amount of RAM they take. (the kernel stack is 4K at a minimum -
so with a million outstanding requests they would use up 4 GB of RAM.
With 20k outstanding requests it's 80 MB of RAM - that's acceptable.)
> My tests show that with 4k connections per second (8k concurrency)
> more than 20k connections of 80k total block in tcp_sendmsg() over
> gigabit lan between quite fast machines.
yeah. Note that you can have a million sleeping threads if you want, the
scheduler wont care. What matters more is the amount of true concurrency
that is present at any given time. But yes, i agree that overscheduling
can be a problem.
btw., what is the measurement utility you are using with kevents ('ab'
perhaps, with a high -c concurrency count?), and which webserver are you
using? (light-httpd?)
Ingo
On Thu, Feb 22, 2007 at 01:59:31PM +0100, Ingo Molnar ([email protected]) wrote:
>
> * Evgeniy Polyakov <[email protected]> wrote:
>
> > It is not a TUX anymore - you had 1024 threads, and all of them will
> > be consumed by tcp_sendmsg() for slow clients - rescheduling will kill
> > a machine.
>
> maybe it will, maybe it wont. Lets try? There is no true difference
> between having a 'request structure' that represents the current state
> of the HTTP connection plus a statemachine that moves that request
> between various queues, and a 'kernel stack' that goes in and out of
> runnable state and carries its processing state in its stack - other
> than the amount of RAM they take. (the kernel stack is 4K at a minimum -
> so with a million outstanding requests they would use up 4 GB of RAM.
> With 20k outstanding requests it's 80 MB of RAM - that's acceptable.)
I tried already :) - I just made a allocations atomic in tcp_sendmsg() and
ended up with 1/4 of the sends blocking (I counted both allocation
failure and socket queue overflow). Those 20k blocked requests were
created in about 20 seconds, so roughly saying we have 1k of thread
creation/freeing per second - do we want this?
> > My tests show that with 4k connections per second (8k concurrency)
> > more than 20k connections of 80k total block in tcp_sendmsg() over
> > gigabit lan between quite fast machines.
>
> yeah. Note that you can have a million sleeping threads if you want, the
> scheduler wont care. What matters more is the amount of true concurrency
> that is present at any given time. But yes, i agree that overscheduling
> can be a problem.
Before I started M:N threading library implementation I checked some
threading perfomance of the current POSIX library - I created simple
pool of threads and 'sent' a message between them using futex wait/wake
(sema_post/wait) one-by-one, results are quite dissapointing - given
that number of sleeping threads were about hundreds, kernel rescheduling
is about 10 times slower than setjmp based (I think so) used in Erlang.
Above example is not 100% correct, I understand, but situation with
thread-like AIO is much worse - it is possible that several threads will
be ready simultaneous, so rescheduling between them will kill a
performance.
> btw., what is the measurement utility you are using with kevents ('ab'
> perhaps, with a high -c concurrency count?), and which webserver are you
> using? (light-httpd?)
Yes, it is ab and lighttpd, before it was httperf (unfair on high load
due to poll/select usage) and own web server (evserver_kevent.c).
> Ingo
--
Evgeniy Polyakov
From: [email protected]
Date: 22 Feb 2007 07:27:21 -0500
> May I just say, that this is f***ing brilliant.
It's brilliant for disk I/O, not for networking for which
blocking is the norm not the exception.
So people will have to likely do something like divide their
applications into handling for I/O to files and I/O to networking.
So beautiful. :-)
Nobody has proposed anything yet which scales well and handles both
cases.
It is one reoccuring point made by Evgeniy, and he is very right
about it.
If used for networking one could easily make this new interface create
an arbitrary number of threads by just opening up that many
connections to such a server and just sitting there not reading
anything from any of the client sockets. And this happens
non-maliciously for slow clients, whether that is due to application
blockage or the characteristics of the network path.
From: Evgeniy Polyakov <[email protected]>
Date: Thu, 22 Feb 2007 15:39:30 +0300
> It does not matter - even with threads cost of having thousands of
> threads is _too_ expensive. So, IMO, it is wrong to have to create
> 20k threads for the simple web server which only sends one index page to
> 80k connections with 4k connections per seconds rate.
>
> Just have that example in mind - more than 20k blocks in 80k connections
> over gigabit lan, and it is likely optimistic result, when designing new
> type of AIO.
I totally agree with Evgeniy on these points.
Using things like syslets and threadlets for networking I/O
is not a very good idea. Blocking is more the norm than the
exception for networking I/O.
On Thu, Feb 22, 2007 at 01:59:31PM +0100, Ingo Molnar wrote:
>
> * Evgeniy Polyakov <[email protected]> wrote:
>
> > It is not a TUX anymore - you had 1024 threads, and all of them will
> > be consumed by tcp_sendmsg() for slow clients - rescheduling will kill
> > a machine.
>
> maybe it will, maybe it wont. Lets try? There is no true difference
> between having a 'request structure' that represents the current state
> of the HTTP connection plus a statemachine that moves that request
> between various queues, and a 'kernel stack' that goes in and out of
> runnable state and carries its processing state in its stack - other
> than the amount of RAM they take. (the kernel stack is 4K at a minimum -
> so with a million outstanding requests they would use up 4 GB of RAM.
> With 20k outstanding requests it's 80 MB of RAM - that's acceptable.)
At what point are the cachemiss threads destroyed ? In other words how well
does this adapt to load variations ? For example, would this 80MB of RAM
continue to be locked down even during periods of lighter loads thereafter ?
Regards
Suparna
>
> > My tests show that with 4k connections per second (8k concurrency)
> > more than 20k connections of 80k total block in tcp_sendmsg() over
> > gigabit lan between quite fast machines.
>
> yeah. Note that you can have a million sleeping threads if you want, the
> scheduler wont care. What matters more is the amount of true concurrency
> that is present at any given time. But yes, i agree that overscheduling
> can be a problem.
>
> btw., what is the measurement utility you are using with kevents ('ab'
> perhaps, with a high -c concurrency count?), and which webserver are you
> using? (light-httpd?)
>
> Ingo
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
* David Miller <[email protected]> wrote:
> If used for networking one could easily make this new interface create
> an arbitrary number of threads by just opening up that many
> connections to such a server and just sitting there not reading
> anything from any of the client sockets. And this happens
> non-maliciously for slow clients, whether that is due to application
> blockage or the characteristics of the network path.
there are two issues on which i'd like to disagree.
Firstly, i dont think you are fully applying the syslet/threadlet model.
There is no reason why an 'idle' client would have to use up a full
thread! It all depends on how you use syslets/threadlets, and how
(frequently) you queue back requests from cachemiss threads back to the
primary thread. It is only the simplest queueing model where there is
one thread per request that is currently blocked. Syslets/threadlets do
/not/ force request processing to be performed in the async context
forever - the async thread could very much queue it back to the primary
context. (That's in essence what Tux did.) So the same state-machine
techniques can be applied on both the syslet and the threadlet model,
but in much more natural (and thus lower overhead) points: /between/
system calls and not in the middle of them. There are a number of
measures that can be used to keep the number of parallel threads down.
Secondly, even assuming lots of pending requests/async-threads and a
naive queueing model, an open request will eat up resources on the
server no matter what. So if your point is that "+4K of kernel stack
pinned down per open, blocked request makes syslets and threadlets not a
very good idea", then i'd like to disagree with that: while it wont be
zero-cost (4K does cost you 400MB of RAM per 100,000 outstanding
threads), it's often comparable to the other RAM costs that are already
attached to an open connection.
(let me know if i misunderstood your point.)
Ingo
* Suparna Bhattacharya <[email protected]> wrote:
> > maybe it will, maybe it wont. Lets try? There is no true difference
> > between having a 'request structure' that represents the current
> > state of the HTTP connection plus a statemachine that moves that
> > request between various queues, and a 'kernel stack' that goes in
> > and out of runnable state and carries its processing state in its
> > stack - other than the amount of RAM they take. (the kernel stack is
> > 4K at a minimum - so with a million outstanding requests they would
> > use up 4 GB of RAM. With 20k outstanding requests it's 80 MB of RAM
> > - that's acceptable.)
>
> At what point are the cachemiss threads destroyed ? In other words how
> well does this adapt to load variations ? For example, would this 80MB
> of RAM continue to be locked down even during periods of lighter loads
> thereafter ?
you can destroy them at will from user-space too - just start a slow
timer that zaps them if load goes down. I can add a
sys_async_thread_exit(nr_threads) API to be able to drive this without
knowing the TIDs of those threads, and/or i can add a kernel-internal
mechanism to zap inactive threads. It would be rather easy and
low-overhead - the v2 code already had a max_nr_threads tunable, i can
reintroduce it. So the size of the pool of contexts does not have to be
permanent at all.
Ingo
From: Ingo Molnar <[email protected]>
Date: Thu, 22 Feb 2007 15:31:45 +0100
> Firstly, i dont think you are fully applying the syslet/threadlet model.
> There is no reason why an 'idle' client would have to use up a full
> thread! It all depends on how you use syslets/threadlets, and how
> (frequently) you queue back requests from cachemiss threads back to the
> primary thread. It is only the simplest queueing model where there is
> one thread per request that is currently blocked. Syslets/threadlets do
> /not/ force request processing to be performed in the async context
> forever - the async thread could very much queue it back to the primary
> context. (That's in essence what Tux did.) So the same state-machine
> techniques can be applied on both the syslet and the threadlet model,
> but in much more natural (and thus lower overhead) points: /between/
> system calls and not in the middle of them. There are a number of
> measures that can be used to keep the number of parallel threads down.
Ok.
> Secondly, even assuming lots of pending requests/async-threads and a
> naive queueing model, an open request will eat up resources on the
> server no matter what. So if your point is that "+4K of kernel stack
> pinned down per open, blocked request makes syslets and threadlets not a
> very good idea", then i'd like to disagree with that: while it wont be
> zero-cost (4K does cost you 400MB of RAM per 100,000 outstanding
> threads), it's often comparable to the other RAM costs that are already
> attached to an open connection.
The 400MB is extra, and it's in no way commensurate with the cost
of the TCP socket itself even including the application specific
state being used for that connection.
Even if it would be _equal_, we would be doubling the memory
requirements for such a scenerio.
This is why I dislike the threadlet model, when used in that way.
The pushback to the primary thread you speak of is just extra work in
my mind, for networking. Better to just begin operations and sit in
the primary thread(s) waiting for events, and when they arrive push
the operations further along using non-blocking writes, reads, and
accept() calls. There is no blocking context really needed for these
kinds of things, so a mechanism that tries to provide one is a waste.
As a side note although Evgeniy likes M:N threading model ideas, they
are a mine field wrt. signal semantics. Solaris guys took several
years to get it right, just grep through the Solaris kernel patch
readme files over the years to get an idea of how bad it can be. I
would therefore never advocate such an approach.
The more I think about it, a reasonable solution might actually be to
use threadlets for disk I/O and pure event based processing for
networking. It is two different handling paths and non-unified,
but that might be the price for good performance :-)
David Miller wrote:
> From: Evgeniy Polyakov <[email protected]>
> Date: Thu, 22 Feb 2007 15:39:30 +0300
>
>
>> It does not matter - even with threads cost of having thousands of
>> threads is _too_ expensive. So, IMO, it is wrong to have to create
>> 20k threads for the simple web server which only sends one index page to
>> 80k connections with 4k connections per seconds rate.
>>
>> Just have that example in mind - more than 20k blocks in 80k connections
>> over gigabit lan, and it is likely optimistic result, when designing new
>> type of AIO.
>>
>
> I totally agree with Evgeniy on these points.
>
> Using things like syslets and threadlets for networking I/O
> is not a very good idea. Blocking is more the norm than the
> exception for networking I/O.
>
And for O_DIRECT, and for large storage systems which overwhelm caches.
The optimize for the nonblocking case approach does not fit all
workloads. And of course we have to be able to mix mostly-nonblocking
threadlets and mostly-blocking O_DIRECT and networking.
--
error compiling committee.c: too many arguments to function
* Ingo Molnar <[email protected]> wrote:
> Firstly, i dont think you are fully applying the syslet/threadlet
> model. There is no reason why an 'idle' client would have to use up a
> full thread! It all depends on how you use syslets/threadlets, and how
> (frequently) you queue back requests from cachemiss threads back to
> the primary thread. It is only the simplest queueing model where there
> is one thread per request that is currently blocked.
> Syslets/threadlets do /not/ force request processing to be performed
> in the async context forever - the async thread could very much queue
> it back to the primary context. (That's in essence what Tux did.) So
> the same state-machine techniques can be applied on both the syslet
> and the threadlet model, but in much more natural (and thus lower
> overhead) points: /between/ system calls and not in the middle of
> them. There are a number of measures that can be used to keep the
> number of parallel threads down.
i think the best model here is to use kevents or epoll to discover
accept()-able or recv()-able keepalive sockets, and to do the main
request loop via syslets/threadlets, with a 'queue back to the main
context if we went async and if the request is done' feedback mechanism
that keeps the size of the pool down.
Ingo
On Thu, Feb 22, 2007 at 06:47:04AM -0800, David Miller ([email protected]) wrote:
> As a side note although Evgeniy likes M:N threading model ideas, they
> are a mine field wrt. signal semantics. Solaris guys took several
> years to get it right, just grep through the Solaris kernel patch
> readme files over the years to get an idea of how bad it can be. I
> would therefore never advocate such an approach.
I have fully synchronous kevent signal delivery for that purpose :)
Having all events synchronous allows trivial handling of them -
including signals.
> The more I think about it, a reasonable solution might actually be to
> use threadlets for disk I/O and pure event based processing for
> networking. It is two different handling paths and non-unified,
> but that might be the price for good performance :-)
Hmm, yes, for such scenario we need some kind of event delivery
mechanism, which would allow to wait on different kinds of events.
In the above sentence I see known to pain letters -
letter k
letter e
letter v
letter e
letter n
letter t
Or more modern trend - async_wait(epoll).
--
Evgeniy Polyakov
* David Miller <[email protected]> wrote:
> The pushback to the primary thread you speak of is just extra work in
> my mind, for networking. Better to just begin operations and sit in
> the primary thread(s) waiting for events, and when they arrive push
> the operations further along using non-blocking writes, reads, and
> accept() calls. There is no blocking context really needed for these
> kinds of things, so a mechanism that tries to provide one is a waste.
one question is, what is cheaper, to block out of a read and a write and
to set up the event notification and then to return to the user context,
or to just stay right in there with all the context already constructed
and on the stack, and schedule away and then come back and queue back to
the primary thread once the condition the thread is waiting for is done?
The latter isnt all that unattractive in my mind, because it always does
forward progress, with minimal 'backout' costs.
furthermore, in a real webserver there's a whole lot of other stuff
happening too: VFS blocking, mutex/lock blocking, memory pressure
blocking, filesystem blocking, etc., etc. Threadlets/syslets cover them
/all/ and never hold up the primary context: as long as there's more
requests to process, they will be processed. Plus other important
networked workloads, like fileservers are typically on fast LANs and
those requests are very much a fire-and-forget matter most of the time.
in any case, this definitely needs to be measured.
Ingo
* Ingo Molnar <[email protected]> wrote:
> > The pushback to the primary thread you speak of is just extra work
> > in my mind, for networking. Better to just begin operations and sit
> > in the primary thread(s) waiting for events, and when they arrive
> > push the operations further along using non-blocking writes, reads,
> > and accept() calls. There is no blocking context really needed for
> > these kinds of things, so a mechanism that tries to provide one is a
> > waste.
>
> one question is, what is cheaper, to block out of a read and a write and
^-------to back out
> to set up the event notification and then to return to the user
> context, or to just stay right in there with all the context already
> constructed and on the stack, and schedule away and then come back and
> queue back to the primary thread once the condition the thread is
> waiting for is done? The latter isnt all that unattractive in my mind,
> because it always does forward progress, with minimal 'backout' costs.
Ingo
From: Ingo Molnar <[email protected]>
Date: Thu, 22 Feb 2007 16:15:09 +0100
> furthermore, in a real webserver there's a whole lot of other stuff
> happening too: VFS blocking, mutex/lock blocking, memory pressure
> blocking, filesystem blocking, etc., etc. Threadlets/syslets cover them
> /all/ and never hold up the primary context: as long as there's more
> requests to process, they will be processed. Plus other important
> networked workloads, like fileservers are typically on fast LANs and
> those requests are very much a fire-and-forget matter most of the time.
I expect clients of a fileserver to cause the server to block in
places such as tcp_sendmsg() as much if not more so than a webserver
:-)
But yes, it should all be tested, for sure.
On Thu, 22 Feb 2007, Ingo Molnar wrote:
>
> * Ulrich Drepper <[email protected]> wrote:
>
> > Ingo Molnar wrote:
> > > in terms of AIO, the best queueing model is i think what the kernel uses
> > > internally: freely ordered, with barrier support.
> >
> > Speaking of AIO, how do you imagine lio_listio is implemented? If
> > there is no asynchronous syscall it would mean creating a threadlet
> > for each request but this means either waiting or creating
> > several/many threads.
>
> my current thinking is that special-purpose (non-programmable, static)
> APIs like aio_*() and lio_*(), where every last cycle of performance
> matters, should be implemented using syslets - even if it is quite
> tricky to write syslets (which they no doubt are - just compare the size
> of syslet-test.c to threadlet-test.c). So i'd move syslets into the same
> category as raw syscalls: pieces of the raw infrastructure between the
> kernel and glibc, not an exposed API to apps. [and even if we keep them
> in that category they still need quite a bit of API work, to clean up
> the 32/64-bit issues, etc.]
Now that chains of syscalls can be way more easily handled with clets^wthreadlets,
why would we need the whole syslets crud inside?
Why can't aio_* be implemented with *simple* (or parallel/unrelated)
syscall submit w/out the burden of a complex, limiting and heavy API (I
won't list all the points against syslets, because I already did it
enough times)? The compat layer only is so bad to not be even funny.
Look at the code. Only removing the syslets crud would prolly cut 40% of
it. And we did not even touch the compat code yet.
- Davide
> From: "Michael K. Edwards"
> Newsgroups: gmane.linux.kernel
> Subject: Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3
> Date: Thu, 22 Feb 2007 00:59:10 -0800
[striping cc list]
[]
> On 2/21/07, Ingo Molnar <[email protected]> wrote:
>> > [...] As for threadlets, making them kernel threads is not such a good
>> > design feature, O(1) scheduler or not. You take the full hit of
>> > kernel task creation, on the spot, for every threadlet that blocks.
>> > [...]
>>
>> this is very much not how they work. Threadlets share the same basic
>> infrastructure with syslets and they do /not/ take a 'full hit of kernel
>> thread creation' when they block. Please read the announcements, past
>> discussions on lkml and the code about how it works.
[]
> Yes, I will go back and read the code for myself. This will take me
> some time because I have only a hand-waving level of knowledge about
> task_structs and pt_regs, and have largely avoided the dark corners of
> the x86 architecture.
This architecture was brought to us by windows on our screens. And it
took years (a decade?) for them to use all hardware features:
(IA-32, i386) --> (MS Windows NT,9X)
Yet you must still do much system programming to use that features.
While
> But I think my point still stands: allowing code inside threadlets to
> use the usual C library wrappers around the usual synchronous syscalls
> is going to mean that the threadlet context is fairly heavyweight, both
> in memory and CPU/MMU state. This means that you can't pull it cheaply
> over to whatever CPU core happened to process the device I/O that
> delivered the data it wanted.
[]
> Oh, and while I haven't written a kernel or an RDBMS, I have written
> some fairly serious non-blocking I/O code (without resorting to
> threads; one socket and thousands of independent userspace state
> machines) and rewritten the plumbing of two fairly heavy-duty network
> operations frameworks, one of them attached to a horrifically complex
> GUI. And briefly, long ago, I made Transputers dance with Occam and
> galaxies spin with PVM.
transputers were (AFAIK) completely orthogonal to any today's x86 CPU
architecture -- hardware parallelism, special programming language and
technique to match this hardware. And they were chosen to work on Mars in
mid-90th, while crowd wanted more stupid windows on cheap CPUs.
> This gives me exactly zero credentials here, of course, but may suggest
> to you that when I say something that seems naive, it's more likely
> that I got the Linux-specific lingo wrong. That, or I'm _actively_
> stupid. :-)
Thus, i think, you are thinking mostly on hardware level, while it's
longstanding software problem, i.e. to use x86 (:.
Regards.
--
-o--=O`C
#oo'L O
<___=E M
On Thu, 22 Feb 2007, Evgeniy Polyakov wrote:
> > maybe it will, maybe it wont. Lets try? There is no true difference
> > between having a 'request structure' that represents the current state
> > of the HTTP connection plus a statemachine that moves that request
> > between various queues, and a 'kernel stack' that goes in and out of
> > runnable state and carries its processing state in its stack - other
> > than the amount of RAM they take. (the kernel stack is 4K at a minimum -
> > so with a million outstanding requests they would use up 4 GB of RAM.
> > With 20k outstanding requests it's 80 MB of RAM - that's acceptable.)
>
> I tried already :) - I just made a allocations atomic in tcp_sendmsg() and
> ended up with 1/4 of the sends blocking (I counted both allocation
> failure and socket queue overflow). Those 20k blocked requests were
> created in about 20 seconds, so roughly saying we have 1k of thread
> creation/freeing per second - do we want this?
A dynamic pool will smooth thread creation/freeing up by a lot.
And, in my box a *pthread* create/free takes ~10us, at 1000/s is 10ms, 1%.
Bad, but not so aweful ;)
Look, I'm *definitely* not trying to advocate the use of async syscalls for
network here, just pointing out that when we're talking about threads,
Linux does a pretty good job.
- Davide
On Thu, 22 Feb 2007, David Miller wrote:
> The more I think about it, a reasonable solution might actually be to
> use threadlets for disk I/O and pure event based processing for
> networking. It is two different handling paths and non-unified,
> but that might be the price for good performance :-)
Well, it takes 20 lines of userspace C code to bring *unification* to the
universe ;)
- Davide
On 2/22/07, Oleg Verych <[email protected]> wrote:
> > Yes, I will go back and read the code for myself. This will take me
> > some time because I have only a hand-waving level of knowledge about
> > task_structs and pt_regs, and have largely avoided the dark corners of
> > the x86 architecture.
>
> This architecture was brought to us by windows on our screens. And it
> took years (a decade?) for them to use all hardware features:
>
> (IA-32, i386) --> (MS Windows NT,9X)
>
> Yet you must still do much system programming to use that features.
Actually, this architecture was brought to us largely by WordPerfect,
VisiCalc, and IEEE754. Nobody really cares what bloody operating
system the thing is running; they cared then, and care now, about
being able to write reports and memos that print cleanly and to build
spreadsheets that calculate correctly. Both of these things are made
much more practical by predictable floating point semantics, which
meant at first that you had to write your own floating point library.
The first (and for a long time the only) piece of hardware to put
_usable_ hardware floating point within reach of the desktop was the
Intel 80387. Usable, not because it was more accurate than the soft
version (it wasn't, actually quite the reverse), but because it got
the exception semantics right.
The '387 is what made the PC architecture the _only_ choice for the
corporate desktop, pretty much immediately upon its release in 1987.
My first corporate job was porting an electric utility's in-house
revenue requirements application -- written in Fortran with assembly
display routines -- from a Prime mini to the PC. I still have the
nice leather coffee coaster with the Prime logo on my desk. The rest
of Prime is dead and forgotten, largely because of the '387.
> transputers were (AFAIK) completely orthogonal to any today's x86 CPU
> architecture -- hardware parallelism, special programming language and
> technique to match this hardware. And they were chosen to work on Mars in
> mid-90th, while crowd wanted more stupid windows on cheap CPUs.
Y'know, what goes around comes around. The HyperTransport CPU
interconnect from AMD that all the overclocker types are so excited
about is just the Transputer ports warmed over, with a modern cache
coherency protocol stacked on top. Worked on one of those too, on the
Intel HyperCube -- it's not my fault that the i860 (predecessor of the
Itanic, in a way) lost its marbles so comprehensively on IRQ that you
couldn't do anything I/O intensive with it.
My first government job (at NASA) started with a crash course in Occam
and explicit parallelism, but it was so blindingly obvious that this
had no future outside its little niche that I looked around for other
stuff to do. The adjacent console belonged to a Symbolics LISP
machine -- also clearly a box with no future, since the only
applications for it that still mattered (expert systems) were in the
last stages of being ported to a C-based expert system engine
developed down the hall (which is now open source, and which I have
had occasion to use for many purposes since). I was a Mac weenie at
the time, so I polished my C skills working on the Mac port of CLIPS
and writing a genetic algorithm engine. Had I stuck to the
Transputer, I would probably know a lot more about faking NUMA using a
cache coherency protocol than I do today.
> Thus, i think, you are thinking mostly on hardware level, while it's
> longstanding software problem, i.e. to use x86 (:.
I don't think much about hardware or software. I think about shipping
products in volume at positive gross margin, even when what's coming
out of my fingertips is source code and shell commands. That's why
I've worked mostly on widgets with ARMs inside in recent years. But
I'm kinda bored with that, and Niagara or Octeon may wind up cheap in
volume if somebody with extra fab capacity scoops up the wreckage of
Sun or Cavium, so I'm here harassing people more competent than I
about what it takes to make them programmable by mere mortals.
Cheers,
- Michael
On 2/22/07, Ingo Molnar <[email protected]> wrote:
> > It is not a TUX anymore - you had 1024 threads, and all of them will
> > be consumed by tcp_sendmsg() for slow clients - rescheduling will kill
> > a machine.
>
> maybe it will, maybe it wont. Lets try? There is no true difference
> between having a 'request structure' that represents the current state
> of the HTTP connection plus a statemachine that moves that request
> between various queues, and a 'kernel stack' that goes in and out of
> runnable state and carries its processing state in its stack - other
> than the amount of RAM they take. (the kernel stack is 4K at a minimum -
> so with a million outstanding requests they would use up 4 GB of RAM.
> With 20k outstanding requests it's 80 MB of RAM - that's acceptable.)
This is a fundamental misconception. The state machine doesn't have
to do anything but chase pointers through cache. Done right, it
hardly even branches (although the branch misprediction penalty is a
lot less of a worry on current x86_64 than it was in the
mega-superscalar-out-of-order-speculative-execution days). It's damn
near free -- but it's a pain in the butt to code, and it has to be
done either in-kernel or in per-CPU OS-atop-the-OS dispatch threads.
The scheduler, on the other hand, has to blow and reload all of the
hidden state associated with force-loading the PC and wherever your
architecture keeps its TLS (maybe not the whole TLB, but not nothing,
either). The only way around this that I can think of is to make
threadlets promise that they will not touch anything thread-local, and
that when the FPU is handed to them in a specific, known state, they
leave it in that same state. (Some of the flags can be
unspecified-but-don't-touch-me.) Then you can schedule threadlets in
bursts with negligible transition cost from one to the next.
There is, however, a substantial setup cost for a burst, because you
have to put the FPU in that known state and lock out TLS access (this
is user code, after all). If the wrong process is in foreground, you
also need to switch process context at the start of a burst; no
fandangos on other processes' core, please, and to be remotely useful
the threadlets need access to process-global data structures and
synchronization primitives anyway. That's why you need for threadlets
to have a separate SCHED_THREADLET priority and at least a weak
ordering by PID. At which point you are outside the feature set of
the O(1) scheduler as I understand it, and you might as well schedule
them from the next tasklet following the softirq dispatcher.
> > My tests show that with 4k connections per second (8k concurrency)
> > more than 20k connections of 80k total block in tcp_sendmsg() over
> > gigabit lan between quite fast machines.
>
> yeah. Note that you can have a million sleeping threads if you want, the
> scheduler wont care. What matters more is the amount of true concurrency
> that is present at any given time. But yes, i agree that overscheduling
> can be a problem.
What matters is that a burst of I/O responses be scheduled efficiently
without taking down the rest of the box. That, and the ability to
cancel no-longer-interesting I/O requests in bulk, without leaking
memory and synchronization primitives all over the place. If you
don't have that, this scheme is UNUSABLE for network I/O.
> btw., what is the measurement utility you are using with kevents ('ab'
> perhaps, with a high -c concurrency count?), and which webserver are you
> using? (light-httpd?)
Do me a favor. Do some floating point math and a memcpy() in between
syscalls in the threadlet. Actually fiddle with errno and the FPU
rounding flags. Watch it slow to a crawl and/or break floating point
arithmetic horribly. Understand why no one with half a brain uses
Java, or any other language which cuts FP corners for the sake of
cheap threads, for calculations that have to be correct. (Note that
Kahan received the Turing award for contributions to IEEE 754. If his
polemic is too thick, read
http://www-128.ibm.com/developerworks/java/library/j-jtp0114/.)
Cheers,
- Michael
> Plus there's the fundamental killer that KAIO is a /lot/ harder to
> implement (and to maintain) on the kernel side: it has to be
> implemented
> for every IO discipline, and even for the IO disciplines it
> supports at
> the moment, it is not truly asynchronous for things like metadata
> blocking or VFS blocking. To handle things like metadata blocking
> it has
> to resort to non-statemachine techniques like retries - which are bad
> for performance.
Yes, yes, yes.
As one of the poor suckers who has been fixing bugs in fs/aio.c and
fs/direct-io.c, I really want everyone to read Ingo's paragraph a few
times. Have it printed on a t-shirt.
Look at the number of man-years that have gone into fs/aio.c and fs/
direct-io.c. After all that effort it *barely* supports non-blocking
O_DIRECT IO.
The maintenance overhead of those two files, above all else, is what
pushed me to finally try that nutty fibril attempt.
> Syslets/threadlets on the other hand, once the core is implemented,
> have
> near zero ongoing maintainance cost (compared to KAIO pushed into
> every
> IO subsystem) and cover all IO disciplines and API variants
> immediately,
> and they are as perfectly asynchronous as it gets.
Amen.
As an experiment, I'm working on backing the sys_io_*() calls with
syslets. It's looking very promising so far.
> So all in one, i used to think that AIO state-machines have a long-
> term
> place within the kernel, but with syslets i think i've proven myself
> embarrasingly wrong =B-)
Welcome to the party :).
- z
> The more I think about it, a reasonable solution might actually be to
> use threadlets for disk I/O and pure event based processing for
> networking. It is two different handling paths and non-unified,
> but that might be the price for good performance :-)
I generally agree, with some comments.
If we come to the decision that there are some message rates that are
better suited to delivery into a user-read ring (10gige rx to kevent,
say) then it doesn't seem like it would be much of a stretch to add a
facility where syslet completion could be funneled into that channel
as well.
I also wonder if there isn't some opportunity to cut down the number
of syscalls / op in networking land. Is it madness to think of a
call like recvmsgv() which could provide a vector of msghdrs? It
might not make sense, but it might cut down on the per-op overhead
for loads that know they're going to be heavy enough to get a decent
amount of batching without fatally harming latency. Maybe those
loads are rare..
- z
On Thu, Feb 22, 2007 at 01:23:57PM -0800, Zach Brown wrote:
> As one of the poor suckers who has been fixing bugs in fs/aio.c and
> fs/direct-io.c, I really want everyone to read Ingo's paragraph a few
> times. Have it printed on a t-shirt.
direct-io.c is evil. Ridiculously.
> Amen.
>
> As an experiment, I'm working on backing the sys_io_*() calls with
> syslets. It's looking very promising so far.
Great, I'd love to see the comparisons.
> >So all in one, i used to think that AIO state-machines have a long-
> >term
> >place within the kernel, but with syslets i think i've proven myself
> >embarrasingly wrong =B-)
>
> Welcome to the party :).
Well, there are always the 2.4 patches which are properly state drive and
reasonably simple. Retry was born out of a need to come up with a mechanism
that had less impact on the core kernel code, and yes, it seems to be a
failure and in dire need of replacement.
One other implementation to consider is actually using kernel threads
compared to how syslets perform. Direct IO for one always blocks, so
there shouldn't be much of a performance difference compared to syslets,
with the bonus that no arch specific code is needed.
-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.
On 2/22/07, Ingo Molnar <[email protected]> wrote:
> Secondly, even assuming lots of pending requests/async-threads and a
> naive queueing model, an open request will eat up resources on the
> server no matter what.
Another fundamental misconception. Kernel AIO is not for servers.
One programmer in a hundred is working on a server codebase, and one
in a thousand dares to touch server plumbing. Kernel AIO is for
clients, especially when mated to GUIs with an event delivery
mechanism. Ask yourself why the one and only thing that Windows NT
has ever gotten right about networking is I/O completion ports.
Cheers,
- Michael
> direct-io.c is evil. Ridiculously.
You will have a hard time finding someone to defend it, I predict :).
There is good news on that front, too. Chris (Mason) is making
progress on getting rid of the worst of the Magical Locking that
makes buffered races with O_DIRECT ops so awful.
I'm not holding my breath for a page cache so fine grained that it
could pin and reference 512B granular user regions and build bios
from them, though that sure would be nice :).
>> As an experiment, I'm working on backing the sys_io_*() calls with
>> syslets. It's looking very promising so far.
>
> Great, I'd love to see the comparisons.
I'm out for data. If it sucks, well, we'll know just how much. I'm
pretty hopeful that it won't :).
> One other implementation to consider is actually using kernel threads
> compared to how syslets perform. Direct IO for one always blocks, so
> there shouldn't be much of a performance difference compared to
> syslets,
> with the bonus that no arch specific code is needed.
Yeah, I'm starting with raw kernel threads so we can get some numbers
before moving to syslets.
One of the benefits syslets bring is the ability to start processing
the next pending request the moment current request processing
blocks. In the concurrent O_DIRECT write case that avoids releasing
a ton of kernel threads which all just run to serialize on i_mutex
(potentially bouncing it around cache domains) as the O_DIRECT ops
are built and sent.
-z
> to do anything but chase pointers through cache. Done right, it
> hardly even branches (although the branch misprediction penalty is a
> lot less of a worry on current x86_64 than it was in the
> mega-superscalar-out-of-order-speculative-execution days). It's damn
Actually it costs a lot more on at least one vendors processor because
you stall very long pipelines.
> threadlets promise that they will not touch anything thread-local, and
> that when the FPU is handed to them in a specific, known state, they
> leave it in that same state. (Some of the flags can be
We don't use the FPU in the kernel except in very weird cases where it
makes an enormous performance difference. The threadlets also have the
same page tables so they have the same %cr3 so its very cheap to switch,
basically a predicted jump and some register loads
> Do me a favor. Do some floating point math and a memcpy() in between
> syscalls in the threadlet. Actually fiddle with errno and the FPU
We don't have an errno in the kernel because its a stupid idea. Errno is
a user space hack for compatibility with 1970's bad design. So its not
relevant either.
Alan
> It's brilliant for disk I/O, not for networking for which
> blocking is the norm not the exception.
>
> So people will have to likely do something like divide their
> applications into handling for I/O to files and I/O to networking.
> So beautiful. :-)
>
> Nobody has proposed anything yet which scales well and handles both
> cases.
The truly brilliant thing about the whole "create a thread on blocking"
is that you immediately make *every* system call asynchronous-capable,
including the thousands of obscure ioctls, without having to boil the
ocean rewriting 5/6 of the kernel from implicit (stack-based) to
explicit state machines.
You're right that it doesn't solve everything, but it's a big step
forward while keeping a reasonably clean interface.
Now, we have some portions of the kernel (to be precise, those that
currently support poll() and select()) that are written as explicit
state machines and can block on a much smaller context structure.
In truth, the division you assume above isn't so terrible.
My applications are *already* written like that. It's just "poll()
until I accumulate a whole request, then fork a thread to handle it."
The only way to avoid allocating a kernel stack is to have the entire
handling code path, including the return to user space, written in
explicit state machine style. (Once you get to user space, you can have
a threading library there if you like.) All the flaming about different
ways to implement completion notification is precisely because not much
is known about the best way to do it; there aren't a lot of applications
that work that way.
(Certainly that's because it wasn't possible before, but it's clearly
an area that requires research, so not committing to an implementation
is A Good Thing.)
But once that is solved, and "system call complete" can be reported
without returning to a user-space thread (which is basically an alternate
system call submission interface, *independent* of the fibril/threadlet
non-blocking implementation), then you can find the hot paths in the
kernel and special-case them to avoid creating a whole thread.
To use a networking analogy, this is a cleanly layered protocol design,
with an optimized fast path *implementation* that blurs the boundaries.
As for the overhead of threading, there are basically three parts:
1) System call (user/kernel boundary crossing) costs. These depend only
on the total number of system calls and not on the number of threads
making them. They can be mitigated *if necessary* with a syslet-like
"macro syscall" mechanism to increase the work per boundary crossing.
The only place threading might increase these numbers is thread
synchronization, and futexes already solve that pretty well.
2) Register and stack swapping. These (and associated cache issues)
are basically unavoidable, and are the bare minimum that longjmp()
does. Nothing thread-based is going to reduce this. (Actually,
the kernel can do better than user space because it can do lazy FPU
state swapping.)
3) MMU context switch costs. These are the big ones, particular on x86
without TLB context IDs. However, these fall into a few categories:
- Mandatory switches because the entire application is blocked.
I don't see how this can be avoided; these are the cases where
even a user-space longjmp-based thread library would context
switch.
- Context switches between threads in an application. The Linux
kernel already optimizes out the MMU context switch in this case,
and the scheduler already knows that such context switches are
cheaper and preferred.
The one further optimization that's possible is if you have a system
call that (in a common case) blocks multiple times *without accessing
user memory*. This is not a read() or write(), but could be
something like fsync() or ftruncate(). In this case, you could
temporarily mark the thread as a "kernel thread" that can run in any
MMU context, and then fix it explicitly when you unmark it on the
return path.
I can see the space overhead of 1:1 threading, but I really don't think
there's much time overhead.
On 2/22/07, Alan <[email protected]> wrote:
> > to do anything but chase pointers through cache. Done right, it
> > hardly even branches (although the branch misprediction penalty is a
> > lot less of a worry on current x86_64 than it was in the
> > mega-superscalar-out-of-order-speculative-execution days). It's damn
>
> Actually it costs a lot more on at least one vendors processor because
> you stall very long pipelines.
You're right; I overreached there. I haven't measured branch
misprediction penalties in dog's years (I focus more on system latency
issues these days), so I'm just going on rumor. If your CPU vendor is
still playing the tune-for-SpecINT-at-the-expense-of-real-code game
(*cough* Itanic *cough*), get another CPU vendor -- while you still
can.
> > threadlets promise that they will not touch anything thread-local, and
> > that when the FPU is handed to them in a specific, known state, they
> > leave it in that same state. (Some of the flags can be
>
> We don't use the FPU in the kernel except in very weird cases where it
> makes an enormous performance difference. The threadlets also have the
> same page tables so they have the same %cr3 so its very cheap to switch,
> basically a predicted jump and some register loads
Do you not understand that real user code touches FPU state at
unpredictable (to the kernel) junctures? Maybe not in a database or a
web server, but in the GUIs and web-based monitoring applications that
are 99% of the potential customers for kernel AIO? I have no idea
what a %cr3 is, but if you don't fence off thread-local stuff from the
threadlets you are just begging for end-user Heisenbugs and a place in
the dustheap of history next to Symbolics LISP.
> > Do me a favor. Do some floating point math and a memcpy() in between
> > syscalls in the threadlet. Actually fiddle with errno and the FPU
>
> We don't have an errno in the kernel because its a stupid idea. Errno is
> a user space hack for compatibility with 1970's bad design. So its not
> relevant either.
Dude, it's thread-local, and the glibc wrapper around most synchronous
syscalls touches it. If you don't instantiate a new TLS context (or
whatever the right lingo for that is) for every threadlet, you are
TOAST -- if you let the user call stuff out of <stdlib.h> (let alone
<stdio.h>) from within the threadlet.
Cheers,
- Michael
OK, having skimmed through Ingo's code once now, I can already see I
have some crow to eat. But I still have some marginally less stupid
questions.
Cachemiss threads are created with CLONE_VM | CLONE_FS | CLONE_FILES |
CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM. Does that mean they
share thread-local storage with the userspace thread, have
thread-local storage of their own, or have no thread-local storage
until NPTL asks for it?
When the kernel zeroes the userspace stack pointer in
cachemiss_thread(), presumably the allocation of a new userspace stack
page is postponed until that thread needs to resume userspace
execution (after completion of the first I/O that missed cache). When
do you copy the contents of the threadlet function's stack frame into
this new stack page?
Is there anything in a struct pt_regs that is expensive to restore
(perhaps because it flushes a pipeline or cache that wasn't already
flushed on syscall entry)? Is there any reason why the FPU context
has to differ among threadlets that have blocked while executing the
same userspace function with different stacks? If the TLS pointer
isn't in either of these, where is it, and why doesn't
move_user_context() swap it?
If you set out to cancel one of these threadlets, how are you going to
ensure that it isn't holding any locks? Is there any reasonable way
to implement a userland finally { } block so that you can release
malloc'd memory and clean up application data structures?
If you want to migrate a threadlet to another CPU on syscall entry
and/or exit, what has to travel other than the userspace stack and the
struct pt_regs? (I am assuming a quiesced FPU and thread(s) at the
destination with compatible FPU flags.) Does it make sense for the
userspace stack page to have space reserved for a struct pt_regs
before the threadlet stack frame, so that the entire userspace
threadlet state migrates as one page?
I now see that an effort is already made to schedule threadlets in
bursts, grouped by PID, when several have unblocked since the last
timeslice. What is the transition cost from one threadlet to another?
Can that transition cost be made lower by reducing the amount of
state that belongs to the individual threadlet vs. the pool of
cachemiss threads associated with that threadlet entrypoint?
Generally, is there a "contract" that could be made between the
threadlet application programmer and the implementation which would
allow, perhaps in future hardware, the kind of invisible pipelined
coprocessing for AIO that has been so successful for FP?
I apologize for having adopted a hostile tone in a couple of previous
messages in this thread; remind me in the future not to alternate
between thinking about code and about the FSF. :-) I do really like
a lot of things about the threadlet model, and would rather not see it
given up on for network I/O and NUMA systems. So I'm going to
reiterate again -- more politely this time -- the need for a
data-structure-centric threadlet pool abstraction that supports
request throttling, reprioritization, bulk cancellation, and migration
of individual threadlets to the node nearest the relevant I/O port.
I'm still not sold on syslets as anything userspace-visible, but I
could imagine them enabling a sort of functional syntax for chaining
I/O operations, with most failures handled as inline "Not-a-Pointer"
values or as "AEIOU" (asynchronously executed I/O unit?) exceptions
instead of syscall-test-branch-syscall-test-branch. Actually working
out the semantics and getting them adopted as an IEEE standard could
even win someone a Turing award. :-)
Cheers,
- Michael
* Michael K. Edwards <[email protected]> wrote:
> On 2/22/07, Alan <[email protected]> wrote:
> > We don't use the FPU in the kernel except in very weird cases where
> > it makes an enormous performance difference. The threadlets also
> > have the same page tables so they have the same %cr3 so its very
> > cheap to switch, basically a predicted jump and some register loads
>
> Do you not understand that real user code touches FPU state at
> unpredictable (to the kernel) junctures? Maybe not in a database or a
> web server, but in the GUIs and web-based monitoring applications that
> are 99% of the potential customers for kernel AIO?
> I have no idea what a %cr3 is, [...]
then please stop wasting Alan's time ...
Ingo
* David Miller <[email protected]> wrote:
> > furthermore, in a real webserver there's a whole lot of other stuff
> > happening too: VFS blocking, mutex/lock blocking, memory pressure
> > blocking, filesystem blocking, etc., etc. Threadlets/syslets cover
> > them /all/ and never hold up the primary context: as long as there's
> > more requests to process, they will be processed. Plus other
> > important networked workloads, like fileservers are typically on
> > fast LANs and those requests are very much a fire-and-forget matter
> > most of the time.
>
> I expect clients of a fileserver to cause the server to block in
> places such as tcp_sendmsg() as much if not more so than a webserver
> :-)
yeah, true. But ... i'd still like to midly disagree with the
characterisation that due to blocking being the norm in networking, that
this goes against the concept of syslets/threadlets.
Networking has a 'work cache' too, in an abstract sense, which works in
favor of syslets: the socket buffers. If there is a reasonably sized
output queue for sockets (not extremely small like 4k per socket but
something at least 16k), then user-space can chunk up its workload along
pretty reasonable lines without assuming too much, and turn one such
'chunk' into one atomic step done in the 'syslet/threadlet request'. In
the rare cases where blocking happens in an unexpected way, the
syslet/threadlet 'goes async' but that only holds up that particular
request, and only for that chunk, not the main loop of processing and
doesnt force the request into an async thread forever.
the kevent model is very much about: /never every/ block the main
thread. If you ever block, performance goes down the drain.
the syslet performance model is: if you block less than say 10-20% of
the time, you are basically as fast as the most extreme kevents based
model. Syslets/threadlets also avoids the fundamental workflow problem
that nonblocking designs have in a natural way: 'how do we guarantee
that the system progresses forward', which problem makes nonblocking
code quite fragile.
another property is that the performance curve is alot less sensitive to
blocking in the syslet model - and real user-space servers will always
have unexpected points of blockage - unless all of userspace code is
perfectly converted into state machines - which is not reasonable. So
with syslets we are not forced to program everything as state-machines
in user-space, such techniques are only needed to reduce the amount of
context-switching and to reduce the RAM footprint - they wont change
fundamental scalability.
plus there's the hidden advantage of having a 'constructed state' on the
kernel stack: thread that blocks in the middle of tcp_sendmsg() has
quite some state: the socket has been looked up in the hash(es), all
input parameters have been validated, the timer has been set, skbs have
been allocated ahead, etc. Those things do add up especially if you are
after a long wait and all those things are scattered around the memory
map cache-cold - not nicely composed into a single block of memory on
the stack (generating only a single cachemiss in essence).
All in one, i'm cautiosly optimistic that even a totally naive,
blocks-itself-for-every-request syslet application would perform pretty
close to a Tux/kevents type of nonblock+event-queueing based application
- with a vastly higher generic utility benefit. So i'm not dismissing
this possibility and i'm not writing off syslet performance just because
they do context-switches :-)
Ingo
> Do you not understand that real user code touches FPU state at
> unpredictable (to the kernel) junctures? Maybe not in a database or a
We don't care. We don't have to care. The kernel threadlets don't execute
in user space and don't do FP.
> web server, but in the GUIs and web-based monitoring applications that
> are 99% of the potential customers for kernel AIO? I have no idea
> what a %cr3 is, but if you don't fence off thread-local stuff from the
How about you go read the intel architecture manuals then you might know
more.
> > We don't have an errno in the kernel because its a stupid idea. Errno is
> > a user space hack for compatibility with 1970's bad design. So its not
> > relevant either.
>
> Dude, it's thread-local, and the glibc wrapper around most synchronous
Last time I checked glibc was in userspace and the interface for kernel
AIO is a matter for the kernel so errno is irrelevant, plus any
threadlets doing system calls will only be living in kernel space anyway.
* Evgeniy Polyakov <[email protected]> wrote:
> [...] Those 20k blocked requests were created in about 20 seconds, so
> roughly saying we have 1k of thread creation/freeing per second - do
> we want this?
i'm not sure why you mention thread creation and freeing. The
syslet/threadlet code reuses already created async threads, and that is
visible all around in both the kernel-space and in the user-space
syslet/threadlet code.
While Linux creates+destroys threads pretty damn fast (in about 10-15
usecs - which is roughly the cost of getting a single 1-byte packet
through a TCP socket from one process to another, on localhost), still
we dont want to create and destroy a thread per request.
Ingo
On Thu, Feb 22, 2007 at 11:46:48AM -0800, Davide Libenzi ([email protected]) wrote:
> > I tried already :) - I just made a allocations atomic in tcp_sendmsg() and
> > ended up with 1/4 of the sends blocking (I counted both allocation
> > failure and socket queue overflow). Those 20k blocked requests were
> > created in about 20 seconds, so roughly saying we have 1k of thread
> > creation/freeing per second - do we want this?
>
> A dynamic pool will smooth thread creation/freeing up by a lot.
> And, in my box a *pthread* create/free takes ~10us, at 1000/s is 10ms, 1%.
> Bad, but not so aweful ;)
> Look, I'm *definitely* not trying to advocate the use of async syscalls for
> network here, just pointing out that when we're talking about threads,
> Linux does a pretty good job.
If we are going to create 1000 threads each second, then it is better to
preallocate them and queue a work to that pool - like syslets did with
syscalls, but not ulitimately create a new thread just because it is not
that slow.
All such micro-thread designs are especially good in the case when
1. switching is _rare_ (very)
2. programmer does not want to create complex model to achieve maximum
performance
Disk (cached) IO definitely hits first entry and second one is there for
advertisements and fast deployment, but overall usage of the
asynchronous IO model is not limited to the above scenario, so
micro-threads definitely hit own niche, but they can not cover all usage
cases.
>
> - Davide
>
--
Evgeniy Polyakov
On Fri, Feb 23, 2007 at 12:51:52PM +0100, Ingo Molnar ([email protected]) wrote:
> > [...] Those 20k blocked requests were created in about 20 seconds, so
> > roughly saying we have 1k of thread creation/freeing per second - do
> > we want this?
>
> i'm not sure why you mention thread creation and freeing. The
> syslet/threadlet code reuses already created async threads, and that is
> visible all around in both the kernel-space and in the user-space
> syslet/threadlet code.
>
> While Linux creates+destroys threads pretty damn fast (in about 10-15
> usecs - which is roughly the cost of getting a single 1-byte packet
> through a TCP socket from one process to another, on localhost), still
> we dont want to create and destroy a thread per request.
I meant that we end up with having one thread per IO - they were
preallocated, but that does not matter. And what about your idea of
switching userspace threads to cachemiss threads?
My main concern was only about the situation, when we ends up with truly
bloking context (like network), and this results in having thousands of
threads doing the work - even having most of them sleeping, there is a
problem with memory overhead and context switching, although it is usable
situation, but when all of them are ready immediately - context switching
will kill a machine even with O(1) scheduler which made situation damn
better than before, but it is not a cure for the problem.
> Ingo
--
Evgeniy Polyakov
* Michael K. Edwards <[email protected]> wrote:
> On 2/22/07, Ingo Molnar <[email protected]> wrote:
> > maybe it will, maybe it wont. Lets try? There is no true difference
> > between having a 'request structure' that represents the current
> > state of the HTTP connection plus a statemachine that moves that
> > request between various queues, and a 'kernel stack' that goes in
> > and out of runnable state and carries its processing state in its
> > stack - other than the amount of RAM they take. (the kernel stack is
> > 4K at a minimum - so with a million outstanding requests they would
> > use up 4 GB of RAM. With 20k outstanding requests it's 80 MB of RAM
> > - that's acceptable.)
>
> This is a fundamental misconception. [...]
> The scheduler, on the other hand, has to blow and reload all of the
> hidden state associated with force-loading the PC and wherever your
> architecture keeps its TLS (maybe not the whole TLB, but not nothing,
> either). [...]
please read up a bit more about how the Linux scheduler works. Maybe
even read the code if in doubt? In any case, please direct kernel newbie
questions to http://kernelnewbies.org/, not [email protected].
Ingo
On Fri, Feb 23, 2007 at 03:22:25PM +0300, Evgeniy Polyakov ([email protected]) wrote:
> I meant that we end up with having one thread per IO - they were
> preallocated, but that does not matter. And what about your idea of
> switching userspace threads to cachemiss threads?
>
> My main concern was only about the situation, when we ends up with truly
> bloking context (like network), and this results in having thousands of
> threads doing the work - even having most of them sleeping, there is a
> problem with memory overhead and context switching, although it is usable
> situation, but when all of them are ready immediately - context switching
simultaneously
> will kill a machine even with O(1) scheduler which made situation damn
> better than before, but it is not a cure for the problem.
Week of no-dictionary writings starts beating me.
--
Evgeniy Polyakov
On Wed, Feb 21 2007, Ingo Molnar wrote:
> this is the v3 release of the syslet/threadlet subsystem:
>
> http://redhat.com/~mingo/syslet-patches/
[snip]
Ingo, some testing of the experimental syslet queueing stuff, in the
syslet-testing branch of fio.
Fio job file:
[global]
bs=8k
size=1g
direct=0
ioengine=syslet-rw
iodepth=32
rw=read
[file]
filename=/ramfs/testfile
Only changes between runs was changing ioengine and iodepth as indicated
in the table below.
Results:
Engine Depth Bw (MiB/sec)
--------------------------------------------
libaio 1 441
syslet 1 574
sync 1 589
libaio 32 613
syslet 32 681
Results are stable to within +/- 1MiB/sec. So you can see that syslet
are still a bit slower than sync for depth 1, but beats the pants off
libaio for equal depths. Note that this is buffered IO, I'll be out for
the weekend, but I'll hack some direct IO testing up next week to
compare "real" queuing.
Just a quick microbenchmark to gauge current overhead...
--
Jens Axboe
On Fri, Feb 23, 2007 at 01:52:47PM +0100, Jens Axboe wrote:
> On Wed, Feb 21 2007, Ingo Molnar wrote:
> > this is the v3 release of the syslet/threadlet subsystem:
> >
> > http://redhat.com/~mingo/syslet-patches/
>
> [snip]
>
> Ingo, some testing of the experimental syslet queueing stuff, in the
> syslet-testing branch of fio.
>
> Fio job file:
>
> [global]
> bs=8k
> size=1g
> direct=0
> ioengine=syslet-rw
> iodepth=32
> rw=read
>
> [file]
> filename=/ramfs/testfile
>
> Only changes between runs was changing ioengine and iodepth as indicated
> in the table below.
>
> Results:
>
> Engine Depth Bw (MiB/sec)
> --------------------------------------------
> libaio 1 441
> syslet 1 574
> sync 1 589
> libaio 32 613
> syslet 32 681
>
> Results are stable to within +/- 1MiB/sec. So you can see that syslet
> are still a bit slower than sync for depth 1, but beats the pants off
> libaio for equal depths. Note that this is buffered IO, I'll be out for
> the weekend, but I'll hack some direct IO testing up next week to
> compare "real" queuing.
>
> Just a quick microbenchmark to gauge current overhead...
This is just ramfs, to gauge pure overheads, is that correct ?
BTW, I'm not surprised at Ingo's initial results of syslet vs libaio
overheads, for aio-stress/fio type streaming io runs, because these cases
do not involve large numbers of outstanding ios. So the overhead of
thread creation with syslets is amortized across the entire run of io
submissions because of the reuse of already created async threads. While
in the libaio case there is a setup and teardown of kiocbs per request.
What I have been concerned about instead in the past when considering
thread based AIO implementations is the resource(memory) consumption impact
on overall system performance and adaptability to varying loads. It is nice
that we can avoid that for the cached cases, but for the general blocking
cases, it is still not clear to me whether we have addressed this well
enough yet. I used to think that even the kiocb was too heavyweight for its
purpose ... especially in terms of scaling to larger loads.
As a really crude (and not very realistic) example of the potential impact
of large numbers of outstanding IOs, I tried some quick direct IO comparisons
using fio:
[global]
ioengine=syslet-rw
buffered=0
rw=randread
bs=64k
size=1024m
iodepth=64
Engine Depth Bw (MiB/sec)
libaio 64 17.323
syslet 64 17.524
libaio 20000 15.226
syslet 20000 11.015
Regards
Suparna
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
On Thu, Feb 22, 2007 at 03:36:58PM +0100, Ingo Molnar wrote:
>
> * Suparna Bhattacharya <[email protected]> wrote:
>
> > > maybe it will, maybe it wont. Lets try? There is no true difference
> > > between having a 'request structure' that represents the current
> > > state of the HTTP connection plus a statemachine that moves that
> > > request between various queues, and a 'kernel stack' that goes in
> > > and out of runnable state and carries its processing state in its
> > > stack - other than the amount of RAM they take. (the kernel stack is
> > > 4K at a minimum - so with a million outstanding requests they would
> > > use up 4 GB of RAM. With 20k outstanding requests it's 80 MB of RAM
> > > - that's acceptable.)
> >
> > At what point are the cachemiss threads destroyed ? In other words how
> > well does this adapt to load variations ? For example, would this 80MB
> > of RAM continue to be locked down even during periods of lighter loads
> > thereafter ?
>
> you can destroy them at will from user-space too - just start a slow
> timer that zaps them if load goes down. I can add a
> sys_async_thread_exit(nr_threads) API to be able to drive this without
> knowing the TIDs of those threads, and/or i can add a kernel-internal
> mechanism to zap inactive threads. It would be rather easy and
> low-overhead - the v2 code already had a max_nr_threads tunable, i can
> reintroduce it. So the size of the pool of contexts does not have to be
> permanent at all.
If you can find a way to do this without additional tunables burden on
the administrator that would certainly help ! IIRC, performance problems
linked to having too many or too few AIO kernel threads has been a commonly
reported issue elsewhere - it would be nice to be able to avoid repeating
the crux of that (mistake) in Linux. To me, any need to manually tune the
number has always seemed to defeat the very benefit of adaptability of varying
loads that AIO intrinsically provides.
Regards
Suparna
>
> Ingo
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
* Suparna Bhattacharya <[email protected]> wrote:
> As a really crude (and not very realistic) example of the potential
> impact of large numbers of outstanding IOs, I tried some quick direct
> IO comparisons using fio:
>
> [global]
> ioengine=syslet-rw
> buffered=0
> rw=randread
> bs=64k
> size=1024m
> iodepth=64
could you please try those iodepth=20000 tests with the latest
fio-testing branch of fio as well? Jens wrote a new, smarter syslet
plugin for FIO. You'll need the v3 syslet kernel plus:
git-clone git://git.kernel.dk/data/git/fio.git
cd fio
git-checkout syslet-testing
my expectation is that it should behave better with iodepth=20000
(although i havent tried that yet).
Ingo
On Fri, Feb 23, 2007 at 03:58:26PM +0100, Ingo Molnar wrote:
>
> * Suparna Bhattacharya <[email protected]> wrote:
>
> > As a really crude (and not very realistic) example of the potential
> > impact of large numbers of outstanding IOs, I tried some quick direct
> > IO comparisons using fio:
> >
> > [global]
> > ioengine=syslet-rw
> > buffered=0
> > rw=randread
> > bs=64k
> > size=1024m
> > iodepth=64
>
> could you please try those iodepth=20000 tests with the latest
> fio-testing branch of fio as well? Jens wrote a new, smarter syslet
> plugin for FIO. You'll need the v3 syslet kernel plus:
>
> git-clone git://git.kernel.dk/data/git/fio.git
> cd fio
> git-checkout syslet-testing
>
> my expectation is that it should behave better with iodepth=20000
> (although i havent tried that yet).
I picked up the fio snapshot from 22nd Feb (fio-git-20070222212513.tar.gz)
and used the v3 syslet patches from your web-site.
Do I still need to get something more recent ?
Regards
Suparna
>
> Ingo
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
On Fri, Feb 23 2007, Suparna Bhattacharya wrote:
> On Fri, Feb 23, 2007 at 03:58:26PM +0100, Ingo Molnar wrote:
> >
> > * Suparna Bhattacharya <[email protected]> wrote:
> >
> > > As a really crude (and not very realistic) example of the potential
> > > impact of large numbers of outstanding IOs, I tried some quick direct
> > > IO comparisons using fio:
> > >
> > > [global]
> > > ioengine=syslet-rw
> > > buffered=0
> > > rw=randread
> > > bs=64k
> > > size=1024m
> > > iodepth=64
> >
> > could you please try those iodepth=20000 tests with the latest
> > fio-testing branch of fio as well? Jens wrote a new, smarter syslet
> > plugin for FIO. You'll need the v3 syslet kernel plus:
> >
> > git-clone git://git.kernel.dk/data/git/fio.git
> > cd fio
> > git-checkout syslet-testing
> >
> > my expectation is that it should behave better with iodepth=20000
> > (although i havent tried that yet).
>
> I picked up the fio snapshot from 22nd Feb (fio-git-20070222212513.tar.gz)
> and used the v3 syslet patches from your web-site.
>
> Do I still need to get something more recent ?
Yes, you need to test the syslet+testing branch that Ingo referenced.
Your test above is not totally fair right now, since you are doing
significantly less system calls with libaio. So to compare apples with
apples, try the syslet-testing branch. If you can't get it because of
firewall problems, check http://brick.kernel.dk/snaps/ for the latest
fio snapshot. If it has the syslet-testing branch, then that is
recent enough.
--
Jens Axboe
* Suparna Bhattacharya <[email protected]> wrote:
> > my expectation is that it should behave better with iodepth=20000
> > (although i havent tried that yet).
>
> I picked up the fio snapshot from 22nd Feb
> (fio-git-20070222212513.tar.gz) and used the v3 syslet patches from
> your web-site.
>
> Do I still need to get something more recent ?
yeah, there's something more recent. Please do this:
yum install git
git-clone git://git.kernel.dk/data/git/fio.git
cd fio
git-branch syslet-testing
git-checkout
this should give you the latest version of the v3 based FIO code. It's
one generation newer than the one you tried. I mean the snapshot you
used is meanwhile a /whole/ day old, so it's truly ancient stuff! ;-)
Ingo
On Fri, Feb 23, 2007 at 05:25:08PM +0100, Jens Axboe wrote:
> On Fri, Feb 23 2007, Suparna Bhattacharya wrote:
> > On Fri, Feb 23, 2007 at 03:58:26PM +0100, Ingo Molnar wrote:
> > >
> > > * Suparna Bhattacharya <[email protected]> wrote:
> > >
> > > > As a really crude (and not very realistic) example of the potential
> > > > impact of large numbers of outstanding IOs, I tried some quick direct
> > > > IO comparisons using fio:
> > > >
> > > > [global]
> > > > ioengine=syslet-rw
> > > > buffered=0
> > > > rw=randread
> > > > bs=64k
> > > > size=1024m
> > > > iodepth=64
> > >
> > > could you please try those iodepth=20000 tests with the latest
> > > fio-testing branch of fio as well? Jens wrote a new, smarter syslet
> > > plugin for FIO. You'll need the v3 syslet kernel plus:
> > >
> > > git-clone git://git.kernel.dk/data/git/fio.git
> > > cd fio
> > > git-checkout syslet-testing
> > >
> > > my expectation is that it should behave better with iodepth=20000
> > > (although i havent tried that yet).
> >
> > I picked up the fio snapshot from 22nd Feb (fio-git-20070222212513.tar.gz)
> > and used the v3 syslet patches from your web-site.
> >
> > Do I still need to get something more recent ?
>
> Yes, you need to test the syslet+testing branch that Ingo referenced.
> Your test above is not totally fair right now, since you are doing
> significantly less system calls with libaio. So to compare apples with
> apples, try the syslet-testing branch. If you can't get it because of
> firewall problems, check http://brick.kernel.dk/snaps/ for the latest
> fio snapshot. If it has the syslet-testing branch, then that is
> recent enough.
I have a feeling this is getting to be a little more bleeding edge than
I had anticipated :), so will just hold off for a bit until this
crystallizes a bit.
Regards
Suparna
>
> --
> Jens Axboe
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
On Fri, 23 Feb 2007, Evgeniy Polyakov wrote:
> On Thu, Feb 22, 2007 at 11:46:48AM -0800, Davide Libenzi ([email protected]) wrote:
> >
> > A dynamic pool will smooth thread creation/freeing up by a lot.
> > And, in my box a *pthread* create/free takes ~10us, at 1000/s is 10ms, 1%.
> > Bad, but not so aweful ;)
> > Look, I'm *definitely* not trying to advocate the use of async syscalls for
> > network here, just pointing out that when we're talking about threads,
> > Linux does a pretty good job.
>
> If we are going to create 1000 threads each second, then it is better to
> preallocate them and queue a work to that pool - like syslets did with
> syscalls, but not ulitimately create a new thread just because it is not
> that slow.
We do create a pool indeed, as I said in the opening of my asnwer. The
numbers I posted was just to show that thread creation/destroy is pretty
fast, but that does not justify it as a design choice.
> All such micro-thread designs are especially good in the case when
> 1. switching is _rare_ (very)
> 2. programmer does not want to create complex model to achieve maximum
> performance
>
> Disk (cached) IO definitely hits first entry and second one is there for
> advertisements and fast deployment, but overall usage of the
> asynchronous IO model is not limited to the above scenario, so
> micro-threads definitely hit own niche, but they can not cover all usage
> cases.
You know, I read this a few times, but I still don't get what your point
is here ;) Are you talking about micro-thread design in the kernel as for
kthreads usage for AIO, or about userspace?
- Davide
On Fri, Feb 23, 2007 at 09:43:14AM -0800, Davide Libenzi ([email protected]) wrote:
> On Fri, 23 Feb 2007, Evgeniy Polyakov wrote:
>
> > On Thu, Feb 22, 2007 at 11:46:48AM -0800, Davide Libenzi ([email protected]) wrote:
> > >
> > > A dynamic pool will smooth thread creation/freeing up by a lot.
> > > And, in my box a *pthread* create/free takes ~10us, at 1000/s is 10ms, 1%.
> > > Bad, but not so aweful ;)
> > > Look, I'm *definitely* not trying to advocate the use of async syscalls for
> > > network here, just pointing out that when we're talking about threads,
> > > Linux does a pretty good job.
> >
> > If we are going to create 1000 threads each second, then it is better to
> > preallocate them and queue a work to that pool - like syslets did with
> > syscalls, but not ulitimately create a new thread just because it is not
> > that slow.
>
> We do create a pool indeed, as I said in the opening of my asnwer. The
> numbers I posted was just to show that thread creation/destroy is pretty
> fast, but that does not justify it as a design choice.
I was not clear - I meant why do we need to do that when we can run the
same code in userspace? And better if we can have non-blocking dataflows
and number of threads equal to number of processors...
> > All such micro-thread designs are especially good in the case when
> > 1. switching is _rare_ (very)
> > 2. programmer does not want to create complex model to achieve maximum
> > performance
> >
> > Disk (cached) IO definitely hits first entry and second one is there for
> > advertisements and fast deployment, but overall usage of the
> > asynchronous IO model is not limited to the above scenario, so
> > micro-threads definitely hit own niche, but they can not cover all usage
> > cases.
>
> You know, I read this a few times, but I still don't get what your point
> is here ;) Are you talking about micro-thread design in the kernel as for
> kthreads usage for AIO, or about userspace?
I started a week of writing without russian-english dictionary, so
expect some troubles in communications with me :)
I said that about kernel design - when we have thousand(s) of threads,
which do the work - if number of context switches is small (i.e. when
operations mostly do not block), then it is ok (although 'ps' output
with threads can scary a grandma).
It is also ok to say - 'hey, Linux has so easy AIO model, so that
everyone should switch and start using it and do not care about problems
associated with multi-threaded programming with high concurrency',
but, in my opinion, both that cases can not cover all (and most of)
the usage cases.
To eat my hat (or force others to do the same) I'm preparing a tree for
threadlet test - I plan to write a trivial web server
(accept/recv/send/sendfile in one threadlet function) and give it a try
soon.
> - Davide
>
--
Evgeniy Polyakov
On Fri, Feb 23 2007, Suparna Bhattacharya wrote:
> On Fri, Feb 23, 2007 at 05:25:08PM +0100, Jens Axboe wrote:
> > On Fri, Feb 23 2007, Suparna Bhattacharya wrote:
> > > On Fri, Feb 23, 2007 at 03:58:26PM +0100, Ingo Molnar wrote:
> > > >
> > > > * Suparna Bhattacharya <[email protected]> wrote:
> > > >
> > > > > As a really crude (and not very realistic) example of the potential
> > > > > impact of large numbers of outstanding IOs, I tried some quick direct
> > > > > IO comparisons using fio:
> > > > >
> > > > > [global]
> > > > > ioengine=syslet-rw
> > > > > buffered=0
> > > > > rw=randread
> > > > > bs=64k
> > > > > size=1024m
> > > > > iodepth=64
> > > >
> > > > could you please try those iodepth=20000 tests with the latest
> > > > fio-testing branch of fio as well? Jens wrote a new, smarter syslet
> > > > plugin for FIO. You'll need the v3 syslet kernel plus:
> > > >
> > > > git-clone git://git.kernel.dk/data/git/fio.git
> > > > cd fio
> > > > git-checkout syslet-testing
> > > >
> > > > my expectation is that it should behave better with iodepth=20000
> > > > (although i havent tried that yet).
> > >
> > > I picked up the fio snapshot from 22nd Feb (fio-git-20070222212513.tar.gz)
> > > and used the v3 syslet patches from your web-site.
> > >
> > > Do I still need to get something more recent ?
> >
> > Yes, you need to test the syslet+testing branch that Ingo referenced.
> > Your test above is not totally fair right now, since you are doing
> > significantly less system calls with libaio. So to compare apples with
> > apples, try the syslet-testing branch. If you can't get it because of
> > firewall problems, check http://brick.kernel.dk/snaps/ for the latest
> > fio snapshot. If it has the syslet-testing branch, then that is
> > recent enough.
>
> I have a feeling this is getting to be a little more bleeding edge than
> I had anticipated :), so will just hold off for a bit until this
> crystallizes a bit.
Fair enough, I'll try your test with a huge number of pending requests
and see how it fares.
--
Jens Axboe
On Fri, 23 Feb 2007, Evgeniy Polyakov wrote:
> I was not clear - I meant why do we need to do that when we can run the
> same code in userspace? And better if we can have non-blocking dataflows
> and number of threads equal to number of processors...
I've a userspace library that does exactly that (GUASI - GPL code avail if
you want, but no man page written yet). It uses a pool of threads and
queue requests. I've a bench that crawl through a directory a read files.
The sync against async performance sucks. You can't do the cachehit
optimization in userspace. With network stuff could prolly do better
(since network is more heavily towards async), but still.
> I started a week of writing without russian-english dictionary, so
> expect some troubles in communications with me :)
>
> I said that about kernel design - when we have thousand(s) of threads,
> which do the work - if number of context switches is small (i.e. when
> operations mostly do not block), then it is ok (although 'ps' output
> with threads can scary a grandma).
> It is also ok to say - 'hey, Linux has so easy AIO model, so that
> everyone should switch and start using it and do not care about problems
> associated with multi-threaded programming with high concurrency',
> but, in my opinion, both that cases can not cover all (and most of)
> the usage cases.
>
> To eat my hat (or force others to do the same) I'm preparing a tree for
> threadlet test - I plan to write a trivial web server
> (accept/recv/send/sendfile in one threadlet function) and give it a try
> soon.
Funny, I lazily started doing the same thing last weekend (than I had to
stop, since real job kicked in ;). I wanted to compare a fully MT trivial
HTTP server:
http://www.xmailserver.org/thrhttp.c
with one that is event-driven (epoll) and coroutine based. This one will
only be compared for memory-content delivery, since it has no async vfs
capabilities. They both support the special "/mem-XXXX" url, that allows
an HTTP loader to request a given content size.
I also have a epoll+coroutine HTTP loader (that works around httperf
limitations).
Then, I wanted to compare the above, with one that is epoll+GUASI+coroutine
based (basically a userpace-only thingy).
I've the code for all the above.
Finally, with one that is epoll+syslet+coroutine based (no code for this
yet - but it should be a easy port from the GUASI one).
Keep in mind though, that a threadlet solution doing accept/recv/send/sendfile
is becoming blazingly similar to a full MT solution.
I can only immagine the thunders and flames that Larry would throw at us
for using all those threads :D
- Davide
On Fri, Feb 23, 2007 at 01:52:47PM +0100, Jens Axboe wrote:
> Results:
>
> Engine Depth Bw (MiB/sec)
> --------------------------------------------
> libaio 1 441
> syslet 1 574
> sync 1 589
> libaio 32 613
> syslet 32 681
Can we get runs with large I/Os, large I/O depths, and most
importantly tons of processes? I can absolutely believe that syslets
would compete well with one process on the system. But with 1000
processes doing 1000s of blocking I/Os, I'd really be interested to see
how that plays out.
Joel
--
Joel's Second Law:
If a code change requires additional user setup, it is wrong.
Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Wed, 21 Feb 2007, Ingo Molnar wrote:
> +asmlinkage long
> +sys_threadlet_on(unsigned long restore_stack,
> + unsigned long restore_eip,
> + struct async_head_user __user *ahu)
> +{
> + struct task_struct *t = current;
> + struct async_head *ah = t->ah;
> + struct async_thread *at = &t->__at;
> + long ret;
> +
> + /*
> + * Do not allow recursive calls of sys_threadlet_on():
> + */
> + if (t->async_ready || t->at)
> + return -EINVAL;
> +
> + if (unlikely(!ah)) {
> + ret = init_head(ah, t, ahu);
> + if (ret)
> + return ret;
> + ah = t->ah;
> + }
> +
> + if (unlikely(list_empty(&ah->ready_async_threads))) {
> + ret = refill_cachemiss_pool(ah, t, ahu);
> + if (ret)
> + return ret;
> + }
> +
> + t->async_ready = at;
> + ah->restore_stack = restore_stack;
> + ah->restore_eip = restore_eip;
> +
> + ah->ahu = ahu;
> +
> + return 0;
> +}
> +
> +asmlinkage long sys_threadlet_off(void)
> +{
> + struct task_struct *t = current;
> + struct async_head *ah = t->ah;
> +
> + /*
> + * Are we still executing as head?
> + */
> + if (ah) {
> + t->async_ready = NULL;
> +
> + return 1;
> + }
> +
> + /*
> + * We got turned into a cachemiss thread,
> + * return to user-space, which can do
> + * the notification, etc:
> + */
> + return 0;
> +}
If we have a new syscall that does the exec, we can save the two on/off
calls. Also, the complete_thread() thingy can be done automatically from
inside the kernel upon function return, by hence making the threadlet
function look like a normal thread function:
long thfn(void *) {
...
return error;
}
No?
- Davide
On 2/23/07, Alan <[email protected]> wrote:
> > Do you not understand that real user code touches FPU state at
> > unpredictable (to the kernel) junctures? Maybe not in a database or a
>
> We don't care. We don't have to care. The kernel threadlets don't execute
> in user space and don't do FP.
Blocked threadlets go back out to userspace, as new threads, after the
first blocking syscall completes. That's how Ingo described them in
plain English, that's how his threadlet example would have to work,
and that appears to be what his patches actually do.
> > web server, but in the GUIs and web-based monitoring applications that
> > are 99% of the potential customers for kernel AIO? I have no idea
> > what a %cr3 is, but if you don't fence off thread-local stuff from the
>
> How about you go read the intel architecture manuals then you might know
> more.
Y'know, there's more to life than x86. I'm no MMU expert, but I know
enough about ARM TLS and ptrace to have fixed ltrace -- not that that
took any special wizardry, just a need for it to work and some basic
forensic skill. If you want me to go away completely or not follow up
henceforth on anything you write, say so, and I'll decide what to do
in response. Otherwise, you might consider evaluating whether there's
a way to interpret my comments so that they reflect a perspective that
does not overlap 100% with yours rather than total idiocy.
> Last time I checked glibc was in userspace and the interface for kernel
> AIO is a matter for the kernel so errno is irrelevant, plus any
> threadlets doing system calls will only be living in kernel space anyway.
Ingo's original example code:
long my_threadlet_fn(void *data)
{
char *name = data;
int fd;
fd = open(name, O_RDONLY);
if (fd < 0)
goto out;
fstat(fd, &stat);
read(fd, buf, count)
...
out:
return threadlet_complete();
}
You're telling me that runs entirely in kernel space when open()
blocks, and doesn't touch errno if fstat() fails? Now who hasn't read
the code?
Cheers,
- Michael
> long my_threadlet_fn(void *data)
> {
> char *name = data;
> int fd;
>
> fd = open(name, O_RDONLY);
> if (fd < 0)
> goto out;
>
> fstat(fd, &stat);
> read(fd, buf, count)
> ...
>
> out:
> return threadlet_complete();
> }
>
> You're telling me that runs entirely in kernel space when open()
> blocks, and doesn't touch errno if fstat() fails? Now who hasn't read
> the code?
That example touches back into user space, but doesnt involve MMU changes
or cache flushes, or tlb flushes, or floating point.
errno is thread specific if you use it but errno is as I said before
entirely a C library detail that you don't have to suffer if you don't
want to. Avoiding that saves a segment register load - which isn't too
costly but isn't free.
Alan
Thanks for taking me at least minimally seriously, Alan. Pretty
generous of you, all things considered.
On 2/23/07, Alan <[email protected]> wrote:
> That example touches back into user space, but doesnt involve MMU changes
> or cache flushes, or tlb flushes, or floating point.
True -- on an architecture where a change of TLS does not
substantially affect the TLB and cache, which (AIUI) it does on most
or all ARMs. (On a pre-EABI ARM, there is even a substantial
cache-related penalty for encoding the syscall number in the syscall
opcode, because you have to peek back at the text segment to see it,
which costs you a D-cache stall.) Now put an sprintf with a %d in it
between a couple of the syscalls, and _your_ arch is hurting. Deny
the userspace programmer the use of the FPU in threadlets, and they
become a lot less widely applicable -- and a lot flakier in a
non-wizard's hands, given that people often cheat around the small
number of x86 integer registers by using FP registers when copying
memory in bulk.
> errno is thread specific if you use it but errno is as I said before
> entirely a C library detail that you don't have to suffer if you don't
> want to. Avoiding that saves a segment register load - which isn't too
> costly but isn't free.
On your arch, it's a segment register -- and another
who-knows-how-many pages to migrate along with the stack and pt_regs.
On ARM, it's a coprocessor register that is incorrectly emulated by
most JTAG emulators (so bye-bye JTAG-assisted debugging and
profiling), or possibly a register stolen from the general purpose
register set. On some MIPSes I have known you probably can't
implement TLS safely without a cache flush.
If you tell people up front not to touch TLS in threadlets -- which
means not to use routines from <stdlib.h> and <stdio.h> -- then
implementors may have enough flexibility to make them perform well on
a wide range of architectures. Alternately, if there are some things
that threadlet users will genuinely need TLS for, you can tell them
that all of the threadlets belonging to process X on CPU Y share a TLS
context, and therefore things like errno can't be trusted across a
syscall -- but then you had better make fairly sure that threadlets
aren't preempted by other threadlets in between syscalls. Similar
arguments apply to FPU state.
IEEE 754. Harp, harp. :-)
Cheers,
- Michael
I wrote:
> (On a pre-EABI ARM, there is even a substantial
> cache-related penalty for encoding the syscall number in the syscall
> opcode, because you have to peek back at the text segment to see it,
> which costs you a D-cache stall.)
Before you say it, I'm aware that this is not directly relevant to TLS
switch costs, except insofar as the "arch-dependent syscalls"
introduced for certain parts of ARM TLS handling carry the same
overhead as any other syscall. My point is that the system impact of
seemingly benign operations is not always predictable even to the arch
experts, and therefore one should be "parsimonious" (to use Kahan's
word) in defining what semantics programmers may rely on in
performance-critical situations.
If you arrange things so that threadlets are scheduled as much as
possible in bursts that share the same processor context (process
context, location in program text, TLS arena, FPU state -- basically
everything other than stack and integer registers), you are giving
yourself and future designers the maximum opportunity for exploiting
hardware optimizations. This would be a good thing if you want
threadlets to be performance-competitive with state machine designs.
If you still allow application programmers to _use_ shared processor
state, in the knowledge that it will be clobbered on threadlet switch,
then threadlets can use most of the coding style with which
programmers of event-driven frameworks are familiar. This would be a
good thing if you want threadlets to get wider use than the innards of
three or four databases and web servers.
Cheers,
- Michael
On 2/23/07, Michael K. Edwards <[email protected]> wrote:
> which costs you a D-cache stall.) Now put an sprintf with a %d in it
> between a couple of the syscalls, and _your_ arch is hurting. ...
er, that would be a %f. :-)
Cheers,
- Michael
* Davide Libenzi <[email protected]> wrote:
> > +asmlinkage long
> > +sys_threadlet_on(unsigned long restore_stack,
> > + unsigned long restore_eip,
> > + struct async_head_user __user *ahu)
> > +asmlinkage long sys_threadlet_off(void)
> If we have a new syscall that does the exec, we can save the two
> on/off calls.
the on/off calls are shaped in a way that makes them ultimately
vsyscall-able - the kernel only needs to know about the fact that we are
in a threadlet (so that the scheduler can do its special
push-head-to-another-context thing) - and this can be signalled via a
small user-space-side info structure as well, put into the TLS.
> [...] Also, the complete_thread() thingy can be done automatically
> from inside the kernel upon function return, by hence making the
> threadlet function look like a normal thread function:
yeah - and that's how it works in my current codebase already,
threadlet_off() takes a 'completion event' pointer as well, and the ahu.
I'll release v4 so that you can have a look.
Ingo
this is the v4 release of the syslet/threadlet subsystem:
http://redhat.com/~mingo/syslet-patches/
v4 is a smaller update than v3 (so i wont send out the full queue to
lkml - see the broken out queue in the patches-v4 directory at the URL
above). Changes since v3:
- the threadlet API changed: the sys_async_threadlet() syscall now takes
a 'completion event' pointer, and auto-completes it into the
completion ring. I've updated the test-threadlet.c code to make use of
it. So completion of threadlets and syslets is quite similar now -
sharing even more infrastructure. (To get true pthread compatibility a
sys_exit() driven CLEARTID completion method will be added too in the
future).
- a small performance fix for syslet rescheduling. The syslet ABI has
not changed.
- test-threadlet.c fixes a thread stack leak, and the other tests too
have a number of small fixes and cleanups.
Ingo
On Fri, Feb 23 2007, Joel Becker wrote:
> On Fri, Feb 23, 2007 at 01:52:47PM +0100, Jens Axboe wrote:
> > Results:
> >
> > Engine Depth Bw (MiB/sec)
> > --------------------------------------------
> > libaio 1 441
> > syslet 1 574
> > sync 1 589
> > libaio 32 613
> > syslet 32 681
>
> Can we get runs with large I/Os, large I/O depths, and most
> importantly tons of processes? I can absolutely believe that syslets
> would compete well with one process on the system. But with 1000
> processes doing 1000s of blocking I/Os, I'd really be interested to see
> how that plays out.
Sure, I'll add this to the testing list for monday.
--
Jens Axboe
On 2/23/07, Ingo Molnar <[email protected]> wrote:
> > This is a fundamental misconception. [...]
>
> > The scheduler, on the other hand, has to blow and reload all of the
> > hidden state associated with force-loading the PC and wherever your
> > architecture keeps its TLS (maybe not the whole TLB, but not nothing,
> > either). [...]
>
> please read up a bit more about how the Linux scheduler works. Maybe
> even read the code if in doubt? In any case, please direct kernel newbie
> questions to http://kernelnewbies.org/, not [email protected].
This is not the first kernel I've swum around in, and I've been
mucking with the Linux kernel since early 2.2 and coding assembly for
heavily pipelined processors on and off since 1990. So I may be a
newbie to your lingo, and I may even be a loud-mouthed idiot, but I'm
not a wet-behind-the-ears undergrad, OK?
Now, I've addressed the non-free-ness of a TLS swap elsewhere; what
about function pointers in state machines (with or without flipping
"supervisor mode" bits)? Just because loading the PC from a data
register is one opcode in the instruction stream does not mean that it
is not quite expensive in terms of blown pipeline state and I-cache
stalls. Really fast state machines exploit PC-relative branches that
really smart CPUs can speculatively execute past (after a few
traversals) because there are a small number of branch targets
actually hit. The instruction prefetch / scheduler unit actually
keeps a table of PC-relative jump instructions found in I-cache, with
a little histogram of destinations eventually branched to, and
speculatively executes down the top branch or two. (Intel Pentiums
have a fairly primitive but effective variant of this; see
http://www.x86.org/articles/branch/branchprediction.htm.)
More general mechanisms are called "branch target buffers" and US
Patent 6609194 is a good hook into the literature. A sufficiently
smart CPU designer may have figured out how to do something similar
with computed jumps (add pc, pc, foo), but odds are high that it cuts
out when you throw function pointers around. Syscall dispatch is a
special and heavily optimized case, though -- so it's quite
conceivable that a well designed userland switch/case state machine
that makes syscalls will outperform an in-kernel state machine data
structure traversal. If this doesn't happen to be true on today's
desktop, it may be on tomorrow's desktop or today's NUMA monstrosity
or embedded mega-multi-MIPS.
There can also be other reasons why tabulated PC-relative jumps and
immediate PC loads are faster than PC loads from data registers.
Take, for instance, the Transmeta Crusoe, which (AIUI) used a trick
similar to the FX!32 x86 emulation on Alpha/NT. If you're going to
"translate" CISC to RISC on the fly, you're going to recognize
switch/case idioms (including tabulated PC-relative branches), and fix
up the translated branch table to contain offsets to the
RISC-translated branch targets. So the state transitions are just as
cheap as if they had been compiled to RISC in the first place. Do it
with function pointers, and the the execution machine is going to have
to stall while it looks up the text location to see if it has it
translated in I-cache somewhere. Guess what: the PIV works the same
way (http://www.karbosguide.com/books/pcarchitecture/chapter12.htm).
Are you starting to get the picture that syslets -- clever as they
might have been on a VAX -- defeat many of the mechanisms that CPU and
compiler architects have negotiated over decades for accelerating real
code? Especially now that we have hyper-threaded CPUs (parallel
instruction decode/issue units sharing almost all of their cache
hierarchy), you can almost treat the kernel as if it were microcode
for a syscall coprocessor. If you try to migrate application code
across the syscall boundary, you may perform well on micro-benchmarks
but you're storing up trouble for the future.
If you don't think this kind of fallout is real, talk to whoever had
the bright idea of hijacking FPU registers to implement memcpy in
1996. The PIII designers rolled over and added XMM so
micro-optimizers would get their dirty mitts off the FPU, which it
appears that Doug Ledford and Jim Blandy duly acted on in 1999. Yes,
you still need to use FXSAVE/FXRSTOR when you want to mess with the
XMM stuff, but the CPU is smart enough to keep a shadow copy of all
the microstate that the flag states represent. So if all you do
between FXSAVE and FXRSTOR is shlep bytes around with MOVAPS, the
FXRSTOR costs you little or nothing. What hurts is an FXRSTOR from a
location that isn't the last location you FXSAVEd to, or an FXRSTOR
after actual FP arithmetic instructions have altered status flags.
The preceding may contain errors in detail -- I am neither a CPU
architect nor an x86 compiler writer nor even a serious kernel hacker.
But hopefully it's at least food for thought. If not, you know where
the "ignore this prolix nitwit" key is to be found on your keyboard.
Cheers,
- Michael
On Sat, 24 Feb 2007, Michael K. Edwards wrote:
> The preceding may contain errors in detail -- I am neither a CPU
> architect nor an x86 compiler writer nor even a serious kernel hacker.
Ok, roger that. But why are you playing "Google & Preach" games to Ingo,
that ate bread and CPUs for the last 15 years?
- Davide
On Sat, 24 Feb 2007, Ingo Molnar wrote:
> * Davide Libenzi <[email protected]> wrote:
>
> > > +asmlinkage long
> > > +sys_threadlet_on(unsigned long restore_stack,
> > > + unsigned long restore_eip,
> > > + struct async_head_user __user *ahu)
>
> > > +asmlinkage long sys_threadlet_off(void)
>
> > If we have a new syscall that does the exec, we can save the two
> > on/off calls.
>
> the on/off calls are shaped in a way that makes them ultimately
> vsyscall-able - the kernel only needs to know about the fact that we are
> in a threadlet (so that the scheduler can do its special
> push-head-to-another-context thing) - and this can be signalled via a
> small user-space-side info structure as well, put into the TLS.
IMO it's not a matter of speed. We'll have those two new syscalls, that I
don't see other practical use for. IMO the best thing would be to hide all
inside the sys_threadlet_exec (or whatever name).
> > [...] Also, the complete_thread() thingy can be done automatically
> > from inside the kernel upon function return, by hence making the
> > threadlet function look like a normal thread function:
>
> yeah - and that's how it works in my current codebase already,
> threadlet_off() takes a 'completion event' pointer as well, and the ahu.
> I'll release v4 so that you can have a look.
Ok. Will look into it...
- Davide
On Feb 24, 2007, at 16:10:33, Davide Libenzi wrote:
> On Sat, 24 Feb 2007, Ingo Molnar wrote:
>> the on/off calls are shaped in a way that makes them ultimately
>> vsyscall-able - the kernel only needs to know about the fact that
>> we are in a threadlet (so that the scheduler can do its special
>> push-head-to-another-context thing) - and this can be signalled
>> via a small user-space-side info structure as well, put into the TLS.
>
> IMO it's not a matter of speed. We'll have those two new syscalls,
> that I don't see other practical use for. IMO the best thing would
> be to hide all inside the sys_threadlet_exec (or whatever name).
No, it absolutely is a matter of speed. The reason to have those two
implemented that way is so that they can be implemented as vsyscalls
completely in userspace. This means that on most modern platforms
you can implement the "make a threadlet when I block" semantic
without even touching kernel-mode. The way it's set up all you'd
have to do is save some parameters, set up a new callstack, and poke
a "1" into a memory address in the TLS. To stop, you effectively
just poke a "0" into the spot in the TLS and either return or
terminate the thread.
Cheers,
Kyle Moffett
On Sat, 24 Feb 2007, Kyle Moffett wrote:
> On Feb 24, 2007, at 16:10:33, Davide Libenzi wrote:
> > On Sat, 24 Feb 2007, Ingo Molnar wrote:
> > > the on/off calls are shaped in a way that makes them ultimately
> > > vsyscall-able - the kernel only needs to know about the fact that we are
> > > in a threadlet (so that the scheduler can do its special
> > > push-head-to-another-context thing) - and this can be signalled via a
> > > small user-space-side info structure as well, put into the TLS.
> >
> > IMO it's not a matter of speed. We'll have those two new syscalls, that I
> > don't see other practical use for. IMO the best thing would be to hide all
> > inside the sys_threadlet_exec (or whatever name).
>
> No, it absolutely is a matter of speed. The reason to have those two
> implemented that way is so that they can be implemented as vsyscalls
> completely in userspace. This means that on most modern platforms you can
> implement the "make a threadlet when I block" semantic without even touching
> kernel-mode. The way it's set up all you'd have to do is save some
> parameters, set up a new callstack, and poke a "1" into a memory address in
> the TLS. To stop, you effectively just poke a "0" into the spot in the TLS
> and either return or terminate the thread.
Right. I don't why but I got the implression Ingo's threadlet_exec example
was just sketch code to be moved in a syscall. That's why I was talking
about a sys_threadlet_exec. But yeah, it makes a lot of sense to turn
threadlet_exec in a glibc thing, and play everything in userspace at that
point.
- Davide
On 2/24/07, Davide Libenzi <[email protected]> wrote:
> Ok, roger that. But why are you playing "Google & Preach" games to Ingo,
> that ate bread and CPUs for the last 15 years?
Sure I used Google -- for clickable references so that lurkers can
tell I'm not making these things up as I go along. Ingo and Alan have
obviously forgotten more about x86en than I will ever know, and I'm
carrying coals to Newcastle when I comment on pros and cons of XMM
memcpy.
But although the latest edition of the threadlet patches actually has
quite good internal documentation and makes most of its intent clear
even to a reader (me) who is unfamiliar with the code being patched,
it lacks "theory of operations". How is an arch maintainer supposed
to adapt this interface to a completely different CPU, with different
stuff in pt_regs and different cost profiles for blown pipelines and
reloaded coprocessor state? What are the hidden costs of this
particular style of M:N microthreading, and will they explode when
this model escapes out of the microbenchmarks and people who don't
know CPUs inside and out start using it? What standard
thread-pool-management use cases are being glossed over at kernel
level and left to Ulrich (or implementors of JVMs and other bytecode
machines) to sort out?
At some level, I'm just along for the ride; nobody with any sense is
going to pay me to design this sort of thing, and the level of effort
involved in coding an alternate AIO implementation is not something I
can afford to expend on non-revenue-producing activities even if I did
have the skill. Maybe half of my quibbles are sheer stupidity and
four out of five of the rest are things that Ingo has already taken
account in v4 of his patch set. But that would leave one quibble in
ten that has some substance, which might save some nasty rework down
the line. Even if everything I ask about has a simple explanation,
and for Alan and Ingo to waste time spelling it out for me would
result in nothing but an accelerated "theory of operation" document,
would that be a bad thing?
Now I know very little about x86_64 other than that 64-bit code not
only has double-size integer registers to work with, it has twice as
many of them. So for all I know the transition to pure-64-bit 2-4
core x 2-4 thread/core systems, which is going to be 90% or more of
the revenue-generating Linux market over the next few years, makes all
of my concerns moot for Ingo's purposes. After all, as long as Linux
stays good enough to keep Oracle from losing confidence and switching
to Darwin or something, the 100 or so people who earn invites to the
kernel summit have cush jobs for life.
The rest of us would perhaps like for major proposed kernel overhauls
to be accompanied by some kind of analysis of their impact on arches
that live elsewhere in CPU parameter space. That analysis might
suggest small design refinements that make Linux AIO scale well on the
class of processors I'm interested, too. And I personally would like
to see Ingo get that Turing award for designing AIO semantics that are
as big an advance over the past as IEEE 754 was over its predecessors.
He'd have to earn it, though.
Cheers,
- Michael
* Davide Libenzi <[email protected]> wrote:
> > No, it absolutely is a matter of speed. The reason to have those
> > two implemented that way is so that they can be implemented as
> > vsyscalls completely in userspace. This means that on most modern
> > platforms you can implement the "make a threadlet when I block"
> > semantic without even touching kernel-mode. The way it's set up all
> > you'd have to do is save some parameters, set up a new callstack,
> > and poke a "1" into a memory address in the TLS. To stop, you
> > effectively just poke a "0" into the spot in the TLS and either
> > return or terminate the thread.
>
> Right. I don't why but I got the implression Ingo's threadlet_exec
> example was just sketch code to be moved in a syscall. That's why I
> was talking about a sys_threadlet_exec. But yeah, it makes a lot of
> sense to turn threadlet_exec in a glibc thing, and play everything in
> userspace at that point.
yeah, not having to do any extra entry into the kernel at all (in the
cached case), and to make them in essence equivalent to a function call
is my plan/hope for threadlets :-)
Ingo
On Sun, Feb 25, 2007 at 05:33:10PM +0300, Evgeniy Polyakov ([email protected]) wrote:
> Hi Ingo.
>
> I tried to create web server application for threadlets to show (will
> not write what I wanted to show, but you might guess :)
>
> I started threadlet-test from async-test-v4 and got following bug.
>
> To compile v4 and v3 I need to apply a patch (sent in previous e-mail),
> which removes %xgs assignment since it is never used and actually does
> not even exist it pt_regs.
If I uncomment xgs and recompile - kernel crashes early on init...
Btw, could you create a git tree to pull from?
That would be much easier to track problems and to test for others.
--
Evgeniy Polyakov
On Wed, Feb 21, 2007 at 10:13:55PM +0100, Ingo Molnar ([email protected]) wrote:
> this is the v3 release of the syslet/threadlet subsystem:
>
> http://redhat.com/~mingo/syslet-patches/
There is no %xgs.
--- ./arch/i386/kernel/process.c~ 2007-02-24 22:56:14.000000000 +0300
+++ ./arch/i386/kernel/process.c 2007-02-24 22:53:19.000000000 +0300
@@ -426,7 +426,6 @@
regs.xds = __USER_DS;
regs.xes = __USER_DS;
- regs.xgs = __KERNEL_PDA;
regs.orig_eax = -1;
regs.eip = (unsigned long) async_thread_helper;
regs.xcs = __KERNEL_CS | get_kernel_rpl();
--
Evgeniy Polyakov
Hi Ingo.
I tried to create web server application for threadlets to show (will
not write what I wanted to show, but you might guess :)
I started threadlet-test from async-test-v4 and got following bug.
To compile v4 and v3 I need to apply a patch (sent in previous e-mail),
which removes %xgs assignment since it is never used and actually does
not even exist it pt_regs.
It is 2.6.20-rc1 tree (patch failed in Makefile (no need to export your
kernel extraversion) and in asm/i386/kernel/process (i387
initialization)), both are trivially fixable by hands.
Machine is via epia, fc5 distro.
I'm currently compiling tree fwith debug support to decode where bug
lives if it will be reproduced.
[ 894.597934] general protection fault: 0000 [#1]
[ 894.597959] PREEMPT SMP
[ 894.597975] Modules linked in: dm_mod nvram ehci_hcd uhci_hcd
snd_via82xx snd_ac97_codec ac97_bus snd_seq_dummy snd_seq_oss
snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss snd_pcm snd_timer
snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd generic
soundcore via_rhine i2c_viapro i2c_core mii ext3 jbd ide_disk
[ 894.598131] CPU: 0
[ 894.598135] EIP: 0060:[<c012ab99>] Not tainted VLI
[ 894.598141] EFLAGS: 00010282 (2.6.21-rc1 #2)
[ 894.598179] EIP is at cachemiss_thread+0xa/0xc6
[ 894.598200] eax: cdfa8ef4 ebx: cdfa8ef4 ecx: cdfa8ef4 edx: cdfa8ef4
[ 894.598222] esi: 00000000 edi: 00000000 ebp: 00000000 esp: ca6aff94
[ 894.598245] ds: 007b es: 007b fs: 0000 gs: 0033 ss: 0068
[ 894.598268] Process threadlet-test (pid: 12030, ti=ca6ae000 task=cdf9a790 task.ti=ca6ae000)
[ 894.598290] Stack: cd28a080 cdfa8db8 c0114536 cdfa8c10 c012ab8f 00000000 00000000 c0103af8
[ 894.598339] cdfa8ef4 c012ab8f 00000000 cdfa8ef4 00000000 00000000 00000000 00000000
[ 894.598385] 0000007b 0000007b 00000000 ffffffff c0103af0 00000060 00000286 00000000
[ 894.598437] Call Trace:
[ 894.598463] [<c0114536>] schedule_tail+0x2d/0x80
[ 894.598497] [<c012ab8f>] cachemiss_thread+0x0/0xc6
[ 894.598526] [<c0103af8>] async_thread_helper+0x8/0x1c
[ 894.598561] [<c012ab8f>] cachemiss_thread+0x0/0xc6
[ 894.598598] [<c0103af0>] async_thread_helper+0x0/0x1c
[ 894.598631] =======================
[ 894.598648] Code: b8 1f 00 00 89 42 38 8b 53 04 8b 46 0c 81 c2 b8 1f
00 00 89 42 2c 83 c8 ff 83 c4 10 5b 5e 5f 5d c3 57 89 c1 56 53 89 c3 83
ec 10 <64> 8b 35 08 00 00 00 8d be c4 02 00 00 89 f0 89 fa e8 ab f8 ff
[ 894.598875] EIP: [<c012ab99>] cachemiss_thread+0xa/0xc6 SS:ESP 0068:ca6aff94
--
Evgeniy Polyakov
* Evgeniy Polyakov <[email protected]> wrote:
> On Wed, Feb 21, 2007 at 10:13:55PM +0100, Ingo Molnar ([email protected]) wrote:
> > this is the v3 release of the syslet/threadlet subsystem:
> >
> > http://redhat.com/~mingo/syslet-patches/
>
> There is no %xgs.
>
> --- ./arch/i386/kernel/process.c~ 2007-02-24 22:56:14.000000000 +0300
> +++ ./arch/i386/kernel/process.c 2007-02-24 22:53:19.000000000 +0300
> @@ -426,7 +426,6 @@
>
> regs.xds = __USER_DS;
> regs.xes = __USER_DS;
> - regs.xgs = __KERNEL_PDA;
hm, what tree are you using as a base? The syslet patches are against
v2.6.20 at the moment. (the x86 PDA changes will probably interfere with
it on v2.6.21-rc1-ish kernels) Note that otherwise the syslet/threadlet
patches are for x86 only at the moment (as i mentioned in the
announcement), and the generic code itself contains some occasional
x86-ishms as well. (None of the concepts are x86-specific though -
multi-stack architectures should work just as well as RISC-ish CPUs.)
if you create a threadlet based test-webserver, could you please do a
comparable kevents implementation as well? I.e. same HTTP parser (or
non-parser, as usually the case is with prototypes ;). Best would be
something that one could trigger between threadlet and kevent mode,
using the same binary :-)
Ingo
* Evgeniy Polyakov <[email protected]> wrote:
> My main concern was only about the situation, when we ends up with
> truly bloking context (like network), and this results in having
> thousands of threads doing the work - even having most of them
> sleeping, there is a problem with memory overhead and context
> switching, although it is usable situation, but when all of them are
> ready immediately - context switching will kill a machine even with
> O(1) scheduler which made situation damn better than before, but it is
> not a cure for the problem.
yes. This is why in the original fibril discussion i concentrated so
much on scheduling performance.
to me the picture is this: conceptually the scheduler runqueue is a
queue of work. You get items queued upon certain events, and they can
unqueue themselves. (there is also register context but that is already
optimized to death by hardware) So whatever scheduling overhead we have,
it's a pure software thing. It's because we have priorities attached.
It's because we have some legacies. Etc., etc. - it's all stuff /we/
wanted to add, but nothing truly fundamental on top of the basic 'work
queueing' model.
now look at kevents as the queueing model. It does not queue 'tasks', it
lets user-space queue requests in essence, in various states. But it's
still the same conceptual thing: a memory buffer with some state
associated to it. Yes, it has no legacies, it has no priorities and
other queueing concepts attached to it ... yet. If kevents got
mainstream, it would get the same kind of pressure to grow 'more
advanced' event queueing and event scheduling capabilities.
Prioritization would be needed, etc.
So my fundamental claim is: a kernel thread /is/ our main request
structure. We've got tons of really good system calls that queue these
'requests' around the place and offer functionality around this concept.
Plus there's a 1.2+ billion lines of Linux userspace code that works
well with this abstraction - while there's nary a few thousand lines of
event-based user-space code.
I also say that you'll likely get kevents outperform threadlets. Maybe
even significantly so under the right conditions. But i very much
believe we want to get similar kind of performance out of thread/task
scheduling, and not introduce a parallel framework to do request
scheduling the hard way ... just because our task concept and scheduling
implementation got too fat. For the same reason i didnt really like
fibrils: they are nice, and Zach's core idea i think nicely survived in
the syslet/threadlet model too, but they are more limited than true
threads. So doing that parallel infrastructure, which really just
implements the same, and is only faster because it skips features, would
just be hiding the problem with our primary abstraction. Ok?
Ingo
On Sun, Feb 25, 2007 at 06:23:38PM +0100, Ingo Molnar ([email protected]) wrote:
>
> * Evgeniy Polyakov <[email protected]> wrote:
>
> > On Wed, Feb 21, 2007 at 10:13:55PM +0100, Ingo Molnar ([email protected]) wrote:
> > > this is the v3 release of the syslet/threadlet subsystem:
> > >
> > > http://redhat.com/~mingo/syslet-patches/
> >
> > There is no %xgs.
> >
> > --- ./arch/i386/kernel/process.c~ 2007-02-24 22:56:14.000000000 +0300
> > +++ ./arch/i386/kernel/process.c 2007-02-24 22:53:19.000000000 +0300
> > @@ -426,7 +426,6 @@
> >
> > regs.xds = __USER_DS;
> > regs.xes = __USER_DS;
> > - regs.xgs = __KERNEL_PDA;
>
> hm, what tree are you using as a base? The syslet patches are against
> v2.6.20 at the moment. (the x86 PDA changes will probably interfere with
> it on v2.6.21-rc1-ish kernels) Note that otherwise the syslet/threadlet
> patches are for x86 only at the moment (as i mentioned in the
> announcement), and the generic code itself contains some occasional
> x86-ishms as well. (None of the concepts are x86-specific though -
> multi-stack architectures should work just as well as RISC-ish CPUs.)
It is rc1 - and crashes.
I test on i386 via epia (the only machine which runs x86 right now).
If there will not be any new patches, I will create 2.6.20 test tree
tomorrow.
> if you create a threadlet based test-webserver, could you please do a
> comparable kevents implementation as well? I.e. same HTTP parser (or
> non-parser, as usually the case is with prototypes ;). Best would be
> something that one could trigger between threadlet and kevent mode,
> using the same binary :-)
Ok, I will create such a monster tomorrow :)
I will use the same base for threadlet as for kevent/epoll - there is no
parser, just sendfile() of the static file which contains http header
and actual page.
threadlet1 {
accept()
create threadlet2 {
send data
}
}
Is above scheme correct for threadlet scenario?
But note, that on my athlon64 3500 test machine kevent is about 7900
requests per second compared to 4000+ epoll, so expect a challenge.
lighhtpd is about the same 4000 requests per second though, since it can
not be easily optimized for kevents.
> Ingo
--
Evgeniy Polyakov
* Evgeniy Polyakov <[email protected]> wrote:
> > hm, what tree are you using as a base? The syslet patches are
> > against v2.6.20 at the moment. (the x86 PDA changes will probably
> > interfere with it on v2.6.21-rc1-ish kernels) Note that otherwise
> > the syslet/threadlet patches are for x86 only at the moment (as i
> > mentioned in the announcement), and the generic code itself contains
> > some occasional x86-ishms as well. (None of the concepts are
> > x86-specific though - multi-stack architectures should work just as
> > well as RISC-ish CPUs.)
>
> It is rc1 - and crashes.
yeah. I'm not surprised. The PDA is not set up in create_async_thread()
for example.
> > if you create a threadlet based test-webserver, could you please do
> > a comparable kevents implementation as well? I.e. same HTTP parser
> > (or non-parser, as usually the case is with prototypes ;). Best
> > would be something that one could trigger between threadlet and
> > kevent mode, using the same binary :-)
>
> Ok, I will create such a monster tomorrow :)
>
> I will use the same base for threadlet as for kevent/epoll - there is
> no parser, just sendfile() of the static file which contains http
> header and actual page.
>
> threadlet1 {
> accept()
> create threadlet2 {
> send data
> }
> }
>
> Is above scheme correct for threadlet scenario?
yep, this is a good first cut. Doing this after the listen() is useful:
int one = 1;
ret = setsockopt(listen_sock_fd, SOL_SOCKET, SO_REUSEADDR,
(char *)&one, sizeof(int));
and i'd suggest to do this after every accept()-ed socket:
int flag = 1;
setsockopt(sock_fd, IPPROTO_TCP, TCP_NODELAY,
(char *) &flag, sizeof(int));
Do you have any link where i could check the type of HTTP parsing and
send transport you are (or will be) using? What type of http client are
you using to measure, with precisely what options?
> But note, that on my athlon64 3500 test machine kevent is about 7900
> requests per second compared to 4000+ epoll, so expect a challenge.
single-core CPU i suspect?
> lighhtpd is about the same 4000 requests per second though, since it
> can not be easily optimized for kevents.
mean question: do you promise to post the results even if they are not
unfavorable to threadlets? ;-)
if i want to test kevents on a v2.6.20 kernel base, do you have an URL
for me to try?
Ingo
On Sun, Feb 25, 2007 at 06:45:05PM +0100, Ingo Molnar ([email protected]) wrote:
>
> * Evgeniy Polyakov <[email protected]> wrote:
>
> > My main concern was only about the situation, when we ends up with
> > truly bloking context (like network), and this results in having
> > thousands of threads doing the work - even having most of them
> > sleeping, there is a problem with memory overhead and context
> > switching, although it is usable situation, but when all of them are
> > ready immediately - context switching will kill a machine even with
> > O(1) scheduler which made situation damn better than before, but it is
> > not a cure for the problem.
>
> yes. This is why in the original fibril discussion i concentrated so
> much on scheduling performance.
>
> to me the picture is this: conceptually the scheduler runqueue is a
> queue of work. You get items queued upon certain events, and they can
> unqueue themselves. (there is also register context but that is already
> optimized to death by hardware) So whatever scheduling overhead we have,
> it's a pure software thing. It's because we have priorities attached.
> It's because we have some legacies. Etc., etc. - it's all stuff /we/
> wanted to add, but nothing truly fundamental on top of the basic 'work
> queueing' model.
>
> now look at kevents as the queueing model. It does not queue 'tasks', it
> lets user-space queue requests in essence, in various states. But it's
> still the same conceptual thing: a memory buffer with some state
> associated to it. Yes, it has no legacies, it has no priorities and
> other queueing concepts attached to it ... yet. If kevents got
> mainstream, it would get the same kind of pressure to grow 'more
> advanced' event queueing and event scheduling capabilities.
> Prioritization would be needed, etc.
>
> So my fundamental claim is: a kernel thread /is/ our main request
> structure. We've got tons of really good system calls that queue these
> 'requests' around the place and offer functionality around this concept.
> Plus there's a 1.2+ billion lines of Linux userspace code that works
> well with this abstraction - while there's nary a few thousand lines of
> event-based user-space code.
>
> I also say that you'll likely get kevents outperform threadlets. Maybe
> even significantly so under the right conditions. But i very much
> believe we want to get similar kind of performance out of thread/task
> scheduling, and not introduce a parallel framework to do request
> scheduling the hard way ... just because our task concept and scheduling
> implementation got too fat. For the same reason i didnt really like
> fibrils: they are nice, and Zach's core idea i think nicely survived in
> the syslet/threadlet model too, but they are more limited than true
> threads. So doing that parallel infrastructure, which really just
> implements the same, and is only faster because it skips features, would
> just be hiding the problem with our primary abstraction. Ok?
Kevent is a _very_ small entity and there is _no_ cost of requeueing
(well, there is list_add guarded by lock) - after it is done, process
can start real work. With rescheduling there are _too_ many things to be
done before we can start new work. We have to change registers, change
address space, various tlb bits and so on - we have to do it, since task
describes very heavy entity - the whole process.
IO in turn is a very small subset of what process is (can do), so there
is no need to change the whole picture, so it is enough to have one
process, which does the work.
Threads are a bit smaller than process, but still it is too heavy to
have it per IO - so we have pools - this decreases rescheduling
overhead, but limits parallelism.
I think it is _too_ heavy to have such a monster structure like
task(thread/process) and related overhead just to do an IO.
> Ingo
--
Evgeniy Polyakov
On Sun, Feb 25, 2007 at 06:54:37PM +0100, Ingo Molnar ([email protected]) wrote:
>
> * Evgeniy Polyakov <[email protected]> wrote:
>
> > > hm, what tree are you using as a base? The syslet patches are
> > > against v2.6.20 at the moment. (the x86 PDA changes will probably
> > > interfere with it on v2.6.21-rc1-ish kernels) Note that otherwise
> > > the syslet/threadlet patches are for x86 only at the moment (as i
> > > mentioned in the announcement), and the generic code itself contains
> > > some occasional x86-ishms as well. (None of the concepts are
> > > x86-specific though - multi-stack architectures should work just as
> > > well as RISC-ish CPUs.)
> >
> > It is rc1 - and crashes.
>
> yeah. I'm not surprised. The PDA is not set up in create_async_thread()
> for example.
Ok, I will roll back to vanilla 2.6.20 tomorrow.
> > > if you create a threadlet based test-webserver, could you please do
> > > a comparable kevents implementation as well? I.e. same HTTP parser
> > > (or non-parser, as usually the case is with prototypes ;). Best
> > > would be something that one could trigger between threadlet and
> > > kevent mode, using the same binary :-)
> >
> > Ok, I will create such a monster tomorrow :)
> >
> > I will use the same base for threadlet as for kevent/epoll - there is
> > no parser, just sendfile() of the static file which contains http
> > header and actual page.
> >
> > threadlet1 {
> > accept()
> > create threadlet2 {
> > send data
> > }
> > }
> >
> > Is above scheme correct for threadlet scenario?
>
> yep, this is a good first cut. Doing this after the listen() is useful:
>
> int one = 1;
>
> ret = setsockopt(listen_sock_fd, SOL_SOCKET, SO_REUSEADDR,
> (char *)&one, sizeof(int));
>
> and i'd suggest to do this after every accept()-ed socket:
>
> int flag = 1;
>
> setsockopt(sock_fd, IPPROTO_TCP, TCP_NODELAY,
> (char *) &flag, sizeof(int));
>
> Do you have any link where i could check the type of HTTP parsing and
> send transport you are (or will be) using? What type of http client are
> you using to measure, with precisely what options?
For example this ones (essentially the same, except that epoll and
kevent are used):
http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
> > But note, that on my athlon64 3500 test machine kevent is about 7900
> > requests per second compared to 4000+ epoll, so expect a challenge.
>
> single-core CPU i suspect?
Yep.
> > lighhtpd is about the same 4000 requests per second though, since it
> > can not be easily optimized for kevents.
>
> mean question: do you promise to post the results even if they are not
> unfavorable to threadlets? ;-)
If they are too good, I will start searching for bugs and tune my code
first, but eventually of course yes.
In my blog I will post them in 'real-time' even if kevent will
unbelieveably suck.
> if i want to test kevents on a v2.6.20 kernel base, do you have an URL
> for me to try?
I have a git tree at (based on rc1 as requested by Andrew Morton):
http://tservice.net.ru/~s0mbre/archive/kevent/kevent.git/
Or patches at kevent homepage:
http://tservice.net.ru/~s0mbre/old/?section=projects&item=kevent
Direct link to the latest patchset:
http://tservice.net.ru/~s0mbre/archive/kevent/kevent-37/
(order is insignificant as far as I recall, except 'compile-fix',
whcih must be the latest).
> Ingo
--
Evgeniy Polyakov
* Evgeniy Polyakov <[email protected]> wrote:
> > Do you have any link where i could check the type of HTTP parsing
> > and send transport you are (or will be) using? What type of http
> > client are you using to measure, with precisely what options?
>
> For example this ones (essentially the same, except that epoll and
> kevent are used):
> http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
> http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
thx - i guess i should just run them without any options and they bind
themselves to port 80? What 'ab' options are you using typically to
measure them?
Ingo
* Evgeniy Polyakov <[email protected]> wrote:
> > > Do you have any link where i could check the type of HTTP parsing
> > > and send transport you are (or will be) using? What type of http
> > > client are you using to measure, with precisely what options?
> >
> > For example this ones (essentially the same, except that epoll and
> > kevent are used):
> > http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
> > http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
>
> Client is 'ab' with high number of connections and concurrency. For
> example for athlon64 3500 I used concurrency of 8000 connections and
> 80k total.
ok. So it's:
ab -c 8000 -n 80000 http://yourserver/tmp/index.html
right? How large is index.html typically - the default 40960 bytes, and
with a constructed HTTP reply header already included in the file?
Ingo
On Sun, Feb 25, 2007 at 09:21:35PM +0300, Evgeniy Polyakov ([email protected]) wrote:
> > Do you have any link where i could check the type of HTTP parsing and
> > send transport you are (or will be) using? What type of http client are
> > you using to measure, with precisely what options?
>
> For example this ones (essentially the same, except that epoll and
> kevent are used):
> http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
> http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
Client is 'ab' with high number of connections and concurrency.
For example for athlon64 3500 I used concurrency of 8000 connections
and 80k total.
--
Evgeniy Polyakov
On Sun, Feb 25, 2007 at 07:22:30PM +0100, Ingo Molnar ([email protected]) wrote:
> > > Do you have any link where i could check the type of HTTP parsing
> > > and send transport you are (or will be) using? What type of http
> > > client are you using to measure, with precisely what options?
> >
> > For example this ones (essentially the same, except that epoll and
> > kevent are used):
> > http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
> > http://tservice.net.ru/~s0mbre/archive/kevent/evserver_epoll.c
>
> thx - i guess i should just run them without any options and they bind
> themselves to port 80? What 'ab' options are you using typically to
> measure them?
Yes, but they require /tmp/index.html to have http header and actual
data page. They do not parse http request :)
For athlon 3500 I used
ab -c8000 -n80000 $url
for via epia likely two/three times less.
> Ingo
--
Evgeniy Polyakov
* Evgeniy Polyakov <[email protected]> wrote:
> > thx - i guess i should just run them without any options and they
> > bind themselves to port 80? What 'ab' options are you using
> > typically to measure them?
>
> Yes, but they require /tmp/index.html to have http header and actual
> data page. They do not parse http request :)
ok. When i connect to the epoll server via "telnet mysever 80", and
enter a 'request', i get back the content - but the socket connection is
not closed. Every time i type enter i get a new content back. Why is
that so - the code seems to contain a close(fd).
Ingo
Ingo Molnar wrote:
> if you create a threadlet based test-webserver, could you please do a
> comparable kevents implementation as well? I.e. same HTTP parser (or
> non-parser, as usually the case is with prototypes ;). Best would be
> something that one could trigger between threadlet and kevent mode,
> using the same binary :-)
Now, why would you want this?
Is there some performance issue with separately loaded binaries?
Thanks!
--
Al
Ingo Molnar wrote:
> now look at kevents as the queueing model. It does not queue 'tasks', it
> lets user-space queue requests in essence, in various states. But it's
> still the same conceptual thing: a memory buffer with some state
> associated to it. Yes, it has no legacies, it has no priorities and
> other queueing concepts attached to it ... yet. If kevents got
> mainstream, it would get the same kind of pressure to grow 'more
> advanced' event queueing and event scheduling capabilities.
> Prioritization would be needed, etc.
But it would probably be tuned specifically to its use case, which would mean
inherently better performance.
> So my fundamental claim is: a kernel thread /is/ our main request
> structure. We've got tons of really good system calls that queue these
> 'requests' around the place and offer functionality around this concept.
> Plus there's a 1.2+ billion lines of Linux userspace code that works
> well with this abstraction - while there's nary a few thousand lines of
> event-based user-space code.
Think of the kernel scheduler as a default fallback scheduler, for procs that
are randomly queued. Anytime you can identify a group of procs/threads that
behave in a similar way, it's almost always best to do specific/private
scheduling, for performance reasons.
> I also say that you'll likely get kevents outperform threadlets. Maybe
> even significantly so under the right conditions. But i very much
> believe we want to get similar kind of performance out of thread/task
> scheduling, and not introduce a parallel framework to do request
> scheduling the hard way ... just because our task concept and scheduling
> implementation got too fat. For the same reason i didnt really like
> fibrils: they are nice, and Zach's core idea i think nicely survived in
> the syslet/threadlet model too, but they are more limited than true
> threads. So doing that parallel infrastructure, which really just
> implements the same, and is only faster because it skips features, would
> just be hiding the problem with our primary abstraction. Ok?
Ok. But what you are proposing here is a dynamically plugable scheduler that
is extensible on top of that.
Sounds Great!
Thanks!
--
Al
* Evgeniy Polyakov <[email protected]> wrote:
> Kevent is a _very_ small entity and there is _no_ cost of requeueing
> (well, there is list_add guarded by lock) - after it is done, process
> can start real work. With rescheduling there are _too_ many things to
> be done before we can start new work. [...]
actually, no. For example a wakeup too is fundamentally a list_add
guarded by a lock. Take a look at try_to_wake_up(). The rest you see
there is just extra frills that relate to things like 'load-balancing
the requests over multiple CPUs [which i'm sure kevent users would
request in the future too]'.
> [...] We have to change registers, change address space, various tlb
> bits and so on - we have to do it, since task describes very heavy
> entity - the whole process. [...]
but ... 'threadlets' are called thread-lets because they are not full
processes, they are threads. There's no TLB state in that case. There's
indeed register state associated with them, and currently there can
certainly be quite a bit of overhead in a context switch - but not in
register saving. We do user-space register saving not in the scheduler
but upon /every system call/. Fundamentally a kernel thread is just its
EIP/ESP [on x86, similar on other architectures] - which can be
saved/restored in near zero time. All the rest is something we added for
good /work queueing/ reasons - and those same extras should either be
eliminated if they turn out to be not so good reasons after all, or they
will be wanted for kevents too eventually, once it matures as a work
queueing solution.
> I think it is _too_ heavy to have such a monster structure like
> task(thread/process) and related overhead just to do an IO.
i think you are really, really mistaken if you believe that the fact
that whole tasks/threads or processes can be 'monster structures',
somehow has any relevance to scheduling/task-queueing performance and
scalability. It does not matter how large a task's address space is -
scheduling only relates to the minimal context that is in the CPU. And
most of that context we save upon /every system call entry/, and restore
it upon every system call return. If it's so expensive to manipulate,
why can the Linux kernel do a full system call in ~150 cycles? That's
cheaper than the access latency to a single DRAM page.
for the same reason has it no relevance that the full kevent-based
webserver is a 'monster structure' - still a single request's basic
queueing operation is cheap. The same is true to tasks/threads.
Really, you dont even have to know or assume anything about the
scheduler, just lets do some elementary math here:
the reqs/sec your sendfile+kevent based webserver can do is 7900 per
sec. Lets assume you will write further great kevent code which will
optimize it further and it goes up to 10,100 reqs per sec (100 usecs per
request), ok? Then also try how many reschedules/sec can your Athon64
3500 box do. My guess is: about a million per second (1 usec per
reschedule), perhaps a bit more.
Now lets assume that a threadlet based server would have to
context-switch for /every single/ request served. That's totally
over-estimating it, even with lots of slow clients, but lets assume it,
to judge the worst-case impact.
So if you had to schedule once per every request served, you'd have to
add 1 usec to your 100 usecs cost, making it 101 usecs. That would bring
your 10,100 requests per sec to 10,000 requests/sec, under a threadlet
model of operation. Put differently: it will cost you only 1% in
performance to schedule once for every request. Or lets assume the task
is totally cache-cold and you'd have to add 4 usecs for its scheduling -
that'd still only be 4%. So where is the fat?
Ingo
* Evgeniy Polyakov <[email protected]> wrote:
> > thx - i guess i should just run them without any options and they
> > bind themselves to port 80? What 'ab' options are you using
> > typically to measure them?
>
> Yes, but they require /tmp/index.html to have http header and actual
> data page. They do not parse http request :)
>
> For athlon 3500 I used
> ab -c8000 -n80000 $url
how do the header portions of your /tmp/index.html data page look like?
Ingo
On Sun, Feb 25, 2007 at 08:04:15PM +0100, Ingo Molnar ([email protected]) wrote:
>
> * Evgeniy Polyakov <[email protected]> wrote:
>
> > Kevent is a _very_ small entity and there is _no_ cost of requeueing
> > (well, there is list_add guarded by lock) - after it is done, process
> > can start real work. With rescheduling there are _too_ many things to
> > be done before we can start new work. [...]
>
> actually, no. For example a wakeup too is fundamentally a list_add
> guarded by a lock. Take a look at try_to_wake_up(). The rest you see
> there is just extra frills that relate to things like 'load-balancing
> the requests over multiple CPUs [which i'm sure kevent users would
> request in the future too]'.
wake_up() as a call is pretty simple and fast, but its result - it is
slow. I did not run reschedulingtests with kernel thread, but posix
threads (they do look like a kernel thread) have significant overhead
there. In early developemnt days of M:N threading library I tested
rescheduling performance of the POSIX threads - I created pool of
threads and 'sent' a message using futex wait/wake - such performance of
the userspace threading library (I tested erlang) was 10 times slower.
> > [...] We have to change registers, change address space, various tlb
> > bits and so on - we have to do it, since task describes very heavy
> > entity - the whole process. [...]
>
> but ... 'threadlets' are called thread-lets because they are not full
> processes, they are threads. There's no TLB state in that case. There's
> indeed register state associated with them, and currently there can
> certainly be quite a bit of overhead in a context switch - but not in
> register saving. We do user-space register saving not in the scheduler
> but upon /every system call/. Fundamentally a kernel thread is just its
> EIP/ESP [on x86, similar on other architectures] - which can be
> saved/restored in near zero time. All the rest is something we added for
> good /work queueing/ reasons - and those same extras should either be
> eliminated if they turn out to be not so good reasons after all, or they
> will be wanted for kevents too eventually, once it matures as a work
> queueing solution.
If things decreases performance noticebly, it is a bad things, but it is
matter of taste. Anyway, kevents are very small, threads are very big,
and both are the way they are exactly on purpose - threads serve for
processing of any generic code, kevents are used for event waiting - IO
is such an event, it does not require a lot of infrastructure to handle,
it only nees some simple bits, so it can be optimized to be extremely
fast, with huge infrastructure behind each IO (like in case when it is a
separated thread) it can not be done effectively.
> > I think it is _too_ heavy to have such a monster structure like
> > task(thread/process) and related overhead just to do an IO.
>
> i think you are really, really mistaken if you believe that the fact
> that whole tasks/threads or processes can be 'monster structures',
> somehow has any relevance to scheduling/task-queueing performance and
> scalability. It does not matter how large a task's address space is -
> scheduling only relates to the minimal context that is in the CPU. And
> most of that context we save upon /every system call entry/, and restore
> it upon every system call return. If it's so expensive to manipulate,
> why can the Linux kernel do a full system call in ~150 cycles? That's
> cheaper than the access latency to a single DRAM page.
I meant not its size, but the whole infrastructure, which surrounds
task. If it is that lightweight, why don't we have posix thread per IO?
One question is that mmap/allocation of the stack is too slow (and it is
very slow indeed, that is why glibc and M:N threading lib caches
allocated stacks), another one is kernel/userspace boundary crossing,
next one are tlb flushes, then copies.
Why userspace rescheduling is in order of tens times faster than
kernel/user?
> for the same reason has it no relevance that the full kevent-based
> webserver is a 'monster structure' - still a single request's basic
> queueing operation is cheap. The same is true to tasks/threads.
To move that tasks there must be done too may steps, and although each
one can be quite fast, the whole process of rescheduling in the case of
thousands running threads creates too big overhead per task to drop
performance.
> Really, you dont even have to know or assume anything about the
> scheduler, just lets do some elementary math here:
>
> the reqs/sec your sendfile+kevent based webserver can do is 7900 per
> sec. Lets assume you will write further great kevent code which will
> optimize it further and it goes up to 10,100 reqs per sec (100 usecs per
> request), ok? Then also try how many reschedules/sec can your Athon64
> 3500 box do. My guess is: about a million per second (1 usec per
> reschedule), perhaps a bit more.
Let's calculate: disk bandwidth is about gigabytes per second (cached
case), so to transfer 10k file we need about 10 usec - 10% of it will be
spent in rescheduling (if there will be only one if any).
Network is an order of 10 slower (1gbit for example), but there are much
more blockings, so to transfer 10k we will have let's say 5 blocks, i.e.
5 reschedulings - another 5%, so we wasted 15% of our time in
rescheduling.
Event in turn is a 30 bytes copy (plus of course own overhead, but it is
still faster - it is faster just because ukevet size is smaller than
pt_regs :).
Interesting discussion, that will be very fun if kevent will lose badly
:)
> Now lets assume that a threadlet based server would have to
> context-switch for /every single/ request served. That's totally
> over-estimating it, even with lots of slow clients, but lets assume it,
> to judge the worst-case impact.
>
> So if you had to schedule once per every request served, you'd have to
> add 1 usec to your 100 usecs cost, making it 101 usecs. That would bring
> your 10,100 requests per sec to 10,000 requests/sec, under a threadlet
> model of operation. Put differently: it will cost you only 1% in
> performance to schedule once for every request. Or lets assume the task
> is totally cache-cold and you'd have to add 4 usecs for its scheduling -
> that'd still only be 4%. So where is the fat?
I need to move home or I will sleep on the street, otherwise I would already
ran a test and started to eat a hat (present me a red one like Lan Cox
had), or watch how you to do it :)
Give me several hours.
> Ingo
--
Evgeniy Polyakov
On Sun, Feb 25, 2007 at 07:34:38PM +0100, Ingo Molnar wrote:
>
> * Evgeniy Polyakov <[email protected]> wrote:
>
> > > thx - i guess i should just run them without any options and they
> > > bind themselves to port 80? What 'ab' options are you using
> > > typically to measure them?
> >
> > Yes, but they require /tmp/index.html to have http header and actual
> > data page. They do not parse http request :)
>
> ok. When i connect to the epoll server via "telnet mysever 80", and
> enter a 'request', i get back the content - but the socket connection is
> not closed. Every time i type enter i get a new content back. Why is
> that so - the code seems to contain a close(fd).
>
I'd say a close(s); is missing just before return 0; in
evtest_callback_client() ?
Regards,
Frederik
* Evgeniy Polyakov <[email protected]> wrote:
> Interesting discussion, that will be very fun if kevent will lose
> badly :)
with your keepalive test no way can it lose against 80,000 sync
threadlets - it's pretty much the worst-case thing for threadlets while
it's the best-case for kevents. Try a non-keepalive test perhaps?
Ingo
On Thu, 22 Feb 2007, Evgeniy Polyakov wrote:
>
> My tests show that with 4k connections per second (8k concurrency) more
> than 20k connections of 80k total block in tcp_sendmsg() over gigabit
> lan between quite fast machines.
Why do people *keep* taking this up as an issue?
Use select/poll/epoll/kevent/whatever for event mechanisms. STOP CLAIMING
that you'd use threadlets/syslets/aio for that. It's been pointed out over
and over and over again, and yet you continue to make the same mistake,
Evgeniy.
So please read that sentence ten times, and then don't continue to make
that same mistake. PLEASE.
Event mechanisms are *superior* for events. But they *suck* for things
that aren't events, but are actual code execution with random places that
can block. THE TWO THINGS ARE TOTALLY AND UTTERLY INDEPENDENT!
Examples of events:
- packet arrives
- timer happens
Examples of things that are *not* "events":
- filesystem lookup.
- page faults
So the basic point is: for events, you use an event-based thing. For code
execution, you use a thread-based thing. It's really that simple.
And yes, the two different things can usually be translated (at a very
high cost in complexity *and* performance) into each other, so people who
look at it as purely a theoretical exercise may think that "events" and
"code execution" are equivalent. That's a very very silly and stupid way
of looking at things in real life, though.
Yes, you can turn things that are better seen as threaded execution into
an event-based thing by turning it into a state machine. And usually that
is a TOTAL DISASTER, and the end result is fragile and impossible to
maintain.
And yes, you can often (more easily) turn an event-based mechanism into a
thread-based one, and usually the end result is a TOTAL DISASTER because
it doesn't scale very well, and while it may actually result in somewhat
simpler code, the overhead of managing ten thousand outstanding threads is
just too high, when you compare to managing just a list of ten thousand
outstanding events.
And yes, people have done both of those mistakes. Java, for example,
largely did the latter mistake ("we don't need anything like 'select',
because we'll just use threads for everything" - what a totally moronic
thing to do!)
So Evgeniy, threadlets/syslets/aio is *not* a replacement for event
queues. It's a TOTALLY DIFFERENT MECHANISM, and one that is hugely
superior to event queues for certain kinds of things. Anybody who thinks
they want to do pathname and inode lookup as a series of events is likely
a moron. It's really that simple.
In a complex server (say, a database), you'd use both. You'd probably use
events for doing the things you *already* use events for (whether it be
select/poll/epoll or whatever): probably things like the client network
connection handling.
But you'd *in*addition* use threadlets to be able to do the actual
database IO in a threaded manner, so that you can scale the things that
are not easily handled as events (usually because they have internal
kernel state that the user cannot even see, and *must*not* see because of
security issues).
So please. Stop this "kevents are better". The only thing you show by
trying to go down that avenue is that you don't understand the
*difference* between an event model and a thread model. They are both
perfectly fine models and they ARE NOT THE SAME! They aren't even mutually
incompatible - quite the reverse.
The thing people want to remove with threadlets is the internal overhead
of maintaining special-purpose code like aio_read() inside the kernel,
that doesn't even do all that people want it to do, and that really does
need a fair amount of internal complexity that we could hopefully do with
a more generic (and hopefully *simpler*) model.
Linus
On 2/25/07, Ingo Molnar <[email protected]> wrote:
> Fundamentally a kernel thread is just its
> EIP/ESP [on x86, similar on other architectures] - which can be
> saved/restored in near zero time.
That's because the kernel address space is identical in every
process's MMU context, so the MMU doesn't have to be touched _at_all_.
Also, the kernel very rarely touches FPU state, and even when it
does, the FXSAVE/FXRSTOR pair is highly optimized for the "save state
just long enough to move some memory around with XMM instructions"
case. (I know you know this; this is for the benefit of less
experienced readers.) If your threadlet model shares the FPU state
and TLS arena among all threadlets running on the same CPU, and
threadlets are scheduled in bursts belonging to the same process (and
preferably the same threadlet entrypoint), then you will get similarly
efficient userspace threadlet-to-threadlet transitions. If not, not.
> scheduling only relates to the minimal context that is in the CPU. And
> most of that context we save upon /every system call entry/, and restore
> it upon every system call return. If it's so expensive to manipulate,
> why can the Linux kernel do a full system call in ~150 cycles? That's
> cheaper than the access latency to a single DRAM page.
That would be the magic of shadow register files. When the software
does things that hardware expects it to do, everybody wins. When the
software tries to get clever based on micro-benchmarks, everybody
loses.
Cheers,
- Michael
* Davide Libenzi <[email protected]> wrote:
> Also, the evtest_kevent_remove call is superfluous with epoll.
it's only used in the error path AFAICS.
but you are right about evserver_epoll/kevent.c incorrectly assuming
that things wont block in evtest_callback_client(), which, after
receiving the "there's stuff on the input socket" event does:
recvmsg(sock),
fd = open();
sendfile(sock, fd)
close(fd);
while evserver_threadlet.c, even this naive implementation, does not
assume that we wont block in that function.
> In any case, comparing epoll/kevent with 100K active sessions, against
> threadlets, is not exactly a fair/appropriate test for it.
fully agreed.
Ingo
On Mon, Feb 26, 2007 at 09:16:56AM +0100, Ingo Molnar ([email protected]) wrote:
> > Also, the evtest_kevent_remove call is superfluous with epoll.
>
> it's only used in the error path AFAICS.
>
> but you are right about evserver_epoll/kevent.c incorrectly assuming
> that things wont block in evtest_callback_client(), which, after
> receiving the "there's stuff on the input socket" event does:
>
> recvmsg(sock),
> fd = open();
> sendfile(sock, fd)
> close(fd);
>
> while evserver_threadlet.c, even this naive implementation, does not
> assume that we wont block in that function.
>
> > In any case, comparing epoll/kevent with 100K active sessions, against
> > threadlets, is not exactly a fair/appropriate test for it.
>
> fully agreed.
Hi.
I will highlight several items in this mail:
1. evserver_epoll.c is broken in that regard, that it does not close a
socket - I tried to make it possible to work with keepalive, but failed.
So close of the socket is a must, like in evserver_kevent.c
2. keepalive is not supported - it is a hack server at the end.
3. this test does not assume that above snippet blocks or does not - it
is _typical_ case of the web server with one working thread (per cpu) -
every op can block, so there is no problem - threadlet will reschedule,
event based will block (bad for them).
4. benchmark does not cover all possible cases - initial goal of that
servers was to show how fast/slow _event_ generation/processing is in
epoll/kevent case, but not to create real-life web server.
lighttpd for example can not be used as a good bench too, since its arch
does not support some kevent extensions (which do not exist in epoll),
and, looking to the number of comments in kevent threads, I'm not motivated
to change it at all.
So, drawing a line, evserver_* is a simple event driven server, it does
have disadvantages, but the same approach only favours threadlet model.
Having millions or thousands of connections works against threadlets,
but we compare ultimate cases, it is a one of the possible tests.
So...
I'm cooking up a git tree with kevents and threadlets, which I will test
in via epia (1ghz, 256 mb of ram, 100mbit lan) and Intel(R) Core(TM)2 CPU
6600 @ 2.40GHz (2gb of ram, 1gbit lan) later today with
kevent/epoll/threadlet if wine will not end suddenly.
I will use Ingo's evserver_threadlet server as plong as evserver_epoll
(with fixed closing) and evserver_kevent.c.
Eventually I can move all of them into one file.
Client is 'ab' on my desktop core duo 3.7 ghz. Machines are connected
over gigabit dlink dgs-1216t switch (which freezes on slightly broken
dhcp and tftp packets btw).
Stay tuned.
P.S. Linus, if you do not mind, I will postpone scholastic masturbation
about event vs process context in IO. One sentence only - page fault and
filename lookup both wait - they wait for new page is ready or inode is read
from the disk, eventually they wake up, thing which wakes them up _is_
an event.
> Ingo
--
Evgeniy Polyakov
* Evgeniy Polyakov <[email protected]> wrote:
> I will use Ingo's evserver_threadlet server as plong as evserver_epoll
> (with fixed closing) and evserver_kevent.c.
please also try evserver_epoll_threadlet.c that i've attached below - it
uses epoll as the main event mechanism but does threadlets for request
handling.
This is a one step more intelligent threadlet queueing model than
'thousands of threads' - although obviously epoll alone should do well
too with this trivial workload.
Ingo
---------------------------->
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/resource.h>
#include <sys/wait.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/poll.h>
#include <sys/sendfile.h>
#include <sys/epoll.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <arpa/inet.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <fcntl.h>
#include <time.h>
#include <ctype.h>
#include <netdb.h>
#define DEBUG 0
#include "syslet.h"
#include "sys.h"
#include "threadlet.h"
struct request {
struct request *next_free;
/*
* The threadlet stack is part of the request structure
* and is thus reused as threadlets complete:
*/
unsigned long threadlet_stack;
/*
* These are all the request-specific parameters:
*/
long sock;
};
/*
* Freelist to recycle requests:
*/
static struct request *freelist;
/*
* Allocate a request and set up its syslet atoms:
*/
static struct request *alloc_req(void)
{
struct request *req;
/*
* Occasionally we have to refill the new-thread stack
* entry:
*/
if (!async_head.new_thread_stack) {
async_head.new_thread_stack = thread_stack_alloc();
pr("allocated new thread stack: %08lx\n",
async_head.new_thread_stack);
}
if (freelist) {
req = freelist;
pr("reusing req %p, threadlet stack %08lx\n",
req, req->threadlet_stack);
freelist = freelist->next_free;
req->next_free = NULL;
return req;
}
req = calloc(1, sizeof(struct request));
pr("allocated req %p\n", req);
req->threadlet_stack = thread_stack_alloc();
pr("allocated thread stack %08lx\n", req->threadlet_stack);
return req;
}
/*
* Check whether there are any completions queued for user-space
* to finish up:
*/
static unsigned long complete(void)
{
unsigned long completed = 0;
struct request *req;
for (;;) {
req = (void *)completion_ring[async_head.user_ring_idx];
if (!req)
return completed;
completed++;
pr("completed req %p (threadlet stack %08lx)\n",
req, req->threadlet_stack);
req->next_free = freelist;
freelist = req;
/*
* Clear the completion pointer. To make sure the
* kernel never stomps upon still unhandled completions
* in the ring the kernel only writes to a NULL entry,
* so user-space has to clear it explicitly:
*/
completion_ring[async_head.user_ring_idx] = NULL;
async_head.user_ring_idx++;
if (async_head.user_ring_idx == MAX_PENDING)
async_head.user_ring_idx = 0;
}
}
static unsigned int pending_requests;
/*
* Handle a request that has just been submitted (either it has
* already been executed, or we have to account it as pending):
*/
static void handle_submitted_request(struct request *req, long done)
{
unsigned int nr;
if (done) {
/*
* This is the cached case - free the request:
*/
pr("cache completed req %p (threadlet stack %08lx)\n",
req, req->threadlet_stack);
req->next_free = freelist;
freelist = req;
return;
}
/*
* 'cachemiss' case - the syslet is not finished
* yet. We will be notified about its completion
* via the completion ring:
*/
assert(pending_requests < MAX_PENDING-1);
pending_requests++;
pr("req %p is pending. %d reqs pending.\n", req, pending_requests);
/*
* Attempt to complete requests - this is a fast
* check if there's no completions:
*/
nr = complete();
pending_requests -= nr;
/*
* If the ring is full then wait a bit:
*/
while (pending_requests == MAX_PENDING-1) {
pr("sys_async_wait()");
/*
* Wait for 4 events - to batch things a bit:
*/
sys_async_wait(4, async_head.user_ring_idx, &async_head);
nr = complete();
pending_requests -= nr;
pr("after wait: completed %d requests - still pending: %d\n",
nr, pending_requests);
}
}
#include <linux/types.h>
//#define ulog(f, a...) fprintf(stderr, f, ##a)
#define ulog(f, a...)
#define ulog_err(f, a...) printf(f ": %s [%d].\n", ##a, strerror(errno), errno)
static int kevent_ctl_fd, main_server_s;
static void usage(char *p)
{
ulog("Usage: %s -a addr -p port -f kevent_path -t timeout -w wait_num\n", p);
}
static int evtest_server_init(char *addr, unsigned short port)
{
struct hostent *h;
int s, on;
struct sockaddr_in sa;
if (!addr) {
ulog("%s: Bind address cannot be NULL.\n", __func__);
return -1;
}
h = gethostbyname(addr);
if (!h) {
ulog_err("%s: Failed to get address of %s.\n", __func__, addr);
return -1;
}
s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
if (s == -1) {
ulog_err("%s: Failed to create server socket", __func__);
return -1;
}
fcntl(s, F_SETFL, O_NONBLOCK);
memcpy(&(sa.sin_addr.s_addr), h->h_addr_list[0], 4);
sa.sin_port = htons(port);
sa.sin_family = AF_INET;
on = 1;
setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &on, 4);
if (bind(s, (struct sockaddr *)&sa, sizeof(struct sockaddr_in)) == -1) {
ulog_err("%s: Failed to bind to %s", __func__, addr);
close(s);
return -1;
}
if (listen(s, 30000) == -1) {
ulog_err("%s: Failed to listen on %s", __func__, addr);
close(s);
return -1;
}
return s;
}
static int evtest_kevent_remove(int fd)
{
int err;
struct epoll_event event;
event.events = EPOLLIN | EPOLLET;
event.data.fd = fd;
err = epoll_ctl(kevent_ctl_fd, EPOLL_CTL_DEL, fd, &event);
if (err < 0) {
ulog_err("Failed to perform control REMOVE operation");
return err;
}
return err;
}
static int evtest_kevent_init(int fd)
{
int err;
struct timeval tm;
struct epoll_event event;
event.events = EPOLLIN | EPOLLET;
event.data.fd = fd;
err = epoll_ctl(kevent_ctl_fd, EPOLL_CTL_ADD, fd, &event);
gettimeofday(&tm, NULL);
ulog("%08lu:%06lu: fd=%3d, err=%1d.\n", tm.tv_sec, tm.tv_usec, fd, err);
if (err < 0) {
ulog_err("Failed to perform control ADD operation: fd=%d, events=%08x", fd, event.events);
return err;
}
return err;
}
static long handle_request(void *__req)
{
struct request *req = __req;
int s = req->sock, err, fd;
off_t offset;
int count;
char path[] = "/tmp/index.html";
char buf[4096];
struct timeval tm;
count = 40960;
offset = 0;
err = recv(s, buf, sizeof(buf), 0);
if (err < 0) {
ulog_err("Failed to read data from s=%d", s);
goto err_out_remove;
}
if (err == 0) {
gettimeofday(&tm, NULL);
ulog("%08lu:%06lu: Client exited: fd=%d.\n", tm.tv_sec, tm.tv_usec, s);
goto err_out_remove;
}
fd = open(path, O_RDONLY);
if (fd == -1) {
ulog_err("Failed to open '%s'", path);
err = -1;
goto err_out_remove;
}
#if 0
do {
err = read(fd, buf, sizeof(buf));
if (err <= 0)
break;
err = send(s, buf, err, 0);
if (err <= 0)
break;
} while (1);
#endif
err = sendfile(s, fd, &offset, count);
{
int on = 0;
setsockopt(s, SOL_TCP, TCP_CORK, &on, sizeof(on));
}
close(fd);
if (err < 0) {
ulog_err("Failed send %d bytes: fd=%d.\n", count, s);
goto err_out_remove;
}
gettimeofday(&tm, NULL);
ulog("%08lu:%06lu: %d bytes has been sent to client fd=%d.\n", tm.tv_sec, tm.tv_usec, err, s);
close(s);
return complete_threadlet_fn(req, &async_head);
err_out_remove:
evtest_kevent_remove(s);
close(s);
return complete_threadlet_fn(req, &async_head);
}
static int evtest_callback_client(int sock)
{
struct request *req;
long done;
req = alloc_req();
if (!req) {
printf("no req\n");
evtest_kevent_remove(sock);
return -ENOMEM;
}
req->sock = sock;
done = threadlet_exec(handle_request, req,
req->threadlet_stack, &async_head);
handle_submitted_request(req, done);
return 0;
}
static int evtest_callback_main(int s)
{
int cs, err;
struct sockaddr_in csa;
socklen_t addrlen = sizeof(struct sockaddr_in);
struct timeval tm;
memset(&csa, 0, sizeof(csa));
if ((cs = accept(s, (struct sockaddr *)&csa, &addrlen)) == -1) {
ulog_err("Failed to accept client");
return -1;
}
fcntl(cs, F_SETFL, O_NONBLOCK);
gettimeofday(&tm, NULL);
ulog("%08lu:%06lu: Accepted connect from %s:%d.\n",
tm.tv_sec, tm.tv_usec,
inet_ntoa(csa.sin_addr), ntohs(csa.sin_port));
err = evtest_kevent_init(cs);
if (err < 0) {
close(cs);
return -1;
}
return 0;
}
static int evtest_kevent_wait(unsigned int timeout, unsigned int wait_num)
{
int num, err;
struct timeval tm;
struct epoll_event event[256];
int i;
err = epoll_wait(kevent_ctl_fd, event, 256, -1);
if (err < 0) {
ulog_err("Failed to perform control operation");
return num;
}
gettimeofday(&tm, NULL);
num = err;
ulog("%08lu.%06lu: Wait: num=%d.\n", tm.tv_sec, tm.tv_usec, num);
for (i=0; i<num; ++i) {
if (event[i].data.fd == main_server_s)
err = evtest_callback_main(event[i].data.fd);
else
err = evtest_callback_client(event[i].data.fd);
}
return err;
}
int main(int argc, char *argv[])
{
int ch, err;
char *addr;
unsigned short port;
unsigned int timeout, wait_num;
addr = "0.0.0.0";
port = 8080;
timeout = 1000;
wait_num = 1;
async_head_init();
while ((ch = getopt(argc, argv, "f:n:t:a:p:h")) > 0) {
switch (ch) {
case 't':
timeout = atoi(optarg);
break;
case 'n':
wait_num = atoi(optarg);
break;
case 'a':
addr = optarg;
break;
case 'p':
port = atoi(optarg);
break;
case 'f':
break;
default:
usage(argv[0]);
return -1;
}
}
kevent_ctl_fd = epoll_create(10);
if (kevent_ctl_fd == -1) {
ulog_err("Failed to epoll descriptor");
return -1;
}
main_server_s = evtest_server_init(addr, port);
if (main_server_s < 0)
return main_server_s;
err = evtest_kevent_init(main_server_s);
if (err < 0)
goto err_out_exit;
while (1) {
err = evtest_kevent_wait(timeout, wait_num);
}
err_out_exit:
close(kevent_ctl_fd);
async_head_exit();
return 0;
}
On Mon, Feb 26, 2007 at 10:55:47AM +0100, Ingo Molnar ([email protected]) wrote:
> > I will use Ingo's evserver_threadlet server as plong as evserver_epoll
> > (with fixed closing) and evserver_kevent.c.
>
> please also try evserver_epoll_threadlet.c that i've attached below - it
> uses epoll as the main event mechanism but does threadlets for request
> handling.
>
> This is a one step more intelligent threadlet queueing model than
> 'thousands of threads' - although obviously epoll alone should do well
> too with this trivial workload.
No problem.
If I will complete setup today before I go climbing (I need to do some
paid job too), I will post results here and in my blog (without
political correctness).
Btw, 'evserver' in the name means 'event server', so you might think
about changing the name :)
Stay tuned.
> Ingo
--
Evgeniy Polyakov
* Ingo Molnar <[email protected]> wrote:
> please also try evserver_epoll_threadlet.c that i've attached below -
> it uses epoll as the main event mechanism but does threadlets for
> request handling.
find updated code below - your evserver_epoll.c spuriously missed event
edges - so i changed it back to level-triggered. While that is not as
fast as edge-triggered, it does not result in spurious hangs and
workflow 'hickups' during the test.
Could this be the reason why in your testing kevents outperformed epoll?
Also, i have removed the set-nonblocking calls because they are not
needed under threadlets.
[ to build this code, copy it into the async-test/ directory and build
it there - or copy the *.h files from async-test/ directory into your
build directory. ]
Ingo
-------{ evserver_epoll_threadlet.c }-------------->
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/resource.h>
#include <sys/wait.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/poll.h>
#include <sys/sendfile.h>
#include <sys/epoll.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <arpa/inet.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <fcntl.h>
#include <time.h>
#include <ctype.h>
#include <netdb.h>
#define DEBUG 0
#include "syslet.h"
#include "sys.h"
#include "threadlet.h"
struct request {
struct request *next_free;
/*
* The threadlet stack is part of the request structure
* and is thus reused as threadlets complete:
*/
unsigned long threadlet_stack;
/*
* These are all the request-specific parameters:
*/
long sock;
};
/*
* Freelist to recycle requests:
*/
static struct request *freelist;
/*
* Allocate a request and set up its syslet atoms:
*/
static struct request *alloc_req(void)
{
struct request *req;
/*
* Occasionally we have to refill the new-thread stack
* entry:
*/
if (!async_head.new_thread_stack) {
async_head.new_thread_stack = thread_stack_alloc();
pr("allocated new thread stack: %08lx\n",
async_head.new_thread_stack);
}
if (freelist) {
req = freelist;
pr("reusing req %p, threadlet stack %08lx\n",
req, req->threadlet_stack);
freelist = freelist->next_free;
req->next_free = NULL;
return req;
}
req = calloc(1, sizeof(struct request));
pr("allocated req %p\n", req);
req->threadlet_stack = thread_stack_alloc();
pr("allocated thread stack %08lx\n", req->threadlet_stack);
return req;
}
/*
* Check whether there are any completions queued for user-space
* to finish up:
*/
static unsigned long complete(void)
{
unsigned long completed = 0;
struct request *req;
for (;;) {
req = (void *)completion_ring[async_head.user_ring_idx];
if (!req)
return completed;
completed++;
pr("completed req %p (threadlet stack %08lx)\n",
req, req->threadlet_stack);
req->next_free = freelist;
freelist = req;
/*
* Clear the completion pointer. To make sure the
* kernel never stomps upon still unhandled completions
* in the ring the kernel only writes to a NULL entry,
* so user-space has to clear it explicitly:
*/
completion_ring[async_head.user_ring_idx] = NULL;
async_head.user_ring_idx++;
if (async_head.user_ring_idx == MAX_PENDING)
async_head.user_ring_idx = 0;
}
}
static unsigned int pending_requests;
/*
* Handle a request that has just been submitted (either it has
* already been executed, or we have to account it as pending):
*/
static void handle_submitted_request(struct request *req, long done)
{
unsigned int nr;
if (done) {
/*
* This is the cached case - free the request:
*/
pr("cache completed req %p (threadlet stack %08lx)\n",
req, req->threadlet_stack);
req->next_free = freelist;
freelist = req;
return;
}
/*
* 'cachemiss' case - the syslet is not finished
* yet. We will be notified about its completion
* via the completion ring:
*/
assert(pending_requests < MAX_PENDING-1);
pending_requests++;
pr("req %p is pending. %d reqs pending.\n", req, pending_requests);
/*
* Attempt to complete requests - this is a fast
* check if there's no completions:
*/
nr = complete();
pending_requests -= nr;
/*
* If the ring is full then wait a bit:
*/
while (pending_requests == MAX_PENDING-1) {
pr("sys_async_wait()");
/*
* Wait for 4 events - to batch things a bit:
*/
sys_async_wait(4, async_head.user_ring_idx, &async_head);
nr = complete();
pending_requests -= nr;
pr("after wait: completed %d requests - still pending: %d\n",
nr, pending_requests);
}
}
#include <linux/types.h>
//#define ulog(f, a...) fprintf(stderr, f, ##a)
#define ulog(f, a...)
#define ulog_err(f, a...) printf(f ": %s [%d].\n", ##a, strerror(errno), errno)
static int kevent_ctl_fd, main_server_s;
static void usage(char *p)
{
ulog("Usage: %s -a addr -p port -f kevent_path -t timeout -w wait_num\n", p);
}
static int evtest_server_init(char *addr, unsigned short port)
{
struct hostent *h;
int s, on;
struct sockaddr_in sa;
if (!addr) {
ulog("%s: Bind address cannot be NULL.\n", __func__);
return -1;
}
h = gethostbyname(addr);
if (!h) {
ulog_err("%s: Failed to get address of %s.\n", __func__, addr);
return -1;
}
s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
if (s == -1) {
ulog_err("%s: Failed to create server socket", __func__);
return -1;
}
// fcntl(s, F_SETFL, O_NONBLOCK);
memcpy(&(sa.sin_addr.s_addr), h->h_addr_list[0], 4);
sa.sin_port = htons(port);
sa.sin_family = AF_INET;
on = 1;
setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &on, 4);
if (bind(s, (struct sockaddr *)&sa, sizeof(struct sockaddr_in)) == -1) {
ulog_err("%s: Failed to bind to %s", __func__, addr);
close(s);
return -1;
}
if (listen(s, 30000) == -1) {
ulog_err("%s: Failed to listen on %s", __func__, addr);
close(s);
return -1;
}
return s;
}
#define EPOLL_EVENT_MASK (EPOLLIN | EPOLLERR | EPOLLPRI)
static int evtest_kevent_remove(int fd)
{
int err;
struct epoll_event event;
event.events = EPOLL_EVENT_MASK;
event.data.fd = fd;
err = epoll_ctl(kevent_ctl_fd, EPOLL_CTL_DEL, fd, &event);
if (err < 0) {
ulog_err("Failed to perform control REMOVE operation");
return err;
}
return err;
}
static int evtest_kevent_init(int fd)
{
int err;
struct timeval tm;
struct epoll_event event;
event.events = EPOLL_EVENT_MASK;
event.data.fd = fd;
err = epoll_ctl(kevent_ctl_fd, EPOLL_CTL_ADD, fd, &event);
gettimeofday(&tm, NULL);
ulog("%08lu:%06lu: fd=%3d, err=%1d.\n", tm.tv_sec, tm.tv_usec, fd, err);
if (err < 0) {
ulog_err("Failed to perform control ADD operation: fd=%d, events=%08x", fd, event.events);
return err;
}
return err;
}
#define MAX_FILES 1000000
/*
* Debug check:
*/
static struct request *fd_to_req[MAX_FILES];
static long handle_request(void *__req)
{
struct request *req = __req;
int s = req->sock, err, fd;
off_t offset;
int count;
char path[] = "/tmp/index.html";
char buf[4096];
struct timeval tm;
if (!fd_to_req[s])
ulog_err("Bad: no request to fd?");
count = 40960;
offset = 0;
err = recv(s, buf, sizeof(buf), 0);
if (err < 0) {
ulog_err("Failed to read data from s=%d", s);
goto err_out_remove;
}
if (err == 0) {
gettimeofday(&tm, NULL);
ulog("%08lu:%06lu: Client exited: fd=%d.\n", tm.tv_sec, tm.tv_usec, s);
goto err_out_remove;
}
fd = open(path, O_RDONLY);
if (fd == -1) {
ulog_err("Failed to open '%s'", path);
err = -1;
goto err_out_remove;
}
#if 0
do {
err = read(fd, buf, sizeof(buf));
if (err <= 0)
break;
err = send(s, buf, err, 0);
if (err <= 0)
break;
} while (1);
#endif
err = sendfile(s, fd, &offset, count);
{
int on = 0;
setsockopt(s, SOL_TCP, TCP_CORK, &on, sizeof(on));
}
close(fd);
if (err < 0) {
ulog_err("Failed send %d bytes: fd=%d.\n", count, s);
goto err_out_remove;
}
gettimeofday(&tm, NULL);
ulog("%08lu:%06lu: %d bytes has been sent to client fd=%d.\n", tm.tv_sec, tm.tv_usec, err, s);
close(s);
fd_to_req[s] = NULL;
return complete_threadlet_fn(req, &async_head);
err_out_remove:
evtest_kevent_remove(s);
close(s);
fd_to_req[s] = NULL;
return complete_threadlet_fn(req, &async_head);
}
static int evtest_callback_client(int sock)
{
struct request *req;
long done;
if (fd_to_req[sock]) {
ulog_err("Bad: request overlap?");
return 0;
}
req = alloc_req();
if (!req) {
ulog_err("Bad: no req\n");
evtest_kevent_remove(sock);
return -ENOMEM;
}
req->sock = sock;
fd_to_req[sock] = req;
done = threadlet_exec(handle_request, req,
req->threadlet_stack, &async_head);
handle_submitted_request(req, done);
return 0;
}
static int evtest_callback_main(int s)
{
int cs, err;
struct sockaddr_in csa;
socklen_t addrlen = sizeof(struct sockaddr_in);
struct timeval tm;
memset(&csa, 0, sizeof(csa));
if ((cs = accept(s, (struct sockaddr *)&csa, &addrlen)) == -1) {
ulog_err("Failed to accept client");
return -1;
}
// fcntl(cs, F_SETFL, O_NONBLOCK);
gettimeofday(&tm, NULL);
ulog("%08lu:%06lu: Accepted connect from %s:%d.\n",
tm.tv_sec, tm.tv_usec,
inet_ntoa(csa.sin_addr), ntohs(csa.sin_port));
err = evtest_kevent_init(cs);
if (err < 0) {
close(cs);
return -1;
}
return 0;
}
static int evtest_kevent_wait(unsigned int timeout, unsigned int wait_num)
{
int num, err;
struct timeval tm;
struct epoll_event event[256];
int i;
err = epoll_wait(kevent_ctl_fd, event, 256, -1);
if (err < 0) {
ulog_err("Failed to perform control operation");
return num;
}
gettimeofday(&tm, NULL);
num = err;
ulog("%08lu.%06lu: Wait: num=%d.\n", tm.tv_sec, tm.tv_usec, num);
for (i=0; i<num; ++i) {
if (event[i].data.fd == main_server_s)
err = evtest_callback_main(event[i].data.fd);
else
err = evtest_callback_client(event[i].data.fd);
}
return err;
}
int main(int argc, char *argv[])
{
int ch, err;
char *addr;
unsigned short port;
unsigned int timeout, wait_num;
addr = "0.0.0.0";
port = 8080;
timeout = 1000;
wait_num = 1;
async_head_init();
while ((ch = getopt(argc, argv, "f:n:t:a:p:h")) > 0) {
switch (ch) {
case 't':
timeout = atoi(optarg);
break;
case 'n':
wait_num = atoi(optarg);
break;
case 'a':
addr = optarg;
break;
case 'p':
port = atoi(optarg);
break;
case 'f':
break;
default:
usage(argv[0]);
return -1;
}
}
kevent_ctl_fd = epoll_create(10);
if (kevent_ctl_fd == -1) {
ulog_err("Failed to epoll descriptor");
return -1;
}
main_server_s = evtest_server_init(addr, port);
if (main_server_s < 0)
return main_server_s;
err = evtest_kevent_init(main_server_s);
if (err < 0)
goto err_out_exit;
while (1) {
err = evtest_kevent_wait(timeout, wait_num);
}
err_out_exit:
close(kevent_ctl_fd);
async_head_exit();
return 0;
}
* Evgeniy Polyakov <[email protected]> wrote:
> Btw, 'evserver' in the name means 'event server', so you might think
> about changing the name :)
why should i change the name? The 'outer' loop, which feeds requests to
threadlets, is an epoll based event loop. The inner loop, where all the
application complexity resides, is a threadlet. This is the "more
intelligent queueing" model i talked about in my reply to David 4 days
ago:
http://lkml.org/lkml/2007/2/22/180
http://lkml.org/lkml/2007/2/22/191
Ingo
On Mon, Feb 26, 2007 at 11:31:17AM +0100, Ingo Molnar ([email protected]) wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > please also try evserver_epoll_threadlet.c that i've attached below -
> > it uses epoll as the main event mechanism but does threadlets for
> > request handling.
>
> find updated code below - your evserver_epoll.c spuriously missed event
> edges - so i changed it back to level-triggered. While that is not as
> fast as edge-triggered, it does not result in spurious hangs and
> workflow 'hickups' during the test.
Hmm, exact the same evserver_epoll.c you downloaded works ok for me,
although yes, it is buggy in that regard that it does not contain socket
close when data is transferred.
> Could this be the reason why in your testing kevents outperformed epoll?
I will try to check. In theory without _ET it should perfoem much worse,
but in practice its performance is essentially the same (the same applies
to kevent without KEVENT_REQ_ET flag - since the same socket almost never
is used several times, it is purely zero overhead to have or not have that
flag set).
> Also, i have removed the set-nonblocking calls because they are not
> needed under threadlets.
>
> [ to build this code, copy it into the async-test/ directory and build
> it there - or copy the *.h files from async-test/ directory into your
> build directory. ]
Ok, right now I'm compiling kevent/threadlet tree on my test machines.
> Ingo
--
Evgeniy Polyakov
On Mon, Feb 26, 2007 at 11:35:17AM +0100, Ingo Molnar ([email protected]) wrote:
>
> * Evgeniy Polyakov <[email protected]> wrote:
>
> > Btw, 'evserver' in the name means 'event server', so you might think
> > about changing the name :)
>
> why should i change the name? The 'outer' loop, which feeds requests to
> threadlets, is an epoll based event loop. The inner loop, where all the
> application complexity resides, is a threadlet. This is the "more
> intelligent queueing" model i talked about in my reply to David 4 days
> ago:
>
> http://lkml.org/lkml/2007/2/22/180
> http://lkml.org/lkml/2007/2/22/191
:)
Ingo, of course it was a joke.
Even having main dispatcher as epoll/kevent loop, the _whole_ threadlet
model is absolutely micro-thread in nature and not state machine/event.
So it does not have events at all, especially with speculations about
removing completion notifications - fire and forget model.
> Ingo
--
Evgeniy Polyakov
* Evgeniy Polyakov <[email protected]> wrote:
> > > Kevent is a _very_ small entity and there is _no_ cost of
> > > requeueing (well, there is list_add guarded by lock) - after it is
> > > done, process can start real work. With rescheduling there are
> > > _too_ many things to be done before we can start new work. [...]
> >
> > actually, no. For example a wakeup too is fundamentally a list_add
> > guarded by a lock. Take a look at try_to_wake_up(). The rest you see
> > there is just extra frills that relate to things like
> > 'load-balancing the requests over multiple CPUs [which i'm sure
> > kevent users would request in the future too]'.
>
> wake_up() as a call is pretty simple and fast, but its result - it is
> slow. [...]
You are still very much wrong, and now you refuse to even /read/ what i
wrote. Your only reply to my detailed analysis is: "it is slow, because
it is slow and heavy". I told you how fast it is, i told you what
happens on a context switch and why, i told you that you can measure if
you want.
> [...] I did not run reschedulingtests with kernel thread, but posix
> threads (they do look like a kernel thread) have significant overhead
> there.
You are wrong. Let me show you some more numbers. This is a
hackbench_pth.c run:
$ ./hackbench_pth 500
Time: 14.371
this uses 20,000 real threads and during this test the runqueue length
is extreme - up to over a ten thousand threads. (hackbench_pth.c was
posted to lkml recently.
The same run with hackbench.c (20,000 forked processes):
$ ./hackbench 500
Time: 14.632
so the TLB overhead from using processes is 1.8%.
> [...] In early developemnt days of M:N threading library I tested
> rescheduling performance of the POSIX threads - I created pool of
> threads and 'sent' a message using futex wait/wake - such performance
> of the userspace threading library (I tested erlang) was 10 times
> slower.
how much would it take for you to actually re-measure it and interpet
the results you are seeing? You've apparently built a whole mental house
of cards on the flawed proposition that tasks are 'super-heavy' and that
context-switching them is 'slow'. You are unwilling to explain /how/
they are slow, and all the numbers i post are contrary to that
proposition of yours.
your whole reasoning seems to be faith-based:
[...] Anyway, kevents are very small, threads are very big, [...]
How about following the scientific method instead?
> [...] and both are the way they are exactly on purpose - threads serve
> for processing of any generic code, kevents are used for event waiting
> - IO is such an event, it does not require a lot of infrastructure to
> handle, it only nees some simple bits, so it can be optimized to be
> extremely fast, with huge infrastructure behind each IO (like in case
> when it is a separated thread) it can not be done effectively.
you are wrong, and i have pointed it out to you in my previous replies
why you are wrong. Your only coherent specific thought on this topic was
your incorrect assumption is that the scheduler somehow saves registers
and that this makes it heavy. I pointed it out to you in the mail you
reply to that /every/ system call that saves user registers. You've not
even replied to that point of mine, you are ignoring it completely and
you are still repeating your same old, incorrect argument. If it is
heavy, /why/ do you think it is heavy? Where is that magic pixie dust
piece of scheduler code that miraculously turns the runqueue into a
molass slow, heavy piece of thing?
Or put in another way: your test-code does ~6 syscalls per every event.
So if what you said would be true (which it isnt), a kevent based
request would have be just as slow as thread based request ...
> > i think you are really, really mistaken if you believe that the fact
> > that whole tasks/threads or processes can be 'monster structures',
> > somehow has any relevance to scheduling/task-queueing performance
> > and scalability. It does not matter how large a task's address space
> > is - scheduling only relates to the minimal context that is in the
> > CPU. And most of that context we save upon /every system call
> > entry/, and restore it upon every system call return. If it's so
> > expensive to manipulate, why can the Linux kernel do a full system
> > call in ~150 cycles? That's cheaper than the access latency to a
> > single DRAM page.
>
> I meant not its size, but the whole infrastructure, which surrounds
> task. [...]
/what/ infrastructure do you mean? sched.c? Most of that never runs in
the scheduler hotpath.
> [...] If it is that lightweight, why don't we have posix thread per
> IO? [...]
because it would be pretty stupid to do that?
But more importantly: because many people still believe 'the scheduler
is slow and context-switching is evil'? The FIO AIO syslet code from
Jens is an intelligent mix of queueing combined with async execution. I
expect that model to prevail.
> [...] One question is that mmap/allocation of the stack is too slow
> (and it is very slow indeed, that is why glibc and M:N threading lib
> caches allocated stacks), another one is kernel/userspace boundary
> crossing, next one are tlb flushes, then copies.
now you come up again with creation overhead but nobody is talking about
context creation overhead. (btw., you are also wrong if you think that
mmap() is all that slow - try measuring it one day) We were talking
about context /switching/.
> Why userspace rescheduling is in order of tens times faster than
> kernel/user?
what on earth does this have to do with the topic of whether context
switches are fast enough? Or if you like random info, just let me throw
in a random piece of information as well:
user-space function calls are more than /two/ orders of magnitude faster
than system calls. Still you are using 6 ... SIX system calls in the
sample kevent request handling hotpath.
> > for the same reason has it no relevance that the full kevent-based
> > webserver is a 'monster structure' - still a single request's basic
> > queueing operation is cheap. The same is true to tasks/threads.
>
> To move that tasks there must be done too may steps, and although each
> one can be quite fast, the whole process of rescheduling in the case
> of thousands running threads creates too big overhead per task to drop
> performance.
again, please come up with specifics! I certainly came up with enough
specifics.
Ingo
* Evgeniy Polyakov <[email protected]> wrote:
> Even having main dispatcher as epoll/kevent loop, the _whole_
> threadlet model is absolutely micro-thread in nature and not state
> machine/event.
Evgeniy, i'm not sure how many different ways to tell this to you, but
you are not listening, you are not learning and you are still not
getting it at all.
The scheduler /IS/ a generic work/event queue. And it's pretty damn
fast. No amount of badmouthing will change that basic fact. Not exactly
as fast as a special-purpose queueing system (for all the reasons i
outlined to you, and which you ignored), but it gets pretty damn close
even for the web workload /you/ identified, and offers a user-space
programming model that is about 1000 times more useful than
state-machines.
Ingo
* Linus Torvalds <[email protected]> wrote:
> > My tests show that with 4k connections per second (8k concurrency)
> > more than 20k connections of 80k total block in tcp_sendmsg() over
> > gigabit lan between quite fast machines.
>
> Why do people *keep* taking this up as an issue?
>
> Use select/poll/epoll/kevent/whatever for event mechanisms. STOP
> CLAIMING that you'd use threadlets/syslets/aio for that. It's been
> pointed out over and over and over again, and yet you continue to make
> the same mistake, Evgeniy.
>
> So please read that sentence ten times, and then don't continue to
> make that same mistake. PLEASE.
>
> Event mechanisms are *superior* for events. But they *suck* for things
> that aren't events, but are actual code execution with random places
> that can block. THE TWO THINGS ARE TOTALLY AND UTTERLY INDEPENDENT!
Note that even for something tasks are supposed to suck at, and even if
used in extremely stupid ways, they perform reasonably well in practice
;-)
And i fully agree: specialization based on knowledge about frequency of
blocking will always be useful - if not /forced/ on the workflow
architecture and if not overdone. On the other hand, fully event-driven
servers based on 'nonblocking' calls, which Evgeniy is advocating and
which the kevent model is forcing upon userspace, is pure madness.
We very much can and should use things like epoll for events that we
expect to happen asynchronously 100% of the time - it just makes no
sense for those events to take up 4-5K of RAM apiece, when they could
also be only using up the 32 bytes that say a pending timer takes. I've
posted the code for that, how to do an 'outer' epoll loop around an
internal threadlep iterator. But those will always be very narrow event
sources, and likely wont (and shouldnt) cover 'request-internal'
processing.
but otherwise, there is no real difference between a task that is
scheduled and a request that is queued, 'other' than the size of the
request (the task takes 4-5K of RAM), and the register context (64-128
bytes on most CPUs, the loading of which is optimized to death).
Which difference can still be significant for certain workloads, so we
certainly dont want to prohibit specialized event interfaces and force
generic threads on everything. But for anything that isnt a raw and
natural external event source (time, network, disk, user-generated)
there shouldnt be much of an event queueing abstraction i believe (other
than what we get 'for free' within epoll, from having poll()-able files)
- and even for those event sources threadlets offer a pretty good run
for the money.
one can always find the point and workload where say 40,000 threads
start trashing the L2 cache, but where 40,000 queued special requests
are still fully in cache, and produce spectacular numbers.
Ingo
Some more results, using a larger number of processes and io depths. A
repeat of the tests from friday, with added depth 20000 for syslet and
libaio:
Engine Depth Processes Bw (MiB/sec)
----------------------------------------------------
libaio 1 1 602
syslet 1 1 759
sync 1 1 776
libaio 32 1 832
syslet 32 1 898
libaio 20000 1 581
syslet 20000 1 609
syslet still on top. Measuring O_DIRECT reads (of 4kb size) on ramfs
with 100 processes each with a depth of 200, reading a per-process
private file of 10mb (need to fit in my ram...) 10 times each. IOW,
doing 10,000MiB of IO in total:
Engine Depth Processes Bw (MiB/sec)
----------------------------------------------------
libaio 200 100 1488
syslet 200 100 1714
Results are stable to within approx +/- 10MiB/sec. The syslet case
completes a whole second faster than libaio (~6 vs ~7 seconds). Testing
was done with fio HEAD eb7c8ae27bc301b77490b3586dd5ccab7c95880a, and it
uses the v4 patch series.
Engine Depth Processes Bw (MiB/sec)
----------------------------------------------------
libaio 200 100 1488
syslet 200 100 1714
sync 200 100 1843
--
Jens Axboe
On Mon, Feb 26, 2007 at 01:39:23PM +0100, Ingo Molnar ([email protected]) wrote:
>
> * Evgeniy Polyakov <[email protected]> wrote:
>
> > > > Kevent is a _very_ small entity and there is _no_ cost of
> > > > requeueing (well, there is list_add guarded by lock) - after it is
> > > > done, process can start real work. With rescheduling there are
> > > > _too_ many things to be done before we can start new work. [...]
> > >
> > > actually, no. For example a wakeup too is fundamentally a list_add
> > > guarded by a lock. Take a look at try_to_wake_up(). The rest you see
> > > there is just extra frills that relate to things like
> > > 'load-balancing the requests over multiple CPUs [which i'm sure
> > > kevent users would request in the future too]'.
> >
> > wake_up() as a call is pretty simple and fast, but its result - it is
> > slow. [...]
>
> You are still very much wrong, and now you refuse to even /read/ what i
> wrote. Your only reply to my detailed analysis is: "it is slow, because
> it is slow and heavy". I told you how fast it is, i told you what
> happens on a context switch and why, i told you that you can measure if
> you want.
Ingo, you likely will not believe, but your mails are ones of the
several which I always read several times to get every bit of it :)
I clearly understand your point of view, it is absoutely clear and shine
for me. But... I can not agree with it.
Because of theoretical point of view and practical one concerned my
measurements. It is not pure speculation, which one can expect, but real
life comparison of kernel/user scheduling with pure userspace one (like
in own M:N threading lib or concurrent prgramming language like erlang).
For me (and probably _only_ for me), it is enough to show that some lib
shows 10 times faster rescheduling to start developing own, so I pointed
to it in a discussion.
> > [...] I did not run reschedulingtests with kernel thread, but posix
> > threads (they do look like a kernel thread) have significant overhead
> > there.
>
> You are wrong. Let me show you some more numbers. This is a
> hackbench_pth.c run:
>
> $ ./hackbench_pth 500
> Time: 14.371
>
> this uses 20,000 real threads and during this test the runqueue length
> is extreme - up to over a ten thousand threads. (hackbench_pth.c was
> posted to lkml recently.
>
> The same run with hackbench.c (20,000 forked processes):
>
> $ ./hackbench 500
> Time: 14.632
>
> so the TLB overhead from using processes is 1.8%.
>
> > [...] In early developemnt days of M:N threading library I tested
> > rescheduling performance of the POSIX threads - I created pool of
> > threads and 'sent' a message using futex wait/wake - such performance
> > of the userspace threading library (I tested erlang) was 10 times
> > slower.
>
> how much would it take for you to actually re-measure it and interpet
> the results you are seeing? You've apparently built a whole mental house
> of cards on the flawed proposition that tasks are 'super-heavy' and that
> context-switching them is 'slow'. You are unwilling to explain /how/
> they are slow, and all the numbers i post are contrary to that
> proposition of yours.
>
> your whole reasoning seems to be faith-based:
>
> [...] Anyway, kevents are very small, threads are very big, [...]
>
> How about following the scientific method instead?
That are only rethorical words as you have understood I bet, I meant that the
whole process of getting readiness notification from kevent is way tooo
much faster than resheduling of the new process/thread to handle that IO.
The whole process of switching from one process to another can be as fast
as bloody hell, but all other details just kill the thing.
I do not know, what exact line ends up with problem, but having thousands
of threads, rescheduling itself, results in slower performance than
userspace rescheduling. Then I interpolate it to our IO test case.
> > [...] and both are the way they are exactly on purpose - threads serve
> > for processing of any generic code, kevents are used for event waiting
> > - IO is such an event, it does not require a lot of infrastructure to
> > handle, it only nees some simple bits, so it can be optimized to be
> > extremely fast, with huge infrastructure behind each IO (like in case
> > when it is a separated thread) it can not be done effectively.
>
> you are wrong, and i have pointed it out to you in my previous replies
> why you are wrong. Your only coherent specific thought on this topic was
> your incorrect assumption is that the scheduler somehow saves registers
> and that this makes it heavy. I pointed it out to you in the mail you
> reply to that /every/ system call that saves user registers. You've not
> even replied to that point of mine, you are ignoring it completely and
> you are still repeating your same old, incorrect argument. If it is
> heavy, /why/ do you think it is heavy? Where is that magic pixie dust
> piece of scheduler code that miraculously turns the runqueue into a
> molass slow, heavy piece of thing?
I do not arue that I'm absolutely right.
I just point that I tested some cases and that tests ends up with
completely broken behaviour for micro-thread design (even besides the
case of thousand of new thread creation/reuse per second, which itself
does not look perfect).
I do not even ever try to say that threadlets suck (although I do believe
that it is in some cases, at leat for now :) ), I just point that
rescheduling overhead happens to be tooo big when it comes to
benchmarks I ran (where you never replied too, but that does not matter
after all :).
It can end up with (handwaving) broken syscall wrapper implementation,
or with anything else. Absolutely.
I never ever tried to say that scheduler's code is broken - I just show
my own tests which resulted in the situation, when many working threads
can end up with timings worse than some other case.
Register/tlb/whatever is just a speculation about _possible _ root of
the problem. I did not investigate problem enough - I just decided to
implement different library. Shame on me for that, since I never showed
what exactly is a root of the problem, but for _me_ it is enough, so I'm
trying to share it with you and other developers.
> Or put in another way: your test-code does ~6 syscalls per every event.
> So if what you said would be true (which it isnt), a kevent based
> request would have be just as slow as thread based request ...
I can neither confirm nor object against this sentence.
> > > i think you are really, really mistaken if you believe that the fact
> > > that whole tasks/threads or processes can be 'monster structures',
> > > somehow has any relevance to scheduling/task-queueing performance
> > > and scalability. It does not matter how large a task's address space
> > > is - scheduling only relates to the minimal context that is in the
> > > CPU. And most of that context we save upon /every system call
> > > entry/, and restore it upon every system call return. If it's so
> > > expensive to manipulate, why can the Linux kernel do a full system
> > > call in ~150 cycles? That's cheaper than the access latency to a
> > > single DRAM page.
> >
> > I meant not its size, but the whole infrastructure, which surrounds
> > task. [...]
>
> /what/ infrastructure do you mean? sched.c? Most of that never runs in
> the scheduler hotpath.
>
> > [...] If it is that lightweight, why don't we have posix thread per
> > IO? [...]
>
> because it would be pretty stupid to do that?
>
> But more importantly: because many people still believe 'the scheduler
> is slow and context-switching is evil'? The FIO AIO syslet code from
> Jens is an intelligent mix of queueing combined with async execution. I
> expect that model to prevail.
Suparna showed its problems - although on an older version.
Let's see another tests.
> > [...] One question is that mmap/allocation of the stack is too slow
> > (and it is very slow indeed, that is why glibc and M:N threading lib
> > caches allocated stacks), another one is kernel/userspace boundary
> > crossing, next one are tlb flushes, then copies.
>
> now you come up again with creation overhead but nobody is talking about
> context creation overhead. (btw., you are also wrong if you think that
> mmap() is all that slow - try measuring it one day) We were talking
> about context /switching/.
Ugh, Ingo, do not think absolutely...
I did it. And it is slow.
http://tservice.net.ru/~s0mbre/blog/2007/01/15#2007_01_15
> > Why userspace rescheduling is in order of tens times faster than
> > kernel/user?
>
> what on earth does this have to do with the topic of whether context
> switches are fast enough? Or if you like random info, just let me throw
> in a random piece of information as well:
>
> user-space function calls are more than /two/ orders of magnitude faster
> than system calls. Still you are using 6 ... SIX system calls in the
> sample kevent request handling hotpath.
I can only laugh on that, Ingo :)
If you will be in Moscow, I will buy you a beer, just drop me a mail.
What we are talking about, Ingo, kevent and IO in thread contexts,
or userspace vs. kernelspace scheduling?
Kevent can be broken as hell, it can be stupid application, which does
not work at all - it is one of the possible theories.
Practice however shows that it is not true.
Anyway, if we are talking about kevents and micro-threads, that is one
point, if we are talking about possible overhead of rescheduling - it is
another topic.
> > > for the same reason has it no relevance that the full kevent-based
> > > webserver is a 'monster structure' - still a single request's basic
> > > queueing operation is cheap. The same is true to tasks/threads.
> >
> > To move that tasks there must be done too may steps, and although each
> > one can be quite fast, the whole process of rescheduling in the case
> > of thousands running threads creates too big overhead per task to drop
> > performance.
>
> again, please come up with specifics! I certainly came up with enough
> specifics.
I thought I showed it several times already.
Anyway: http://tservice.net.ru/~s0mbre/blog/2006/11/09#2006_11_09
That is an initial step, which shows that rescheduling of threads (
I DO NOT say about problems in sched.c, Ingo, that would be somehow
stupid, althogh can be right) has some overhead comapred to userspace
rescheduling. If so, it can be eliminated or reduced.
Second (COMPELTELY DIFFERENT STARTING POINT).
If rescheduling has some overhead, is it possible to reduce it using
different model for IO? So I created kevent (as you likely do not know,
original idea was a bit differnet - network AIO, but results is quite good).
> Ingo
--
Evgeniy Polyakov
On Mon, Feb 26, 2007 at 02:57:36PM +0100, Jens Axboe wrote:
>
> Some more results, using a larger number of processes and io depths. A
> repeat of the tests from friday, with added depth 20000 for syslet and
> libaio:
>
> Engine Depth Processes Bw (MiB/sec)
> ----------------------------------------------------
> libaio 1 1 602
> syslet 1 1 759
> sync 1 1 776
> libaio 32 1 832
> syslet 32 1 898
> libaio 20000 1 581
> syslet 20000 1 609
>
> syslet still on top. Measuring O_DIRECT reads (of 4kb size) on ramfs
> with 100 processes each with a depth of 200, reading a per-process
> private file of 10mb (need to fit in my ram...) 10 times each. IOW,
> doing 10,000MiB of IO in total:
But, why ramfs ? Don't we want to exercise the case where O_DIRECT actually
blocks ? Or am I missing something here ?
Regards
Suparna
>
> Engine Depth Processes Bw (MiB/sec)
> ----------------------------------------------------
> libaio 200 100 1488
> syslet 200 100 1714
>
> Results are stable to within approx +/- 10MiB/sec. The syslet case
> completes a whole second faster than libaio (~6 vs ~7 seconds). Testing
> was done with fio HEAD eb7c8ae27bc301b77490b3586dd5ccab7c95880a, and it
> uses the v4 patch series.
>
> Engine Depth Processes Bw (MiB/sec)
> ----------------------------------------------------
> libaio 200 100 1488
> syslet 200 100 1714
> sync 200 100 1843
>
> --
> Jens Axboe
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
* Evgeniy Polyakov <[email protected]> wrote:
> > your whole reasoning seems to be faith-based:
> >
> > [...] Anyway, kevents are very small, threads are very big, [...]
> >
> > How about following the scientific method instead?
>
> That are only rethorical words as you have understood I bet, I meant
> that the whole process of getting readiness notification from kevent
> is way tooo much faster than resheduling of the new process/thread to
> handle that IO.
>
> The whole process of switching from one process to another can be as
> fast as bloody hell, but all other details just kill the thing.
for our primary abstractions there /IS NO OTHER DETAIL/ but wakeup and
context-switching! The "event notification" of a sys_read() /IS/ the
wakeup and context-switching that we do - or the epoll/kevent enqueueing
as an alternative.
yes, the two are still different in a number of ways, and yes, it's
still stupid to do a pool of thousands of threads and thus we can always
optimize queuing, RAM and cache footprint via specialization, but your
whole foundation seems to be constructed around the false notion that
queueing and scheduling a task by the scheduler is somehow magically
expensive and different from queueing and scheduling other type of
requests. Please reconsider that foundation and open up a bit more to a
slightly different world view: scheduling is really just another, more
generic (and thus certainly more expensive) type of 'request queueing',
and user-space, most of the time, is much better off if it handles its
'requests' and 'events' via tasks. (Especially if many of those 'events'
turn out to be non-events at all, so to speak.)
Ingo
* Suparna Bhattacharya <[email protected]> wrote:
> > syslet still on top. Measuring O_DIRECT reads (of 4kb size) on ramfs
> > with 100 processes each with a depth of 200, reading a per-process
> > private file of 10mb (need to fit in my ram...) 10 times each. IOW,
> > doing 10,000MiB of IO in total:
>
> But, why ramfs ? Don't we want to exercise the case where O_DIRECT
> actually blocks ? Or am I missing something here ?
ramfs is just the easiest way to measure the pure CPU overhead of a
workload without real IO delays (and resulting idle time) getting in the
way. It's certainly not the same thing, but still it's pretty useful
most of the time. I used a different method, loopback block device, and
got similar results. [ Real IO shows similar results as well, but is a
lot more noisy and hence harder to interpret (and thus easier to get
wrong). ]
Ingo
On Mon, Feb 26 2007, Suparna Bhattacharya wrote:
> On Mon, Feb 26, 2007 at 02:57:36PM +0100, Jens Axboe wrote:
> >
> > Some more results, using a larger number of processes and io depths. A
> > repeat of the tests from friday, with added depth 20000 for syslet and
> > libaio:
> >
> > Engine Depth Processes Bw (MiB/sec)
> > ----------------------------------------------------
> > libaio 1 1 602
> > syslet 1 1 759
> > sync 1 1 776
> > libaio 32 1 832
> > syslet 32 1 898
> > libaio 20000 1 581
> > syslet 20000 1 609
> >
> > syslet still on top. Measuring O_DIRECT reads (of 4kb size) on ramfs
> > with 100 processes each with a depth of 200, reading a per-process
> > private file of 10mb (need to fit in my ram...) 10 times each. IOW,
> > doing 10,000MiB of IO in total:
>
> But, why ramfs ? Don't we want to exercise the case where O_DIRECT actually
> blocks ? Or am I missing something here ?
Just overhead numbers for that test case, lets try something like your
described job.
Test case is doing random reads from /dev/sdb, in chunks of 64kb:
Engine Depth Processes Bw (KiB/sec)
----------------------------------------------------
libaio 200 100 2813
syslet 200 100 3944
libaio 20000 1 2793
syslet 20000 1 3854
sync (*) 20000 1 2866
deadline was used for IO scheduling, to minimize impact. Not sure why
syslet actually does so much better here, looing at vmstat the rate is
steady and all runs are basically 50/50 idle/wait. One difference is
that the submission itself takes a long time on libaio, since the
io_submit() will block on request allocation. The generated IO pattern
from each process is the same for all runs. The drive is a lousy sata
that doesn't even do queuing, FWIW.
[*] Just for comparison, the depth is obviously really 1 at the kernel
side since it's sync.
--
Jens Axboe
On Mon, Feb 26, 2007 at 01:51:23PM +0100, Ingo Molnar ([email protected]) wrote:
>
> * Evgeniy Polyakov <[email protected]> wrote:
>
> > Even having main dispatcher as epoll/kevent loop, the _whole_
> > threadlet model is absolutely micro-thread in nature and not state
> > machine/event.
>
> Evgeniy, i'm not sure how many different ways to tell this to you, but
> you are not listening, you are not learning and you are still not
> getting it at all.
>
> The scheduler /IS/ a generic work/event queue. And it's pretty damn
> fast. No amount of badmouthing will change that basic fact. Not exactly
> as fast as a special-purpose queueing system (for all the reasons i
> outlined to you, and which you ignored), but it gets pretty damn close
> even for the web workload /you/ identified, and offers a user-space
> programming model that is about 1000 times more useful than
> state-machines.
Meanwhile on practiceal side:
via epia kevent/epoll/threadlet:
client: ab -c500 -n5000 $url
kevent: 849.72
epoll: 538.16
threadlet:
gcc ./evserver_epoll_threadlet.c -o ./evserver_epoll_threadlet
In file included from ./evserver_epoll_threadlet.c:30:
./threadlet.h: In function ‘threadlet_exec’:
./threadlet.h:46: error: can't find a register in class ‘GENERAL_REGS’
while reloading ‘asm’
That particular asm optimization fails to compile.
> Ingo
--
Evgeniy Polyakov
On Mon, Feb 26, 2007 at 03:15:18PM +0100, Ingo Molnar ([email protected]) wrote:
> > > your whole reasoning seems to be faith-based:
> > >
> > > [...] Anyway, kevents are very small, threads are very big, [...]
> > >
> > > How about following the scientific method instead?
> >
> > That are only rethorical words as you have understood I bet, I meant
> > that the whole process of getting readiness notification from kevent
> > is way tooo much faster than resheduling of the new process/thread to
> > handle that IO.
> >
> > The whole process of switching from one process to another can be as
> > fast as bloody hell, but all other details just kill the thing.
>
> for our primary abstractions there /IS NO OTHER DETAIL/ but wakeup and
> context-switching! The "event notification" of a sys_read() /IS/ the
> wakeup and context-switching that we do - or the epoll/kevent enqueueing
> as an alternative.
>
> yes, the two are still different in a number of ways, and yes, it's
> still stupid to do a pool of thousands of threads and thus we can always
> optimize queuing, RAM and cache footprint via specialization, but your
> whole foundation seems to be constructed around the false notion that
> queueing and scheduling a task by the scheduler is somehow magically
> expensive and different from queueing and scheduling other type of
> requests. Please reconsider that foundation and open up a bit more to a
> slightly different world view: scheduling is really just another, more
> generic (and thus certainly more expensive) type of 'request queueing',
> and user-space, most of the time, is much better off if it handles its
> 'requests' and 'events' via tasks. (Especially if many of those 'events'
> turn out to be non-events at all, so to speak.)
If kernelspace rescheduling is that fast, then please explain me why
userspace one always beats kernel/userspace?
And you showed that threadlets without polling accept still does not
scale good - if it is the same fast queueing of events, then why doesn't
it work?
Actually it does not matter, if that narrow place exist (like
kernel/user transformation, or register copy or something else), it can
be eliminated in different model - kevent is that model - it does not
require a lot of things to be changed to get notification and start
working, so it scales better.
It is very similar to epoll, but there are at least two significant
moments:
1. it can work with _any_ type of events with minimal overhead (can not
be even remotely compared with 'file' binding which is required to be
pollable).
2. its notifications do not go through the second loop, i.e. it is O(1),
not O(ready_num), and notifications happens directly from internals of
the appropriate subsystem, which does not require special wakeup
(although it can be done too).
> Ingo
--
Evgeniy Polyakov
On Sun, Feb 25, 2007 at 02:44:11PM -0800, Linus Torvalds ([email protected]) wrote:
>
>
> On Thu, 22 Feb 2007, Evgeniy Polyakov wrote:
> >
> > My tests show that with 4k connections per second (8k concurrency) more
> > than 20k connections of 80k total block in tcp_sendmsg() over gigabit
> > lan between quite fast machines.
>
> Why do people *keep* taking this up as an issue?
Let's warm our brains a little in this pseudo-technical word throwings :)
> Use select/poll/epoll/kevent/whatever for event mechanisms. STOP CLAIMING
> that you'd use threadlets/syslets/aio for that. It's been pointed out over
> and over and over again, and yet you continue to make the same mistake,
> Evgeniy.
>
> So please read that sentence ten times, and then don't continue to make
> that same mistake. PLEASE.
>
> Event mechanisms are *superior* for events. But they *suck* for things
> that aren't events, but are actual code execution with random places that
> can block. THE TWO THINGS ARE TOTALLY AND UTTERLY INDEPENDENT!
>
> Examples of events:
> - packet arrives
> - timer happens
>
> Examples of things that are *not* "events":
> - filesystem lookup.
> - page faults
>
> So the basic point is: for events, you use an event-based thing. For code
> execution, you use a thread-based thing. It's really that simple.
Linus, you made your point clearly - generic AIO should not be used for
the cases, when it is supposed to block 90% of the time - only when it
almost never blocks, like in case of buffered IO.
Otherwise, when load nature requires almost always to block, we would
see that each block is eventually removed - by something, which calls
wake_up(), that something is an event, which we were supposed to wait,
but instead we we resceduled and waited there, so we just added
additional layer of dereferencing - we were scehduled away, did some
work, then awakened, instead of doing some work and get event.
I do not even discuss, that micro-threading model is way tooo simpler to
programm, but from above example is clear that it adds additional
overhead, which in turn can be high or noticeble.
> And yes, the two different things can usually be translated (at a very
> high cost in complexity *and* performance) into each other, so people who
> look at it as purely a theoretical exercise may think that "events" and
> "code execution" are equivalent. That's a very very silly and stupid way
> of looking at things in real life, though.
>
> Yes, you can turn things that are better seen as threaded execution into
> an event-based thing by turning it into a state machine. And usually that
> is a TOTAL DISASTER, and the end result is fragile and impossible to
> maintain.
>
> And yes, you can often (more easily) turn an event-based mechanism into a
> thread-based one, and usually the end result is a TOTAL DISASTER because
> it doesn't scale very well, and while it may actually result in somewhat
> simpler code, the overhead of managing ten thousand outstanding threads is
> just too high, when you compare to managing just a list of ten thousand
> outstanding events.
>
> And yes, people have done both of those mistakes. Java, for example,
> largely did the latter mistake ("we don't need anything like 'select',
> because we'll just use threads for everything" - what a totally moronic
> thing to do!)
I can only say that I'm fully agree. Absolutely. No jokes.
> So Evgeniy, threadlets/syslets/aio is *not* a replacement for event
> queues. It's a TOTALLY DIFFERENT MECHANISM, and one that is hugely
> superior to event queues for certain kinds of things. Anybody who thinks
> they want to do pathname and inode lookup as a series of events is likely
> a moron. It's really that simple.
Hmmm... let me describe that process a bit:
Userspace wants to open a file, so it needs some file-related (inode,
direntry and others) structures in the mem, they should be read from
disk. Eventually it will be reading some blocks from the disk
(for example ext3_lookup->ext3_find_entry->ext3_getblk/ll_rw_block) and
we will wait for them (wait_on_bit()) - we will wait for event.
But I agree, it was a brainfscking example, but nevertheless, it can be
easily done using event driven model.
Reading from the disk is _exactly_ the same - the same waiting for
buffer_heads/pages, and (since it is bigger) it can be easily
transferred to event driven model.
Ugh, wait, it not only _can_ be transferred, it is already done in
kevent AIO, and it shows faster speeds (though I only tested sending
them over the net).
> In a complex server (say, a database), you'd use both. You'd probably use
> events for doing the things you *already* use events for (whether it be
> select/poll/epoll or whatever): probably things like the client network
> connection handling.
>
> But you'd *in*addition* use threadlets to be able to do the actual
> database IO in a threaded manner, so that you can scale the things that
> are not easily handled as events (usually because they have internal
> kernel state that the user cannot even see, and *must*not* see because of
> security issues).
Event is data readiness - no more, no less.
It has nothing with internal kernel structures - you just wait until
data is ready in requested buffer (disk, net, whatever).
Internal mechanism of moving data to the destination point can use event
driven model too, but it is another question.
Eventually threads wait for the same events - but there is an additional
overhead created to manage that objects called threads.
Ingo says that it is fast to manage them, but it can not be faster than
properly created event driven abstraction, because, as you noticed
himself, mutual transformations are complex.
Threadlets are simpler to program, but they do not add a gain compared
to properly created singlethreaded model (or number of CPU-threadede
model) with the right event processing mechanims.
Waiting for any IO is a waiting for event, other tasks can be made an
event too, but I agree, it is simpler to use different models just
because they already exist.
> So please. Stop this "kevents are better". The only thing you show by
> trying to go down that avenue is that you don't understand the
> *difference* between an event model and a thread model. They are both
> perfectly fine models and they ARE NOT THE SAME! They aren't even mutually
> incompatible - quite the reverse.
>
> The thing people want to remove with threadlets is the internal overhead
> of maintaining special-purpose code like aio_read() inside the kernel,
> that doesn't even do all that people want it to do, and that really does
> need a fair amount of internal complexity that we could hopefully do with
> a more generic (and hopefully *simpler*) model.
Let me rephrase that stuff:
the thing people want to remove with linked lists is the internal
overhead of maintaing special-purpose code like RB trees.
It sounds with similar part of idiocity in it.
If additional code provides faster processing, it must be used instead of
fear of 'the internal overhead of maintaining special-purpose code'.
> Linus
--
Evgeniy Polyakov
On Mon, Feb 26, 2007 at 02:11:33PM +0100, Ingo Molnar ([email protected]) wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
> > > My tests show that with 4k connections per second (8k concurrency)
> > > more than 20k connections of 80k total block in tcp_sendmsg() over
> > > gigabit lan between quite fast machines.
> >
> > Why do people *keep* taking this up as an issue?
> >
> > Use select/poll/epoll/kevent/whatever for event mechanisms. STOP
> > CLAIMING that you'd use threadlets/syslets/aio for that. It's been
> > pointed out over and over and over again, and yet you continue to make
> > the same mistake, Evgeniy.
> >
> > So please read that sentence ten times, and then don't continue to
> > make that same mistake. PLEASE.
> >
> > Event mechanisms are *superior* for events. But they *suck* for things
> > that aren't events, but are actual code execution with random places
> > that can block. THE TWO THINGS ARE TOTALLY AND UTTERLY INDEPENDENT!
>
> Note that even for something tasks are supposed to suck at, and even if
> used in extremely stupid ways, they perform reasonably well in practice
> ;-)
>
> And i fully agree: specialization based on knowledge about frequency of
> blocking will always be useful - if not /forced/ on the workflow
> architecture and if not overdone. On the other hand, fully event-driven
> servers based on 'nonblocking' calls, which Evgeniy is advocating and
> which the kevent model is forcing upon userspace, is pure madness.
>
> We very much can and should use things like epoll for events that we
> expect to happen asynchronously 100% of the time - it just makes no
> sense for those events to take up 4-5K of RAM apiece, when they could
> also be only using up the 32 bytes that say a pending timer takes. I've
> posted the code for that, how to do an 'outer' epoll loop around an
> internal threadlep iterator. But those will always be very narrow event
> sources, and likely wont (and shouldnt) cover 'request-internal'
> processing.
>
> but otherwise, there is no real difference between a task that is
> scheduled and a request that is queued, 'other' than the size of the
> request (the task takes 4-5K of RAM), and the register context (64-128
> bytes on most CPUs, the loading of which is optimized to death).
>
> Which difference can still be significant for certain workloads, so we
> certainly dont want to prohibit specialized event interfaces and force
> generic threads on everything. But for anything that isnt a raw and
> natural external event source (time, network, disk, user-generated)
> there shouldnt be much of an event queueing abstraction i believe (other
> than what we get 'for free' within epoll, from having poll()-able files)
> - and even for those event sources threadlets offer a pretty good run
> for the money.
I tend to agree.
Yes, some loads require event driven model, other can be done using
threads. The only reason kevent was created is to allow to process any
type of events in exaclty the same way in the same processing loop,
it was optimized to have event register structure less than a cache
line.
What I can not agree with, is that IO is a thread based stuff.
> one can always find the point and workload where say 40,000 threads
> start trashing the L2 cache, but where 40,000 queued special requests
> are still fully in cache, and produce spectacular numbers.
>
> Ingo
--
Evgeniy Polyakov
On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:
>
> Linus, you made your point clearly - generic AIO should not be used for
> the cases, when it is supposed to block 90% of the time - only when it
> almost never blocks, like in case of buffered IO.
I don't think it's quite that simple.
EVEN *IF* it were to block 100% of the time, it depends on other things
than just "blockingness".
For example, let's look at something like
fd = open(filename, O_RDONLY);
if (fd < 0)
return -1;
if (fstat(fd, &st) < 0) {
close(fd);
return -1;
}
.. do something ..
and please realize that EVEN IF YOU KNOW WITH 100% CERTAINTY that the
above open (or fstat()) is going to block, because you know that your
working set is bigger than the available memory for caching, YOU SIMPLY
CANNOT SANELY WRITE THAT AS AN EVENT-BASED STATE MACHINE!
It's really that simple. Some things block "in the middle". The reason
UNIX made non-blocking reads available for networking, but not for
filesystem accesses is not because one blocks and the other doesn't. No,
it's really much more fundamental than that!
When you do a "recvmsg()", there is a clear event-based model: you can
return -EAGAIN if the data simply isn't there, because a network
connection is a simple stream of data, and there is a clear event on "ok,
data arrived" without any state what-so-ever.
The same is simply not true for "open a file descriptor". There is no sane
way to turn the "filename lookup blocked" into an event model with a
select- or kevent-based interface.
Similarly, even for a simple "read()" on a filesystem, there is no way to
just say "block until data is available" like there is for a socket,
because on a filesystem, the data may be available, BUT AT THE WRONG
OFFSET. So while a socket or a pipe are both simple "streaming interfaces"
as far as a "read()" is concerned, a file access is *not* a simple
streaming interface.
Notice? For a read()/recvmsg() call on a socket or a pipe, there is no
"position" involved. The "event" is clear: it's always the head of the
streaming interface that is relevant, and the event is "is there room" or
"is there data". It's an event-based thing.
But for a read() on a file, it's no longer a streaming interface, and
there is no longer a simple "is there data" event. You'd have to make the
event be a much more complex "is there data at position X through Y" kind
of thing.
And "read()" on a filesystem is the _simple_ case. Sure, we could add
support for those kinds of ranges, and have an event interface for that.
But the "open a filename" is much more complicated, and doesn't even have
a file descriptor available to it (since we're trying to _create_ one), so
you'd have to do something even more complex to have the event "that
filename can now be opened without blocking".
See? Even if you could make those kinds of events, it would be absolutely
HORRIBLE to code for. And it would suck horribly performance-wise for most
code too.
THAT is what I'm saying. There's a *difference* between event-based and
thread-based programming. It makes no sense to try to turn one into the
other. But it often makes sense to *combine* the two approaches.
> Userspace wants to open a file, so it needs some file-related (inode,
> direntry and others) structures in the mem, they should be read from
> disk. Eventually it will be reading some blocks from the disk
> (for example ext3_lookup->ext3_find_entry->ext3_getblk/ll_rw_block) and
> we will wait for them (wait_on_bit()) - we will wait for event.
>
> But I agree, it was a brainfscking example, but nevertheless, it can be
> easily done using event driven model.
>
> Reading from the disk is _exactly_ the same - the same waiting for
> buffer_heads/pages, and (since it is bigger) it can be easily
> transferred to event driven model.
> Ugh, wait, it not only _can_ be transferred, it is already done in
> kevent AIO, and it shows faster speeds (though I only tested sending
> them over the net).
It would be absolutely horrible to program for. Try anything more complex
than read/write (which is the simplest case, but even that is nasty).
Try imagining yourself in the shoes of a database server (or just about
anything else). Imagine what kind of code you want to write. You probably
do *not* want to have everything be one big event loop, and having to make
different "states" for "I'm trying to open the file", "I opened the file,
am now doing 'fstat()' to figure out how big it is", "I'm now reading the
file and have read X bytes of the total Y bytes I want to read", "I took a
page fault in the middle" etc etc.
I pretty much can *guarantee* you that you'll never see anybody do that.
Page faults in user space are particularly hard to handle in a state
machine, since they basically require saving the whole thread state, as
they can happen on any random access. So yeah, you could do them as a
state machine, but in reality it would just become a "user-level thread
library" in the end, just to handle those.
In contrast, if you start using thread-like programming to begin with, you
have none of those issues. Sure, some thread may block because you got a
page fault, or because an inode needed to be brought into memory, but from
a user-level programming interface standpoint, the thread library just
takes care of the "state machine" on its own, so it's much simpler, and in
the end more efficient.
And *THAT* is what I'm trying to say. Some simple obvious events are
better handled and seen as "events" in user space. But other things are so
intertwined, and have basically random state associated with them, that
they are better seen as threads.
Yes, from a "turing machine" kind of viewpoint, the two are 100% logically
equivalent. But "logical equivalence" does NOT translate into "can
practically speaking be implemented".
Linus
On Mon, 2007-02-26 at 20:37 +0300, Evgeniy Polyakov wrote:
> I tend to agree.
> Yes, some loads require event driven model, other can be done using
> threads.
event driven model is really complex though. For event driven to work
well you basically can't tolerate blocking calls at all ...
open() blocks on files. read() may block on metadata reads (say indirect
blockgroups) not just on datareads etc etc. It gets really hairy really
quick that way.
--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org
On Mon, Feb 26, 2007 at 09:57:00AM -0800, Linus Torvalds ([email protected]) wrote:
> Similarly, even for a simple "read()" on a filesystem, there is no way to
> just say "block until data is available" like there is for a socket,
> because on a filesystem, the data may be available, BUT AT THE WRONG
> OFFSET. So while a socket or a pipe are both simple "streaming interfaces"
> as far as a "read()" is concerned, a file access is *not* a simple
> streaming interface.
>
> Notice? For a read()/recvmsg() call on a socket or a pipe, there is no
> "position" involved. The "event" is clear: it's always the head of the
> streaming interface that is relevant, and the event is "is there room" or
> "is there data". It's an event-based thing.
>
> But for a read() on a file, it's no longer a streaming interface, and
> there is no longer a simple "is there data" event. You'd have to make the
> event be a much more complex "is there data at position X through Y" kind
> of thing.
>
> And "read()" on a filesystem is the _simple_ case. Sure, we could add
> support for those kinds of ranges, and have an event interface for that.
> But the "open a filename" is much more complicated, and doesn't even have
> a file descriptor available to it (since we're trying to _create_ one), so
> you'd have to do something even more complex to have the event "that
> filename can now be opened without blocking".
>
> See? Even if you could make those kinds of events, it would be absolutely
> HORRIBLE to code for. And it would suck horribly performance-wise for most
> code too.
I see our main discussion trouble I think.
I never ever tried to say, that every bit in read() should be
non-blocking, it would be even more stupid than you expect it to be.
But you and Ingo do not want to see, what I'm trying to say, because it
is too cozy to read own right ideas than others.
I want to say, that read() consists of tons of events, but programmer
needs only one - data is ready in requested buffer. Programmer might
not even know what is the object behind provided file descriptor.
One only wans data in the buffer.
That is an event.
It is also possible to have other events inside complex read() machinery
- for example waiting for page obtained via ->readpages().
> THAT is what I'm saying. There's a *difference* between event-based and
> thread-based programming. It makes no sense to try to turn one into the
> other. But it often makes sense to *combine* the two approaches.
Thread is simple just because there is an interface.
Change interface, and no one will ever want to do it.
Provide nice aio_read() (forget about posix) and everyone will use it.
I always said that combining of such models is a must, I fully agree,
but it seems that we still do not agree where the bound should be drawn.
> > Userspace wants to open a file, so it needs some file-related (inode,
> > direntry and others) structures in the mem, they should be read from
> > disk. Eventually it will be reading some blocks from the disk
> > (for example ext3_lookup->ext3_find_entry->ext3_getblk/ll_rw_block) and
> > we will wait for them (wait_on_bit()) - we will wait for event.
> >
> > But I agree, it was a brainfscking example, but nevertheless, it can be
> > easily done using event driven model.
> >
> > Reading from the disk is _exactly_ the same - the same waiting for
> > buffer_heads/pages, and (since it is bigger) it can be easily
> > transferred to event driven model.
> > Ugh, wait, it not only _can_ be transferred, it is already done in
> > kevent AIO, and it shows faster speeds (though I only tested sending
> > them over the net).
>
> It would be absolutely horrible to program for. Try anything more complex
> than read/write (which is the simplest case, but even that is nasty).
>
> Try imagining yourself in the shoes of a database server (or just about
> anything else). Imagine what kind of code you want to write. You probably
> do *not* want to have everything be one big event loop, and having to make
> different "states" for "I'm trying to open the file", "I opened the file,
> am now doing 'fstat()' to figure out how big it is", "I'm now reading the
> file and have read X bytes of the total Y bytes I want to read", "I took a
> page fault in the middle" etc etc.
>
> I pretty much can *guarantee* you that you'll never see anybody do that.
> Page faults in user space are particularly hard to handle in a state
> machine, since they basically require saving the whole thread state, as
> they can happen on any random access. So yeah, you could do them as a
> state machine, but in reality it would just become a "user-level thread
> library" in the end, just to handle those.
>
> In contrast, if you start using thread-like programming to begin with, you
> have none of those issues. Sure, some thread may block because you got a
> page fault, or because an inode needed to be brought into memory, but from
> a user-level programming interface standpoint, the thread library just
> takes care of the "state machine" on its own, so it's much simpler, and in
> the end more efficient.
>
> And *THAT* is what I'm trying to say. Some simple obvious events are
> better handled and seen as "events" in user space. But other things are so
> intertwined, and have basically random state associated with them, that
> they are better seen as threads.
In threading you still need to do exactly the same, since stat() can
not be done before open(). So you will maintain some state (actually
order of things which will not be changed).
Have the same in the function.
Then start execution - it can be perfectly done using threads.
Just because it is unconvenient to program using state machines and
there is no appropriate interface.
But there are another operations.
Consider database or whatever you like, which wants to read thousands of
blocks from disk, each access potentially blocks, and blocking on the
mutex is nothing compared to blocking on waiting for storage to be ready
to provide data (network, disk, whatever). If storage is not optimized
(or small cache, or something else) we end up with blocked contexts,
which sleep - thousands of contexts.
And this number will not decrease.
So we ended with enourmous overhead just because we do not have simple
enough aio_read() and aio_wait().
> Yes, from a "turing machine" kind of viewpoint, the two are 100% logically
> equivalent. But "logical equivalence" does NOT translate into "can
> practically speaking be implemented".
You have said that finally!
"can practically speaking be implemented".
I see your and Ingo point - kevent is a 'merde' (pardon my french, but I
even somehow glad Ingo said (almost) that - I now have plenty of time for other
interesting projects), we want threads.
Ok, you have threads, but you need to wait on (some of) them.
So you will invent the wheel again - some subsystem which will wait for
threads (likely waiting on futexes), then other events (probably).
Eventually it will be the same from different point of view :)
Anyway, I _do_ hope we understood each other that both events and
threads can co-exist, although board was not drawn.
My point is that many things can be done using event just because they
are faster and smaller - and IO (any IO without limit) is one of the
areas where it is unbeatable.
Threading in turn can be done a higher-layer abstraction model - thread
can wrap around the whole transaction with all related data flows, but
not thread per IO.
> Linus
--
Evgeniy Polyakov
On Mon, Feb 26, 2007 at 10:19:03AM -0800, Arjan van de Ven ([email protected]) wrote:
> On Mon, 2007-02-26 at 20:37 +0300, Evgeniy Polyakov wrote:
>
> > I tend to agree.
> > Yes, some loads require event driven model, other can be done using
> > threads.
>
> event driven model is really complex though. For event driven to work
> well you basically can't tolerate blocking calls at all ...
> open() blocks on files. read() may block on metadata reads (say indirect
> blockgroups) not just on datareads etc etc. It gets really hairy really
> quick that way.
I never ever tried to say _everything_ must be driven by events.
IO must be driven, it is a must IMO.
Some (and a lot of) things can easily be programmed by threads.
Threads are essentially for those tasks which can not be easily
programmed with events.
But have in mind that fact, that thread execution _is_ an event itself,
its completion is an event.
> --
> if you want to mail me at work (you don't), use arjan (at) linux.intel.com
> Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org
--
Evgeniy Polyakov
Evgeniy Polyakov wrote:
> I never ever tried to say _everything_ must be driven by events.
> IO must be driven, it is a must IMO.
Do you disagree with Linus' post about the difficulty of treating
open(), fstat(), page faults, etc. as events? Or do you not consider
them to be IO?
Chris
On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:
>
> I want to say, that read() consists of tons of events, but programmer
> needs only one - data is ready in requested buffer. Programmer might
> not even know what is the object behind provided file descriptor.
> One only wans data in the buffer.
You're not following the discussion.
First off, I already *told* you that "read()" is the absolute simplest
case, and yes, we could make it an event if you also passed in the "which
range of the file do we care about" information (we could consider it
"f_pos", which the kernel already knows about, but that doesn't handle
pread()/pwrite(), so it's not very good for many cases).
But that's not THE ISSUE.
The issue is that it's a horrible interface from a users standpoint.
It's a lot better to program certain things as a thread. Why do you argue
against that, when that is just obviously true.
There's a reason that people write code that is functional, rather than
write code as a state machine.
We simply don't write code like
for (;;) {
switch (state) {
case Open:
fd = open();
if (fd < 0)
break;
state = Stat;
case Stat:
if (fstat(fd, &stat) < 0)
break;
state = Read;
case Read:
count = read(fd, buf + pos, size - pos));
if (count < 0)
break;
pos += count;
if (!count || pos == size)
state = Close;
continue;
case Close;
if (close(fd) < 0)
break;
state = Done;
return 0;
}
}
/* Returning 1 means wait in the event loop .. */
if (errno == EAGAIN || errno == EINTR)
return 1;
/* Else we had a real error */
state = Error;
return -1;
and instead we write it as
fd = open(..)
if (fd < 0)
return -1;
if (fstat(fd, &st)) < 0) {
close(fd);
return -1;
}
..
and if you cannot see the *reason* why people don't use event-based
programming for everything, I don't see the point of continuing this
discussion.
See? Stop blathering about how everything is an event. THAT'S NOT
RELEVANT. I've told you a hundred times - they may be "logically
equivalent", but that doesn't change ANYTHING. Event-based programming
simply isn't suitable for 99% of all stuff, and for the 1% where it *is*
suitable, it actually tends to be a very specific subset of the code that
you actually use events for (ie accept and read/write on pure streams).
Linus
On Mon, Feb 26, 2007 at 12:56:33PM -0600, Chris Friesen ([email protected]) wrote:
> Evgeniy Polyakov wrote:
>
> >I never ever tried to say _everything_ must be driven by events.
> >IO must be driven, it is a must IMO.
>
> Do you disagree with Linus' post about the difficulty of treating
> open(), fstat(), page faults, etc. as events? Or do you not consider
> them to be IO?
>From practical point of view - yes some of that processes are complex
enough to not attract attention as async usage model.
But I'm absolutely for the scenario, when several operations are
performed asynchronously like open+stat+fadvice+sendfile.
By IO I meant something which has end result, and that result must be
enough to start async processing - data in the buffer for example.
Async open I would combine with actual data processing - that one can be
a one event.
> Chris
--
Evgeniy Polyakov
On Mon, Feb 26, 2007 at 11:22:46AM -0800, Linus Torvalds ([email protected]) wrote:
> See? Stop blathering about how everything is an event. THAT'S NOT
> RELEVANT. I've told you a hundred times - they may be "logically
> equivalent", but that doesn't change ANYTHING. Event-based programming
> simply isn't suitable for 99% of all stuff, and for the 1% where it *is*
> suitable, it actually tends to be a very specific subset of the code that
> you actually use events for (ie accept and read/write on pure streams).
Will you argue that people do things like
num = epoll_wait()
for (i=0; i<num; ++i) {
process(event[i])?
}
Will you spawn thread per IO?
Stop writing the same again and again - I perfectly understand that not
everything can be easily covered by events, but covering everything with
threads is more stupid idea.
High-performance IO requires as small as possible overhead, dispatching
events from ring buffer or queue from each cpu is the smallest one, but
not spawning a thread per read.
> Linus
--
Evgeniy Polyakov
On Sun, 25 Feb 2007, Evgeniy Polyakov wrote:
> Why userspace rescheduling is in order of tens times faster than
> kernel/user?
About 50 times in my Opteron 254 actually. That's libpcl's (swapcontext
based) cobench against lat_ctx.
- Davide
On Mon, 26 Feb 2007, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > please also try evserver_epoll_threadlet.c that i've attached below -
> > it uses epoll as the main event mechanism but does threadlets for
> > request handling.
>
> find updated code below - your evserver_epoll.c spuriously missed event
> edges - so i changed it back to level-triggered. While that is not as
> fast as edge-triggered, it does not result in spurious hangs and
> workflow 'hickups' during the test.
>
> Could this be the reason why in your testing kevents outperformed epoll?
This is how I handle a read (write/accept/connect, same thing) inside
coronet (coroutine+epoll async library - http://www.xmailserver.org/coronet-lib.html ).
static int conet_read_ll(struct sk_conn *conn, char *buf, int nbyte) {
int n;
while ((n = read(conn->sfd, buf, nbyte)) < 0) {
if (errno == EINTR)
continue;
if (errno != EAGAIN && errno != EWOULDBLOCK)
return -1;
if (!(conn->events & EPOLLIN)) {
conn->events = EPOLLIN;
if (conet_mod_conn(conn, conn->events) < 0)
return -1;
}
if (conet_yield(conn) < 0)
return -1;
}
return n;
}
I use EPOLLET and, you don't change the interest set until you actually
get an EAGAIN. *Many* read/write mode changes in the usage will simply
happen w/out an epoll_ctl() needed. The conet_mod_conn() function does the
actual epoll_ctl() and add EPOLLET to the specified event set. The
conet_yield() function end up calling the libpcl's co_resume(), that is
basically a switch-to-next-coroutine-until-fd-becomes-ready (maps
directly to a swapcontext).
That cuts 50+% of the epoll_ctl(EPOLL_CTL_MOD).
- Davide
On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:
>
> Will you argue that people do things like
> num = epoll_wait()
> for (i=0; i<num; ++i) {
> process(event[i])?
> }
I have several times told you that I argue for a *combination* of
event-bassed interfaces and thread-like code. And that the choice depends
on which is more natural. Sometimes you might have just one or the other.
Sometimes you have both. And sometimes you have neither (although,
strictly speaking, normal single-threaded code is certainly "thread-like"
- it's a serial execution, the same way threadlets are serial executions -
it's just not running in parallel with anything else).
> Will you spawn thread per IO?
Depending on what the IO is, yes.
Is that _really_ so hard to understand? There is no "yes" or "no". There's
a "depends on what the problem is, and what the solution looks like".
Sometimes the best way to do parallelism is through explicit threads.
Sometimes it is through whatever "threadlets" or other that gets out of
this whole development discussion. Sometimes it's an event loop.
So get over it. The world is not a black and white, either-or kind of
place. It's full of grayscales, and colors, and mixing things
appropriately. And the choices are often done on whims and on whatever
feels most comfortable to the person doing the choice. Not on some strict
"you must always use things in an event-driven main loop" or "you must
always use threads for parallelism".
The world is simply _richer_ than that kind of either-or thing.
Linus
* Linus Torvalds <[email protected]> wrote:
> > Reading from the disk is _exactly_ the same - the same waiting for
> > buffer_heads/pages, and (since it is bigger) it can be easily
> > transferred to event driven model. Ugh, wait, it not only _can_ be
> > transferred, it is already done in kevent AIO, and it shows faster
> > speeds (though I only tested sending them over the net).
>
> It would be absolutely horrible to program for. Try anything more
> complex than read/write (which is the simplest case, but even that is
> nasty).
note that even for something as 'simple and straightforward' as TCP
sockets, the 25-50 lines of evserver code i worked on today had 3
separate bugs, is known to be fundamentally incorrect and one of the
bugs (the lost event problem) showed up as a subtle epoll performance
problem and it took me more than an hour to track down. And that matches
my Tux experience as well: event based models are horribly hard to debug
BECAUSE there is /no procedural state associated with requests/. Hence
there is no real /proof of progress/. Not much to use for debugging -
except huge logs of execution, which, if one is unlucky (which i often
was with Tux) would just make the problem go away.
Furthermore, with a 'retry' model, who guarantees that the retry wont be
an infinite retry where none of the retries ever progresses the state of
the system enough to return the data we are interested in? The moment we
have to /retry/, depending on the 'depth' of how deep the retry kicked
in, we've got to reach that 'depth' of code again and execute it.
plus, 'there is not much state' is not even completely true to begin
with, even in the most simple, TCP socket case! There /is/ quite a bit
of state constructed on the kernel stack: user parameters have been
evaluated/converted, the socket has been looked up, its state has been
validated, etc. With a 'retry' model - but even with a pure 'event
queueing' model we redo all those things /both/ at request submission
and at event generation time, again and again - while with a synchronous
syscall you do it just once and upon event completion a piece of that
data is already on the kernel stack. I'd much rather spend time and
effort on simplifying the scheduler and reducing the cache footprint of
the kernel thread context switch path, etc., to make it more useful even
in more extreme, highly prallel '100% context-switching' case, because i
have first-hand experience about how fragile and inflexible event based
servers are. I do think that event interfaces for raw, external physical
events make sense in some circumstances, but for any more complex
'derived' event type it's less and less clear whether we want a direct
interface to it. For something like the VFS it's outright horrible to
even think about.
Ingo
* Evgeniy Polyakov <[email protected]> wrote:
> If kernelspace rescheduling is that fast, then please explain me why
> userspace one always beats kernel/userspace?
because 'user space scheduling' makes no sense? I explained my thinking
about that in a past mail:
-------------------------->
One often repeated (because pretty much only) performance advantage of
'light threads' is context-switch performance between user-space
threads. But reality is, nobody /cares/ about being able to
context-switch between "light user-space threads"! Why? Because there
are only two reasons why such a high-performance context-switch would
occur:
1) there's contention between those two tasks. Wonderful: now two
artificial threads are running on the /same/ CPU and they are even
contending each other. Why not run a single context on a single CPU
instead and only get contended if /another/ CPU runs a conflicting
context?? While this makes for nice "pthread locking benchmarks",
it is not really useful for anything real.
2) there has been an IO event. The thing is, for IO events we enter the
kernel no matter what - and we'll do so for the next 10 years at
minimum. We want to abstract away the hardware, we want to do
reliable resource accounting, we want to share hardware resources,
we want to rate-limit, etc., etc. While in /theory/ you could handle
IO purely from user-space, in practice you dont want to do that. And
if we accept the premise that we'll enter the kernel anyway, there's
zero performance difference between scheduling right there in the
kernel, or returning back to user-space to schedule there. (in fact
i submit that the former is faster). Or if we accept the theoretical
possibility of 'perfect IO hardware' that implements /all/ the
features that the kernel wants (in a secure and generic way, and
mind you, such IO hardware does not exist yet), then /at most/ the
performance advantage of user-space doing the scheduling is the
overhead of a null syscall entry. Which is a whopping 100 nsecs on
modern CPUs! That's roughly the latency of a /single/ DRAM access!
....
<-----------
(see http://lwn.net/Articles/219958/)
btw., the words that follow that section are quite interesting in
retrospect:
| Furthermore, 'light thread' concepts can no way approach the
| performance of #2 state-machines: if you /know/ what the structure of
| your context is, and you can program it in a specialized state-machine
| way, there's just so many shortcuts possible that it's not even funny.
[ oops! ;-) ]
i severely under-estimated the kind of performance one can reach even
with pure procedural concepts. Btw., when i wrote this mail was when i
started thinking about "is it really true that the only way to get good
performance are 100% event-based servers and nonblocking designs?", and
started coding syslets and then threadlets.
Ingo
On Wed, 21 Feb 2007 22:15:21 +0100 Ingo Molnar wrote:
> From: Ingo Molnar <[email protected]>
>
> Add Documentation/syslet-design.txt with a high-level description
> of the syslet concepts.
Just a few comments & questions...
> Signed-off-by: Ingo Molnar <[email protected]>
> Signed-off-by: Arjan van de Ven <[email protected]>
> ---
> Documentation/syslet-design.txt | 137 ++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 137 insertions(+)
>
> Index: linux/Documentation/syslet-design.txt
> ===================================================================
> --- /dev/null
> +++ linux/Documentation/syslet-design.txt
> @@ -0,0 +1,137 @@
> +Syslets / asynchronous system calls
> +===================================
> +
> +the core syslet concepts are:
> +
> +The Syslet Atom:
> +----------------
> +
> +The syslet atom is a small, fixed-size (44 bytes on 32-bit) piece of
> +user-space memory, which is the basic unit of execution within the syslet
> +framework. A syslet represents a single system-call and its arguments.
> +In addition it also has condition flags attached to it that allows the
> +construction of larger programs (syslets) from these atoms.
> +
> +Arguments to the system call are implemented via pointers to arguments.
> +This not only increases the flexibility of syslet atoms (multiple syslets
> +can share the same variable for example), but is also an optimization:
> +copy_uatom() will only fetch syscall parameters up until the point it
> +meets the first NULL pointer. 50% of all syscalls have 2 or less
> +parameters (and 90% of all syscalls have 4 or less parameters).
> +
> + [ Note: since the argument array is at the end of the atom, and the
> + kernel will not touch any argument beyond the final NULL one, atoms
I would s/final/first/ since one could put many (unneeded) NULL pointers
in the argument array.
> + might be packed more tightly. (the only special case exception to
> + this rule would be SKIP_TO_NEXT_ON_STOP atoms, where the kernel will
> + jump a full syslet_uatom number of bytes.) ]
...
> +Completion of asynchronous syslets:
> +-----------------------------------
> +
> +Completion of asynchronous syslets is done via the 'completion ring',
> +which is a ringbuffer of syslet atom pointers user user-space memory,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^^^^ ^^^^ ??
> +provided by user-space as an argument to the sys_async_exec() syscall.
> +The kernel fills in the ringbuffer starting at index 0, and user-space
> +must clear out these pointers. Once the kernel reaches the end of
> +the ring it wraps back to index 0. The kernel will not overwrite
> +non-NULL pointers (but will return an error), user-space has to
> +make sure it completes all events it asked for.
Last sentence is actually 2 sentences, so (e.g.) change the comma
to a semi-colon, xor begin the sentence with "Since".
How does the kernel return an error if the ring buffer is full?
Just a syscall negative error return code?
> +int main(int argc, char *argv[])
> +{
> + unsigned long int fd_out = 1; /* standard output */
> + char *buf = "Hello Syslet World!\n";
> + unsigned long size = strlen(buf);
> + struct syslet_uatom atom, *done;
> +
> + async_head_init();
> +
> + /*
> + * Simple syslet consisting of a single atom:
> + */
> + init_atom(&atom, __NR_sys_write, &fd_out, &buf, &size,
> + NULL, NULL, NULL, NULL, SYSLET_ASYNC, NULL);
init_atom() (was) above. sys_async_exec(), sys_async_wait() are
new syscalls. What are async_head_init() and async_head_exit()?
> + done = sys_async_exec(&atom);
> + if (!done) {
> + sys_async_wait(1);
> + if (completion_ring[curr_ring_idx] == &atom) {
> + completion_ring[curr_ring_idx] = NULL;
> + printf("completed an async syslet atom!\n");
> + }
> + } else {
> + printf("completed an cached syslet atom!\n");
> + }
> +
> + async_head_exit();
> +
> + return 0;
> +}
---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
On Mon, 26 Feb 2007, Jens Axboe wrote:
>
> Some more results, using a larger number of processes and io depths. A
> repeat of the tests from friday, with added depth 20000 for syslet and
> libaio:
>
> Engine Depth Processes Bw (MiB/sec)
> ----------------------------------------------------
> libaio 1 1 602
> syslet 1 1 759
> sync 1 1 776
> libaio 32 1 832
> syslet 32 1 898
> libaio 20000 1 581
> syslet 20000 1 609
That looks great! IMO there may be a little higher cost associated with
the syslets thread switches, that we currently do not perform 100%
cleanly, but results look nevertheless awesome.
- Davide
On Mon, Feb 26, 2007 at 09:35:43PM +0100, Ingo Molnar wrote:
> * Evgeniy Polyakov <[email protected]> wrote:
>
> > If kernelspace rescheduling is that fast, then please explain me why
> > userspace one always beats kernel/userspace?
>
> because 'user space scheduling' makes no sense? I explained my thinking
> about that in a past mail:
>
> -------------------------->
> One often repeated (because pretty much only) performance advantage of
> 'light threads' is context-switch performance between user-space
> threads. But reality is, nobody /cares/ about being able to
> context-switch between "light user-space threads"! Why? Because there
> are only two reasons why such a high-performance context-switch would
> occur:
...
> 2) there has been an IO event. The thing is, for IO events we enter the
> kernel no matter what - and we'll do so for the next 10 years at
> minimum. We want to abstract away the hardware, we want to do
> reliable resource accounting, we want to share hardware resources,
> we want to rate-limit, etc., etc. While in /theory/ you could handle
> IO purely from user-space, in practice you dont want to do that. And
> if we accept the premise that we'll enter the kernel anyway, there's
> zero performance difference between scheduling right there in the
> kernel, or returning back to user-space to schedule there. (in fact
> i submit that the former is faster). Or if we accept the theoretical
> possibility of 'perfect IO hardware' that implements /all/ the
> features that the kernel wants (in a secure and generic way, and
> mind you, such IO hardware does not exist yet), then /at most/ the
> performance advantage of user-space doing the scheduling is the
> overhead of a null syscall entry. Which is a whopping 100 nsecs on
> modern CPUs! That's roughly the latency of a /single/ DRAM access!
Ingo and Evgeniy,
I was trying to avoid getting into this discussion, but whatever. M:N
threading systems also require just about all of the threading semantics
that are inside the kernel to be available in userspace. Implementations
of the userspace scheduler side of things must be able to turn off
preemption to do per CPU local storage, report blocking/preempting via
(via upcall or a mailbox) and other scheduler-ish things in reliable way
so that the complexity of a system like that ends up not being worth it
and is often monsteriously large to implement and debug. That's why
Solaris 10 removed their scheduler activations framework and went with
1:1 like in Linux since the scheduler activations model is so difficult
to control. The slowness of the futex stuff might be compounded by some
VM mapping issues that Bill Irwin and Peter Ziljstra have pointed out in
the past regard, if I understand correctly.
Bryan Cantril of Solaris 10/dtrace fame can comment on that if you ask
him sometime.
For an exercise, think about all of things you need to either migrate
or to do a cross CPU wake of a task. It goes to hell in complexity
really quick. Erlang and other language based concurrency systems get
their regularities by indirectly oversimplifying what threading is from
what kernel folks are use to. Try doing a cross CPU wake quickly a
system like that, good luck. Now think about how to do an IPI in
userspace ? Good luck.
That's all :)
bill
On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:
> 2. its notifications do not go through the second loop, i.e. it is O(1),
> not O(ready_num), and notifications happens directly from internals of
> the appropriate subsystem, which does not require special wakeup
> (although it can be done too).
Sorry if I do not read kevent code correctly, but in kevent_user_wait()
there is a:
while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) {
...
}
loop, that make it O(ready_num). From a mathematical standpoint, they're
both O(ready_num), but epoll is doing three passes over the ready set.
I always though that if the number of ready events is so big that the more
passes over the ready set becomes relevant, probably the "work" done by
userspace for each fetched event would make the extra cost irrelevant.
But that can be fixed by a patch that will follow on lkml ...
- Davide
On Mon, Feb 26, 2007 at 03:45:48PM +0100, Jens Axboe wrote:
> On Mon, Feb 26 2007, Suparna Bhattacharya wrote:
> > On Mon, Feb 26, 2007 at 02:57:36PM +0100, Jens Axboe wrote:
> > >
> > > Some more results, using a larger number of processes and io depths. A
> > > repeat of the tests from friday, with added depth 20000 for syslet and
> > > libaio:
> > >
> > > Engine Depth Processes Bw (MiB/sec)
> > > ----------------------------------------------------
> > > libaio 1 1 602
> > > syslet 1 1 759
> > > sync 1 1 776
> > > libaio 32 1 832
> > > syslet 32 1 898
> > > libaio 20000 1 581
> > > syslet 20000 1 609
> > >
> > > syslet still on top. Measuring O_DIRECT reads (of 4kb size) on ramfs
> > > with 100 processes each with a depth of 200, reading a per-process
> > > private file of 10mb (need to fit in my ram...) 10 times each. IOW,
> > > doing 10,000MiB of IO in total:
> >
> > But, why ramfs ? Don't we want to exercise the case where O_DIRECT actually
> > blocks ? Or am I missing something here ?
>
> Just overhead numbers for that test case, lets try something like your
> described job.
>
> Test case is doing random reads from /dev/sdb, in chunks of 64kb:
>
> Engine Depth Processes Bw (KiB/sec)
> ----------------------------------------------------
> libaio 200 100 2813
> syslet 200 100 3944
> libaio 20000 1 2793
> syslet 20000 1 3854
> sync (*) 20000 1 2866
>
> deadline was used for IO scheduling, to minimize impact. Not sure why
> syslet actually does so much better here, looing at vmstat the rate is
> steady and all runs are basically 50/50 idle/wait. One difference is
> that the submission itself takes a long time on libaio, since the
> io_submit() will block on request allocation. The generated IO pattern
> from each process is the same for all runs. The drive is a lousy sata
> that doesn't even do queuing, FWIW.
I tried the latest fio code with syslet v4, and my results are a little
different - have yet to figure out why or what to make of it.
I hope I have all the right pieces now.
This is an ext2 filesystem, SCSI AIC7xxx.
I used an iodepth_batch size of 8 to limit the number of ios in a single
io_submit (thanks for adding that parameter to fio !), like we did in
aio-stress.
Engine Depth Batch Bw (KiB/sec)
----------------------------------------------------
libaio 64 8 17,226
syslet 64 8 17,620
libaio 20000 8 18,552
syslet 20000 8 14,935
Which is not bad, actually.
If I do not specify the iodepth_batch (i.e. default to depth), then the
difference becomes more pronounced at higher depths. However, I doubt
whether anyone would be using such high batch sizes in practice ...
Engine Depth Batch Bw (KiB/sec)
----------------------------------------------------
libaio 64 default 17,429
syslet 64 default 16,155
libaio 20000 default 15,494
syslet 20000 default 7,971
Often times it is the application tuning that makes all the difference,
so am not really sure how much to read into these results.
That's always been the hard part of async io ...
Regards
Suparna
>
> [*] Just for comparison, the depth is obviously really 1 at the kernel
> side since it's sync.
>
> --
> Jens Axboe
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
* Randy Dunlap <[email protected]> wrote:
> > + [ Note: since the argument array is at the end of the atom, and the
> > + kernel will not touch any argument beyond the final NULL one, atoms
>
> I would s/final/first/ since one could put many (unneeded) NULL
> pointers in the argument array.
ok, fixed.
> > +Completion of asynchronous syslets is done via the 'completion ring',
> > +which is a ringbuffer of syslet atom pointers user user-space memory,
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^^^^ ^^^^ ??
fixed - it's "in user-space memory".
> > +provided by user-space as an argument to the sys_async_exec() syscall.
> > +The kernel fills in the ringbuffer starting at index 0, and user-space
> > +must clear out these pointers. Once the kernel reaches the end of
> > +the ring it wraps back to index 0. The kernel will not overwrite
> > +non-NULL pointers (but will return an error), user-space has to
> > +make sure it completes all events it asked for.
>
> Last sentence is actually 2 sentences, so (e.g.) change the comma
> to a semi-colon, xor begin the sentence with "Since".
i've changed it to ', and thus user-space has to make sure'.
> How does the kernel return an error if the ring buffer is full? Just a
> syscall negative error return code?
correct - we return a -EFAULT in that case. Should probably do something
more distinguished though?
> > + /*
> > + * Simple syslet consisting of a single atom:
> > + */
> > + init_atom(&atom, __NR_sys_write, &fd_out, &buf, &size,
> > + NULL, NULL, NULL, NULL, SYSLET_ASYNC, NULL);
>
> init_atom() (was) above. sys_async_exec(), sys_async_wait() are new
> syscalls. What are async_head_init() and async_head_exit()?
They used to initialize kernel-side stuff too, but now they only
initialize the user-space-head structure: the ring and a 'head stack
pointer'.
Ingo
* Evgeniy Polyakov <[email protected]> wrote:
> On Mon, Feb 26, 2007 at 01:51:23PM +0100, Ingo Molnar ([email protected]) wrote:
> >
> > * Evgeniy Polyakov <[email protected]> wrote:
> >
> > > Even having main dispatcher as epoll/kevent loop, the _whole_
> > > threadlet model is absolutely micro-thread in nature and not state
> > > machine/event.
> >
> > Evgeniy, i'm not sure how many different ways to tell this to you, but
> > you are not listening, you are not learning and you are still not
> > getting it at all.
> >
> > The scheduler /IS/ a generic work/event queue. And it's pretty damn
> > fast. No amount of badmouthing will change that basic fact. Not exactly
> > as fast as a special-purpose queueing system (for all the reasons i
> > outlined to you, and which you ignored), but it gets pretty damn close
> > even for the web workload /you/ identified, and offers a user-space
> > programming model that is about 1000 times more useful than
> > state-machines.
>
> Meanwhile on practiceal side:
> via epia kevent/epoll/threadlet:
>
> client: ab -c500 -n5000 $url
>
> kevent: 849.72
> epoll: 538.16
> threadlet:
> gcc ./evserver_epoll_threadlet.c -o ./evserver_epoll_threadlet
> In file included from ./evserver_epoll_threadlet.c:30:
> ./threadlet.h: In function ‘threadlet_exec’:
> ./threadlet.h:46: error: can't find a register in class ‘GENERAL_REGS’
> while reloading ‘asm’
>
> That particular asm optimization fails to compile.
it's not really an asm optimization but an API glue. I'm using:
gcc -O2 -g -Wall -o evserver_epoll_threadlet evserver_epoll_threadlet.c -fomit-frame-pointer
does that work for you?
Ingo
On Mon, Feb 26, 2007 at 12:04:51PM -0800, Linus Torvalds ([email protected]) wrote:
>
>
> On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:
> >
> > Will you argue that people do things like
> > num = epoll_wait()
> > for (i=0; i<num; ++i) {
> > process(event[i])?
> > }
>
> I have several times told you that I argue for a *combination* of
> event-bassed interfaces and thread-like code. And that the choice depends
> on which is more natural. Sometimes you might have just one or the other.
> Sometimes you have both. And sometimes you have neither (although,
> strictly speaking, normal single-threaded code is certainly "thread-like"
> - it's a serial execution, the same way threadlets are serial executions -
> it's just not running in parallel with anything else).
>
> > Will you spawn thread per IO?
>
> Depending on what the IO is, yes.
>
> Is that _really_ so hard to understand? There is no "yes" or "no". There's
> a "depends on what the problem is, and what the solution looks like".
>
> Sometimes the best way to do parallelism is through explicit threads.
> Sometimes it is through whatever "threadlets" or other that gets out of
> this whole development discussion. Sometimes it's an event loop.
>
> So get over it. The world is not a black and white, either-or kind of
> place. It's full of grayscales, and colors, and mixing things
> appropriately. And the choices are often done on whims and on whatever
> feels most comfortable to the person doing the choice. Not on some strict
> "you must always use things in an event-driven main loop" or "you must
> always use threads for parallelism".
>
> The world is simply _richer_ than that kind of either-or thing.
It seems you like to repeate that white is white and it is not black :)
Did you see what I wrote in previous e-mail to you?
No, you do not.
I wrote that both model should co-exist, and boundary between the two is
not clear, but for described workloads IMO event driven model provides
way too higher performance.
That is it, Linus, no one wants to say that you say something stupid -
just read what others write for you.
Thanks.
> Linus
--
Evgeniy Polyakov
On Tue, Feb 27 2007, Suparna Bhattacharya wrote:
> On Mon, Feb 26, 2007 at 03:45:48PM +0100, Jens Axboe wrote:
> > On Mon, Feb 26 2007, Suparna Bhattacharya wrote:
> > > On Mon, Feb 26, 2007 at 02:57:36PM +0100, Jens Axboe wrote:
> > > >
> > > > Some more results, using a larger number of processes and io depths. A
> > > > repeat of the tests from friday, with added depth 20000 for syslet and
> > > > libaio:
> > > >
> > > > Engine Depth Processes Bw (MiB/sec)
> > > > ----------------------------------------------------
> > > > libaio 1 1 602
> > > > syslet 1 1 759
> > > > sync 1 1 776
> > > > libaio 32 1 832
> > > > syslet 32 1 898
> > > > libaio 20000 1 581
> > > > syslet 20000 1 609
> > > >
> > > > syslet still on top. Measuring O_DIRECT reads (of 4kb size) on ramfs
> > > > with 100 processes each with a depth of 200, reading a per-process
> > > > private file of 10mb (need to fit in my ram...) 10 times each. IOW,
> > > > doing 10,000MiB of IO in total:
> > >
> > > But, why ramfs ? Don't we want to exercise the case where O_DIRECT actually
> > > blocks ? Or am I missing something here ?
> >
> > Just overhead numbers for that test case, lets try something like your
> > described job.
> >
> > Test case is doing random reads from /dev/sdb, in chunks of 64kb:
> >
> > Engine Depth Processes Bw (KiB/sec)
> > ----------------------------------------------------
> > libaio 200 100 2813
> > syslet 200 100 3944
> > libaio 20000 1 2793
> > syslet 20000 1 3854
> > sync (*) 20000 1 2866
> >
> > deadline was used for IO scheduling, to minimize impact. Not sure why
> > syslet actually does so much better here, looing at vmstat the rate is
> > steady and all runs are basically 50/50 idle/wait. One difference is
> > that the submission itself takes a long time on libaio, since the
> > io_submit() will block on request allocation. The generated IO pattern
> > from each process is the same for all runs. The drive is a lousy sata
> > that doesn't even do queuing, FWIW.
>
>
> I tried the latest fio code with syslet v4, and my results are a little
> different - have yet to figure out why or what to make of it.
> I hope I have all the right pieces now.
>
> This is an ext2 filesystem, SCSI AIC7xxx.
>
> I used an iodepth_batch size of 8 to limit the number of ios in a single
> io_submit (thanks for adding that parameter to fio !), like we did in
> aio-stress.
>
> Engine Depth Batch Bw (KiB/sec)
> ----------------------------------------------------
> libaio 64 8 17,226
> syslet 64 8 17,620
> libaio 20000 8 18,552
> syslet 20000 8 14,935
>
>
> Which is not bad, actually.
It's not bad for such a high depth/batch setting, but I still wonder why
are results are so different. I'll look around for an x86 box with some
TCQ/NCQ enabled storage attached for testing. Can you pass me your
command line or job file (whatever you use) so we are on the same page?
> If I do not specify the iodepth_batch (i.e. default to depth), then the
> difference becomes more pronounced at higher depths. However, I doubt
> whether anyone would be using such high batch sizes in practice ...
>
> Engine Depth Batch Bw (KiB/sec)
> ----------------------------------------------------
> libaio 64 default 17,429
> syslet 64 default 16,155
> libaio 20000 default 15,494
> syslet 20000 default 7,971
>
If iodepth_batch isn't set, the syslet queued io will be serialized and
not take advantage of queueing. How does the job file perform with
ioengine=sync?
> Often times it is the application tuning that makes all the difference,
> so am not really sure how much to read into these results.
> That's always been the hard part of async io ...
Yes I agree, it's handy to get an overview though.
--
Jens Axboe
On Mon, Feb 26, 2007 at 09:35:43PM +0100, Ingo Molnar ([email protected]) wrote:
> > If kernelspace rescheduling is that fast, then please explain me why
> > userspace one always beats kernel/userspace?
>
> because 'user space scheduling' makes no sense? I explained my thinking
> about that in a past mail:
>
> -------------------------->
> One often repeated (because pretty much only) performance advantage of
> 'light threads' is context-switch performance between user-space
> threads. But reality is, nobody /cares/ about being able to
> context-switch between "light user-space threads"! Why? Because there
> are only two reasons why such a high-performance context-switch would
> occur:
>
> 1) there's contention between those two tasks. Wonderful: now two
> artificial threads are running on the /same/ CPU and they are even
> contending each other. Why not run a single context on a single CPU
> instead and only get contended if /another/ CPU runs a conflicting
> context?? While this makes for nice "pthread locking benchmarks",
> it is not really useful for anything real.
Obviously there must be several real threads, each of them can reschedule
userspace threads, because it is fast and scalable.
So, one CPU - one real thread.
> 2) there has been an IO event. The thing is, for IO events we enter the
> kernel no matter what - and we'll do so for the next 10 years at
> minimum. We want to abstract away the hardware, we want to do
> reliable resource accounting, we want to share hardware resources,
> we want to rate-limit, etc., etc. While in /theory/ you could handle
> IO purely from user-space, in practice you dont want to do that. And
> if we accept the premise that we'll enter the kernel anyway, there's
> zero performance difference between scheduling right there in the
> kernel, or returning back to user-space to schedule there. (in fact
> i submit that the former is faster). Or if we accept the theoretical
> possibility of 'perfect IO hardware' that implements /all/ the
> features that the kernel wants (in a secure and generic way, and
> mind you, such IO hardware does not exist yet), then /at most/ the
> performance advantage of user-space doing the scheduling is the
> overhead of a null syscall entry. Which is a whopping 100 nsecs on
> modern CPUs! That's roughly the latency of a /single/ DRAM access!
> ....
And here were start our discussion again from the begining :)
We entered kernel, of course, but then kernel decides to move thread
away and put another one on its place under hardware sun - so kernel
starts to move registers, change some states and so on.
While in case of userspace threads we arelady returned back to userspace
(on behalf of written above overhead) and start doing new task - move
registers, change some states and so one.
And somehow it happens that doing it in userspace (for example with
setjmp/longjmp) is faster than in kernel. I do not know why - I never
investigated that case, but that is it.
> <-----------
>
> (see http://lwn.net/Articles/219958/)
>
> btw., the words that follow that section are quite interesting in
> retrospect:
>
> | Furthermore, 'light thread' concepts can no way approach the
> | performance of #2 state-machines: if you /know/ what the structure of
> | your context is, and you can program it in a specialized state-machine
> | way, there's just so many shortcuts possible that it's not even funny.
>
> [ oops! ;-) ]
>
> i severely under-estimated the kind of performance one can reach even
> with pure procedural concepts. Btw., when i wrote this mail was when i
> started thinking about "is it really true that the only way to get good
> performance are 100% event-based servers and nonblocking designs?", and
> started coding syslets and then threadlets.
:-)
Threads happen to be really fast, and easy to programm, but the maximum
performance still can not be achieved with them just because event
handling is faster - read one cacheline from the ring or queue, or
reschedule threads?
Anyway, what we are talking about, Ingo?
I understand your point, I also understand that you are not going to
change it (yes, that's it, but I need to confirm that I'm guilty too - I
doubtly will think that thousands of threads doing IO is a win), so we
can close the discussion at this point.
My main point in participating in it was kevent introduction - although
I created kevent AIO as a state machine, I never wanted to promote
kevent as AIO - kevent is event processing mechanism, one of its usage
models is AIO, another ones are signals, file events, timers, whatever...
You drawn a line - kevent is not needed - that is your point.
I wanted to hear definitive wordsa half a year ago, but community kept
silence. Eventually things are done.
Thanks.
> Ingo
--
Evgeniy Polyakov
On Mon, Feb 26, 2007 at 06:18:51PM -0800, Davide Libenzi ([email protected]) wrote:
> On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:
>
> > 2. its notifications do not go through the second loop, i.e. it is O(1),
> > not O(ready_num), and notifications happens directly from internals of
> > the appropriate subsystem, which does not require special wakeup
> > (although it can be done too).
>
> Sorry if I do not read kevent code correctly, but in kevent_user_wait()
> there is a:
>
> while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) {
> ...
> }
>
> loop, that make it O(ready_num). From a mathematical standpoint, they're
> both O(ready_num), but epoll is doing three passes over the ready set.
> I always though that if the number of ready events is so big that the more
> passes over the ready set becomes relevant, probably the "work" done by
> userspace for each fetched event would make the extra cost irrelevant.
> But that can be fixed by a patch that will follow on lkml ...
No, kevent_dequeue_ready() copies data to userspace, that is it.
So it looks roughly following:
storage is ready: -> kevent_requee() - ends up in ading event to the end
of the queue (list add under spinlock)
kevent_wait() -> copy first, second, ...
Kevent poll (as long as epoll) model requires _additional_ check in
userspace context before it is copied, so we endup with checking the
full ready queue again - that what I pointed as O(ready_num), O() implies
price for copying to userspace, list_add and so on.
> - Davide
>
--
Evgeniy Polyakov
On Mon, Feb 26, 2007 at 08:54:16PM +0100, Ingo Molnar ([email protected]) wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
> > > Reading from the disk is _exactly_ the same - the same waiting for
> > > buffer_heads/pages, and (since it is bigger) it can be easily
> > > transferred to event driven model. Ugh, wait, it not only _can_ be
> > > transferred, it is already done in kevent AIO, and it shows faster
> > > speeds (though I only tested sending them over the net).
> >
> > It would be absolutely horrible to program for. Try anything more
> > complex than read/write (which is the simplest case, but even that is
> > nasty).
>
> note that even for something as 'simple and straightforward' as TCP
> sockets, the 25-50 lines of evserver code i worked on today had 3
> separate bugs, is known to be fundamentally incorrect and one of the
> bugs (the lost event problem) showed up as a subtle epoll performance
> problem and it took me more than an hour to track down. And that matches
> my Tux experience as well: event based models are horribly hard to debug
> BECAUSE there is /no procedural state associated with requests/. Hence
> there is no real /proof of progress/. Not much to use for debugging -
> except huge logs of execution, which, if one is unlucky (which i often
> was with Tux) would just make the problem go away.
I'm glad you found a bug in my epoll processing code (which does not exist
in kevent part though) - it took me more than a year after kevent
introduction to make someone to look into.
Obviously there are bugs, it is simply how things work.
And debugging state machine code has exactly the same complexity as
debugging multi-threading code - if not less...
> Furthermore, with a 'retry' model, who guarantees that the retry wont be
> an infinite retry where none of the retries ever progresses the state of
> the system enough to return the data we are interested in? The moment we
> have to /retry/, depending on the 'depth' of how deep the retry kicked
> in, we've got to reach that 'depth' of code again and execute it.
>
> plus, 'there is not much state' is not even completely true to begin
> with, even in the most simple, TCP socket case! There /is/ quite a bit
> of state constructed on the kernel stack: user parameters have been
> evaluated/converted, the socket has been looked up, its state has been
> validated, etc. With a 'retry' model - but even with a pure 'event
> queueing' model we redo all those things /both/ at request submission
> and at event generation time, again and again - while with a synchronous
> syscall you do it just once and upon event completion a piece of that
> data is already on the kernel stack. I'd much rather spend time and
> effort on simplifying the scheduler and reducing the cache footprint of
> the kernel thread context switch path, etc., to make it more useful even
> in more extreme, highly prallel '100% context-switching' case, because i
> have first-hand experience about how fragile and inflexible event based
> servers are. I do think that event interfaces for raw, external physical
> events make sense in some circumstances, but for any more complex
> 'derived' event type it's less and less clear whether we want a direct
> interface to it. For something like the VFS it's outright horrible to
> even think about.
Ingo, you draw too much black into the picture.
No one will crate purely event based model for socket or VFS processing
- event is completion of the request - data in the buffer, data was sent
and so one - it is also possible to add events into middle of the
processing especially if operation can be logically separated - like
sendfile.
> Ingo
--
Evgeniy Polyakov
On Tue, Feb 27, 2007 at 07:24:19AM +0100, Ingo Molnar ([email protected]) wrote:
>
> * Evgeniy Polyakov <[email protected]> wrote:
>
> > On Mon, Feb 26, 2007 at 01:51:23PM +0100, Ingo Molnar ([email protected]) wrote:
> > >
> > > * Evgeniy Polyakov <[email protected]> wrote:
> > >
> > > > Even having main dispatcher as epoll/kevent loop, the _whole_
> > > > threadlet model is absolutely micro-thread in nature and not state
> > > > machine/event.
> > >
> > > Evgeniy, i'm not sure how many different ways to tell this to you, but
> > > you are not listening, you are not learning and you are still not
> > > getting it at all.
> > >
> > > The scheduler /IS/ a generic work/event queue. And it's pretty damn
> > > fast. No amount of badmouthing will change that basic fact. Not exactly
> > > as fast as a special-purpose queueing system (for all the reasons i
> > > outlined to you, and which you ignored), but it gets pretty damn close
> > > even for the web workload /you/ identified, and offers a user-space
> > > programming model that is about 1000 times more useful than
> > > state-machines.
> >
> > Meanwhile on practiceal side:
> > via epia kevent/epoll/threadlet:
> >
> > client: ab -c500 -n5000 $url
> >
> > kevent: 849.72
> > epoll: 538.16
> > threadlet:
> > gcc ./evserver_epoll_threadlet.c -o ./evserver_epoll_threadlet
> > In file included from ./evserver_epoll_threadlet.c:30:
> > ./threadlet.h: In function ‘threadlet_exec’:
> > ./threadlet.h:46: error: can't find a register in class ‘GENERAL_REGS’
> > while reloading ‘asm’
> >
> > That particular asm optimization fails to compile.
>
> it's not really an asm optimization but an API glue. I'm using:
>
> gcc -O2 -g -Wall -o evserver_epoll_threadlet evserver_epoll_threadlet.c -fomit-frame-pointer
>
> does that work for you?
Yes, -fomit-frame-point make the deal.
In average, threadlet runs as fast as epoll.
Just because there are _no_ rescheduling in that case.
I added a printk into __async_schedule() and started
ab -c7000 -n20000 http://192.168.4.80/
against slow via epia, and got only one of them.
Client is latest
../async-test-v4/evserver_epoll_threadlet
Btw, I need to admit, that I have totally broken kevent tree there - it
does not work on that machine on higher loads, so I'm investigating that
problem now.
> Ingo
--
Evgeniy Polyakov
* Evgeniy Polyakov <[email protected]> wrote:
> > does that work for you?
>
> Yes, -fomit-frame-point make the deal.
>
> In average, threadlet runs as fast as epoll.
yeah.
> Just because there are _no_ rescheduling in that case.
in my test it was 'little', not 'no'. But yes, that's exactly my point:
we can remove the nonblock hackeries from event loops and just
concentrate on making it schedule in less than 10-20% of the cases. Even
a relatively high 10-20% rescheduling rate is hardly measurable with
threadlets, while it gives a 10%-20% regression (and possibly bad
latencies) for the pure epoll/kevent server.
and such a mixed technique is simply not possible with ordinary
user-space threads, because there it's an all-or-nothing affair: either
you go fully to threads (at which point we are again back to a fully
threaded design, now also saddled with event loop overhead), or you try
to do user-space threads, which Just Make Little Sense (tm).
so threadlets remove the biggest headache from event loops: they dont
have to be '100% nonblocking' anymore. No O_NONBLOCK overhead, no
complex state machines - just handle the most common event type via an
outer event loop and keep the other 99% of server complexity in plain
procedural programming. 1% of state-machine code is perfectly
acceptable.
Ingo
My two coins:
# cat job
[global]
bs=8k
size=1g
direct=0
ioengine=sync
iodepth=32
rw=read
[file]
filename=/home/user/test
sync:
READ: io=1,024MiB, aggrb=39,329KiB/s, minb=39,329KiB/s,
maxb=39,329KiB/s, mint=27301msec, maxt=27301msec
libaio:
READ: io=1,024MiB, aggrb=39,435KiB/s, minb=39,435KiB/s,
maxb=39,435KiB/s, mint=27228msec, maxt=27228msec
syslet-rw:
READ: io=1,024MiB, aggrb=29,567KiB/s, minb=29,567KiB/s,
maxb=29,567KiB/s, mint=36315msec, maxt=36315msec
During syslet-rw test about 9500 async schduledes happend.
I use fio-git-20070226150114.tar.gz
P.S. Jens, fio_latest.tar.gz has wrong permissions, it can not be
opened.
--
Evgeniy Polyakov
On Tue, Feb 27 2007, Evgeniy Polyakov wrote:
> My two coins:
> # cat job
> [global]
> bs=8k
> size=1g
> direct=0
> ioengine=sync
> iodepth=32
> rw=read
>
> [file]
> filename=/home/user/test
>
> sync:
> READ: io=1,024MiB, aggrb=39,329KiB/s, minb=39,329KiB/s,
> maxb=39,329KiB/s, mint=27301msec, maxt=27301msec
>
> libaio:
> READ: io=1,024MiB, aggrb=39,435KiB/s, minb=39,435KiB/s,
> maxb=39,435KiB/s, mint=27228msec, maxt=27228msec
>
> syslet-rw:
> READ: io=1,024MiB, aggrb=29,567KiB/s, minb=29,567KiB/s,
> maxb=29,567KiB/s, mint=36315msec, maxt=36315msec
>
> During syslet-rw test about 9500 async schduledes happend.
> I use fio-git-20070226150114.tar.gz
That looks pretty pathetic :-). What IO scheduler did you use? syslets
will confuse CFQ currently, so you want to compare with using eg
deadline or as. That is one of the downsides of this approach.
I'll try your test as soon as this bisect series is done.
> P.S. Jens, fio_latest.tar.gz has wrong permissions, it can not be
> opened.
Oh thanks, indeed. It was disabled symlinks that broke it. Fixed now.
--
Jens Axboe
On Tue, Feb 27, 2007 at 01:28:32PM +0300, Evgeniy Polyakov wrote:
> Obviously there are bugs, it is simply how things work.
> And debugging state machine code has exactly the same complexity as
> debugging multi-threading code - if not less...
Evgeniy,
I think what you are not hearing, and what everyone else is saying
(INCLUDING Linus), is that for most programmers, state machines are
much, much harder to program, understand, and debug compared to
multi-threaded code. You may disagree (were you a MacOS 9 programmer
in another life?), and it may not even be true for you if you happen
to be one of those folks more at home with Scheme continuations, for
example. But it is true that for most kernel programmers, threaded
programming is much easier to understand, and we need to engineer the
kernel for what will be maintainable for the majority of the kernel
development community.
Regards,
- Ted
On Tue, Feb 27, 2007 at 06:52:22AM -0500, Theodore Tso ([email protected]) wrote:
> On Tue, Feb 27, 2007 at 01:28:32PM +0300, Evgeniy Polyakov wrote:
> > Obviously there are bugs, it is simply how things work.
> > And debugging state machine code has exactly the same complexity as
> > debugging multi-threading code - if not less...
>
> Evgeniy,
Hi Ted.
> I think what you are not hearing, and what everyone else is saying
> (INCLUDING Linus), is that for most programmers, state machines are
> much, much harder to program, understand, and debug compared to
> multi-threaded code. You may disagree (were you a MacOS 9 programmer
> in another life?), and it may not even be true for you if you happen
> to be one of those folks more at home with Scheme continuations, for
> example. But it is true that for most kernel programmers, threaded
> programming is much easier to understand, and we need to engineer the
> kernel for what will be maintainable for the majority of the kernel
> development community.
I understand that - and I totally agree.
But when more complex, more bug-prone code results in higher performance
- that must be used. We have linked lists and binary trees - the latter
are quite complex structures, but they allow to have higher performance
in searching operatins, so we use them.
The same applies to state machines - yes, in some cases it is hard to
program, but when things are already implemented and are wrapped into
nice (no posix) aio_read(), there is absolutely no usage complexity.
Even if it is up to programmer to programm state machine based on
generated events, that higher-layer state machines are not complex.
Let's get simple case of (aio_)read() from file descriptor - if page is in the
cache, no readpage() method will be called, so we do not need to create
some kind of events - just copy data, if there is no page or page is not
uptodate, we allocate a bio and do not wait until buffers are read - we
return to userspace and start another reading, when bio is completed and
its end_io callback is called, we mark pages as uptodate, copy data to userspace,
and mark event bound to above (aio_)read() as completed.
(that is how kevent aio works, btw).
Userspace programmer just calls
cookie = aio_read();
aio_wait(cookie);
or something like that.
It is simple, it is straightforward, especially if data read must then
be used somewhere else - in that case processing thread will need to operate
with main one, which is simple in event model, since there is a place,
where events of _all_ types are gathered.
> Regards,
>
> - Ted
--
Evgeniy Polyakov
* Evgeniy Polyakov <[email protected]> wrote:
> > [...] But it is true that for most kernel programmers, threaded
> > programming is much easier to understand, and we need to engineer
> > the kernel for what will be maintainable for the majority of the
> > kernel development community.
>
> I understand that - and I totally agree.
why did you then write, just one mail ago, the exact opposite:
> And debugging state machine code has exactly the same complexity as
> debugging multi-threading code - if not less...
the kernel /IS/ multi-threaded code.
which statement of yours is also patently, absurdly untrue.
Multithreaded code is harder to debug than process based code, but it is
still a breeze compared to complex state-machines...
Ingo
On Tue, Feb 27, 2007 at 12:29:08PM +0100, Jens Axboe ([email protected]) wrote:
> On Tue, Feb 27 2007, Evgeniy Polyakov wrote:
> > My two coins:
> > # cat job
> > [global]
> > bs=8k
> > size=1g
> > direct=0
> > ioengine=sync
> > iodepth=32
> > rw=read
> >
> > [file]
> > filename=/home/user/test
> >
> > sync:
> > READ: io=1,024MiB, aggrb=39,329KiB/s, minb=39,329KiB/s,
> > maxb=39,329KiB/s, mint=27301msec, maxt=27301msec
> >
> > libaio:
> > READ: io=1,024MiB, aggrb=39,435KiB/s, minb=39,435KiB/s,
> > maxb=39,435KiB/s, mint=27228msec, maxt=27228msec
> >
> > syslet-rw:
> > READ: io=1,024MiB, aggrb=29,567KiB/s, minb=29,567KiB/s,
> > maxb=29,567KiB/s, mint=36315msec, maxt=36315msec
> >
> > During syslet-rw test about 9500 async schduledes happend.
> > I use fio-git-20070226150114.tar.gz
>
> That looks pretty pathetic :-). What IO scheduler did you use? syslets
> will confuse CFQ currently, so you want to compare with using eg
> deadline or as. That is one of the downsides of this approach.
Deadline shows this:
sync:
READ: io=1,024MiB, aggrb=38,212KiB/s, minb=38,212KiB/s,
maxb=38,212KiB/s, mint=28099msec, maxt=28099msec
libaio:
READ: io=1,024MiB, aggrb=37,933KiB/s, minb=37,933KiB/s,
maxb=37,933KiB/s, mint=28306msec, maxt=28306msec
syslet-rw:
READ: io=1,024MiB, aggrb=34,759KiB/s, minb=34,759KiB/s,
maxb=34,759KiB/s, mint=30891msec, maxt=30891msec
There were about 10k async schedulings.
--
Evgeniy Polyakov
On Tue, Feb 27, 2007 at 10:42:11AM +0100, Jens Axboe wrote:
> On Tue, Feb 27 2007, Suparna Bhattacharya wrote:
> > On Mon, Feb 26, 2007 at 03:45:48PM +0100, Jens Axboe wrote:
> > > On Mon, Feb 26 2007, Suparna Bhattacharya wrote:
> > > > On Mon, Feb 26, 2007 at 02:57:36PM +0100, Jens Axboe wrote:
> > > > >
> > > > > Some more results, using a larger number of processes and io depths. A
> > > > > repeat of the tests from friday, with added depth 20000 for syslet and
> > > > > libaio:
> > > > >
> > > > > Engine Depth Processes Bw (MiB/sec)
> > > > > ----------------------------------------------------
> > > > > libaio 1 1 602
> > > > > syslet 1 1 759
> > > > > sync 1 1 776
> > > > > libaio 32 1 832
> > > > > syslet 32 1 898
> > > > > libaio 20000 1 581
> > > > > syslet 20000 1 609
> > > > >
> > > > > syslet still on top. Measuring O_DIRECT reads (of 4kb size) on ramfs
> > > > > with 100 processes each with a depth of 200, reading a per-process
> > > > > private file of 10mb (need to fit in my ram...) 10 times each. IOW,
> > > > > doing 10,000MiB of IO in total:
> > > >
> > > > But, why ramfs ? Don't we want to exercise the case where O_DIRECT actually
> > > > blocks ? Or am I missing something here ?
> > >
> > > Just overhead numbers for that test case, lets try something like your
> > > described job.
> > >
> > > Test case is doing random reads from /dev/sdb, in chunks of 64kb:
> > >
> > > Engine Depth Processes Bw (KiB/sec)
> > > ----------------------------------------------------
> > > libaio 200 100 2813
> > > syslet 200 100 3944
> > > libaio 20000 1 2793
> > > syslet 20000 1 3854
> > > sync (*) 20000 1 2866
> > >
> > > deadline was used for IO scheduling, to minimize impact. Not sure why
> > > syslet actually does so much better here, looing at vmstat the rate is
> > > steady and all runs are basically 50/50 idle/wait. One difference is
> > > that the submission itself takes a long time on libaio, since the
> > > io_submit() will block on request allocation. The generated IO pattern
> > > from each process is the same for all runs. The drive is a lousy sata
> > > that doesn't even do queuing, FWIW.
> >
> >
> > I tried the latest fio code with syslet v4, and my results are a little
> > different - have yet to figure out why or what to make of it.
> > I hope I have all the right pieces now.
> >
> > This is an ext2 filesystem, SCSI AIC7xxx.
> >
> > I used an iodepth_batch size of 8 to limit the number of ios in a single
> > io_submit (thanks for adding that parameter to fio !), like we did in
> > aio-stress.
> >
> > Engine Depth Batch Bw (KiB/sec)
> > ----------------------------------------------------
> > libaio 64 8 17,226
> > syslet 64 8 17,620
> > libaio 20000 8 18,552
> > syslet 20000 8 14,935
> >
> >
> > Which is not bad, actually.
>
> It's not bad for such a high depth/batch setting, but I still wonder why
> are results are so different. I'll look around for an x86 box with some
> TCQ/NCQ enabled storage attached for testing. Can you pass me your
> command line or job file (whatever you use) so we are on the same page?
Sure - I used variations of the following job file (e.g. engine=syslet-rw,
iodepth=20000).
Also the io scheduler on my system is set to Anticipatory by default.
FWIW it is a 4 way SMP (PIII, 700MHz)
; aio-stress -l -O -o3 <1GB file>
[global]
ioengine=libaio
buffered=0
rw=randread
bs=64k
size=1024m
directory=/kdump/suparna
[testfile2]
iodepth=64
iodepth_batch=8
>
> > If I do not specify the iodepth_batch (i.e. default to depth), then the
> > difference becomes more pronounced at higher depths. However, I doubt
> > whether anyone would be using such high batch sizes in practice ...
> >
> > Engine Depth Batch Bw (KiB/sec)
> > ----------------------------------------------------
> > libaio 64 default 17,429
> > syslet 64 default 16,155
> > libaio 20000 default 15,494
> > syslet 20000 default 7,971
> >
> If iodepth_batch isn't set, the syslet queued io will be serialized and
I see, so then this particular setting is not very meaningful
> not take advantage of queueing. How does the job file perform with
> ioengine=sync?
Just tried it now : 9,027KiB/s
>
> > Often times it is the application tuning that makes all the difference,
> > so am not really sure how much to read into these results.
> > That's always been the hard part of async io ...
>
> Yes I agree, it's handy to get an overview though.
True, at least some of this helps us gain a little more understanding
about the boundaries and how to tune it to be most effective.
Regards
Suparna
>
> --
> Jens Axboe
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
* Theodore Tso <[email protected]> wrote:
> I think what you are not hearing, and what everyone else is saying
> (INCLUDING Linus), is that for most programmers, state machines are
> much, much harder to program, understand, and debug compared to
> multi-threaded code. [...]
btw., another crutial thing that i think Evgeniy is missing is that
threadlets /enable/ event loops to be used in practice! Right now the
epoll/kevent programming model requires a total 100% avoidance of all
context-switching in the 'main' event handler context while handling a
request. If just 1% of all requests happen to block it might cause a
/complete/ breakdown of an event loop's performance - it can easily
cause a 10x drop in performance or worse!
So context-switching has to be avoided in 100% of the code that runs
while handling requests, file descriptors have to be set to nonblocking
(causing extra system calls), and all the syscalls that might return
incomplete with either -EINVAL or with a short read/write have to be
converted into a state machine. (or in the alternative, user-space
threading has to be used, which opens up another hornet's nest)
/That/ is the main inhibiting factor of the measured use of event loops
within Linux! It has zero integration capabilities with 'usual' coding
techniques - driving the costs of its application up in the sky, and
pushing event based servers into niches.
With threadlets the picture changes dramatically: all we have to
concentrate on to get the performance of "100% event based servers" is
to handle 'most' rescheduling events in the event loop. A 10-20% context
switching ratio does not hurt at all. (it causes ~1% of throughput
loss.)
Furthermore, even if a particular configuration or module of the server
(say Apache) happens to trigger a high rate of scheduling, the
performance breakdown model of threadlets is /vastly/ superior to event
based servers. The measurements so far have shown that the absolute
worst-case threading server performance is at around 60% of that of
non-context-switching servers - and even that level is reached
gradually, leaving time for action for the server owner. While with
fully event based servers there are mostly only two modes of
performance: 100% performance and near-0% performance: total breakdown.
Ingo
On Tue, Feb 27, 2007 at 01:13:28PM +0100, Ingo Molnar ([email protected]) wrote:
>
> * Evgeniy Polyakov <[email protected]> wrote:
>
> > > [...] But it is true that for most kernel programmers, threaded
> > > programming is much easier to understand, and we need to engineer
> > > the kernel for what will be maintainable for the majority of the
> > > kernel development community.
> >
> > I understand that - and I totally agree.
>
> why did you then write, just one mail ago, the exact opposite:
>
> > And debugging state machine code has exactly the same complexity as
> > debugging multi-threading code - if not less...
Because thread machinery is much more complex than event one - just
compare amount of code in scheduler and the whole kevent -
kernel/sched.c has about the same size as the whole kevent :)
> the kernel /IS/ multi-threaded code.
>
> which statement of yours is also patently, absurdly untrue.
> Multithreaded code is harder to debug than process based code, but it is
> still a breeze compared to complex state-machines...
It seems that we are talking about different levels.
Model I propose to use in userspace - very simple events mostly about
completion of the request - they are simple to use and simple to debug.
It can be slightly more hard to debug, than the simplest threading model
(one thread - one logical entity, which whould never work with others)
though.
>From userspace point of view it is about the same complexity to check
why event is not marked as ready, or why some thread never got
scheduled...
And we do not get into account possible synchronization methods needed
to run multithreaded code without problems.
> Ingo
--
Evgeniy Polyakov
On Tue, Feb 27, 2007 at 01:34:21PM +0100, Ingo Molnar ([email protected]) wrote:
> based servers. The measurements so far have shown that the absolute
> worst-case threading server performance is at around 60% of that of
> non-context-switching servers - and even that level is reached
> gradually, leaving time for action for the server owner. While with
> fully event based servers there are mostly only two modes of
> performance: 100% performance and near-0% performance: total breakdown.
Let's live in piece! :)
I always agreed that they should be used together - event-based rings
of IO requests, if request happens to block (which should be avaided
as much as possible), then it continues on behalf of sleeping thread.
> Ingo
--
Evgeniy Polyakov
Ingo Molnar wrote:
> * Theodore Tso <[email protected]> wrote:
>
>
>> I think what you are not hearing, and what everyone else is saying
>> (INCLUDING Linus), is that for most programmers, state machines are
>> much, much harder to program, understand, and debug compared to
>> multi-threaded code. [...]
>>
>
> btw., another crutial thing that i think Evgeniy is missing is that
> threadlets /enable/ event loops to be used in practice! Right now the
> epoll/kevent programming model requires a total 100% avoidance of all
> context-switching in the 'main' event handler context while handling a
> request. If just 1% of all requests happen to block it might cause a
> /complete/ breakdown of an event loop's performance - it can easily
> cause a 10x drop in performance or worse!
>
> So context-switching has to be avoided in 100% of the code that runs
> while handling requests, file descriptors have to be set to nonblocking
> (causing extra system calls), and all the syscalls that might return
> incomplete with either -EINVAL or with a short read/write have to be
> converted into a state machine. (or in the alternative, user-space
> threading has to be used, which opens up another hornet's nest)
>
> /That/ is the main inhibiting factor of the measured use of event loops
> within Linux! It has zero integration capabilities with 'usual' coding
> techniques - driving the costs of its application up in the sky, and
> pushing event based servers into niches.
>
>
Having written such a niche event based server, I can 100% confirm what
Ingo is saying here. We had a single process drive I/O to the kernel
through an event model (based on kernel aio extended with IO_CMD_POLL),
and user level threads managed by a custom scheduler that managed I/O,
timeouts, and thread scheduling.
We once considered dropping from a user-level thread model to a state
machine model, but the effort was astronomical and we wouldn't see the
rewards until it was all done, so naturally we didn't do it.
> With threadlets the picture changes dramatically: all we have to
> concentrate on to get the performance of "100% event based servers" is
> to handle 'most' rescheduling events in the event loop. A 10-20% context
> switching ratio does not hurt at all. (it causes ~1% of throughput
> loss.)
>
> Furthermore, even if a particular configuration or module of the server
> (say Apache) happens to trigger a high rate of scheduling, the
> performance breakdown model of threadlets is /vastly/ superior to event
> based servers. The measurements so far have shown that the absolute
> worst-case threading server performance is at around 60% of that of
> non-context-switching servers - and even that level is reached
> gradually, leaving time for action for the server owner. While with
> fully event based servers there are mostly only two modes of
> performance: 100% performance and near-0% performance: total breakdown.
>
Yes. Threadlets as the default aio solution (easy to use, acceptable
performance even in worst cases), with specialized solutions where
applicable (epoll for networking, aio for O_DIRECT disk) look like a
good mix of performance and sanity.
--
error compiling committee.c: too many arguments to function
Suparna Bhattacharya wrote:
> I tried the latest fio code with syslet v4, and my results are a little
> different - have yet to figure out why or what to make of it.
> I hope I have all the right pieces now.
>
> This is an ext2 filesystem, SCSI AIC7xxx.
>
> I used an iodepth_batch size of 8 to limit the number of ios in a single
> io_submit (thanks for adding that parameter to fio !), like we did in
> aio-stress.
>
> Engine Depth Batch Bw (KiB/sec)
> ----------------------------------------------------
> libaio 64 8 17,226
> syslet 64 8 17,620
> libaio 20000 8 18,552
> syslet 20000 8 14,935
>
>
> Which is not bad, actually.
>
> If I do not specify the iodepth_batch (i.e. default to depth), then the
> difference becomes more pronounced at higher depths. However, I doubt
> whether anyone would be using such high batch sizes in practice ...
>
> Engine Depth Batch Bw (KiB/sec)
> ----------------------------------------------------
> libaio 64 default 17,429
> syslet 64 default 16,155
> libaio 20000 default 15,494
> syslet 20000 default 7,971
>
But what about cpu usage? At these low levels, the cpu is probably
underutilized. It would be interesting to measure cpu time per I/O
request (or, alternatively, use an I/O subsystem that can saturate the
processors).
--
error compiling committee.c: too many arguments to function
On Mon, February 26, 2007 15:45, Jens Axboe wrote:
> Test case is doing random reads from /dev/sdb, in chunks of 64kb:
>
> Engine Depth Processes Bw (KiB/sec)
> ----------------------------------------------------
> libaio 200 100 2813
> syslet 200 100 3944
> libaio 20000 1 2793
> syslet 20000 1 3854
> sync (*) 20000 1 2866
If you have time, could you please add Threadlet results?
Considering how awful the syslet API is, and how nice the threadlet
one, it would be great if it turned out that Syslets aren't worth it.
If they are, a better API/ABI should be thought of.
Greetings,
Indan
Theodore Tso wrote:
> On Tue, Feb 27, 2007 at 01:28:32PM +0300, Evgeniy Polyakov wrote:
> > Obviously there are bugs, it is simply how things work.
> > And debugging state machine code has exactly the same complexity as
> > debugging multi-threading code - if not less...
>
> Evgeniy,
>
> I think what you are not hearing, and what everyone else is saying
> (INCLUDING Linus),
Excluding possibly many others.
> is that for most programmers, state machines are
> much, much harder to program, understand, and debug compared to
> multi-threaded code.
That's why you introduce an infrastructure that hides all the nitty-gritty
plumbing, and makes it easy to use.
> You may disagree (were you a MacOS 9 programmer
> in another life?), and it may not even be true for you if you happen
> to be one of those folks more at home with Scheme continuations, for
> example.
Personal attacks are really rather unhelpful/unscientific.
> But it is true that for most kernel programmers, threaded
> programming is much easier to understand, and we need to engineer the
> kernel for what will be maintainable for the majority of the kernel
> development community.
What's probably true is that, for a kernel to stay competitive you need two
distinct traits:
1. Stability
2. Performance
And you can't get that, by arguing that the kernel development community
doesn't have the brains to code for performance, which I dearly doubt.
So, instead of using intimidating language to force one's opinion thru,
especially when it comes from those in control, why not have a democratic
vote?
Thanks!
--
Al
Evgeniy Polyakov wrote:
> Ingo Molnar ([email protected]) wrote:
> > based servers. The measurements so far have shown that the absolute
> > worst-case threading server performance is at around 60% of that of
> > non-context-switching servers - and even that level is reached
> > gradually, leaving time for action for the server owner. While with
> > fully event based servers there are mostly only two modes of
> > performance: 100% performance and near-0% performance: total breakdown.
>
> Let's live in piece! :)
> I always agreed that they should be used together - event-based rings
> of IO requests, if request happens to block (which should be avaided
> as much as possible), then it continues on behalf of sleeping thread.
Agreed 100%. Notify always when you can, thread only when you must.
Thanks!
--
Al
* Avi Kivity <[email protected]> wrote:
> But what about cpu usage? At these low levels, the cpu is probably
> underutilized. It would be interesting to measure cpu time per I/O
> request (or, alternatively, use an I/O subsystem that can saturate the
> processors).
yeah - that's what testing on ramdisk (Jens') or on a loopback block
device (mine) approximates to a certain degree.
Ingo
On Tue, 27 Feb 2007, Evgeniy Polyakov wrote:
> On Mon, Feb 26, 2007 at 06:18:51PM -0800, Davide Libenzi ([email protected]) wrote:
> > On Mon, 26 Feb 2007, Evgeniy Polyakov wrote:
> >
> > > 2. its notifications do not go through the second loop, i.e. it is O(1),
> > > not O(ready_num), and notifications happens directly from internals of
> > > the appropriate subsystem, which does not require special wakeup
> > > (although it can be done too).
> >
> > Sorry if I do not read kevent code correctly, but in kevent_user_wait()
> > there is a:
> >
> > while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) {
> > ...
> > }
> >
> > loop, that make it O(ready_num). From a mathematical standpoint, they're
> > both O(ready_num), but epoll is doing three passes over the ready set.
> > I always though that if the number of ready events is so big that the more
> > passes over the ready set becomes relevant, probably the "work" done by
> > userspace for each fetched event would make the extra cost irrelevant.
> > But that can be fixed by a patch that will follow on lkml ...
>
> No, kevent_dequeue_ready() copies data to userspace, that is it.
> So it looks roughly following:
In all the books where I studied, the algorithms below would be classified
as O(num_ready) ones:
[sys_kevent_wait]
+ for (i=0; i<num; ++i) {
+ k = kevent_dequeue_ready_ring(u);
+ if (!k)
+ break;
+ kevent_complete_ready(k);
+
+ if (k->event.ret_flags & KEVENT_RET_COPY_FAILED)
+ break;
+ kevent_stat_ring(u);
+ copied++;
+ }
[kevent_user_wait]
+ while (num < max_nr && ((k = kevent_dequeue_ready(u)) != NULL)) {
+ if (copy_to_user(buf + num*sizeof(struct ukevent),
+ &k->event, sizeof(struct ukevent))) {
+ if (num == 0)
+ num = -EFAULT;
+ break;
+ }
+ kevent_complete_ready(k);
+ ++num;
+ kevent_stat_wait(u);
+ }
It does not matter if inside the loop you invert a 20Kx20K matrix or you
copy a byte, they both are O(num_ready).
- Davide
Ingo Molnar wrote:
> * Avi Kivity <[email protected]> wrote:
>
>
>> But what about cpu usage? At these low levels, the cpu is probably
>> underutilized. It would be interesting to measure cpu time per I/O
>> request (or, alternatively, use an I/O subsystem that can saturate the
>> processors).
>>
>
> yeah - that's what testing on ramdisk (Jens') or on a loopback block
> device (mine) approximates to a certain degree.
>
>
Ramdisks or fully cached loopback return immediately, so cache thrashing
effects don't show up.
Maybe a device mapper delay target or nbd + O_DIRECT can insert delays
to make the workload more disk-like.
--
error compiling committee.c: too many arguments to function
* Avi Kivity <[email protected]> wrote:
> > yeah - that's what testing on ramdisk (Jens') or on a loopback block
> > device (mine) approximates to a certain degree.
>
> Ramdisks or fully cached loopback return immediately, so cache
> thrashing effects don't show up.
even fully cached loopback schedules the loopback kernel thread - but i
agree that it's inaccurate: hence the 'approximates to a certain
degree'.
> Maybe a device mapper delay target or nbd + O_DIRECT can insert delays
> to make the workload more disk-like.
yeah. I'll hack a small timeout into loopback requests i think. But then
real disk-platter effects are left out ... so it all comes down to
eventually having to try it on real disks too :)
Ingo
On Tue, Feb 27, 2007 at 08:01:05AM -0800, Davide Libenzi ([email protected]) wrote:
> It does not matter if inside the loop you invert a 20Kx20K matrix or you
> copy a byte, they both are O(num_ready).
I probably selected wrong words to desribe, here is in details how
kevent differs from epoll.
Polling case need to perform additional check before event can be copied
to userspace, that check must be done for each even being copied.
Kevent does not need that (it needs it for poll emulation) - if event is
ready, then it is ready.
sys_poll() create a wait queue where different events (callbacks for
them) are stored, when driver calls wake_up() appropriate event is added
to the ready list and calls wake_up() for that wait queue, which in turn
calls ->poll for each event and transfer it to userspace if it is ready.
Kevent works slightly different - it does not perform additional check
for readiness (although it can, and it does in poll notifications), if
event is marked as ready, parked in waiting syscal thread is awakened
and event is copied to userspace.
Also waiting syscall is awakened through one queue - event is added
and wake_up() is called, while in epoll() there are two queues.
> - Davide
>
--
Evgeniy Polyakov
Ingo Molnar wrote:
>
>> Maybe a device mapper delay target or nbd + O_DIRECT can insert delays
>> to make the workload more disk-like.
>>
>
> yeah. I'll hack a small timeout into loopback requests i think. But then
> real disk-platter effects are left out ... so it all comes down to
> eventually having to try it on real disks too :)
>
Having a random component in the timeout may increase realism.
hundred-disk boxes are noisy, though the blinkenlights are nice :)
--
error compiling committee.c: too many arguments to function
On Tuesday 27 February 2007 17:21, Evgeniy Polyakov wrote:
> I probably selected wrong words to desribe, here is in details how
> kevent differs from epoll.
>
> Polling case need to perform additional check before event can be copied
> to userspace, that check must be done for each even being copied.
> Kevent does not need that (it needs it for poll emulation) - if event is
> ready, then it is ready.
>
> sys_poll() create a wait queue where different events (callbacks for
> them) are stored, when driver calls wake_up() appropriate event is added
> to the ready list and calls wake_up() for that wait queue, which in turn
> calls ->poll for each event and transfer it to userspace if it is ready.
>
> Kevent works slightly different - it does not perform additional check
> for readiness (although it can, and it does in poll notifications), if
> event is marked as ready, parked in waiting syscal thread is awakened
> and event is copied to userspace.
> Also waiting syscall is awakened through one queue - event is added
> and wake_up() is called, while in epoll() there are two queues.
Thank you Evgeniy for this comparison. poll()/select()/epoll() are tricky
indeed.
I believe one advantage of epoll is it uses standard mechanism (mandated for
poll()/select() ), while kevent adds some glue and kevent_storage in some
structures (struct inode, struct file, ...), thus adding some extra code and
extra storage in hot paths. Yes there might be a gain IF most users of these
path want kevent. But other users pay the price (larger kernel code and
data), that you cannot easily bench.
Using or not epoll has nearly zero cost over standard kernel (only struct file
has some extra storage)
On Tue, Feb 27, 2007 at 05:58:14PM +0100, Eric Dumazet ([email protected]) wrote:
> I believe one advantage of epoll is it uses standard mechanism (mandated for
> poll()/select() ), while kevent adds some glue and kevent_storage in some
> structures (struct inode, struct file, ...), thus adding some extra code and
> extra storage in hot paths. Yes there might be a gain IF most users of these
> path want kevent. But other users pay the price (larger kernel code and
> data), that you cannot easily bench.
>
> Using or not epoll has nearly zero cost over standard kernel (only struct file
> has some extra storage)
Well, that's a price - any event which wants to be supported needs to
store for events - kevent_storage is a list_head plus spinlock and
pointer to itself (with all current users that pointer can be removed and
access transferred to container_of()) - it is exactly as in epoll storage -
both were created with the smallest possible overhead in mind.
--
Evgeniy Polyakov
On Tue, Feb 27 2007, Evgeniy Polyakov wrote:
> On Tue, Feb 27, 2007 at 12:29:08PM +0100, Jens Axboe ([email protected]) wrote:
> > On Tue, Feb 27 2007, Evgeniy Polyakov wrote:
> > > My two coins:
> > > # cat job
> > > [global]
> > > bs=8k
> > > size=1g
> > > direct=0
> > > ioengine=sync
> > > iodepth=32
> > > rw=read
> > >
> > > [file]
> > > filename=/home/user/test
> > >
> > > sync:
> > > READ: io=1,024MiB, aggrb=39,329KiB/s, minb=39,329KiB/s,
> > > maxb=39,329KiB/s, mint=27301msec, maxt=27301msec
> > >
> > > libaio:
> > > READ: io=1,024MiB, aggrb=39,435KiB/s, minb=39,435KiB/s,
> > > maxb=39,435KiB/s, mint=27228msec, maxt=27228msec
> > >
> > > syslet-rw:
> > > READ: io=1,024MiB, aggrb=29,567KiB/s, minb=29,567KiB/s,
> > > maxb=29,567KiB/s, mint=36315msec, maxt=36315msec
> > >
> > > During syslet-rw test about 9500 async schduledes happend.
> > > I use fio-git-20070226150114.tar.gz
> >
> > That looks pretty pathetic :-). What IO scheduler did you use? syslets
> > will confuse CFQ currently, so you want to compare with using eg
> > deadline or as. That is one of the downsides of this approach.
>
> Deadline shows this:
>
> sync:
> READ: io=1,024MiB, aggrb=38,212KiB/s, minb=38,212KiB/s,
> maxb=38,212KiB/s, mint=28099msec, maxt=28099msec
>
> libaio:
> READ: io=1,024MiB, aggrb=37,933KiB/s, minb=37,933KiB/s,
> maxb=37,933KiB/s, mint=28306msec, maxt=28306msec
>
> syslet-rw:
> READ: io=1,024MiB, aggrb=34,759KiB/s, minb=34,759KiB/s,
> maxb=34,759KiB/s, mint=30891msec, maxt=30891msec
>
> There were about 10k async schedulings.
I think the issue here is pretty simple - when fio gets a queue full
like condition (it reaches the depth you set, 32), it commits them and
starts queuing again. Since that'll likely block, it'll get issued by
another process. So you suddenly have a nice sequence of reads from one
process (pending, only one is actually committed since it's serialized),
and then a read further down the line that goes behind those you already
committed. Then result is seeky, where it should have been sequential.
Do you get expected results if you set iodepth_low=1? That'll make fio
drain the queue before building it up again, should get you a sequential
access pattern with syslets.
--
Jens Axboe
On Tue, Feb 27 2007, Avi Kivity wrote:
> Ingo Molnar wrote:
> >* Avi Kivity <[email protected]> wrote:
> >
> >
> >>But what about cpu usage? At these low levels, the cpu is probably
> >>underutilized. It would be interesting to measure cpu time per I/O
> >>request (or, alternatively, use an I/O subsystem that can saturate the
> >>processors).
> >>
> >
> >yeah - that's what testing on ramdisk (Jens') or on a loopback block
> >device (mine) approximates to a certain degree.
> >
> >
>
> Ramdisks or fully cached loopback return immediately, so cache thrashing
> effects don't show up.
>
> Maybe a device mapper delay target or nbd + O_DIRECT can insert delays
> to make the workload more disk-like.
Take a look at scsi-debug, it can do at least some of that.
--
Jens Axboe
On Tue, Feb 27, 2007 at 07:45:41PM +0100, Jens Axboe ([email protected]) wrote:
> > Deadline shows this:
> >
> > sync:
> > READ: io=1,024MiB, aggrb=38,212KiB/s, minb=38,212KiB/s,
> > maxb=38,212KiB/s, mint=28099msec, maxt=28099msec
> >
> > libaio:
> > READ: io=1,024MiB, aggrb=37,933KiB/s, minb=37,933KiB/s,
> > maxb=37,933KiB/s, mint=28306msec, maxt=28306msec
> >
> > syslet-rw:
> > READ: io=1,024MiB, aggrb=34,759KiB/s, minb=34,759KiB/s,
> > maxb=34,759KiB/s, mint=30891msec, maxt=30891msec
> >
> > There were about 10k async schedulings.
>
> I think the issue here is pretty simple - when fio gets a queue full
> like condition (it reaches the depth you set, 32), it commits them and
> starts queuing again. Since that'll likely block, it'll get issued by
> another process. So you suddenly have a nice sequence of reads from one
> process (pending, only one is actually committed since it's serialized),
> and then a read further down the line that goes behind those you already
> committed. Then result is seeky, where it should have been sequential.
>
> Do you get expected results if you set iodepth_low=1? That'll make fio
> drain the queue before building it up again, should get you a sequential
> access pattern with syslets.
With such a change results should be better - not only because seek is
removed with sequential read, but also number of working threads
decreases with time - until queue is filled again.
So, syslet-rw has increased to 37mb/sec out of 39/sync and 38/libaio,
the latter two did not changed.
With iodepth of 10k, I get the same performance for
libaio and syslets - about 36mb/sec, it does not depend on iodepth_low
being set to 1 or default (full).
So syslets have small problems with small number of iodepth - its
performance is about 34mb/sec and then increases to 36 with iodepth
grow. While libaio decreases from 38 down to 36 mb/sec.
iodepth_low=1 helps syslets to have 37mb/sec with iodepth=32, with 3200
and 10k it does not play any role.
> --
> Jens Axboe
--
Evgeniy Polyakov
On Tue, 27 Feb 2007, Evgeniy Polyakov wrote:
> I probably selected wrong words to desribe, here is in details how
> kevent differs from epoll.
>
> Polling case need to perform additional check before event can be copied
> to userspace, that check must be done for each even being copied.
> Kevent does not need that (it needs it for poll emulation) - if event is
> ready, then it is ready.
That could be changed too. The "void *key" doesn't need to be NULL. Wake
ups to f_op->poll() waiters can use that to send ready events directly,
avoiding an extra f_op->poll() to fetch them.
Infrastructure is already there, just need a big patch to do it everywhere ;)
> Kevent works slightly different - it does not perform additional check
> for readiness (although it can, and it does in poll notifications), if
> event is marked as ready, parked in waiting syscal thread is awakened
> and event is copied to userspace.
> Also waiting syscall is awakened through one queue - event is added
> and wake_up() is called, while in epoll() there are two queues.
The really ancient version of epoll (called /dev/epoll at that time) was
doing a very similar thing. Was adding custom plugs is all over the places
where we wanted to get events from, and was collecting them w/out
resorting to extra f_op->poll(). Event masks going straight through an
event buffer.
The reason why the current design of epoll was chosen, was because:
o Was not requiring custom plus all over the places
o Was working with the current kernel abstractions as-is (though f_op->poll)
- Davide
On Tue, Feb 27, 2007 at 05:15:31PM +0300, Al Boldi wrote:
> > You may disagree (were you a MacOS 9 programmer
> > in another life?), and it may not even be true for you if you happen
> > to be one of those folks more at home with Scheme continuations, for
> > example.
>
> Personal attacks are really rather unhelpful/unscientific.
Just to be clear; this wasn't a personal attack. I know a lot of
people who I greatly respect who were MacOS and Scheme programmers (I
went to school at MIT, after all, the birthplace of Scheme). The
reality though is that most people don't program that way, and their
brains aren't wired in that fashion. There is a reason why procedural
languages are far more common than purely functional models, and why
aside from (pre-version 10) MacOS, most OS's don't use an event driven
system call interface.
> So, instead of using intimidating language to force one's opinion thru,
> especially when it comes from those in control, why not have a democratic
> vote?
So far, I'd have to say the people arguing for an event driven model
are in the distinct minority...
And as far as voting is concerned, I prefer informed voters who can
explain why, in their own words, why they are in favor of a particular
alternative.
Regards,
- Ted
On Tue, Feb 27 2007, Evgeniy Polyakov wrote:
> On Tue, Feb 27, 2007 at 07:45:41PM +0100, Jens Axboe ([email protected]) wrote:
> > > Deadline shows this:
> > >
> > > sync:
> > > READ: io=1,024MiB, aggrb=38,212KiB/s, minb=38,212KiB/s,
> > > maxb=38,212KiB/s, mint=28099msec, maxt=28099msec
> > >
> > > libaio:
> > > READ: io=1,024MiB, aggrb=37,933KiB/s, minb=37,933KiB/s,
> > > maxb=37,933KiB/s, mint=28306msec, maxt=28306msec
> > >
> > > syslet-rw:
> > > READ: io=1,024MiB, aggrb=34,759KiB/s, minb=34,759KiB/s,
> > > maxb=34,759KiB/s, mint=30891msec, maxt=30891msec
> > >
> > > There were about 10k async schedulings.
> >
> > I think the issue here is pretty simple - when fio gets a queue full
> > like condition (it reaches the depth you set, 32), it commits them and
> > starts queuing again. Since that'll likely block, it'll get issued by
> > another process. So you suddenly have a nice sequence of reads from one
> > process (pending, only one is actually committed since it's serialized),
> > and then a read further down the line that goes behind those you already
> > committed. Then result is seeky, where it should have been sequential.
> >
> > Do you get expected results if you set iodepth_low=1? That'll make fio
> > drain the queue before building it up again, should get you a sequential
> > access pattern with syslets.
>
> With such a change results should be better - not only because seek is
> removed with sequential read, but also number of working threads
> decreases with time - until queue is filled again.
Yep, although it probably doesn't matter for such a low bandwidth test
anyway.
> So, syslet-rw has increased to 37mb/sec out of 39/sync and 38/libaio,
> the latter two did not changed.
I wonder why all three aren't doing 39mb/sec flat here, it's a pretty
trivial case...
> With iodepth of 10k, I get the same performance for
> libaio and syslets - about 36mb/sec, it does not depend on iodepth_low
> being set to 1 or default (full).
Yep, the larger the iodepth, the less costly a seek on new queue buildup
gets. So that is as expected.
> So syslets have small problems with small number of iodepth - its
> performance is about 34mb/sec and then increases to 36 with iodepth
> grow. While libaio decreases from 38 down to 36 mb/sec.
Using your job file and fio HEAD (forces iodepth_low=1 for syslet if
iodepth_low isn't specified), I get:
Engine Depth Bw (kb/sec)
-----------------------------------
syslet 1 37163
syslet 32 37197
syslet 10000 36577
libaio 1 37144
libaio 32 37159
libaio 10000 36463
sync 1 37154
Results are highly stable. Note that this test case isn't totally fair,
since libaio isn't really async when you do buffered file IO.
--
Jens Axboe
On 2/27/07, Theodore Tso <[email protected]> wrote:
> I think what you are not hearing, and what everyone else is saying
> (INCLUDING Linus), is that for most programmers, state machines are
> much, much harder to program, understand, and debug compared to
> multi-threaded code. You may disagree (were you a MacOS 9 programmer
> in another life?), and it may not even be true for you if you happen
> to be one of those folks more at home with Scheme continuations, for
> example. But it is true that for most kernel programmers, threaded
> programming is much easier to understand, and we need to engineer the
> kernel for what will be maintainable for the majority of the kernel
> development community.
State machines are much harder to write without going through a real
on-paper design phase first. But multi-threaded code is much harder
for a team of average working coders to write correctly, judging from
the numerous train wrecks that I've been called in to salvage over the
last ten years or so.
The typical 50-250KLoC multi-threaded C/C++/Java application, even if
it's been shipping to customers for several years, is littered with
locking constructs yet routinely corrupts shared data structures.
Change the number of threads in a pool, adjust the thread priorities,
or move a couple of lines of code around, and you're very likely to
get an unexplained deadlock. God help you if you try to use a
debugger on it -- hundreds of latent race conditions will crop up that
didn't happen to trigger before because the thread didn't get
preempted there.
The only programming languages that I have seen widely used in US
industry (so Lisps and self-consciously functional languages are out)
in which mere mortals write passable multi-threaded applications are
Visual Basic and Python. That's partly because programmers in these
languages are not in the habit of throwing pointers around; but if
that were all there was to it, Java programmers would be a lot more
successful than they are at actually writing threaded programs rather
than nibbling cautiously around the edges with EJB. It also helps a
lot that strings are immutable; but again, Java shares this property.
No, the big difference is that VB and Python dicts and arrays are
thread-safed by the language runtime, and Java collections are not.
So while there may be all sorts of pointless and dangerous
mis-locking, it's "protecting" somewhat higher-level data structures.
What does this have to do with the kernel? Well, if you're going to
create Yet Another Micro^WLightweight-Threading Construct for AIO, it
would be mighty nice not to be slinging bare pointers around until the
IO is actually complete and the kernel isn't going to be touching the
data buffer any more. It would also be mighty nice to have a
thread-safe "request pool" data structure on which actions like bulk
cancellation and iteration over a subset can operate. (The iterator
returned by, say, a three-sided query on a RCU priority queue may
contain _stale_ information, but never _inconsistent_ information.)
I recognize that this is more object-oriented snake oil than kernel
programmers usually tolerate, but it would really help AIO-threaded
programs suck less. It is also very much in the Unix tradition --
what are file descriptors and fd_sets if not object-oriented design?
And if following the socket model was good enough for epoll and
netlink, why not for threadlet pools?
In the best of all possible worlds, AIO would look just like the good
old socket-bind-listen-accept model, except that I/O is transacted on
the "listen" socket as long as it can be serviced from cache, and
accept() only gets a new connection when a delayed I/O arrives. The
object hiding inside the fd returned by socket() would be the
"threadlet pool", and the object hiding inside each fd returned by
accept() would be a threadlet. Only this time you do it right and
make errno(fd) be a vsyscall that returns a threadlet-local error
state, and you assign reasonable semantics to operations on an fd that
has already encountered an exception. Much like IEEE 754, actually.
Anyway, like I said, good threaded code is quite rare. On the other
hand, I have seen plenty of reasonably successful event-loop
programming in C and C++, mostly in MacOS and Windows and PalmOS GUIs
where the OS deals with event queues and event handler registration.
It's not the world's most CPU-efficient strategy because of all those
function pointers and virtual methods, but those costs are dwarfed by
the GUI bounding boxes and repaints and things anyway. More to the
point, writing an event-loop framework for other people to use
involves extensive APIs that are stable in the tiniest details and
extensively documented. Not, perhaps, Plan A for the Linux kernel
community. :-)
Happily, a largely event-driven userspace framework can easily be
stacked on top of a threaded kernel -- as long as they're the right
kind of threads. The right kind of threads do not proliferate malloc
arenas by allowing preemption in mid-malloc. (They may need to
malloc(), and that may be a preemption point relative to _real_
threads, but you shouldn't switch or cancel threadlets there.) The
right kind of threads do not leak locking primitives when cancelled,
because they don't have to take a lock in order to update the right
kind of data structure. The right kind of threads can use floating
point safely as long as they don't expect FPU state to be preserved
across a syscall.
The right kind of threads, in short, work like coroutines or
MacOS/PalmOS "event handlers", with the added convenience of being
able to write them as if they were normal sequential code, with normal
access to a persistent stack frame and to process globals. And if you
do them right, they're cheap to migrate, easy and harmless to throttle
and cancel in bulk, and easy to punt out to an "I/O coprocessor" in
the future. The key is to move data into and out of the "I/O
registers" at well-defined points and not to break the encapsulation
in between. Delay the extraction of results from the "I/O registers"
as long as possible, and the hypothetical AIO coprocessor can go chew
on them in parallel with the main "integer" code flow, which only
stalls when it can't go any further without the I/O result.
If you've got some time to kill, you can even analyze an existing,
well-documented flavor of I/O strings (I like James Antill's Vstr
library myself) and define a set of "AIO opcodes" that manipulate AIO
fds and AIO strings as primitive types, just like the FPU manipulates
floating-point numbers as primitive types. Pick a CPU architecture
with a good range of unused trap opcodes (PPC, for instance) and
actually move the I/O strings into kernel space, mapping the AIO
operations and the I/O string API onto the free trap opcodes (really
no different from syscalls, except it's easier to inspect the assembly
that the compiler produces and see what's going on). For extra
credit, implement most of the AIO opcodes in Verilog, bolt them onto
the PPC core inside a Virtex4 FX FPGA, refine the semantics to permit
efficient pipelining, and collect your Turing award. :-)
Cheers,
- Michael
On Tue, Feb 27, 2007 at 07:03:21PM -0800, Michael K. Edwards ([email protected]) wrote:
>
> State machines are much harder to write without going through a real
> on-paper design phase first. But multi-threaded code is much harder
> for a team of average working coders to write correctly, judging from
> the numerous train wrecks that I've been called in to salvage over the
> last ten years or so.
130 lines skipped...
I have only one question - wasn't it too lazy to write all that? :)
> Cheers,
> - Michael
--
Evgeniy Polyakov
On Tue, Feb 27 2007, Suparna Bhattacharya wrote:
> > It's not bad for such a high depth/batch setting, but I still wonder why
> > are results are so different. I'll look around for an x86 box with some
> > TCQ/NCQ enabled storage attached for testing. Can you pass me your
> > command line or job file (whatever you use) so we are on the same page?
>
> Sure - I used variations of the following job file (e.g. engine=syslet-rw,
> iodepth=20000).
>
> Also the io scheduler on my system is set to Anticipatory by default.
> FWIW it is a 4 way SMP (PIII, 700MHz)
>
> ; aio-stress -l -O -o3 <1GB file>
> [global]
> ioengine=libaio
> buffered=0
> rw=randread
> bs=64k
> size=1024m
> directory=/kdump/suparna
>
> [testfile2]
> iodepth=64
> iodepth_batch=8
Ok, now that I can run this on more than x86, I gave it a spin on a box
with a little more potent storage. This is a core 2 quad, disks are
7200rpm sata (with NCQ) and a 15krpm SCSI disk. IO scheduler is
deadline.
SATA disk:
Engine Depth Batch Bw (KiB/sec)
----------------------------------------------------
libaio 64 8 17,486
syslet 64 8 17,357
libaio 20000 8 17,625
syslet 20000 8 16,526
sync 1 1 7,529
SCSI disk:
Engine Depth Batch Bw (KiB/sec)
----------------------------------------------------
libaio 64 8 20,723
syslet 64 8 20,742
libaio 20000 8 21,125
syslet 20000 8 19,610
sync 1 1 16,659
> > > Engine Depth Batch Bw (KiB/sec)
> > > ----------------------------------------------------
> > > libaio 64 default 17,429
> > > syslet 64 default 16,155
> > > libaio 20000 default 15,494
> > > syslet 20000 default 7,971
> > >
> > If iodepth_batch isn't set, the syslet queued io will be serialized and
>
> I see, so then this particular setting is not very meaningful
Not if you want to take advantage of hw queuing, as in this random
workload. fio being a test tool, it's important to be able to control as
many aspects of what happens as possible. That means you can also do
things that you do not want to do in real life, having a pending list of
20000 serialized requests is indeed one of them. It also means you
pretty much have to know what you are doing, when testing little details
like this.
--
Jens Axboe
* Jens Axboe <[email protected]> wrote:
> Engine Depth Batch Bw (KiB/sec)
> libaio 20000 8 21,125
> syslet 20000 8 19,610
i'd like to do something more about this to be more in line with libaio
- if nothing else then for the bragging rights ;-) It seems to me that a
drop of ~7% in throughput cannot be explained with any CPU overhead, it
must be some sort of queueing + IO scheduling effect - right?
Ingo
On Wed, Feb 28 2007, Ingo Molnar wrote:
>
> * Jens Axboe <[email protected]> wrote:
>
> > Engine Depth Batch Bw (KiB/sec)
> > libaio 20000 8 21,125
> > syslet 20000 8 19,610
>
> i'd like to do something more about this to be more in line with libaio
> - if nothing else then for the bragging rights ;-) It seems to me that a
> drop of ~7% in throughput cannot be explained with any CPU overhead, it
> must be some sort of queueing + IO scheduling effect - right?
syslet shows a slightly higher overhead, but nothing that will account
for any bandwidth change in this test. The box is obviously mostly idle
when running this test, it's not very CPU consuming. The IO pattern
issued is not the same, since libaio would commit IO [0..7], then
[8..15] and so on, where syslet would expose [0,8,16,24,32,40,48,56] and
then [1,9,17,25,33,41,49,57] etc. If iodepth_batch is set to 1 you'd
get a closer match wrt io pattern, but at a higher cost (increased
system calls, and 8 times as many pending async threads). That gets it
to 20,253KiB/s here with ~1000 as many context switches.
So in short, it's harder to compare with real storage, as access
patterns don't translate very easily
--
Jens Axboe
* Davide Libenzi <[email protected]> wrote:
> > my current thinking is that special-purpose (non-programmable,
> > static) APIs like aio_*() and lio_*(), where every last cycle of
> > performance matters, should be implemented using syslets - even if
> > it is quite tricky to write syslets (which they no doubt are - just
> > compare the size of syslet-test.c to threadlet-test.c). So i'd move
> > syslets into the same category as raw syscalls: pieces of the raw
> > infrastructure between the kernel and glibc, not an exposed API to
> > apps. [and even if we keep them in that category they still need
> > quite a bit of API work, to clean up the 32/64-bit issues, etc.]
>
> Why can't aio_* be implemented with *simple* (or parallel/unrelated)
> syscall submit w/out the burden of a complex, limiting and heavy API
there are so many variants of what people think 'asynchronous IO' should
look like - i'd not like to limit them. I agree that once a particular
syslet script becomes really popular, it might (and should) in fact be
pushed into a separate system call.
But i also agree that a one-shot-syscall sys_async() syscall could be
done too - for those uses where only a single system call is needed and
where the fetching of a single uatom would be small but nevertheless
unnecessary overhead. A one-shot async syscall needs to get /8/
parameters (the syscall nr is the seventh parameter and the return code
of the nested syscall is the eighth). So at least two parameters will
have to be passed in indirectly and validated, and 32/64-bit compat
conversions added, etc. anyway!
The copy_uatom() assembly code i did is really fast so i doubt there
would be much measurable performance difference between the two
solutions. Plus, putting the uatom into user memory allows the caching
of uatoms - further dilluting the advantage of passing in the values per
register. The whole difference should be on the order of 10 cycles, so
this really isnt a high prio item in my view.
> Now that chains of syscalls can be way more easily handled with
> clets^wthreadlets, why would we need the whole syslets crud inside?
no, threadlets dont really solve the basic issue of people wanting to
'combine' syscalls, avoid the syscall entry overhead (even if that is
small), and the desire to rely on kthread->kthread context switching
which is even faster than uthread->uthread context-switching, etc.
Furthermore, syslets dont really cause any new problem. They are almost
totally orthogonal, isolated, and cause no wide infrastructure needs.
as long as syslets remain a syscall-level API, for the measured use of
the likes of glibc and libaio (and not exposed in a programmable manner
to user-space), i see no big problem with them at all. They can also be
used without them having any classic pthread user-state (without linking
to libpthread). Think of it like the raw use of clone(): possible and
useful in some cases, but not something that a typical application would
do. This is a 'raw syscall plugins' thing, to be used by those
user-space entities that use raw syscalls: infrastructure libraries. Raw
syscalls themselves are tied to the platform, are not easily used in
some cases, thus almost no application uses them directly, but uses the
generic functions glibc exposes.
in the long run, sys_syslet_exec(), were it not to establish itself as a
widely used interface, could be implemented purely from user-space too
(say from the VDSO, at much worse performance, but the kernel would stay
backwards compatible with the syscall), so there's almost no risk here.
You dont like it => dont use it. Meanwhile, i'll happily take any
suggestion to make the syslet API more digestable.
Ingo
Hi!
> > If kernelspace rescheduling is that fast, then please explain me why
> > userspace one always beats kernel/userspace?
>
> because 'user space scheduling' makes no sense? I explained my thinking
> about that in a past mail:
...
> 2) there has been an IO event. The thing is, for IO events we enter the
> kernel no matter what - and we'll do so for the next 10 years at
..actually, at some point 3D acceleration was done by accessing hw
directly from userspace. OTOH I think we are moving away from that
model, so it is probably irrelevant here.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Hi!
> > I think what you are not hearing, and what everyone else is saying
> > (INCLUDING Linus), is that for most programmers, state machines are
> > much, much harder to program, understand, and debug compared to
> > multi-threaded code. You may disagree (were you a MacOS 9 programmer
> > in another life?), and it may not even be true for you if you happen
> > to be one of those folks more at home with Scheme continuations, for
> > example. But it is true that for most kernel programmers, threaded
> > programming is much easier to understand, and we need to engineer the
> > kernel for what will be maintainable for the majority of the kernel
> > development community.
>
> I understand that - and I totally agree.
> But when more complex, more bug-prone code results in higher performance
> - that must be used. We have linked lists and binary trees - the latter
No-o. Kernel is not designed like that.
Often, more complex and slightly faster code exists, and we simply use
slower variant, because it is fast enough.
10% gain in speed is NOT worth major complexity increase.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Wed, 28 Feb 2007, Ingo Molnar wrote:
>
> * Davide Libenzi <[email protected]> wrote:
>
> > Why can't aio_* be implemented with *simple* (or parallel/unrelated)
> > syscall submit w/out the burden of a complex, limiting and heavy API
>
> there are so many variants of what people think 'asynchronous IO' should
> look like - i'd not like to limit them. I agree that once a particular
> syslet script becomes really popular, it might (and should) in fact be
> pushed into a separate system call.
>
> But i also agree that a one-shot-syscall sys_async() syscall could be
> done too - for those uses where only a single system call is needed and
> where the fetching of a single uatom would be small but nevertheless
> unnecessary overhead. A one-shot async syscall needs to get /8/
> parameters (the syscall nr is the seventh parameter and the return code
> of the nested syscall is the eighth). So at least two parameters will
> have to be passed in indirectly and validated, and 32/64-bit compat
> conversions added, etc. anyway!
At this point, given how threadlets can be easily/effectively dispatched
from userspace, I'd argue the presence of either single/parallel or syslet
submission altogether. Threadlets allows you to code chains *way* more
naturally than syslets, and since they basically are like functions calls
in the fast path, they can be used even for single/parallel submissions.
No compat code required (ok, besides the trivial async_wait).
My point is, the syslet infrastructure is expensive for the kernel in
terms of compat, and extra code added to handle the cond/jumps/etc. Is
also non-trivial to use from userspace. Are those big performance
advantages there to justify its existence? I doubt that the price of a
sysenter is a lot bigger than a atom decoding, but I'm looking forward in
being proven wrong by real life performance numbers ;)
- Davide
On Wed, 28 Feb 2007, Davide Libenzi wrote:
>
> At this point, given how threadlets can be easily/effectively dispatched
> from userspace, I'd argue the presence of either single/parallel or syslet
> submission altogether. Threadlets allows you to code chains *way* more
> naturally than syslets, and since they basically are like functions calls
> in the fast path, they can be used even for single/parallel submissions.
Well, I agree, except for one thing:
- user space execution is *inherently* more expensive.
Why? Stack. Stack. Stack.
If you support threadlets with user space code, it means that you need a
separate user-space stack for each threadlet. That's a potentially *big*
cost to bear, both from a setup standpoint and from simply a memory
allocation standpoint.
Quite frankly, I think threadlets are a great idea, but I think the lack
of user-level footprint is *also* a great idea, and you should support
both.
In short - the only thing I *don't* think is a great idea are those linked
lists of atoms. I still think it's a pretty horrible interface, and I
still don't think it really buys us very much. The only way it would buy
us a lot is to change the linked lists dynamically (ie add new events at
the end while old events are still executing), but quite frankly, that
just makes the whole interface *even*worse* and just makes me have
debugging nightmares (I'm also not even convinced it really would help
us: we might avoid some costs of adding new events, but it would only
avoid them for serial execution, and if the whole point of this is to
execute things in parallel, that's a stupid thing to do).
So I would repeat my call for getting rid of the atoms, and instead just
do a "single submission" at a time. Do the linking by running a threadlet
that has user space code (and the stack overhead), which is MUCH more
flexible. And do nonlinked single system calls without *either* atoms *or*
a user-space stack footprint.
Please?
What am I missing?
Linus
Michael K. Edwards wrote:
> State machines are much harder to write without going through a real
> on-paper design phase first. But multi-threaded code is much harder
> for a team of average working coders to write correctly, judging from
> the numerous train wrecks that I've been called in to salvage over the
> last ten years or so.
I have to agree; state machines are harder to design and read, but
multithreaded programs are harder to write and debug _correctly_.
Another way of putting it is that the threadlet approach is easier to
do, but harder to do _right_.
On 2/28/07, Evgeniy Polyakov <[email protected]> wrote:
> 130 lines skipped...
Yeah, I edited it down a lot before sending it. :-)
> I have only one question - wasn't it too lazy to write all that? :)
I'm pretty lazy all right. But occasionally an interesting problem
(and revamping AIO is very interesting) makes me think, and what
little thinking I do is always accompanied by writing. Once I've
thought something through to the point that I think I understand the
problem, I've even been known to attempt a solution. Not always,
though; more often, I find a new interesting problem, or else I am
forcibly reminded that I should be spending my little store of insight
on revenue-producing activity.
In this instance, there didn't seem to be any harm in sending my
thoughts to LKML as I wrote them, on the off chance that Ingo or
Davide would get some value out of them in this design cycle (which
any code I eventually get around to producing will miss). So far,
I've gotten some rather dismissive pushback from Ingo and Alan (who
seem to have no interest outside x86 and less understanding than I
would have thought of what real userspace code looks like), a "why
preach to people who know more than you do" from Davide, a brief aside
on the dominance of x86 from Oleg, and one off-list "keep up the good
work". Not a very rich harvest from (IMHO) pretty good seeds.
In short, so far the "Linux kernel community" is upholding its
reputation for insularity, arrogance, coding without prior design,
lack of interest in userspace problems, and inability to learn from
the mistakes of others. (None of these characterizations depends on
there being any real insight in anything I have written.) Linus
himself has a very different reputation -- plenty of arrogance all
right, but genuine brilliance and hard work, and sincere (if cranky)
efforts to explain the "theory of operations" underlying central
design choices. So far he hasn't commented directly on anything I
have had to say; it will be interesting to see whether he tells me to
stop annoying the pros and to go away until I have some code to
contribute.
Happy hacking,
- Michael
P. S. I do think "threadlets" are brilliant, though, and reading
Ingo's patches gave me a much better idea of what would be involved in
prototyping Asynchronously Executed I/O Unit opcodes.
* Linus Torvalds <[email protected]> wrote:
> [...] The only way it would buy us a lot is to change the linked lists
> dynamically (ie add new events at the end while old events are still
> executing), [...]
that's quite close to what Jens' FIO plugin for syslets
(engines/syslet-rw.c) does currently: it builds lists of syslets as IO
gets submitted, batches them up for some time and then sends them off.
It is a natural next step to do this for in-flight syslets as well.
Ingo
On Wed, 28 Feb 2007, Linus Torvalds wrote:
> On Wed, 28 Feb 2007, Davide Libenzi wrote:
> >
> > At this point, given how threadlets can be easily/effectively dispatched
> > from userspace, I'd argue the presence of either single/parallel or syslet
> > submission altogether. Threadlets allows you to code chains *way* more
> > naturally than syslets, and since they basically are like functions calls
> > in the fast path, they can be used even for single/parallel submissions.
>
> Well, I agree, except for one thing:
> - user space execution is *inherently* more expensive.
>
> Why? Stack. Stack. Stack.
>
> If you support threadlets with user space code, it means that you need a
> separate user-space stack for each threadlet. That's a potentially *big*
> cost to bear, both from a setup standpoint and from simply a memory
> allocation standpoint.
Right, point taken.
> In short - the only thing I *don't* think is a great idea are those linked
> lists of atoms. I still think it's a pretty horrible interface, and I
> still don't think it really buys us very much. The only way it would buy
> us a lot is to change the linked lists dynamically (ie add new events at
> the end while old events are still executing), but quite frankly, that
> just makes the whole interface *even*worse* and just makes me have
> debugging nightmares (I'm also not even convinced it really would help
> us: we might avoid some costs of adding new events, but it would only
> avoid them for serial execution, and if the whole point of this is to
> execute things in parallel, that's a stupid thing to do).
>
> So I would repeat my call for getting rid of the atoms, and instead just
> do a "single submission" at a time. Do the linking by running a threadlet
> that has user space code (and the stack overhead), which is MUCH more
> flexible. And do nonlinked single system calls without *either* atoms *or*
> a user-space stack footprint.
Here we very much agree. The way I'd like it:
struct async_syscall {
unsigned long nr_sysc;
unsigned long params[8];
long result;
};
int async_exec(struct async_syscall *a, int n);
or:
int async_exec(struct async_syscall **a, int n);
At this point I'm ok even with the userspace ring buffer, returning
back pointers to "struct async_syscall".
- Davide
On Wed, 28 Feb 2007, Davide Libenzi wrote:
>
> Here we very much agree. The way I'd like it:
>
> struct async_syscall {
> unsigned long nr_sysc;
> unsigned long params[8];
> long result;
> };
No, the "result" needs to go somewhere else. The caller may be totally
uninterested in keeping the system call number or parameters around until
the operation completes, but if you put them in the same structure with
the result, you obviously cannot sanely get rid of them.
I also don't much like read-write interfaces (which the above would be:
the kernel would read most of the structure, and then write one member of
the structure).
It's entirely possible, for example, that the operation we submit is some
legacy "aio_read()", which has soem other structure layout than the new
one (but one field will be the result code).
Linus
On Wed, 28 Feb 2007, Linus Torvalds wrote:
> On Wed, 28 Feb 2007, Davide Libenzi wrote:
> >
> > Here we very much agree. The way I'd like it:
> >
> > struct async_syscall {
> > unsigned long nr_sysc;
> > unsigned long params[8];
> > long result;
> > };
>
> No, the "result" needs to go somewhere else. The caller may be totally
> uninterested in keeping the system call number or parameters around until
> the operation completes, but if you put them in the same structure with
> the result, you obviously cannot sanely get rid of them.
>
> I also don't much like read-write interfaces (which the above would be:
> the kernel would read most of the structure, and then write one member of
> the structure).
>
> It's entirely possible, for example, that the operation we submit is some
> legacy "aio_read()", which has soem other structure layout than the new
> one (but one field will be the result code).
Ok, makes sense. Something like this then?
struct async_syscall {
unsigned long nr_sysc;
unsigned long params[8];
long *result;
};
And what would async_wait() return bak? Pointers to "struct async_syscall"
or pointers to "result"?
- Davide
Davide Libenzi wrote:
> struct async_syscall {
> unsigned long nr_sysc;
> unsigned long params[8];
> long *result;
> };
>
> And what would async_wait() return bak? Pointers to "struct async_syscall"
> or pointers to "result"?
Either one has downsides. Pointer to struct async_syscall requires that
the caller keep the struct around. Pointer to result requires that the
caller always reserve a location for the result.
Does the kernel care about the (possibly rare) case of callers that
don't want to pay attention to result? If so, what about adding some
kind of caller-specified handle to struct async_syscall, and having
async_wait() return the handle? In the case where the caller does care
about the result, the handle could just be the address of result.
Chris
On Wed, 28 Feb 2007, Chris Friesen wrote:
> Davide Libenzi wrote:
>
> > struct async_syscall {
> > unsigned long nr_sysc;
> > unsigned long params[8];
> > long *result;
> > };
> >
> > And what would async_wait() return bak? Pointers to "struct async_syscall"
> > or pointers to "result"?
>
> Either one has downsides. Pointer to struct async_syscall requires that the
> caller keep the struct around. Pointer to result requires that the caller
> always reserve a location for the result.
>
> Does the kernel care about the (possibly rare) case of callers that don't want
> to pay attention to result? If so, what about adding some kind of
> caller-specified handle to struct async_syscall, and having async_wait()
> return the handle? In the case where the caller does care about the result,
> the handle could just be the address of result.
Something like this (with async_wait() returning asynid's)?
struct async_syscall {
long *result;
unsigned long asynid;
unsigned long nr_sysc;
unsigned long params[8];
};
- Davide
* Davide Libenzi <[email protected]> wrote:
> My point is, the syslet infrastructure is expensive for the kernel in
> terms of compat, [...]
it is not. Today i've implemented 64-bit syslets on x86_64 and
32-bit-on-64-bit compat syslets. Both the 64-bit and the 32-bit syslet
(and threadlet) binaries work just fine on a 64-bit kernel, and they
share 99% of the infrastructure. There's only a single #ifdef
CONFIG_COMPAT in kernel/async.c:
#ifdef CONFIG_COMPAT
asmlinkage struct syslet_uatom __user *
compat_sys_async_exec(struct syslet_uatom __user *uatom,
struct async_head_user __user *ahu)
{
return __sys_async_exec(uatom, ahu, &compat_sys_call_table,
compat_NR_syscalls);
}
#endif
Even mixed-mode syslets should work (although i havent specifically
tested them), where the head switches between 64-bit and 32-bit mode and
submits syslets from both 64-bit and from 32-bit mode, and at the same
time there might be both 64-bit and 32-bit syslets 'in flight'.
But i'm happy to change the syslet API in any sane way, and did so based
on feedback from Jens who is actually using them.
Ingo
On Wed, 28 Feb 2007, Ingo Molnar wrote:
>
> * Davide Libenzi <[email protected]> wrote:
>
> > My point is, the syslet infrastructure is expensive for the kernel in
> > terms of compat, [...]
>
> it is not. Today i've implemented 64-bit syslets on x86_64 and
> 32-bit-on-64-bit compat syslets. Both the 64-bit and the 32-bit syslet
> (and threadlet) binaries work just fine on a 64-bit kernel, and they
> share 99% of the infrastructure. There's only a single #ifdef
> CONFIG_COMPAT in kernel/async.c:
>
> #ifdef CONFIG_COMPAT
>
> asmlinkage struct syslet_uatom __user *
> compat_sys_async_exec(struct syslet_uatom __user *uatom,
> struct async_head_user __user *ahu)
> {
> return __sys_async_exec(uatom, ahu, &compat_sys_call_table,
> compat_NR_syscalls);
> }
>
> #endif
Did you hide all the complexity of the userspace atom decoding inside
another function? :)
How much code would go away, in case we pick a simple/parallel
sys_async_exec engine? Atoms decoding, special userspace variable access
for loops, jumps/cond/... VM engine.
> Even mixed-mode syslets should work (although i havent specifically
> tested them), where the head switches between 64-bit and 32-bit mode and
> submits syslets from both 64-bit and from 32-bit mode, and at the same
> time there might be both 64-bit and 32-bit syslets 'in flight'.
>
> But i'm happy to change the syslet API in any sane way, and did so based
> on feedback from Jens who is actually using them.
Wouldn't you agree on a simple/parallel execution engine like me and Linus
are proposing (and threadlets, of course)?
- Davide
* Davide Libenzi <[email protected]> wrote:
> On Wed, 28 Feb 2007, Ingo Molnar wrote:
>
> >
> > * Davide Libenzi <[email protected]> wrote:
> >
> > > My point is, the syslet infrastructure is expensive for the kernel in
> > > terms of compat, [...]
> >
> > it is not. Today i've implemented 64-bit syslets on x86_64 and
> > 32-bit-on-64-bit compat syslets. Both the 64-bit and the 32-bit syslet
> > (and threadlet) binaries work just fine on a 64-bit kernel, and they
> > share 99% of the infrastructure. There's only a single #ifdef
> > CONFIG_COMPAT in kernel/async.c:
> >
> > #ifdef CONFIG_COMPAT
> >
> > asmlinkage struct syslet_uatom __user *
> > compat_sys_async_exec(struct syslet_uatom __user *uatom,
> > struct async_head_user __user *ahu)
> > {
> > return __sys_async_exec(uatom, ahu, &compat_sys_call_table,
> > compat_NR_syscalls);
> > }
> >
> > #endif
>
> Did you hide all the complexity of the userspace atom decoding inside
> another function? :)
no, i made the 64-bit and 32-bit structures layout-compatible. This
makes the 32-bit structure as large as the 64-bit ones, but that's not a
big issue, compared to the simplifications it brings.
> > But i'm happy to change the syslet API in any sane way, and did so
> > based on feedback from Jens who is actually using them.
>
> Wouldn't you agree on a simple/parallel execution engine [...]
the thing is, there's almost zero overhead from having those basic
things like conditions and the ->next link, and they make it so much
more capable. As usual my biggest problem is that you are not trying to
use syslets at all - you are only trying to get rid of them ;-) My
purpose with syslets is to enable a syslet to do almost anything that
user-space could do too, as simply as possible. Syslets could even
allocate user-space memory and then use it (i dont think we actually
want to do that though). That doesnt mean arbitrary complex code
/should/ be done via syslets, or that it wont be significantly slower
than what user-space can do, but i'd not like to artificially dumb the
engine down. I'm totally willing to simplify/shrink the vectoring of
arguments and just about anything else, but your proposals so far (such
as your return-value-embedded-in-atom suggestion) all kill important
aspects of the engine.
All the existing syslet features were purpose-driven: i actually coded
up a sample syslet, trying to do something that makes sense, and added
these features based on that. The engine core takes up maybe 50 lines of
code.
Ingo
On Wed, 28 Feb 2007, Ingo Molnar wrote:
> * Davide Libenzi <[email protected]> wrote:
>
> > Did you hide all the complexity of the userspace atom decoding inside
> > another function? :)
>
> no, i made the 64-bit and 32-bit structures layout-compatible. This
> makes the 32-bit structure as large as the 64-bit ones, but that's not a
> big issue, compared to the simplifications it brings.
Do you have a new version to review?
> > > But i'm happy to change the syslet API in any sane way, and did so
> > > based on feedback from Jens who is actually using them.
> >
> > Wouldn't you agree on a simple/parallel execution engine [...]
>
> the thing is, there's almost zero overhead from having those basic
> things like conditions and the ->next link, and they make it so much
> more capable. As usual my biggest problem is that you are not trying to
> use syslets at all - you are only trying to get rid of them ;-) My
> purpose with syslets is to enable a syslet to do almost anything that
> user-space could do too, as simply as possible. Syslets could even
> allocate user-space memory and then use it (i dont think we actually
> want to do that though). That doesnt mean arbitrary complex code
> /should/ be done via syslets, or that it wont be significantly slower
> than what user-space can do, but i'd not like to artificially dumb the
> engine down. I'm totally willing to simplify/shrink the vectoring of
> arguments and just about anything else, but your proposals so far (such
> as your return-value-embedded-in-atom suggestion) all kill important
> aspects of the engine.
Ok, we're past the error code in the atom, as Linus pointed out ;)
How about this, with async_wait returning asynid's back to a userspace
ring buffer?
struct syslet_utaom {
long *result;
unsigned long asynid;
unsigned long nr_sysc;
unsigned long params[8];
};
My problem with the syslets in their current form is, do we have a real
use for them that justify the extra complexity inside the kernel? Or with
a simple/parellel async submission, coupled with threadlets, we can cover
a pretty broad range of real life use cases?
- Davide
* Davide Libenzi <[email protected]> wrote:
> On Wed, 28 Feb 2007, Ingo Molnar wrote:
>
> > * Davide Libenzi <[email protected]> wrote:
> >
> > > Did you hide all the complexity of the userspace atom decoding inside
> > > another function? :)
> >
> > no, i made the 64-bit and 32-bit structures layout-compatible. This
> > makes the 32-bit structure as large as the 64-bit ones, but that's not a
> > big issue, compared to the simplifications it brings.
>
> Do you have a new version to review?
yep, i've just released -v5.
> How about this, with async_wait returning asynid's back to a userspace
> ring buffer?
>
> struct syslet_utaom {
> long *result;
> unsigned long asynid;
> unsigned long nr_sysc;
> unsigned long params[8];
> };
we talked about the parameters at length: if they are pointers the
layout is significantly more flexible and more capable. It's a pretty
similar argument to the return-pointer thing. For example take a look at
how the IO syslet atoms in Jens' FIO engine share the same fd. Even if
there's 20000 of them. And they are fully cacheable in constructed
state. The same goes for the webserving examples i've got in the
async-test userspace sample code. I can pick up a cached request and
only update req->fd, i dont have to reinit the atoms at all. It stays
nicely in the cache, is not re-dirtied, etc.
furthermore, having the parameters as pointers is also an optimization:
look at the copy_uatom() x86 assembly code i did - it can do a simple
jump out of the parameter fetching code. I actually tried /both/ of
these variants in assembly (as i mentioned it in a previous reply, in
the v1 thread) and the speed difference between a pointer and
non-pointer variant was negligible. (even with 6 parameters filled in)
but yes ... another two more small changes and your layout will be
awfully similar to the current uatom layout =B-)
> My problem with the syslets in their current form is, do we have a
> real use for them that justify the extra complexity inside the kernel?
i call bullshit. really. I have just gone out and wasted some time
cutting & pasting all the syslet engine code: it is 153 lines total,
plus 51 lines of comments. The total patchset in comparison is:
35 files changed, 1890 insertions(+), 71 deletions(-)
(and this over-estimates it because if this got removed then we'd still
have to add an async execution syscall.) And the code is pretty compact
and self-contained. Threadlets share much of the infrastructure with
syslets: for example the completion ring code is _100%_ shared, the
async execution code is 98% shared.
You are free to not like it though, and i'm willing to change any aspect
of the API to make it more intuitive and more useful, but calling it
'complexity' at this point is just handwaving. And believe it or not, a
good number of people actually find syslets pretty cool.
> Or with a simple/parellel async submission, coupled with threadlets,
> we can cover a pretty broad range of real life use cases?
sure, if we debate its virtualization driven market penetration via self
promoting technologies that also drive customer satisfaction, then we'll
be able to increase shareholder value by improving the user experience
and we'll also succeed in turning this vision into a supply/demand
marketplace. Or not?
Ingo
On Wed, 28 Feb 2007, Ingo Molnar wrote:
> > Or with a simple/parellel async submission, coupled with threadlets,
> > we can cover a pretty broad range of real life use cases?
>
> sure, if we debate its virtualization driven market penetration via self
> promoting technologies that also drive customer satisfaction, then we'll
> be able to increase shareholder value by improving the user experience
> and we'll also succeed in turning this vision into a supply/demand
> marketplace. Or not?
Okkey then, I guess it's good to go as is :)
- Davide
* Linus Torvalds <[email protected]> wrote:
> So I would repeat my call for getting rid of the atoms, and instead
> just do a "single submission" at a time. Do the linking by running a
> threadlet that has user space code (and the stack overhead), which is
> MUCH more flexible. And do nonlinked single system calls without
> *either* atoms *or* a user-space stack footprint.
I agree that threadlets are much more flexible - and they might in fact
win in the long run due to that.
i'll add a one-shot syscall API in v6 and then we'll be able to see them
side by side. (wanted to do that in v5 but it got delayed by x86_64
issues, x86_64's entry code is certainly ... tricky wrt. ptregs saving)
wrt. one-shot syscalls, the user-space stack footprint would still
probably be there, because even async contexts that only do single-shot
processing need to drop out of kernel mode to handle signals. We could
probably hack the signal routing code to never deliver to such threads
(but bounce it over to the head context, which is always available) but
i think that would be a bit messy. (i dont exclude it though)
I think syslets might also act as a prototyping platform for new system
calls. If any particular syslet atom string comes up more frequently
(and we could even automate the profiling of that within the kernel),
then it's a good candidate for a standalone syscall. Currently we dont
have such information in any structured way: the connection between
streams of syscalls done by applications is totally opaque to the
kernel.
Also, i genuinely believe that to be competitive (performance-wise) with
fully in-kernel queueing solutions, we need syslets - the syslet NULL
overhead is 20 cycles (this includes copying, engine overhead, etc.),
the syscall NULL overhead is 280-300 cycles. It could probably be made
more capable by providing more special system calls like sys_upcall() to
execute a user-space function. (that way a syslet could still execute
user-space code without having to exit out of kernel mode too
frequently) Or perhaps a sys_x86_bytecode() call, that would execute a
pre-verified, kernel-stored sequence of simplified x86 bytecode, using
the kernel stack.
My fear is that if we force all these things over to one-shot syscalls
or threadlets then this will become another second-tier mechanism. By
providing syslets we give the message: "sure, come on and play within
the kernel if you want to, but it's not easy".
Ingo
On Thu, Mar 01, 2007 at 12:12:28AM +0100, Ingo Molnar wrote:
> more capable by providing more special system calls like sys_upcall() to
> execute a user-space function. (that way a syslet could still execute
> user-space code without having to exit out of kernel mode too
> frequently) Or perhaps a sys_x86_bytecode() call, that would execute a
> pre-verified, kernel-stored sequence of simplified x86 bytecode, using
> the kernel stack.
Which means the userspace code would then run with kernel privilege
level somehow (after security verifier, whatever). You remember I
think it's a plain crazy idea...
I don't want to argue about syslets, threadlets, whatever async or
syscall-merging mechanism here, I'm just focusing on the idea of yours
of running userland code in kernel space somehow (I hoped you given up
on it by now). Fixing the greatest syslets limitation is going to open
a can of worms as far as security is concerned.
The fact that userland code must not run with kernel privilege level,
is the reason why syslets aren't very useful (but again: focusing on
the syslets vs async-syscalls isn't my interest).
Frankly I think this idea of running userland code with kernel
privileges fits in the same category of porting linux to segmentation
to avoid the cost of pagetables to gain some bit of performance
despite losing in many other areas. Nobody in real life will want to
make that trade, for such an incredibly small performance
improvement.
For things that can be frequently combined, it's much simpler and
cheaper to created a "merged" syscall (i.e. sys_spawn =
sys_fork+sys_exec) than to invent a way to upload userland generated
bytecodes to kernel space to do that.