2007-01-30 21:40:24

by Zach Brown

[permalink] [raw]
Subject: [PATCH 0 of 4] Generic AIO by scheduling stacks

This very rough patch series introduces a different way to provide AIO support
for system calls.

Right now to provide AIO support for a system call you have to express your
interface in the iocb argument struct for sys_io_submit(), teach fs/aio.c to
translate this into some call path in the kernel that passes in an iocb, and
then update your code path implement with either completion-based (EIOCBQUEUED)
or retry-based (EIOCBRETRY) AIO with the iocb.

This patch series changes this by moving the complexity into generic code such
that a system call handler would provide AIO support in exactly the same way
that it supports a synchronous call. It does this by letting a task have
multiple stacks executing system calls in the kernel. Stacks are switched in
schedule() as they block and are made runnable.

First, let's introduce the term 'fibril'. It means small fiber or thread. It
represents the stack and the bits of data which manage its scheduling under the
task_struct. There's a 1:1:1 relationship between a fibril, the stack it's
managing, and the thread_struct it's operating under. They're static for the
fibril's lifetime. I came to choosing a funny new term after a few iterations
of trying to use existing terms (stack, call, thread, task, path, fiber)
without going insane. They were just too confusing to use with a clear
conscious.

So, let's illustrate the changes by walking through the execution of an asys
call. Let's say sys_nanosleep(). Then we'll talk about the trade-offs.

Maybe it'd make sense to walk through the patches in another window while
reading the commentary. I'm sorry if this seems tedious, but this is
non-trivial stuff. I want to get the point across.

We start in sys_asys_submit(). It allocates a fibril for this executing
submission syscall and hangs it off the task_struct. This lets this submission
fibril be scheduled along with the later asys system calls themselves.

sys_asys_submit() has arguments which specify system call numbers and
arguments. For each of these calls the submission syscall allocates a fibril
and constructs an initial stack. The fibril has eip pointing to the system
call handler and esp pointing to the new stack such that when the handler is
called its arguments will be on the stack. The stack is also constructed so
that when the handler returns it jumps to a function which queues the return
code for collection in userspace.

sys_asys_submit() asks the scheduler to put these new fibrils on the simple run
queue in the task_struct. It's just a list_head for now. It then calls
schedule().

After we've gotten the run queue lock in the scheduler we notice that there are
fibrils on the task_struct's run queue. Instead of continuing with schedule(),
then, we switch fibrils. The old submission fibril will still be in schedule()
and we'll start executing sys_nanosleep() in the context of the submission
task_struct.

The specific switching mechanics of this implementation rely on the notion of
tracking a stack as a full thread_info pointer. To make the switch we transfer
the non-stack bits of the thread_info from the old fibril's ti to the new
fibril's ti. We update the book keeping in the task_struct to
consider the new thread_info as the current thread_info for the task. Like so:

*next->ti = *ti;
*thread_info_pt_regs(next->ti) = *thread_info_pt_regs(ti);

current->thread_info = next->ti;
current->thread.esp0 = (unsigned long)(thread_info_pt_regs(next->ti) + 1);
current->fibril = next;
current->state = next->state;
current->per_call = next->per_call;

Yeah, messy. I'm interested in aggressive feedback on how to do this sanely.
Especially from the perspective of worrying about all the archs.

Did everyone catch that "per_call" thing there? That's to switch members of
task_struct which are local to a specific call. link_count, journal_info, that
sort of thing. More on that as we talk about the costs later.

After the switch we're executing in sys_nanosleep(). Eventually it gets to the
point where it's building its timer which will wake it after the sleep
interval. Currently it would store a bare task_struct reference for
wake_up_process(). Instead we introduce a helper which returns a cookie that
is given to a specific wake_up_*() variant. Like so:

- sl->task = task;
+ sl->wake_target = task_wake_target(task);

It then marks itself as TASK_INTERRUPTIBLE and calls schedule(). schedule()
notices that we have another fibril on the run queue. It's the submission
fibril that we switched from earlier. As we switched we saw that it was still
TASK_RUNNING so we put it on the run queue as we switched. We now switch back to
the submission fibril, leaving the sys_nanosleep() fibril sleeping. Let's say
the submission task returns to userspace which then immediately calls
sys_asys_await_completion(). It's an easy case :). It goes to sleep, there
are no running fibrils and the schedule() path really puts the task to sleep.

Eventually the timer fires and the hrtimer code path wakes the fibril:

- if (task)
- wake_up_process(task);
+ if (wake_target)
+ wake_up_target(wake_target);

We've doctored try_to_wake_up() to be able to tell if its argument is a
task_struct or one of these fibril targets. In the fibril case it calls
try_to_wake_up_fibril(). It notices that the target fibril does need to be
woken and sets it TASK_RUNNING. It notices that it it's current in the task so
it puts the fibril on the task's fibril run queue and wakes the task. There's
grossness here. It needs the task to come through schedule() again so that it
can find the new runnable fibril instead of continuing to execute its current
fibril. To this end, wake-up marks the task's current ti TIF_NEED_RESCHED.
This seems to work, but there are some pretty terrifying interactions between
schedule, wake-up, and the maintenance of fibril->state and task->state that
need to be sorted out.

Remember our task was sleeping in asys_await_completion()? The task was woken
by the fibril wake-up path, but it's still executing the
asys_await_completion() fibril. It comes out of schedule() and sees
TIF_NEED_RESCHED and comes back through the top of schedule(). This time it
finds the runnable sys_nanosleep() fibril and switches to it. sys_nanosleep()
runs to completion and it returns which, because of the way we built its stack,
calls asys_teardown_stack().

asys_teardown_stack() takes the return code and puts it off in a list for
asys_await_completion(). It wakes a wait_queue to notify waiters of pending
completions. In so doing it wakes up the asys_await_completion() fibril that
was sleeping in our task.

Then it has to tear down the fibril for the call that just completed. In the
current implementation the fibril struct is actually embedded in an "asys_call"
struct. asys_teardown_stack() frees the asys_call struct, and so the fibril,
after having cleared current->fibril. It then calls schedule(). Our
asys_await_completion() fibril is on the run queue so we switch to it.
Switching notices the null current->fibril that we're switching from and takes
that as a queue to mark the previous thread_info for freeing *after* the
switch.

After the switch we're in asys_await_completion(). We find the waiting return
code completion struct in the list that was left by asys_teardown_stack(). We
return it to userspace.

Phew. OK, so what are the trade-offs here? I'll start with the benefits for
obvious reasons :).

- With get AIO support for all syscalls. Every single one. (Well, please, no
asys sys_exit() :)). Buffered IO, sendfile, recvmsg, poll, epoll, hardware
crypto ioctls, open, mmap, getdents, the entire splice API, etc.

- The syscall API does not change just because calls are being issued AIO,
particularly things that reference task_struct. AIO sys_getpid() does what
you'd expect, signal masking, etc. You don't have to worry about your AIO call
being secretly handled by some worker threads that get very different results
from current-> references.

- We wouldn't multiply testing and maintenance burden with separate AIO paths.
No is_sync_kiocb() testing and divergence between returning or calling
aio_complete(). No auditing to make sure that EIOCBRETRY only being returned
after any significant references of current->. No worries about completion
racing from the submission return path and some aio_complete() being called
from another context. In this scheme if your sync syscall path isn't broken,
your AIO path stands a great chance of working.

- The submission syscall won't block while handling submitted calls. Not for
metadata IO, not for memory allocation, not for mutex contention, nothing.

- AIO syscalls which *don't* block see very little overhead. They'll allocate
stacks and juggle the run queue locks a little, but they'll execute in turn on
the submitting (cache-hot, presumably) processor. There's room to optimize
this path, too, of course.

- We don't need to duplicate interfaces for each AIO interface we want to
support. No iocb unions mirroring the syscall API, no magical AIO sys_
variants.

And the costs? It's not free.

- The 800lb elephant in the room. It uses a full stack per blocked operation.
I believe this is a reasonable price to pay for the flexibility of having *any*
call pending. It rules out some loads which would want to keep *millions* of
operations pending, but I humbly submit that a load rarely takes that number of
concurrent ops to saturate a resource. (think of it this way: we've gotten
this far by having to burn a full *task* to have *any* syscall pending.) While
not optimal, it opens to door to a lot of functionality without having to
rewrite the kernel as a giant non-blocking state machine.

It should be noted that my very first try was to copy the used part of stacks
in to and out of one full allocated stack. This uses less memory per blocking
operation at the cpu cost of copying the used regions. And it's a terrible
idea, for at least two reasons. First, to actually get the memory overhead
savings you allocate at stack switch time. If that allocation can't be
satisfied you are in *trouble* because you might not be able to switch over to
a fibril that is trying to free up memory. Deadlock city. Second, it means
that *you can't reference on-stack data in the wake-up path*. This is a
nightmare. Even our trivial sys_nanosleep() example would have had to take its
hrtimer_sleeper off the stack and allocate it. Never mind, you know, basically
every single user of <linux/wait.h>. My current thinking is that it's just
not worth it.

- We would now have some measure of task_struct concurrency. Read that twice,
it's scary. As two fibrils execute and block in turn they'll each be
referencing current->. It means that we need to audit task_struct to make sure
that paths can handle racing as its scheduled away. The current implementation
*does not* let preemption trigger a fibril switch. So one only has to worry
about racing with voluntary scheduling of the fibril paths. This can mean
moving some task_struct members under an accessor that hides them in a struct
in task_struct so they're switched along with the fibril. I think this is a
manageable burden.

- The fibrils can only execute in the submitter's task_struct. While I think
this is in fact a feature, it does imply some interesting behaviour.
Submitters will be required to explicitly manage any concurrent between CPUs by
issuing their ops in tasks. To guarantee forward-progress in syscall handling
paths (releasing i_mutex, say) we'll have to interrupt userspace when a fibril
is ready to run.

- Signals. I have no idea what behaviour we want. Help? My first guess is
that we'll want signal state to be shared by fibrils by keeping it in the
task_struct. If we want something like individual cancellation, we'll augment
signal_pending() with some some per-fibril test which will cause it to return
from TASK_INTERRUPTIBLE (the only reasonable way to implement generic
cancellation, I'll argue) as it would have if a signal was pending.

- lockdep and co. will need to be updated to track fibrils instead of tasks.
sysrq-t might want to dump fibril stacks, too. That kind of thing. Sorry.

As for the current implementation, it's obviously just a rough sketch. I'm
sending it out in this state because this is the first point at which a tree
walker using AIO openat(), fstat(), and getdents() actually worked on ext3.
Generally, though, these are paths that I don't have the most experience in.
I'd be thrilled to implement whatever the experts think is the right way to do
this.

Blah blah blah. Too much typing. What do people think?


2007-01-30 21:39:48

by Zach Brown

[permalink] [raw]
Subject: [PATCH 1 of 4] Introduce per_call_chain()

There are members of task_struct which are only used by a given call chain to
pass arguments up and down the chain itself. They are logically thread-local
storage.

The patches later in the series want to have multiple calls pending for a given
task, though only one will be executing at a given time. By putting these
thread-local members of task_struct in a seperate storage structure we're able
to trivially swap them in and out as their calls are swapped in and out.

per_call_chain() doesn't have a terribly great name. It was chosen in the
spirit of per_cpu().

The storage was left inline in task_struct to avoid introducing indirection for
the vast majority of uses which will never have multiple calls executing in a
task.

I chose a few members of task_struct to migrate under per_call_chain() along
with the introduction as an example of what it looks like. These would be
seperate patches in a patch series that was suitable for merging.

diff -r b1128b48dc99 -r 26e278468209 fs/jbd/journal.c
--- a/fs/jbd/journal.c Fri Jan 12 20:00:03 2007 +0000
+++ b/fs/jbd/journal.c Mon Jan 29 15:36:13 2007 -0800
@@ -471,7 +471,7 @@ int journal_force_commit_nested(journal_
tid_t tid;

spin_lock(&journal->j_state_lock);
- if (journal->j_running_transaction && !current->journal_info) {
+ if (journal->j_running_transaction && !per_call_chain(journal_info)) {
transaction = journal->j_running_transaction;
__log_start_commit(journal, transaction->t_tid);
} else if (journal->j_committing_transaction)
diff -r b1128b48dc99 -r 26e278468209 fs/jbd/transaction.c
--- a/fs/jbd/transaction.c Fri Jan 12 20:00:03 2007 +0000
+++ b/fs/jbd/transaction.c Mon Jan 29 15:36:13 2007 -0800
@@ -279,12 +279,12 @@ handle_t *journal_start(journal_t *journ
if (!handle)
return ERR_PTR(-ENOMEM);

- current->journal_info = handle;
+ per_call_chain(journal_info) = handle;

err = start_this_handle(journal, handle);
if (err < 0) {
jbd_free_handle(handle);
- current->journal_info = NULL;
+ per_call_chain(journal_info) = NULL;
handle = ERR_PTR(err);
}
return handle;
@@ -1368,7 +1368,7 @@ int journal_stop(handle_t *handle)
} while (old_handle_count != transaction->t_handle_count);
}

- current->journal_info = NULL;
+ per_call_chain(journal_info) = NULL;
spin_lock(&journal->j_state_lock);
spin_lock(&transaction->t_handle_lock);
transaction->t_outstanding_credits -= handle->h_buffer_credits;
diff -r b1128b48dc99 -r 26e278468209 fs/namei.c
--- a/fs/namei.c Fri Jan 12 20:00:03 2007 +0000
+++ b/fs/namei.c Mon Jan 29 15:36:13 2007 -0800
@@ -628,20 +628,20 @@ static inline int do_follow_link(struct
static inline int do_follow_link(struct path *path, struct nameidata *nd)
{
int err = -ELOOP;
- if (current->link_count >= MAX_NESTED_LINKS)
+ if (per_call_chain(link_count) >= MAX_NESTED_LINKS)
goto loop;
- if (current->total_link_count >= 40)
+ if (per_call_chain(total_link_count) >= 40)
goto loop;
BUG_ON(nd->depth >= MAX_NESTED_LINKS);
cond_resched();
err = security_inode_follow_link(path->dentry, nd);
if (err)
goto loop;
- current->link_count++;
- current->total_link_count++;
+ per_call_chain(link_count)++;
+ per_call_chain(total_link_count)++;
nd->depth++;
err = __do_follow_link(path, nd);
- current->link_count--;
+ per_call_chain(link_count)--;
nd->depth--;
return err;
loop:
@@ -1025,7 +1025,7 @@ int fastcall link_path_walk(const char *

int fastcall path_walk(const char * name, struct nameidata *nd)
{
- current->total_link_count = 0;
+ per_call_chain(total_link_count) = 0;
return link_path_walk(name, nd);
}

@@ -1153,7 +1153,7 @@ static int fastcall do_path_lookup(int d

fput_light(file, fput_needed);
}
- current->total_link_count = 0;
+ per_call_chain(total_link_count) = 0;
retval = link_path_walk(name, nd);
out:
if (likely(retval == 0)) {
diff -r b1128b48dc99 -r 26e278468209 include/linux/init_task.h
--- a/include/linux/init_task.h Fri Jan 12 20:00:03 2007 +0000
+++ b/include/linux/init_task.h Mon Jan 29 15:36:13 2007 -0800
@@ -88,6 +88,11 @@ extern struct nsproxy init_nsproxy;

extern struct group_info init_groups;

+#define INIT_PER_CALL_CHAIN(tsk) \
+{ \
+ .journal_info = NULL, \
+}
+
/*
* INIT_TASK is used to set up the first task table, touch at
* your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -124,6 +129,7 @@ extern struct group_info init_groups;
.keep_capabilities = 0, \
.user = INIT_USER, \
.comm = "swapper", \
+ .per_call = INIT_PER_CALL_CHAIN(tsk), \
.thread = INIT_THREAD, \
.fs = &init_fs, \
.files = &init_files, \
@@ -135,7 +141,6 @@ extern struct group_info init_groups;
.signal = {{0}}}, \
.blocked = {{0}}, \
.alloc_lock = __SPIN_LOCK_UNLOCKED(tsk.alloc_lock), \
- .journal_info = NULL, \
.cpu_timers = INIT_CPU_TIMERS(tsk.cpu_timers), \
.fs_excl = ATOMIC_INIT(0), \
.pi_lock = SPIN_LOCK_UNLOCKED, \
diff -r b1128b48dc99 -r 26e278468209 include/linux/jbd.h
--- a/include/linux/jbd.h Fri Jan 12 20:00:03 2007 +0000
+++ b/include/linux/jbd.h Mon Jan 29 15:36:13 2007 -0800
@@ -883,7 +883,7 @@ extern void __wait_on_journal (journal_

static inline handle_t *journal_current_handle(void)
{
- return current->journal_info;
+ return per_call_chain(journal_info);
}

/* The journaling code user interface:
diff -r b1128b48dc99 -r 26e278468209 include/linux/sched.h
--- a/include/linux/sched.h Fri Jan 12 20:00:03 2007 +0000
+++ b/include/linux/sched.h Mon Jan 29 15:36:13 2007 -0800
@@ -784,6 +784,20 @@ static inline void prefetch_stack(struct
static inline void prefetch_stack(struct task_struct *t) { }
#endif

+/*
+ * Members of this structure are used to pass arguments down call chains
+ * without specific arguments. Historically they lived on task_struct,
+ * putting them in one place gives us some flexibility. They're accessed
+ * with per_call_chain(name).
+ */
+struct per_call_chain_storage {
+ int link_count; /* number of links in one symlink */
+ int total_link_count; /* total links followed in a lookup */
+ void *journal_info; /* journalling filesystem info */
+};
+
+#define per_call_chain(foo) current->per_call.foo
+
struct audit_context; /* See audit.c */
struct mempolicy;
struct pipe_inode_info;
@@ -920,7 +934,7 @@ struct task_struct {
it with task_lock())
- initialized normally by flush_old_exec */
/* file system info */
- int link_count, total_link_count;
+ struct per_call_chain_storage per_call;
#ifdef CONFIG_SYSVIPC
/* ipc stuff */
struct sysv_sem sysvsem;
@@ -993,9 +1007,6 @@ struct task_struct {
struct held_lock held_locks[MAX_LOCK_DEPTH];
unsigned int lockdep_recursion;
#endif
-
-/* journalling filesystem info */
- void *journal_info;

/* VM state */
struct reclaim_state *reclaim_state;

2007-01-30 21:39:49

by Zach Brown

[permalink] [raw]
Subject: [PATCH 2 of 4] Introduce i386 fibril scheduling

This patch introduces the notion of a 'fibril'. It's meant to be a lighter
kernel thread. There can be multiple of them in the process of executing for a
given task_struct, but only one can every be actively running at a time. Think
of it as a stack and some metadata for scheduling them inside the task_stuct.

This implementation is wildly architecture-specific but isn't put in the right
places. Since these are not code paths that I have extensive experience with,
I focused more on on getting it going and representative of the concept than on
making it right on the first try. I'm actively interested in feedback from
people who know more about the places this touches.

The fibril struct itself is left stand-alone for clarity. There is a 1:1
relationship between fibrils and struct thread_info, though, so it might make
more sense to embed the two somehow.

The use of list_head for the run queue is simplistic. As long as we're not
removing specific fibrils from the list, which seems unlikely, we be more
clever. Maybe no more clever than a singly-linked list, though.

Fibril management is under the runqueue lock because that ends up working well
for the wake-up path as well. In the current patch, though, it makes for some
pretty sloppy code for unlocking the runqueue lock (and re-enabling interrupts
and pre-emption) on the other side of the switch.

The actual mechanics of switching from one stack to another at the end of
schedule_fibril() makes me nervous. I'm not convinced that blindly copying the
contents of thread_info from the previous to the next stack is safe, even if
done with interrupts disabled. (NMIs?) The juggling of current->thread_info
might be racy, etc.

diff -r 26e278468209 -r df7bc026d50e arch/i386/kernel/process.c
--- a/arch/i386/kernel/process.c Mon Jan 29 15:36:13 2007 -0800
+++ b/arch/i386/kernel/process.c Mon Jan 29 15:36:16 2007 -0800
@@ -698,6 +698,28 @@ struct task_struct fastcall * __switch_t
return prev_p;
}

+/*
+ * We've just switched the stack and instruction pointer to point to a new
+ * fibril. We were called from schedule() -> schedule_fibril() with the
+ * runqueue lock held _irq and with preemption disabled.
+ *
+ * We let finish_fibril_switch() unwind the state that was built up by
+ * our callers. We do that here so that we don't need to ask fibrils to
+ * first execute something analagous to schedule_tail(). Maybe that's
+ * wrong.
+ *
+ * We'd also have to reacquire the kernel lock here. For now we know the
+ * BUG_ON(lock_depth) prevents us from having to worry about it.
+ */
+void fastcall __switch_to_fibril(struct thread_info *ti)
+{
+ finish_fibril_switch();
+
+ /* free the ti if schedule_fibril() told us that it's done */
+ if (ti->status & TS_FREE_AFTER_SWITCH)
+ free_thread_info(ti);
+}
+
asmlinkage int sys_fork(struct pt_regs regs)
{
return do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
diff -r 26e278468209 -r df7bc026d50e include/asm-i386/system.h
--- a/include/asm-i386/system.h Mon Jan 29 15:36:13 2007 -0800
+++ b/include/asm-i386/system.h Mon Jan 29 15:36:16 2007 -0800
@@ -31,6 +31,31 @@ extern struct task_struct * FASTCALL(__s
"=a" (last),"=S" (esi),"=D" (edi) \
:"m" (next->thread.esp),"m" (next->thread.eip), \
"2" (prev), "d" (next)); \
+} while (0)
+
+struct thread_info;
+void fastcall __switch_to_fibril(struct thread_info *ti);
+
+/*
+ * This is called with the run queue lock held _irq and with preemption
+ * disabled. __switch_to_fibril drops those.
+ */
+#define switch_to_fibril(prev, next, ti) do { \
+ unsigned long esi,edi; \
+ asm volatile("pushfl\n\t" /* Save flags */ \
+ "pushl %%ebp\n\t" \
+ "movl %%esp,%0\n\t" /* save ESP */ \
+ "movl %4,%%esp\n\t" /* restore ESP */ \
+ "movl $1f,%1\n\t" /* save EIP */ \
+ "pushl %5\n\t" /* restore EIP */ \
+ "jmp __switch_to_fibril\n" \
+ "1:\t" \
+ "popl %%ebp\n\t" \
+ "popfl" \
+ :"=m" (prev->esp),"=m" (prev->eip), \
+ "=S" (esi),"=D" (edi) \
+ :"m" (next->esp),"m" (next->eip), \
+ "d" (prev), "a" (ti)); \
} while (0)

#define _set_base(addr,base) do { unsigned long __pr; \
diff -r 26e278468209 -r df7bc026d50e include/asm-i386/thread_info.h
--- a/include/asm-i386/thread_info.h Mon Jan 29 15:36:13 2007 -0800
+++ b/include/asm-i386/thread_info.h Mon Jan 29 15:36:16 2007 -0800
@@ -91,6 +91,12 @@ static inline struct thread_info *curren
static inline struct thread_info *current_thread_info(void)
{
return (struct thread_info *)(current_stack_pointer & ~(THREAD_SIZE - 1));
+}
+
+/* XXX perhaps should be integrated with task_pt_regs(task) */
+static inline struct pt_regs *thread_info_pt_regs(struct thread_info *info)
+{
+ return (struct pt_regs *)(KSTK_TOP(info)-8) - 1;
}

/* thread information allocation */
@@ -169,6 +175,7 @@ static inline struct thread_info *curren
*/
#define TS_USEDFPU 0x0001 /* FPU was used by this task this quantum (SMP) */
#define TS_POLLING 0x0002 /* True if in idle loop and not sleeping */
+#define TS_FREE_AFTER_SWITCH 0x0004 /* free ti in __switch_to_fibril() */

#define tsk_is_polling(t) ((t)->thread_info->status & TS_POLLING)

diff -r 26e278468209 -r df7bc026d50e include/linux/init_task.h
--- a/include/linux/init_task.h Mon Jan 29 15:36:13 2007 -0800
+++ b/include/linux/init_task.h Mon Jan 29 15:36:16 2007 -0800
@@ -111,6 +111,8 @@ extern struct group_info init_groups;
.cpus_allowed = CPU_MASK_ALL, \
.mm = NULL, \
.active_mm = &init_mm, \
+ .fibril = NULL, \
+ .runnable_fibrils = LIST_HEAD_INIT(tsk.runnable_fibrils), \
.run_list = LIST_HEAD_INIT(tsk.run_list), \
.ioprio = 0, \
.time_slice = HZ, \
diff -r 26e278468209 -r df7bc026d50e include/linux/sched.h
--- a/include/linux/sched.h Mon Jan 29 15:36:13 2007 -0800
+++ b/include/linux/sched.h Mon Jan 29 15:36:16 2007 -0800
@@ -812,6 +812,38 @@ enum sleep_type {

struct prio_array;

+/*
+ * A 'fibril' is a very small fiber. It's used here to mean a small thread.
+ *
+ * (Chosing a weird new name avoided yet more overloading of 'task', 'call',
+ * 'thread', 'stack', 'fib{er,re}', etc).
+ *
+ * This structure is used by the schduler to track multiple executing stacks
+ * inside a task_struct.
+ *
+ * Only one fibril executes for a given task_struct at a time. When it
+ * blocks, however, another fibril has the chance to execute while it sleeps.
+ * This means that call chains executing in fibrils can see concurrent
+ * current-> accesses at blocking points. "per_call_chain()" members are
+ * switched along with the fibril, so they remain local. Preemption *will not*
+ * trigger a fibril switch.
+ *
+ * XXX
+ * - arch specific
+ */
+struct fibril {
+ struct list_head run_list;
+ /* -1 unrunnable, 0 runnable, >0 stopped */
+ long state;
+ unsigned long eip;
+ unsigned long esp;
+ struct thread_info *ti;
+ struct per_call_chain_storage per_call;
+};
+
+void sched_new_runnable_fibril(struct fibril *fibril);
+void finish_fibril_switch(void);
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
struct thread_info *thread_info;
@@ -857,6 +889,20 @@ struct task_struct {
struct list_head ptrace_list;

struct mm_struct *mm, *active_mm;
+
+ /*
+ * The scheduler uses this to determine if the current call is a
+ * stand-alone task or a fibril. If it's a fibril then wake-ups
+ * will target the fibril and a schedule() might result in swapping
+ * in another runnable fibril. So to start executing fibrils at all
+ * one allocates a fibril to represent the running task and then
+ * puts initialized runnable fibrils in the run list.
+ *
+ * The state members of the fibril and runnable_fibrils list are
+ * managed under the task's run queue lock.
+ */
+ struct fibril *fibril;
+ struct list_head runnable_fibrils;

/* task state */
struct linux_binfmt *binfmt;
diff -r 26e278468209 -r df7bc026d50e kernel/exit.c
--- a/kernel/exit.c Mon Jan 29 15:36:13 2007 -0800
+++ b/kernel/exit.c Mon Jan 29 15:36:16 2007 -0800
@@ -854,6 +854,13 @@ fastcall NORET_TYPE void do_exit(long co
{
struct task_struct *tsk = current;
int group_dead;
+
+ /*
+ * XXX this is just a debug helper, this should be waiting for all
+ * fibrils to return. Possibly after sending them lots of -KILL
+ * signals?
+ */
+ BUG_ON(!list_empty(&current->runnable_fibrils));

profile_task_exit(tsk);

diff -r 26e278468209 -r df7bc026d50e kernel/fork.c
--- a/kernel/fork.c Mon Jan 29 15:36:13 2007 -0800
+++ b/kernel/fork.c Mon Jan 29 15:36:16 2007 -0800
@@ -1179,6 +1179,9 @@ static struct task_struct *copy_process(

/* for sys_ioprio_set(IOPRIO_WHO_PGRP) */
p->ioprio = current->ioprio;
+
+ p->fibril = NULL;
+ INIT_LIST_HEAD(&p->runnable_fibrils);

/*
* The task hasn't been attached yet, so its cpus_allowed mask will
diff -r 26e278468209 -r df7bc026d50e kernel/sched.c
--- a/kernel/sched.c Mon Jan 29 15:36:13 2007 -0800
+++ b/kernel/sched.c Mon Jan 29 15:36:16 2007 -0800
@@ -3407,6 +3407,111 @@ static inline int interactive_sleep(enum
}

/*
+ * This unwinds the state that was built up by schedule -> schedule_fibril().
+ * The arch-specific switch_to_fibril() path calls here once the new fibril
+ * is executing.
+ */
+void finish_fibril_switch(void)
+{
+ spin_unlock_irq(&this_rq()->lock);
+ preempt_enable_no_resched();
+}
+
+/*
+ * Add a new fibril to the runnable list. It'll be switched to next time
+ * the caller comes through schedule().
+ */
+void sched_new_runnable_fibril(struct fibril *fibril)
+{
+ struct task_struct *tsk = current;
+ unsigned long flags;
+ struct rq *rq = task_rq_lock(tsk, &flags);
+
+ fibril->state = TASK_RUNNING;
+ BUG_ON(!list_empty(&fibril->run_list));
+ list_add_tail(&fibril->run_list, &tsk->runnable_fibrils);
+
+ task_rq_unlock(rq, &flags);
+}
+
+/*
+ * This is called from schedule() when we're not being preempted and there is a
+ * fibril waiting in current->runnable_fibrils.
+ *
+ * This is called under the run queue lock to serialize fibril->state and the
+ * runnable_fibrils list with wake-up. We drop it before switching and the
+ * return path takes that into account.
+ *
+ * We always switch so that a caller can specifically make a single pass
+ * through the runnable fibrils by marking itself _RUNNING and calling
+ * schedule().
+ */
+void schedule_fibril(struct task_struct *tsk)
+{
+ struct thread_info *ti = task_thread_info(tsk);
+ struct fibril *prev;
+ struct fibril *next;
+ struct fibril dummy;
+
+ /*
+ * XXX We don't deal with the kernel lock yet. It'd need to be audited
+ * and lock_depth moved under per_call_chain().
+ */
+ BUG_ON(tsk->lock_depth >= 0);
+
+ next = list_entry(current->runnable_fibrils.next, struct fibril,
+ run_list);
+ list_del_init(&next->run_list);
+ BUG_ON(next->state != TASK_RUNNING);
+
+ prev = tsk->fibril;
+ if (prev) {
+ prev->state = tsk->state;
+ prev->per_call = current->per_call;
+ /*
+ * This catches the case where the caller wants to make a pass
+ * through runnable fibrils by marking itself _RUNNING and
+ * calling schedule(). A fibril should not be able to be on
+ * both tsk->fibril and the runnable_list.
+ */
+ if (prev->state == TASK_RUNNING) {
+ BUG_ON(!list_empty(&prev->run_list));
+ list_add_tail(&prev->run_list,
+ &current->runnable_fibrils);
+ }
+ } else {
+ /*
+ * To free a fibril the calling path can free the structure
+ * itself, set current->fibril to NULL, and call schedule().
+ * That causes us to tell __switch_to_fibril() to free the ti
+ * associated with the fibril once we've switched away from it.
+ * The dummy is just use to give switch_to_fibril() something
+ * to save state in to.
+ */
+ prev = &dummy;
+ }
+
+ /*
+ * XXX The idea is to copy all but the actual call stack. Obviously
+ * this is wildly arch-specific and belongs abstracted out.
+ */
+ *next->ti = *ti;
+ *thread_info_pt_regs(next->ti) = *thread_info_pt_regs(ti);
+
+ current->thread_info = next->ti;
+ current->thread.esp0 = (unsigned long)(thread_info_pt_regs(next->ti) + 1);
+ current->fibril = next;
+ current->state = next->state;
+ current->per_call = next->per_call;
+
+ if (prev == &dummy)
+ ti->status |= TS_FREE_AFTER_SWITCH;
+
+ /* __switch_to_fibril() drops the runqueue lock and enables preempt */
+ switch_to_fibril(prev, next, ti);
+}
+
+/*
* schedule() is the main scheduler function.
*/
asmlinkage void __sched schedule(void)
@@ -3468,6 +3573,22 @@ need_resched_nonpreemptible:
run_time /= (CURRENT_BONUS(prev) ? : 1);

spin_lock_irq(&rq->lock);
+
+ /* always switch to a runnable fibril if we aren't being preempted */
+ if (unlikely(!(preempt_count() & PREEMPT_ACTIVE) &&
+ !list_empty(&prev->runnable_fibrils))) {
+ schedule_fibril(prev);
+ /*
+ * finish_fibril_switch() drops the rq lock and enables
+ * premption, but the popfl disables interrupts again. Watch
+ * me learn how context switch locking works before your very
+ * eyes! XXX This will need to be fixed up by throwing
+ * together something like the prepare_lock_switch() path the
+ * scheduler does. Guidance appreciated!
+ */
+ local_irq_enable();
+ return;
+ }

switch_count = &prev->nivcsw;
if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {

2007-01-30 21:40:23

by Zach Brown

[permalink] [raw]
Subject: [PATCH 3 of 4] Teach paths to wake a specific void * target instead of a whole task_struct

The addition of multiple sleeping fibrils under a task_struct means that we
can't simply wake a task_struct to be able to wake a specific sleeping code
path.

This patch introduces task_wake_target() as a way to refer to a code path that
is about to sleep and will be woken in the future. Sleepers that used to wake
a current task_struct reference with wake_up_process() now use this helper to
get a wake target cookie and wake it with wake_up_target().

Some paths know that waking a task will be sufficient. Paths working with
kernel threads that never use fibrils fall into this category. They're changed
to use wake_up_task() instead of wake_up_process().

This is not an exhaustive patch. It isn't yet clear how signals are going to
interract with fibrils. Once that is decided callers of wake_up_state() are
going to need to reflect the desired behaviour. I add __deprecated to it to
highlight this detail.

The actual act of performing the wake-up is hidden under try_to_wake_up() and
is serialized with the scheduler under the runqueue lock. This is very
fiddly stuff. I'm sure I've missed some details. I've tried to comment
the intent above try_to_wake_up_fibril().

diff -r df7bc026d50e -r 4ea674e8825e arch/i386/kernel/ptrace.c
--- a/arch/i386/kernel/ptrace.c Mon Jan 29 15:36:16 2007 -0800
+++ b/arch/i386/kernel/ptrace.c Mon Jan 29 15:46:47 2007 -0800
@@ -492,7 +492,7 @@ long arch_ptrace(struct task_struct *chi
child->exit_code = data;
/* make sure the single step bit is not set. */
clear_singlestep(child);
- wake_up_process(child);
+ wake_up_task(child);
ret = 0;
break;

@@ -508,7 +508,7 @@ long arch_ptrace(struct task_struct *chi
child->exit_code = SIGKILL;
/* make sure the single step bit is not set. */
clear_singlestep(child);
- wake_up_process(child);
+ wake_up_task(child);
break;

case PTRACE_SYSEMU_SINGLESTEP: /* Same as SYSEMU, but singlestep if not syscall */
@@ -526,7 +526,7 @@ long arch_ptrace(struct task_struct *chi
set_singlestep(child);
child->exit_code = data;
/* give it a chance to run. */
- wake_up_process(child);
+ wake_up_task(child);
ret = 0;
break;

diff -r df7bc026d50e -r 4ea674e8825e drivers/block/loop.c
--- a/drivers/block/loop.c Mon Jan 29 15:36:16 2007 -0800
+++ b/drivers/block/loop.c Mon Jan 29 15:46:47 2007 -0800
@@ -824,7 +824,7 @@ static int loop_set_fd(struct loop_devic
goto out_clr;
}
lo->lo_state = Lo_bound;
- wake_up_process(lo->lo_thread);
+ wake_up_task(lo->lo_thread);
return 0;

out_clr:
diff -r df7bc026d50e -r 4ea674e8825e drivers/md/dm-io.c
--- a/drivers/md/dm-io.c Mon Jan 29 15:36:16 2007 -0800
+++ b/drivers/md/dm-io.c Mon Jan 29 15:46:47 2007 -0800
@@ -18,7 +18,7 @@ struct io {
struct io {
unsigned long error;
atomic_t count;
- struct task_struct *sleeper;
+ void *wake_target;
io_notify_fn callback;
void *context;
};
@@ -110,8 +110,8 @@ static void dec_count(struct io *io, uns
set_bit(region, &io->error);

if (atomic_dec_and_test(&io->count)) {
- if (io->sleeper)
- wake_up_process(io->sleeper);
+ if (io->wake_target)
+ wake_up_task(io->wake_target);

else {
int r = io->error;
@@ -323,7 +323,7 @@ static int sync_io(unsigned int num_regi

io.error = 0;
atomic_set(&io.count, 1); /* see dispatch_io() */
- io.sleeper = current;
+ io.wake_target = task_wake_target(current);

dispatch_io(rw, num_regions, where, dp, &io, 1);

@@ -358,7 +358,7 @@ static int async_io(unsigned int num_reg
io = mempool_alloc(_io_pool, GFP_NOIO);
io->error = 0;
atomic_set(&io->count, 1); /* see dispatch_io() */
- io->sleeper = NULL;
+ io->wake_target = NULL;
io->callback = fn;
io->context = context;

diff -r df7bc026d50e -r 4ea674e8825e drivers/scsi/qla2xxx/qla_os.c
--- a/drivers/scsi/qla2xxx/qla_os.c Mon Jan 29 15:36:16 2007 -0800
+++ b/drivers/scsi/qla2xxx/qla_os.c Mon Jan 29 15:46:47 2007 -0800
@@ -2403,7 +2403,7 @@ qla2xxx_wake_dpc(scsi_qla_host_t *ha)
qla2xxx_wake_dpc(scsi_qla_host_t *ha)
{
if (ha->dpc_thread)
- wake_up_process(ha->dpc_thread);
+ wake_up_task(ha->dpc_thread);
}

/*
diff -r df7bc026d50e -r 4ea674e8825e drivers/scsi/scsi_error.c
--- a/drivers/scsi/scsi_error.c Mon Jan 29 15:36:16 2007 -0800
+++ b/drivers/scsi/scsi_error.c Mon Jan 29 15:46:47 2007 -0800
@@ -51,7 +51,7 @@ void scsi_eh_wakeup(struct Scsi_Host *sh
void scsi_eh_wakeup(struct Scsi_Host *shost)
{
if (shost->host_busy == shost->host_failed) {
- wake_up_process(shost->ehandler);
+ wake_up_task(shost->ehandler);
SCSI_LOG_ERROR_RECOVERY(5,
printk("Waking error handler thread\n"));
}
diff -r df7bc026d50e -r 4ea674e8825e fs/aio.c
--- a/fs/aio.c Mon Jan 29 15:36:16 2007 -0800
+++ b/fs/aio.c Mon Jan 29 15:46:47 2007 -0800
@@ -907,7 +907,7 @@ void fastcall kick_iocb(struct kiocb *io
* single context. */
if (is_sync_kiocb(iocb)) {
kiocbSetKicked(iocb);
- wake_up_process(iocb->ki_obj.tsk);
+ wake_up_target(iocb->ki_obj.wake_target);
return;
}

@@ -941,7 +941,7 @@ int fastcall aio_complete(struct kiocb *
BUG_ON(iocb->ki_users != 1);
iocb->ki_user_data = res;
iocb->ki_users = 0;
- wake_up_process(iocb->ki_obj.tsk);
+ wake_up_target(iocb->ki_obj.wake_target);
return 1;
}

@@ -1053,7 +1053,7 @@ struct aio_timeout {
struct aio_timeout {
struct timer_list timer;
int timed_out;
- struct task_struct *p;
+ void *wake_target;
};

static void timeout_func(unsigned long data)
@@ -1061,7 +1061,7 @@ static void timeout_func(unsigned long d
struct aio_timeout *to = (struct aio_timeout *)data;

to->timed_out = 1;
- wake_up_process(to->p);
+ wake_up_target(to->wake_target);
}

static inline void init_timeout(struct aio_timeout *to)
@@ -1070,7 +1070,7 @@ static inline void init_timeout(struct a
to->timer.data = (unsigned long)to;
to->timer.function = timeout_func;
to->timed_out = 0;
- to->p = current;
+ to->wake_target = task_wake_target(current);
}

static inline void set_timeout(long start_jiffies, struct aio_timeout *to,
diff -r df7bc026d50e -r 4ea674e8825e fs/direct-io.c
--- a/fs/direct-io.c Mon Jan 29 15:36:16 2007 -0800
+++ b/fs/direct-io.c Mon Jan 29 15:46:47 2007 -0800
@@ -124,7 +124,7 @@ struct dio {
spinlock_t bio_lock; /* protects BIO fields below */
unsigned long refcount; /* direct_io_worker() and bios */
struct bio *bio_list; /* singly linked via bi_private */
- struct task_struct *waiter; /* waiting task (NULL if none) */
+ void *wake_target; /* waiting initiator (NULL if none) */

/* AIO related stuff */
struct kiocb *iocb; /* kiocb */
@@ -278,8 +278,8 @@ static int dio_bio_end_aio(struct bio *b

spin_lock_irqsave(&dio->bio_lock, flags);
remaining = --dio->refcount;
- if (remaining == 1 && dio->waiter)
- wake_up_process(dio->waiter);
+ if (remaining == 1 && dio->wake_target)
+ wake_up_target(dio->wake_target);
spin_unlock_irqrestore(&dio->bio_lock, flags);

if (remaining == 0) {
@@ -309,8 +309,8 @@ static int dio_bio_end_io(struct bio *bi
spin_lock_irqsave(&dio->bio_lock, flags);
bio->bi_private = dio->bio_list;
dio->bio_list = bio;
- if (--dio->refcount == 1 && dio->waiter)
- wake_up_process(dio->waiter);
+ if (--dio->refcount == 1 && dio->wake_target)
+ wake_up_target(dio->wake_target);
spin_unlock_irqrestore(&dio->bio_lock, flags);
return 0;
}
@@ -393,12 +393,12 @@ static struct bio *dio_await_one(struct
*/
while (dio->refcount > 1 && dio->bio_list == NULL) {
__set_current_state(TASK_UNINTERRUPTIBLE);
- dio->waiter = current;
+ dio->wake_target = task_wake_target(current);
spin_unlock_irqrestore(&dio->bio_lock, flags);
io_schedule();
/* wake up sets us TASK_RUNNING */
spin_lock_irqsave(&dio->bio_lock, flags);
- dio->waiter = NULL;
+ dio->wake_target = NULL;
}
if (dio->bio_list) {
bio = dio->bio_list;
@@ -990,7 +990,7 @@ direct_io_worker(int rw, struct kiocb *i
spin_lock_init(&dio->bio_lock);
dio->refcount = 1;
dio->bio_list = NULL;
- dio->waiter = NULL;
+ dio->wake_target = NULL;

/*
* In case of non-aligned buffers, we may need 2 more
diff -r df7bc026d50e -r 4ea674e8825e fs/jbd/journal.c
--- a/fs/jbd/journal.c Mon Jan 29 15:36:16 2007 -0800
+++ b/fs/jbd/journal.c Mon Jan 29 15:46:47 2007 -0800
@@ -94,7 +94,7 @@ static void commit_timeout(unsigned long
{
struct task_struct * p = (struct task_struct *) __data;

- wake_up_process(p);
+ wake_up_task(p);
}

/*
diff -r df7bc026d50e -r 4ea674e8825e include/linux/aio.h
--- a/include/linux/aio.h Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/aio.h Mon Jan 29 15:46:47 2007 -0800
@@ -98,7 +98,7 @@ struct kiocb {

union {
void __user *user;
- struct task_struct *tsk;
+ void *wake_target;
} ki_obj;

__u64 ki_user_data; /* user's data for completion */
@@ -124,7 +124,6 @@ struct kiocb {
#define is_sync_kiocb(iocb) ((iocb)->ki_key == KIOCB_SYNC_KEY)
#define init_sync_kiocb(x, filp) \
do { \
- struct task_struct *tsk = current; \
(x)->ki_flags = 0; \
(x)->ki_users = 1; \
(x)->ki_key = KIOCB_SYNC_KEY; \
@@ -133,7 +132,7 @@ struct kiocb {
(x)->ki_cancel = NULL; \
(x)->ki_retry = NULL; \
(x)->ki_dtor = NULL; \
- (x)->ki_obj.tsk = tsk; \
+ (x)->ki_obj.wake_target = task_wake_target(current); \
(x)->ki_user_data = 0; \
init_wait((&(x)->ki_wait)); \
} while (0)
diff -r df7bc026d50e -r 4ea674e8825e include/linux/freezer.h
--- a/include/linux/freezer.h Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/freezer.h Mon Jan 29 15:46:47 2007 -0800
@@ -42,7 +42,7 @@ static inline int thaw_process(struct ta
{
if (frozen(p)) {
p->flags &= ~PF_FROZEN;
- wake_up_process(p);
+ wake_up_task(p);
return 1;
}
return 0;
diff -r df7bc026d50e -r 4ea674e8825e include/linux/hrtimer.h
--- a/include/linux/hrtimer.h Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/hrtimer.h Mon Jan 29 15:46:47 2007 -0800
@@ -65,7 +65,7 @@ struct hrtimer {
*/
struct hrtimer_sleeper {
struct hrtimer timer;
- struct task_struct *task;
+ void *wake_target;
};

/**
diff -r df7bc026d50e -r 4ea674e8825e include/linux/kthread.h
--- a/include/linux/kthread.h Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/kthread.h Mon Jan 29 15:46:47 2007 -0800
@@ -22,7 +22,7 @@ struct task_struct *kthread_create(int (
struct task_struct *__k \
= kthread_create(threadfn, data, namefmt, ## __VA_ARGS__); \
if (!IS_ERR(__k)) \
- wake_up_process(__k); \
+ wake_up_task(__k); \
__k; \
})

diff -r df7bc026d50e -r 4ea674e8825e include/linux/module.h
--- a/include/linux/module.h Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/module.h Mon Jan 29 15:46:47 2007 -0800
@@ -334,7 +334,7 @@ struct module
struct list_head modules_which_use_me;

/* Who is waiting for us to be unloaded */
- struct task_struct *waiter;
+ void *wake_target;

/* Destruction function. */
void (*exit)(void);
diff -r df7bc026d50e -r 4ea674e8825e include/linux/mutex.h
--- a/include/linux/mutex.h Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/mutex.h Mon Jan 29 15:46:47 2007 -0800
@@ -65,7 +65,7 @@ struct mutex {
*/
struct mutex_waiter {
struct list_head list;
- struct task_struct *task;
+ void *wake_target;
#ifdef CONFIG_DEBUG_MUTEXES
struct mutex *lock;
void *magic;
diff -r df7bc026d50e -r 4ea674e8825e include/linux/posix-timers.h
--- a/include/linux/posix-timers.h Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/posix-timers.h Mon Jan 29 15:46:47 2007 -0800
@@ -48,6 +48,7 @@ struct k_itimer {
int it_sigev_signo; /* signo word of sigevent struct */
sigval_t it_sigev_value; /* value word of sigevent struct */
struct task_struct *it_process; /* process to send signal to */
+ void *it_wake_target; /* wake target for nanosleep case */
struct sigqueue *sigq; /* signal queue entry. */
union {
struct {
diff -r df7bc026d50e -r 4ea674e8825e include/linux/sched.h
--- a/include/linux/sched.h Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/sched.h Mon Jan 29 15:46:47 2007 -0800
@@ -1338,8 +1338,14 @@ extern void switch_uid(struct user_struc

extern void do_timer(unsigned long ticks);

-extern int FASTCALL(wake_up_state(struct task_struct * tsk, unsigned int state));
-extern int FASTCALL(wake_up_process(struct task_struct * tsk));
+/*
+ * XXX We need to figure out how signal delivery will wake the fibrils in
+ * a task. This is marked deprecated so that we get a compile-time warning
+ * to worry about it.
+ */
+extern int FASTCALL(wake_up_state(struct task_struct * tsk, unsigned int state)) __deprecated;
+extern int FASTCALL(wake_up_target(void *wake_target));
+extern int FASTCALL(wake_up_task(struct task_struct *task));
extern void FASTCALL(wake_up_new_task(struct task_struct * tsk,
unsigned long clone_flags));
#ifdef CONFIG_SMP
diff -r df7bc026d50e -r 4ea674e8825e include/linux/sem.h
--- a/include/linux/sem.h Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/sem.h Mon Jan 29 15:46:47 2007 -0800
@@ -104,7 +104,7 @@ struct sem_queue {
struct sem_queue {
struct sem_queue * next; /* next entry in the queue */
struct sem_queue ** prev; /* previous entry in the queue, *(q->prev) == q */
- struct task_struct* sleeper; /* this process */
+ void *wake_target;
struct sem_undo * undo; /* undo structure */
int pid; /* process id of requesting process */
int status; /* completion status of operation */
diff -r df7bc026d50e -r 4ea674e8825e include/linux/wait.h
--- a/include/linux/wait.h Mon Jan 29 15:36:16 2007 -0800
+++ b/include/linux/wait.h Mon Jan 29 15:46:47 2007 -0800
@@ -54,13 +54,16 @@ typedef struct __wait_queue_head wait_qu
typedef struct __wait_queue_head wait_queue_head_t;

struct task_struct;
+/* XXX sigh, wait.h <-> sched.h have some fun ordering */
+void *task_wake_target(struct task_struct *task);
+struct task_struct *wake_target_to_task(void *wake_target);

/*
* Macros for declaration and initialisaton of the datatypes
*/

#define __WAITQUEUE_INITIALIZER(name, tsk) { \
- .private = tsk, \
+ .private = task_wake_target(tsk), \
.func = default_wake_function, \
.task_list = { NULL, NULL } }

@@ -91,7 +94,7 @@ static inline void init_waitqueue_entry(
static inline void init_waitqueue_entry(wait_queue_t *q, struct task_struct *p)
{
q->flags = 0;
- q->private = p;
+ q->private = task_wake_target(p);
q->func = default_wake_function;
}

@@ -389,7 +392,7 @@ int wake_bit_function(wait_queue_t *wait

#define DEFINE_WAIT(name) \
wait_queue_t name = { \
- .private = current, \
+ .private = task_wake_target(current), \
.func = autoremove_wake_function, \
.task_list = LIST_HEAD_INIT((name).task_list), \
}
@@ -398,7 +401,7 @@ int wake_bit_function(wait_queue_t *wait
struct wait_bit_queue name = { \
.key = __WAIT_BIT_KEY_INITIALIZER(word, bit), \
.wait = { \
- .private = current, \
+ .private = task_wake_target(current), \
.func = wake_bit_function, \
.task_list = \
LIST_HEAD_INIT((name).wait.task_list), \
@@ -407,7 +410,7 @@ int wake_bit_function(wait_queue_t *wait

#define init_wait(wait) \
do { \
- (wait)->private = current; \
+ (wait)->private = task_wake_target(current); \
(wait)->func = autoremove_wake_function; \
INIT_LIST_HEAD(&(wait)->task_list); \
} while (0)
diff -r df7bc026d50e -r 4ea674e8825e ipc/mqueue.c
--- a/ipc/mqueue.c Mon Jan 29 15:36:16 2007 -0800
+++ b/ipc/mqueue.c Mon Jan 29 15:46:47 2007 -0800
@@ -58,7 +58,7 @@


struct ext_wait_queue { /* queue of sleeping tasks */
- struct task_struct *task;
+ void *wake_target;
struct list_head list;
struct msg_msg *msg; /* ptr of loaded message */
int state; /* one of STATE_* values */
@@ -394,10 +394,11 @@ static void wq_add(struct mqueue_inode_i
{
struct ext_wait_queue *walk;

- ewp->task = current;
+ ewp->wake_target = task_wake_target(current);

list_for_each_entry(walk, &info->e_wait_q[sr].list, list) {
- if (walk->task->static_prio <= current->static_prio) {
+ if (wake_target_to_task(walk->wake_target)->static_prio
+ <= current->static_prio) {
list_add_tail(&ewp->list, &walk->list);
return;
}
@@ -785,7 +786,7 @@ static inline void pipelined_send(struct
receiver->msg = message;
list_del(&receiver->list);
receiver->state = STATE_PENDING;
- wake_up_process(receiver->task);
+ wake_up_target(receiver->wake_target);
smp_wmb();
receiver->state = STATE_READY;
}
@@ -804,7 +805,7 @@ static inline void pipelined_receive(str
msg_insert(sender->msg, info);
list_del(&sender->list);
sender->state = STATE_PENDING;
- wake_up_process(sender->task);
+ wake_up_target(sender->wake_target);
smp_wmb();
sender->state = STATE_READY;
}
@@ -869,7 +870,7 @@ asmlinkage long sys_mq_timedsend(mqd_t m
spin_unlock(&info->lock);
ret = timeout;
} else {
- wait.task = current;
+ wait.wake_target = task_wake_target(current);
wait.msg = (void *) msg_ptr;
wait.state = STATE_NONE;
ret = wq_sleep(info, SEND, timeout, &wait);
@@ -944,7 +945,7 @@ asmlinkage ssize_t sys_mq_timedreceive(m
ret = timeout;
msg_ptr = NULL;
} else {
- wait.task = current;
+ wait.wake_target = task_wake_target(current);
wait.state = STATE_NONE;
ret = wq_sleep(info, RECV, timeout, &wait);
msg_ptr = wait.msg;
diff -r df7bc026d50e -r 4ea674e8825e ipc/msg.c
--- a/ipc/msg.c Mon Jan 29 15:36:16 2007 -0800
+++ b/ipc/msg.c Mon Jan 29 15:46:47 2007 -0800
@@ -46,7 +46,7 @@
*/
struct msg_receiver {
struct list_head r_list;
- struct task_struct *r_tsk;
+ struct task_struct *r_wake_target;

int r_mode;
long r_msgtype;
@@ -58,7 +58,7 @@ struct msg_receiver {
/* one msg_sender for each sleeping sender */
struct msg_sender {
struct list_head list;
- struct task_struct *tsk;
+ void *wake_target;
};

#define SEARCH_ANY 1
@@ -180,7 +180,7 @@ static int newque (struct ipc_namespace

static inline void ss_add(struct msg_queue *msq, struct msg_sender *mss)
{
- mss->tsk = current;
+ mss->wake_target = task_wake_target(current);
current->state = TASK_INTERRUPTIBLE;
list_add_tail(&mss->list, &msq->q_senders);
}
@@ -203,7 +203,7 @@ static void ss_wakeup(struct list_head *
tmp = tmp->next;
if (kill)
mss->list.next = NULL;
- wake_up_process(mss->tsk);
+ wake_up_target(mss->wake_target);
}
}

@@ -218,7 +218,7 @@ static void expunge_all(struct msg_queue
msr = list_entry(tmp, struct msg_receiver, r_list);
tmp = tmp->next;
msr->r_msg = NULL;
- wake_up_process(msr->r_tsk);
+ wake_up_target(msr->r_wake_target);
smp_mb();
msr->r_msg = ERR_PTR(res);
}
@@ -602,20 +602,21 @@ static inline int pipelined_send(struct
msr = list_entry(tmp, struct msg_receiver, r_list);
tmp = tmp->next;
if (testmsg(msg, msr->r_msgtype, msr->r_mode) &&
- !security_msg_queue_msgrcv(msq, msg, msr->r_tsk,
- msr->r_msgtype, msr->r_mode)) {
+ !security_msg_queue_msgrcv(msq, msg,
+ wake_target_to_task(msr->r_wake_target),
+ msr->r_msgtype, msr->r_mode)) {

list_del(&msr->r_list);
if (msr->r_maxsize < msg->m_ts) {
msr->r_msg = NULL;
- wake_up_process(msr->r_tsk);
+ wake_up_target(msr->r_wake_target);
smp_mb();
msr->r_msg = ERR_PTR(-E2BIG);
} else {
msr->r_msg = NULL;
- msq->q_lrpid = msr->r_tsk->pid;
+ msq->q_lrpid = wake_target_to_task(msr->r_wake_target)->pid;
msq->q_rtime = get_seconds();
- wake_up_process(msr->r_tsk);
+ wake_up_target(msr->r_wake_target);
smp_mb();
msr->r_msg = msg;

@@ -826,7 +827,7 @@ long do_msgrcv(int msqid, long *pmtype,
goto out_unlock;
}
list_add_tail(&msr_d.r_list, &msq->q_receivers);
- msr_d.r_tsk = current;
+ msr_d.r_wake_target = task_wake_target(current);
msr_d.r_msgtype = msgtyp;
msr_d.r_mode = mode;
if (msgflg & MSG_NOERROR)
diff -r df7bc026d50e -r 4ea674e8825e ipc/sem.c
--- a/ipc/sem.c Mon Jan 29 15:36:16 2007 -0800
+++ b/ipc/sem.c Mon Jan 29 15:46:47 2007 -0800
@@ -411,7 +411,7 @@ static void update_queue (struct sem_arr
error = try_atomic_semop(sma, q->sops, q->nsops,
q->undo, q->pid);

- /* Does q->sleeper still need to sleep? */
+ /* Does q->wake_target still need to sleep? */
if (error <= 0) {
struct sem_queue *n;
remove_from_queue(sma,q);
@@ -431,7 +431,7 @@ static void update_queue (struct sem_arr
n = sma->sem_pending;
else
n = q->next;
- wake_up_process(q->sleeper);
+ wake_up_target(q->wake_target);
/* hands-off: q will disappear immediately after
* writing q->status.
*/
@@ -515,7 +515,7 @@ static void freeary (struct ipc_namespac
q->prev = NULL;
n = q->next;
q->status = IN_WAKEUP;
- wake_up_process(q->sleeper); /* doesn't sleep */
+ wake_up_target(q->wake_target); /* doesn't sleep */
smp_wmb();
q->status = -EIDRM; /* hands-off q */
q = n;
@@ -1223,7 +1223,7 @@ retry_undos:
prepend_to_queue(sma ,&queue);

queue.status = -EINTR;
- queue.sleeper = current;
+ queue.wake_target = task_wake_target(current);
current->state = TASK_INTERRUPTIBLE;
sem_unlock(sma);

diff -r df7bc026d50e -r 4ea674e8825e kernel/exit.c
--- a/kernel/exit.c Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/exit.c Mon Jan 29 15:46:47 2007 -0800
@@ -91,7 +91,7 @@ static void __exit_signal(struct task_st
* then notify it:
*/
if (sig->group_exit_task && atomic_read(&sig->count) == sig->notify_count) {
- wake_up_process(sig->group_exit_task);
+ wake_up_task(sig->group_exit_task);
sig->group_exit_task = NULL;
}
if (tsk == sig->curr_target)
diff -r df7bc026d50e -r 4ea674e8825e kernel/hrtimer.c
--- a/kernel/hrtimer.c Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/hrtimer.c Mon Jan 29 15:46:47 2007 -0800
@@ -660,11 +660,11 @@ static int hrtimer_wakeup(struct hrtimer
{
struct hrtimer_sleeper *t =
container_of(timer, struct hrtimer_sleeper, timer);
- struct task_struct *task = t->task;
-
- t->task = NULL;
- if (task)
- wake_up_process(task);
+ void *wake_target = t->wake_target;
+
+ t->wake_target = NULL;
+ if (wake_target)
+ wake_up_target(wake_target);

return HRTIMER_NORESTART;
}
@@ -672,7 +672,7 @@ void hrtimer_init_sleeper(struct hrtimer
void hrtimer_init_sleeper(struct hrtimer_sleeper *sl, struct task_struct *task)
{
sl->timer.function = hrtimer_wakeup;
- sl->task = task;
+ sl->wake_target = task_wake_target(task);
}

static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode)
@@ -688,9 +688,9 @@ static int __sched do_nanosleep(struct h
hrtimer_cancel(&t->timer);
mode = HRTIMER_ABS;

- } while (t->task && !signal_pending(current));
-
- return t->task == NULL;
+ } while (t->wake_target && !signal_pending(current));
+
+ return t->wake_target == NULL;
}

long __sched hrtimer_nanosleep_restart(struct restart_block *restart)
diff -r df7bc026d50e -r 4ea674e8825e kernel/kthread.c
--- a/kernel/kthread.c Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/kthread.c Mon Jan 29 15:46:47 2007 -0800
@@ -232,7 +232,7 @@ int kthread_stop(struct task_struct *k)

/* Now set kthread_should_stop() to true, and wake it up. */
kthread_stop_info.k = k;
- wake_up_process(k);
+ wake_up_task(k);
put_task_struct(k);

/* Once it dies, reset stop ptr, gather result and we're done. */
diff -r df7bc026d50e -r 4ea674e8825e kernel/module.c
--- a/kernel/module.c Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/module.c Mon Jan 29 15:46:47 2007 -0800
@@ -508,7 +508,7 @@ static void module_unload_init(struct mo
/* Hold reference count during initialization. */
local_set(&mod->ref[raw_smp_processor_id()].count, 1);
/* Backwards compatibility macros put refcount during init. */
- mod->waiter = current;
+ mod->wake_target = task_wake_target(current);
}

/* modules using other modules */
@@ -699,7 +699,7 @@ sys_delete_module(const char __user *nam
}

/* Set this up before setting mod->state */
- mod->waiter = current;
+ mod->wake_target = task_wake_target(current);

/* Stop the machine so refcounts can't move and disable module. */
ret = try_stop_module(mod, flags, &forced);
@@ -797,7 +797,7 @@ void module_put(struct module *module)
local_dec(&module->ref[cpu].count);
/* Maybe they're waiting for us to drop reference? */
if (unlikely(!module_is_live(module)))
- wake_up_process(module->waiter);
+ wake_up_target(module->wake_target);
put_cpu();
}
}
diff -r df7bc026d50e -r 4ea674e8825e kernel/mutex-debug.c
--- a/kernel/mutex-debug.c Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/mutex-debug.c Mon Jan 29 15:46:47 2007 -0800
@@ -53,6 +53,7 @@ void debug_mutex_free_waiter(struct mute
memset(waiter, MUTEX_DEBUG_FREE, sizeof(*waiter));
}

+#warning "this is going to need updating for fibrils"
void debug_mutex_add_waiter(struct mutex *lock, struct mutex_waiter *waiter,
struct thread_info *ti)
{
@@ -67,12 +68,12 @@ void mutex_remove_waiter(struct mutex *l
struct thread_info *ti)
{
DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
- DEBUG_LOCKS_WARN_ON(waiter->task != ti->task);
+ DEBUG_LOCKS_WARN_ON(waiter->wake_target != task_wake_target(ti->task));
DEBUG_LOCKS_WARN_ON(ti->task->blocked_on != waiter);
ti->task->blocked_on = NULL;

list_del_init(&waiter->list);
- waiter->task = NULL;
+ waiter->wake_target = NULL;
}

void debug_mutex_unlock(struct mutex *lock)
diff -r df7bc026d50e -r 4ea674e8825e kernel/mutex.c
--- a/kernel/mutex.c Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/mutex.c Mon Jan 29 15:46:47 2007 -0800
@@ -137,7 +137,7 @@ __mutex_lock_common(struct mutex *lock,

/* add waiting tasks to the end of the waitqueue (FIFO): */
list_add_tail(&waiter.list, &lock->wait_list);
- waiter.task = task;
+ waiter.wake_target = task_wake_target(task);

for (;;) {
/*
@@ -246,7 +246,7 @@ __mutex_unlock_common_slowpath(atomic_t

debug_mutex_wake_waiter(lock, waiter);

- wake_up_process(waiter->task);
+ wake_up_target(waiter->wake_target);
}

debug_mutex_clear_owner(lock);
diff -r df7bc026d50e -r 4ea674e8825e kernel/posix-cpu-timers.c
--- a/kernel/posix-cpu-timers.c Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/posix-cpu-timers.c Mon Jan 29 15:46:47 2007 -0800
@@ -673,7 +673,7 @@ static void cpu_timer_fire(struct k_itim
* This a special case for clock_nanosleep,
* not a normal timer from sys_timer_create.
*/
- wake_up_process(timer->it_process);
+ wake_up_target(timer->it_wake_target);
timer->it.cpu.expires.sched = 0;
} else if (timer->it.cpu.incr.sched == 0) {
/*
@@ -1423,6 +1423,12 @@ static int do_cpu_nanosleep(const clocki
timer.it_overrun = -1;
error = posix_cpu_timer_create(&timer);
timer.it_process = current;
+ /*
+ * XXX This isn't quite right, but the rest of the it_process users
+ * fall under the currently unresolved question of how signal delivery
+ * will behave.
+ */
+ timer.it_wake_target = task_wake_target(current);
if (!error) {
static struct itimerspec zero_it;

diff -r df7bc026d50e -r 4ea674e8825e kernel/ptrace.c
--- a/kernel/ptrace.c Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/ptrace.c Mon Jan 29 15:46:47 2007 -0800
@@ -221,7 +221,7 @@ static inline void __ptrace_detach(struc
__ptrace_unlink(child);
/* .. and wake it up. */
if (child->exit_state != EXIT_ZOMBIE)
- wake_up_process(child);
+ wake_up_task(child);
}

int ptrace_detach(struct task_struct *child, unsigned int data)
diff -r df7bc026d50e -r 4ea674e8825e kernel/rtmutex.c
--- a/kernel/rtmutex.c Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/rtmutex.c Mon Jan 29 15:46:47 2007 -0800
@@ -516,7 +516,8 @@ static void wakeup_next_waiter(struct rt
}
spin_unlock_irqrestore(&pendowner->pi_lock, flags);

- wake_up_process(pendowner);
+#warning "this looks like it needs expert attention"
+ wake_up_task(pendowner);
}

/*
@@ -640,7 +641,7 @@ rt_mutex_slowlock(struct rt_mutex *lock,
/* Signal pending? */
if (signal_pending(current))
ret = -EINTR;
- if (timeout && !timeout->task)
+ if (timeout && !timeout->wake_target)
ret = -ETIMEDOUT;
if (ret)
break;
diff -r df7bc026d50e -r 4ea674e8825e kernel/sched.c
--- a/kernel/sched.c Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/sched.c Mon Jan 29 15:46:47 2007 -0800
@@ -1381,10 +1381,52 @@ static inline int wake_idle(int cpu, str
}
#endif

+/*
+ * This path wakes a fibril.
+ *
+ * In the common case, a task will be sleeping with multiple pending
+ * sleeping fibrils. In that case we need to put the fibril on the task's
+ * runnable list and wake the task itself. We need it to go back through
+ * the scheduler to find the runnable fibril so we set TIF_NEED_RESCHED.
+ *
+ * A derivative of that case is when the fibril that we're waking is already
+ * current on the sleeping task. In that case we just need to wake the
+ * task itself, it will already be executing the fibril we're waking. We
+ * do not put it on the runnable list in that case.
+ *
+ * XXX Obviously, there are lots of very scary races here. We should get
+ * more confidence that they're taken care of.
+ */
+static int try_to_wake_up_fibril(struct task_struct *tsk, void *wake_target,
+ unsigned int state)
+{
+ struct fibril *fibril = (struct fibril *)
+ ((unsigned long)wake_target & ~1UL);
+ long old_state = fibril->state;
+ int ret = 1;
+
+ if (!(old_state & state))
+ goto out;
+
+ ret = 0;
+ fibril->state = TASK_RUNNING;
+
+ if (fibril->ti->task->fibril != fibril) {
+ BUG_ON(!list_empty(&fibril->run_list));
+ list_add_tail(&fibril->run_list, &tsk->runnable_fibrils);
+ if (!tsk->array)
+ set_ti_thread_flag(task_thread_info(tsk),
+ TIF_NEED_RESCHED);
+ }
+
+out:
+ return ret;
+}
+
/***
* try_to_wake_up - wake up a thread
- * @p: the to-be-woken-up thread
- * @state: the mask of task states that can be woken
+ * @wake_target: the to-be-woken-up sleeper, from task_wake_target()
+ * @state: the mask of states that can be woken
* @sync: do a synchronous wakeup?
*
* Put it on the run-queue if it's not already there. The "current"
@@ -1395,9 +1437,10 @@ static inline int wake_idle(int cpu, str
*
* returns failure only if the task is already active.
*/
-static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync)
+static int try_to_wake_up(void *wake_target, unsigned int state, int sync)
{
int cpu, this_cpu, success = 0;
+ struct task_struct *p = wake_target_to_task(wake_target);
unsigned long flags;
long old_state;
struct rq *rq;
@@ -1408,6 +1451,12 @@ static int try_to_wake_up(struct task_st
#endif

rq = task_rq_lock(p, &flags);
+
+ /* See if we're just putting a fibril on its task's runnable list */
+ if (unlikely(((unsigned long)wake_target & 1) &&
+ try_to_wake_up_fibril(p, wake_target, state)))
+ goto out;
+
old_state = p->state;
if (!(old_state & state))
goto out;
@@ -1555,16 +1604,27 @@ out:
return success;
}

-int fastcall wake_up_process(struct task_struct *p)
-{
- return try_to_wake_up(p, TASK_STOPPED | TASK_TRACED |
+int fastcall wake_up_task(struct task_struct *task)
+{
+ return try_to_wake_up((void *)task, TASK_STOPPED | TASK_TRACED |
TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0);
}
-EXPORT_SYMBOL(wake_up_process);
-
+EXPORT_SYMBOL(wake_up_task);
+
+int fastcall wake_up_target(void *wake_target)
+{
+ return try_to_wake_up(wake_target, TASK_STOPPED | TASK_TRACED |
+ TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0);
+}
+EXPORT_SYMBOL(wake_up_target);
+
+/*
+ * XXX We need to figure out how signal delivery will wake the fibrils in
+ * a task.
+ */
int fastcall wake_up_state(struct task_struct *p, unsigned int state)
{
- return try_to_wake_up(p, state, 0);
+ return try_to_wake_up((void *)p, state, 0);
}

static void task_running_tick(struct rq *rq, struct task_struct *p);
@@ -2041,7 +2101,7 @@ static void sched_migrate_task(struct ta

get_task_struct(mt);
task_rq_unlock(rq, &flags);
- wake_up_process(mt);
+ wake_up_task(mt);
put_task_struct(mt);
wait_for_completion(&req.done);

@@ -2673,7 +2733,7 @@ redo:
}
spin_unlock_irqrestore(&busiest->lock, flags);
if (active_balance)
- wake_up_process(busiest->migration_thread);
+ wake_up_task(busiest->migration_thread);

/*
* We've kicked active balancing, reset the failure
@@ -3781,6 +3841,33 @@ need_resched:

#endif /* CONFIG_PREEMPT */

+/*
+ * This is a void * so that it's harder for people to stash it in a small
+ * scalar without getting warnings.
+ */
+void *task_wake_target(struct task_struct *task)
+{
+ if (task->fibril) {
+ return (void *)((unsigned long)task->fibril | 1);
+ } else {
+ BUG_ON((unsigned long)task & 1);
+ return task;
+ }
+}
+EXPORT_SYMBOL(task_wake_target);
+
+struct task_struct *wake_target_to_task(void *wake_target)
+{
+ if ((unsigned long)wake_target & 1) {
+ struct fibril *fibril;
+ fibril = (struct fibril *) ((unsigned long)wake_target ^ 1);
+ return fibril->ti->task;
+ } else
+ return (struct task_struct *)((unsigned long)wake_target);
+}
+EXPORT_SYMBOL(wake_target_to_task);
+
+
int default_wake_function(wait_queue_t *curr, unsigned mode, int sync,
void *key)
{
@@ -5140,7 +5227,7 @@ int set_cpus_allowed(struct task_struct
if (migrate_task(p, any_online_cpu(new_mask), &req)) {
/* Need help from migration thread: drop lock and wait. */
task_rq_unlock(rq, &flags);
- wake_up_process(rq->migration_thread);
+ wake_up_task(rq->migration_thread);
wait_for_completion(&req.done);
tlb_migrate_finish(p->mm);
return 0;
@@ -5462,7 +5549,7 @@ migration_call(struct notifier_block *nf

case CPU_ONLINE:
/* Strictly unneccessary, as first user will wake it. */
- wake_up_process(cpu_rq(cpu)->migration_thread);
+ wake_up_task(cpu_rq(cpu)->migration_thread);
break;

#ifdef CONFIG_HOTPLUG_CPU
diff -r df7bc026d50e -r 4ea674e8825e kernel/signal.c
--- a/kernel/signal.c Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/signal.c Mon Jan 29 15:46:47 2007 -0800
@@ -948,7 +948,7 @@ __group_complete_signal(int sig, struct
signal_wake_up(t, 0);
t = next_thread(t);
} while (t != p);
- wake_up_process(p->signal->group_exit_task);
+ wake_up_task(p->signal->group_exit_task);
return;
}

diff -r df7bc026d50e -r 4ea674e8825e kernel/softirq.c
--- a/kernel/softirq.c Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/softirq.c Mon Jan 29 15:46:47 2007 -0800
@@ -58,7 +58,7 @@ static inline void wakeup_softirqd(void)
struct task_struct *tsk = __get_cpu_var(ksoftirqd);

if (tsk && tsk->state != TASK_RUNNING)
- wake_up_process(tsk);
+ wake_up_task(tsk);
}

/*
@@ -583,7 +583,7 @@ static int __cpuinit cpu_callback(struct
per_cpu(ksoftirqd, hotcpu) = p;
break;
case CPU_ONLINE:
- wake_up_process(per_cpu(ksoftirqd, hotcpu));
+ wake_up_task(per_cpu(ksoftirqd, hotcpu));
break;
#ifdef CONFIG_HOTPLUG_CPU
case CPU_UP_CANCELED:
diff -r df7bc026d50e -r 4ea674e8825e kernel/stop_machine.c
--- a/kernel/stop_machine.c Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/stop_machine.c Mon Jan 29 15:46:47 2007 -0800
@@ -185,7 +185,7 @@ struct task_struct *__stop_machine_run(i
p = kthread_create(do_stop, &smdata, "kstopmachine");
if (!IS_ERR(p)) {
kthread_bind(p, cpu);
- wake_up_process(p);
+ wake_up_task(p);
wait_for_completion(&smdata.done);
}
up(&stopmachine_mutex);
diff -r df7bc026d50e -r 4ea674e8825e kernel/timer.c
--- a/kernel/timer.c Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/timer.c Mon Jan 29 15:46:47 2007 -0800
@@ -1290,7 +1290,7 @@ asmlinkage long sys_getegid(void)

static void process_timeout(unsigned long __data)
{
- wake_up_process((struct task_struct *)__data);
+ wake_up_task((struct task_struct *)__data);
}

/**
diff -r df7bc026d50e -r 4ea674e8825e kernel/workqueue.c
--- a/kernel/workqueue.c Mon Jan 29 15:36:16 2007 -0800
+++ b/kernel/workqueue.c Mon Jan 29 15:46:47 2007 -0800
@@ -504,14 +504,14 @@ struct workqueue_struct *__create_workqu
if (!p)
destroy = 1;
else
- wake_up_process(p);
+ wake_up_task(p);
} else {
list_add(&wq->list, &workqueues);
for_each_online_cpu(cpu) {
p = create_workqueue_thread(wq, cpu, freezeable);
if (p) {
kthread_bind(p, cpu);
- wake_up_process(p);
+ wake_up_task(p);
} else
destroy = 1;
}
@@ -773,7 +773,7 @@ static int __devinit workqueue_cpu_callb

cwq = per_cpu_ptr(wq->cpu_wq, hotcpu);
kthread_bind(cwq->thread, hotcpu);
- wake_up_process(cwq->thread);
+ wake_up_task(cwq->thread);
}
mutex_unlock(&workqueue_mutex);
break;
diff -r df7bc026d50e -r 4ea674e8825e lib/rwsem.c
--- a/lib/rwsem.c Mon Jan 29 15:36:16 2007 -0800
+++ b/lib/rwsem.c Mon Jan 29 15:46:47 2007 -0800
@@ -30,7 +30,7 @@ EXPORT_SYMBOL(__init_rwsem);

struct rwsem_waiter {
struct list_head list;
- struct task_struct *task;
+ void *wake_target;
unsigned int flags;
#define RWSEM_WAITING_FOR_READ 0x00000001
#define RWSEM_WAITING_FOR_WRITE 0x00000002
@@ -50,7 +50,7 @@ __rwsem_do_wake(struct rw_semaphore *sem
__rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
{
struct rwsem_waiter *waiter;
- struct task_struct *tsk;
+ void *wake_target;
struct list_head *next;
signed long oldcount, woken, loop;

@@ -75,16 +75,17 @@ __rwsem_do_wake(struct rw_semaphore *sem
if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
goto readers_only;

- /* We must be careful not to touch 'waiter' after we set ->task = NULL.
- * It is an allocated on the waiter's stack and may become invalid at
- * any time after that point (due to a wakeup from another source).
+ /* We must be careful not to touch 'waiter' after we set ->wake_target
+ * = NULL. It is an allocated on the waiter's stack and may become
+ * invalid at any time after that point (due to a wakeup from another
+ * source).
*/
list_del(&waiter->list);
- tsk = waiter->task;
+ wake_target = waiter->wake_target;
smp_mb();
- waiter->task = NULL;
- wake_up_process(tsk);
- put_task_struct(tsk);
+ waiter->wake_target = NULL;
+ wake_up_target(wake_target);
+ put_task_struct(wake_target_to_task(wake_target));
goto out;

/* don't want to wake any writers */
@@ -123,11 +124,11 @@ __rwsem_do_wake(struct rw_semaphore *sem
for (; loop > 0; loop--) {
waiter = list_entry(next, struct rwsem_waiter, list);
next = waiter->list.next;
- tsk = waiter->task;
+ wake_target = waiter->wake_target;
smp_mb();
- waiter->task = NULL;
- wake_up_process(tsk);
- put_task_struct(tsk);
+ waiter->wake_target = NULL;
+ wake_up_target(wake_target);
+ put_task_struct(wake_target_to_task(wake_target));
}

sem->wait_list.next = next;
@@ -157,7 +158,7 @@ rwsem_down_failed_common(struct rw_semap

/* set up my own style of waitqueue */
spin_lock_irq(&sem->wait_lock);
- waiter->task = tsk;
+ waiter->wake_target = task_wake_target(tsk);
get_task_struct(tsk);

list_add_tail(&waiter->list, &sem->wait_list);
@@ -173,7 +174,7 @@ rwsem_down_failed_common(struct rw_semap

/* wait to be given the lock */
for (;;) {
- if (!waiter->task)
+ if (!waiter->wake_target)
break;
schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
diff -r df7bc026d50e -r 4ea674e8825e mm/pdflush.c
--- a/mm/pdflush.c Mon Jan 29 15:36:16 2007 -0800
+++ b/mm/pdflush.c Mon Jan 29 15:46:47 2007 -0800
@@ -217,7 +217,7 @@ int pdflush_operation(void (*fn)(unsigne
last_empty_jifs = jiffies;
pdf->fn = fn;
pdf->arg0 = arg0;
- wake_up_process(pdf->who);
+ wake_up_task(pdf->who);
spin_unlock_irqrestore(&pdflush_lock, flags);
}
return ret;
diff -r df7bc026d50e -r 4ea674e8825e net/core/pktgen.c
--- a/net/core/pktgen.c Mon Jan 29 15:36:16 2007 -0800
+++ b/net/core/pktgen.c Mon Jan 29 15:46:47 2007 -0800
@@ -3505,7 +3505,7 @@ static int __init pktgen_create_thread(i
pe->proc_fops = &pktgen_thread_fops;
pe->data = t;

- wake_up_process(p);
+ wake_up_task(p);

return 0;
}

2007-01-30 21:40:24

by Zach Brown

[permalink] [raw]
Subject: [PATCH 4 of 4] Introduce aio system call submission and completion system calls

This finally does something useful with the notion of being able to schedule
stacks as fibrils under a task_struct. Again, i386-specific and in need of
proper layering with archs.

sys_asys_submit() is added to let userspace submit asynchronous system calls.
It specifies the system call number and arguments. A fibril is constructed for
each call. Each starts with a stack which executes the given system call
handler and then returns to a function which records the return code of the
system call handler. sys_asys_await_completion() then lets userspace collect
these results.

sys_asys_submit() is careful to construct a fibril for the submission syscall
itself so that it can return to userspace if the calls it is dispatching block.
If none of them block, however, they will have all been run hot in this
submitting task on this processor.

It allocates and runs each system call in turn. It could certainly work in
batches to decrease locking overhead at the cost of increased peak memory
overhead for calls which don't end up blocking.

The complexity of a fully-formed submission and completion interface hasn't
been addressed. Details like targeting explicit completion contexts, batching,
timeouts, signal delivery, and syscall-free submission and completion (now with
more rings!) can all be hashed out in some giant thread, no doubt. I didn't
want them to cloud the basic mechanics being presented here.

diff -r 4ea674e8825e -r 5bdda0f7bef2 arch/i386/kernel/syscall_table.S
--- a/arch/i386/kernel/syscall_table.S Mon Jan 29 15:46:47 2007 -0800
+++ b/arch/i386/kernel/syscall_table.S Mon Jan 29 15:50:10 2007 -0800
@@ -319,3 +319,5 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+ .long sys_asys_submit /* 320 */
+ .long sys_asys_await_completion
diff -r 4ea674e8825e -r 5bdda0f7bef2 include/asm-i386/unistd.h
--- a/include/asm-i386/unistd.h Mon Jan 29 15:46:47 2007 -0800
+++ b/include/asm-i386/unistd.h Mon Jan 29 15:50:10 2007 -0800
@@ -325,6 +325,8 @@
#define __NR_move_pages 317
#define __NR_getcpu 318
#define __NR_epoll_pwait 319
+#define __NR_asys_submit 320
+#define __NR_asys_await_completion 321

#ifdef __KERNEL__

diff -r 4ea674e8825e -r 5bdda0f7bef2 include/linux/init_task.h
--- a/include/linux/init_task.h Mon Jan 29 15:46:47 2007 -0800
+++ b/include/linux/init_task.h Mon Jan 29 15:50:10 2007 -0800
@@ -148,6 +148,8 @@ extern struct group_info init_groups;
.pi_lock = SPIN_LOCK_UNLOCKED, \
INIT_TRACE_IRQFLAGS \
INIT_LOCKDEP \
+ .asys_wait = __WAIT_QUEUE_HEAD_INITIALIZER(tsk.asys_wait), \
+ .asys_completed = LIST_HEAD_INIT(tsk.asys_completed), \
}


diff -r 4ea674e8825e -r 5bdda0f7bef2 include/linux/sched.h
--- a/include/linux/sched.h Mon Jan 29 15:46:47 2007 -0800
+++ b/include/linux/sched.h Mon Jan 29 15:50:10 2007 -0800
@@ -1019,6 +1019,14 @@ struct task_struct {

/* Protection of the PI data structures: */
spinlock_t pi_lock;
+
+ /*
+ * XXX This is just a dummy that should be in a seperately managed
+ * context. An explicit contexts lets asys calls be nested (!) and
+ * will let us provide the sys_io_*() API on top of asys.
+ */
+ struct list_head asys_completed;
+ wait_queue_head_t asys_wait;

#ifdef CONFIG_RT_MUTEXES
/* PI waiters blocked on a rt_mutex held by this task */
diff -r 4ea674e8825e -r 5bdda0f7bef2 kernel/Makefile
--- a/kernel/Makefile Mon Jan 29 15:46:47 2007 -0800
+++ b/kernel/Makefile Mon Jan 29 15:50:10 2007 -0800
@@ -8,7 +8,7 @@ obj-y = sched.o fork.o exec_domain.o
signal.o sys.o kmod.o workqueue.o pid.o \
rcupdate.o extable.o params.o posix-timers.o \
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
- hrtimer.o rwsem.o latency.o nsproxy.o srcu.o
+ hrtimer.o rwsem.o latency.o nsproxy.o srcu.o asys.o

obj-$(CONFIG_STACKTRACE) += stacktrace.o
obj-y += time/
diff -r 4ea674e8825e -r 5bdda0f7bef2 kernel/exit.c
--- a/kernel/exit.c Mon Jan 29 15:46:47 2007 -0800
+++ b/kernel/exit.c Mon Jan 29 15:50:10 2007 -0800
@@ -42,6 +42,7 @@
#include <linux/audit.h> /* for audit_free() */
#include <linux/resource.h>
#include <linux/blkdev.h>
+#include <linux/asys.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -926,6 +927,8 @@ fastcall NORET_TYPE void do_exit(long co
taskstats_exit(tsk, group_dead);

exit_mm(tsk);
+
+ asys_task_exiting(tsk);

if (group_dead)
acct_process();
diff -r 4ea674e8825e -r 5bdda0f7bef2 kernel/fork.c
--- a/kernel/fork.c Mon Jan 29 15:46:47 2007 -0800
+++ b/kernel/fork.c Mon Jan 29 15:50:10 2007 -0800
@@ -49,6 +49,7 @@
#include <linux/delayacct.h>
#include <linux/taskstats_kern.h>
#include <linux/random.h>
+#include <linux/asys.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -987,6 +988,8 @@ static struct task_struct *copy_process(
goto fork_out;

rt_mutex_init_task(p);
+
+ asys_init_task(p);

#ifdef CONFIG_TRACE_IRQFLAGS
DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
diff -r 4ea674e8825e -r 5bdda0f7bef2 include/linux/asys.h
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/include/linux/asys.h Mon Jan 29 15:50:10 2007 -0800
@@ -0,0 +1,7 @@
+#ifndef _LINUX_ASYS_H
+#define _LINUX_ASYS_H
+
+void asys_task_exiting(struct task_struct *tsk);
+void asys_init_task(struct task_struct *tsk);
+
+#endif
diff -r 4ea674e8825e -r 5bdda0f7bef2 kernel/asys.c
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/kernel/asys.c Mon Jan 29 15:50:10 2007 -0800
@@ -0,0 +1,252 @@
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/err.h>
+#include <linux/asys.h>
+
+/* XXX */
+#include <asm/processor.h>
+
+/*
+ * system call and argument specification given to _submit from userspace
+ */
+struct asys_input {
+ int syscall_nr;
+ unsigned long cookie;
+ unsigned long nr_args;
+ unsigned long *args;
+};
+
+/*
+ * system call completion event given to userspace
+ * XXX: compat
+ */
+struct asys_completion {
+ long return_code;
+ unsigned long cookie;
+};
+
+/*
+ * This record of a completed async system call is kept around until it
+ * is collected by userspace.
+ */
+struct asys_result {
+ struct list_head item;
+ struct asys_completion comp;
+};
+
+/*
+ * This stack is built-up and handed to the scheduler to first process
+ * the system call. It stores the progress of the call until the call returns
+ * and this structure is freed.
+ */
+struct asys_call {
+ struct asys_result *result;
+ struct fibril fibril;
+};
+
+void asys_init_task(struct task_struct *tsk)
+{
+ INIT_LIST_HEAD(&tsk->asys_completed);
+ init_waitqueue_head(&tsk->asys_wait);
+}
+
+void asys_task_exiting(struct task_struct *tsk)
+{
+ struct asys_result *res, *next;
+
+ list_for_each_entry_safe(res, next, &tsk->asys_completed, item)
+ kfree(res);
+
+ /*
+ * XXX this only works if tsk->fibril was allocated by
+ * sys_asys_submit(), not if its embedded in an asys_call. This
+ * implies that we must forbid sys_exit in asys_submit.
+ */
+ if (tsk->fibril) {
+ BUG_ON(!list_empty(&tsk->fibril->run_list));
+ kfree(tsk->fibril);
+ tsk->fibril = NULL;
+ }
+}
+
+/*
+ * Initial asys call stacks are constructed such that this is called when
+ * the system call handler returns. It records the return code from
+ * the handler in a completion event and frees data associated with the
+ * completed asys call.
+ *
+ * XXX we know that the x86 syscall handlers put their return code in eax and
+ * that regparm(3) here will take our rc argument from eax.
+ */
+static void fastcall NORET_TYPE asys_teardown_stack(long rc)
+{
+ struct asys_result *res;
+ struct asys_call *call;
+ struct fibril *fibril;
+
+ fibril = current->fibril;
+ call = container_of(fibril, struct asys_call, fibril);
+ res = call->result;
+ call->result = NULL;
+
+ res->comp.return_code = rc;
+ list_add_tail(&res->item, &current->asys_completed);
+ wake_up(&current->asys_wait);
+
+ /*
+ * We embedded the fibril in the call so that we could dereference it
+ * here without adding some tracking to the fibril. We then free the
+ * call and fibril because we're done with them.
+ *
+ * The ti itself, though, is still in use. It will only be freed once
+ * the scheduler switches away from it to another fibril. It does
+ * that when it sees current->fibril assigned to NULL.
+ */
+ current->fibril = NULL;
+ BUG_ON(!list_empty(&fibril->run_list));
+ kfree(call);
+
+ /*
+ * XXX This is sloppy. We "know" this is likely for now as the task
+ * with fibrils is only going to be in sys_asys_submit() or
+ * sys_asys_complete()
+ */
+ BUG_ON(list_empty(&current->runnable_fibrils));
+
+ schedule();
+ BUG();
+}
+
+asmlinkage long sys_asys_await_completion(struct asys_completion __user *comp)
+{
+ struct asys_result *res;
+ long ret;
+
+ ret = wait_event_interruptible(current->asys_wait,
+ !list_empty(&current->asys_completed));
+ if (ret)
+ goto out;
+
+ res = list_entry(current->asys_completed.next, struct asys_result,
+ item);
+
+ /* XXX compat */
+ ret = copy_to_user(comp, &res->comp, sizeof(struct asys_completion));
+ if (ret) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ list_del(&res->item);
+ kfree(res);
+ ret = 1;
+
+out:
+ return ret;
+}
+
+/*
+ * This initializes a newly allocated fibril so that it can be handed to the
+ * scheduler. The fibril is private to this code path at this point.
+ *
+ * XXX
+ * - this is arch specific
+ * - should maybe have a sched helper that uses INIT_PER_CALL_CHAIN
+ */
+extern unsigned long sys_call_table[]; /* XXX */
+static int asys_init_fibril(struct fibril *fibril, struct thread_info *ti,
+ struct asys_input *inp)
+{
+ unsigned long *stack_bottom;
+
+ INIT_LIST_HEAD(&fibril->run_list);
+ fibril->ti = ti;
+
+ /* XXX sanity check syscall_nr */
+ fibril->eip = sys_call_table[inp->syscall_nr];
+ /* this mirrors copy_thread()'s use of task_pt_regs() */
+ fibril->esp = (unsigned long)thread_info_pt_regs(ti) -
+ ((inp->nr_args + 1) * sizeof(long));
+
+ /*
+ * now setup the stack so that our syscall handler gets its arguments
+ * and we return to asys_teardown_stack.
+ */
+ stack_bottom = (unsigned long *)fibril->esp;
+ stack_bottom[0] = (unsigned long)asys_teardown_stack;
+ /* XXX compat */
+ if (copy_from_user(&stack_bottom[1], inp->args,
+ inp->nr_args * sizeof(long)))
+ return -EFAULT;
+
+ return 0;
+}
+
+asmlinkage long sys_asys_submit(struct asys_input __user *user_inp,
+ unsigned long nr_inp)
+{
+ struct asys_input inp;
+ struct asys_result *res;
+ struct asys_call *call;
+ struct thread_info *ti;
+ unsigned long i;
+ long err = 0;
+
+ /* Allocate a fibril for the submitter's thread_info */
+ if (current->fibril == NULL) {
+ current->fibril = kzalloc(sizeof(struct fibril), GFP_KERNEL);
+ if (current->fibril == NULL)
+ return -ENOMEM;
+
+ INIT_LIST_HEAD(&current->fibril->run_list);
+ current->fibril->state = TASK_RUNNING;
+ current->fibril->ti = current_thread_info();
+ }
+
+ for (i = 0; i < nr_inp; i++) {
+
+ if (copy_from_user(&inp, &user_inp[i], sizeof(inp))) {
+ err = -EFAULT;
+ break;
+ }
+
+ res = kmalloc(sizeof(struct asys_result), GFP_KERNEL);
+ if (res == NULL) {
+ err = -ENOMEM;
+ break;
+ }
+
+ /* XXX kzalloc to init call.fibril.per_cpu, add helper */
+ call = kzalloc(sizeof(struct asys_call), GFP_KERNEL);
+ if (call == NULL) {
+ kfree(res);
+ err = -ENOMEM;
+ break;
+ }
+
+ ti = alloc_thread_info(tsk);
+ if (ti == NULL) {
+ kfree(res);
+ kfree(call);
+ err = -ENOMEM;
+ break;
+ }
+
+ err = asys_init_fibril(&call->fibril, ti, &inp);
+ if (err) {
+ kfree(res);
+ kfree(call);
+ free_thread_info(ti);
+ break;
+ }
+
+ res->comp.cookie = inp.cookie;
+ call->result = res;
+ ti->task = current;
+
+ sched_new_runnable_fibril(&call->fibril);
+ schedule();
+ }
+
+ return i ? i : err;
+}

2007-01-30 21:58:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks



On Tue, 30 Jan 2007, Zach Brown wrote:
>
> This very rough patch series introduces a different way to provide AIO support
> for system calls.

Yee-haa!

I looked at this approach a long time ago, and basically gave up because
it looked like too much work.

I heartily approve, although I only gave the actual patches a very cursory
glance. I think the approach is the proper one, but the devil is in the
details. It might be that the stack allocation overhead or some other
subtle fundamental problem ends up making this impractical in the end, but
I would _really_ like for this to basically go in.

One of the biggest issues I see is signalling. Your mention it, and I
think that's going to be one of the big issues.

It won't matter at all for a certain class of calls (a lot of the people
who want to do AIO really end up doing non-interruptible things, and
signalling is a non-issue), but not only is it going to matter for some
others, we will almost certainly want to have a way to not just signal a
task, but a single "fibril" (and let me say that I'm not convinced about
your naming, but I don't hate it either ;)

But from a quick overview of the patches, I really don't see anything
fundamentally wrong. It needs some error checking and some limiting (I
_really_ don't think we want a regular user starting a thousand fibrils
concurrently), but it actually looks much less invasive than I thought it
would be.

So yay! Consider me at least preliminarily very happy.

Linus

2007-01-30 22:23:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks



On Tue, 30 Jan 2007, Linus Torvalds wrote:
>
> But from a quick overview of the patches, I really don't see anything
> fundamentally wrong. It needs some error checking and some limiting (I
> _really_ don't think we want a regular user starting a thousand fibrils
> concurrently), but it actually looks much less invasive than I thought it
> would be.

Side note (and maybe this was obvious to people already): I would suggest
that the "limiting" not be done the way fork() is limited ("return EAGAIN
if you go over a limit") but be done as a per-task counting semaphore
(down() on submit, up() on fibril exit).

So we should limit these to basically have some maximum concurrency
factor, but rather than consider it an error to go over it, we'd just cap
the concurrency by default, so that people can freely use asynchronous
interfaces without having to always worry about what happens if their
resources run out..

However, that also implies that we should probably add a "flags" parameter
to "async_submit()" and have a FIBRIL_IMMEDIATE flag (or a timeout) or
something to tell the kernel to rather return EAGAIN than wait. Sometimes
you don't want to block just because you already have too much work.

Hmm?

Linus

2007-01-30 22:47:34

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

> I looked at this approach a long time ago, and basically gave up
> because
> it looked like too much work.

Indeed, your mention of it in that thread.. a year ago?.. is what got
this notion sitting in the back of my head. I didn't like it at
first, but it grew on me.

> I heartily approve, although I only gave the actual patches a very
> cursory
> glance. I think the approach is the proper one, but the devil is in
> the
> details. It might be that the stack allocation overhead or some other
> subtle fundamental problem ends up making this impractical in the
> end, but
> I would _really_ like for this to basically go in.

As for efficiency and overhead, I hope to get some time with the team
that work on the Giant Database Software Whose Name We Shall Not
Speak. That'll give us some non-trival loads to profile.

> It won't matter at all for a certain class of calls (a lot of the
> people
> who want to do AIO really end up doing non-interruptible things, and
> signalling is a non-issue), but not only is it going to matter for
> some
> others, we will almost certainly want to have a way to not just
> signal a
> task, but a single "fibril" (and let me say that I'm not convinced
> about
> your naming, but I don't hate it either ;)

Yeah, no doubt. I'm wildly open to discussion here. (and yeah, me
either, but I don't care much about the name. I got tired of
qualifying overloaded uses of 'stack' or 'thread', that's all :)).

> But from a quick overview of the patches, I really don't see anything
> fundamentally wrong. It needs some error checking and some limiting (I
> _really_ don't think we want a regular user starting a thousand
> fibrils
> concurrently), but it actually looks much less invasive than I
> thought it
> would be.

I think we'll also want to flesh out the submission and completion
interface so that we don't find ourselves frustrated with it in
another 5 years. What's there now is just scaffolding to support the
interesting kernel-internal part. No doubt the kevent thread will
come into play here.

- z

2007-01-30 22:53:44

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

> So we should limit these to basically have some maximum concurrency
> factor, but rather than consider it an error to go over it, we'd
> just cap
> the concurrency by default, so that people can freely use asynchronous
> interfaces without having to always worry about what happens if their
> resources run out..

Yeah, call it the socket transmit queue model :). Maybe tuned by a
ulimit?

I don't have very strong opinions abou the specific mechanics of
limiting concurrent submissions, as long as they're there. Some
folks in Oracle complain about having one more thing to have to tune,
but the alternative seems worse.

> However, that also implies that we should probably add a "flags"
> parameter
> to "async_submit()" and have a FIBRIL_IMMEDIATE flag (or a timeout) or
> something to tell the kernel to rather return EAGAIN than wait.
> Sometimes
> you don't want to block just because you already have too much work.

EAGAIN or the initial number of submissions completed before the one
that ran over the limit, perhaps. Sure. Nothing too controversial
here :). I have this kind of stuff queued up for worrying about
once the internal mechanics are stronger.

- z

2007-01-30 22:54:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks



On Tue, 30 Jan 2007, Zach Brown wrote:
>
> I think we'll also want to flesh out the submission and completion interface
> so that we don't find ourselves frustrated with it in another 5 years. What's
> there now is just scaffolding to support the interesting kernel-internal part.
> No doubt the kevent thread will come into play here.

Actually, the thing I like about kernel micro-threads (and that "fibril"
name is starting to grow on me) is that I'm hoping we might be able to use
them for that kevent thing too. And perhaps some other issues (ACPI has
some "events" that might work with synchronously scheduled threads).

IOW, synchronous threading does have its advantages..

Btw, I noticed that you didn't Cc Ingo. Definitely worth doing. Not just
because he's basically the normal scheduler maintainer, but also because
he's historically been involved in things like the async filename lookup
that the in-kernel web server thing used. EXACTLY the kinds of things
where fibrils actually give you *much* nicer interfaces.

Ingo - see linux-kernel for the announcement and WIP patches.

Linus

2007-01-30 23:46:24

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

> Btw, I noticed that you didn't Cc Ingo. Definitely worth doing. Not
> just
> because he's basically the normal scheduler maintainer, but also
> because
> he's historically been involved in things like the async filename
> lookup
> that the in-kernel web server thing used.

Yeah, that was dumb. I had him in the cc: initially, then thought it
was too large and lobbed a bunch off. My mistake.

Ingo, I'm interested in your reaction to the i386-specific mechanics
here (the thread_info copies terrify me) and the general notion of
how to tie this cleanly into the scheduler.

- z

2007-01-31 02:04:32

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

> - We would now have some measure of task_struct concurrency. Read that twice,
> it's scary. As two fibrils execute and block in turn they'll each be
> referencing current->. It means that we need to audit task_struct to make sure
> that paths can handle racing as its scheduled away. The current implementation
> *does not* let preemption trigger a fibril switch. So one only has to worry
> about racing with voluntary scheduling of the fibril paths. This can mean
> moving some task_struct members under an accessor that hides them in a struct
> in task_struct so they're switched along with the fibril. I think this is a
> manageable burden.

That's the one scaring me in fact ... Maybe it will end up being an easy
one but I don't feel too comfortable... we didn't create fibril-like
things for threads, instead, we share PIDs between tasks. I wonder if
the sane approach would be to actually create task structs (or have a
pool of them pre-created sitting there for performances) and add a way
to share the necessary bits so that syscalls can be run on those
spin-offs.

Ben.


2007-01-31 02:07:39

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Tue, 2007-01-30 at 15:45 -0800, Zach Brown wrote:
> > Btw, I noticed that you didn't Cc Ingo. Definitely worth doing. Not
> > just
> > because he's basically the normal scheduler maintainer, but also
> > because
> > he's historically been involved in things like the async filename
> > lookup
> > that the in-kernel web server thing used.
>
> Yeah, that was dumb. I had him in the cc: initially, then thought it
> was too large and lobbed a bunch off. My mistake.
>
> Ingo, I'm interested in your reaction to the i386-specific mechanics
> here (the thread_info copies terrify me) and the general notion of
> how to tie this cleanly into the scheduler.

Thread info copies aren't such a big deal, we do that for irq stacks
already afaik

Ben.


2007-01-31 02:46:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks



On Wed, 31 Jan 2007, Benjamin Herrenschmidt wrote:

> > - We would now have some measure of task_struct concurrency. Read that twice,
> > it's scary. As two fibrils execute and block in turn they'll each be
> > referencing current->. It means that we need to audit task_struct to make sure
> > that paths can handle racing as its scheduled away. The current implementation
> > *does not* let preemption trigger a fibril switch. So one only has to worry
> > about racing with voluntary scheduling of the fibril paths. This can mean
> > moving some task_struct members under an accessor that hides them in a struct
> > in task_struct so they're switched along with the fibril. I think this is a
> > manageable burden.
>
> That's the one scaring me in fact ... Maybe it will end up being an easy
> one but I don't feel too comfortable...

We actually have almost zero "interesting" data in the task-struct.

All the real meat of a task has long since been split up into structures
that can be shared for threading anyway (ie signal/files/mm/etc).

Which is why I'm personally very comfy with just re-using task_struct
as-is.

NOTE! This is with the understanding that we *never* do any preemption.
The whole point of the microthreading as far as I'm concerned is exactly
that it is cooperative. It's not preemptive, and it's emphatically *not*
concurrent (ie you'd never have two fibrils running at the same time on
separate CPU's).

If you want preemptive of concurrent CPU usage, you use separate threads.
The point of AIO scheduling is very much inherent in its name: it's for
filling up CPU's when there's IO.

So the theory (and largely practice) is that you want to use real threads
to fill your CPU's, but then *within* those threads you use AIO to make
sure that each thread actually uses the CPU efficiently and doesn't just
block with nothing to do.

So with the understanding that this is neither CPU-concurrent nor
preemptive (*within* a fibril group - obviously the thread itself gets
both preempted and concurrently run with other threads), I don't worry at
all about sharing "struct task_struct".

Does that mean that we might not have some cases where we'd need to make
sure we do things differently? Of course not. Something migt show up. But
this actually makes it very clear what the difference between "struct
thread_struct" and "struct task_struct" are. One is shared between
fibrils, the other isn't.

Linus

2007-01-31 03:02:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks



On Tue, 30 Jan 2007, Linus Torvalds wrote:
>
> Does that mean that we might not have some cases where we'd need to make
> sure we do things differently? Of course not. Something migt show up. But
> this actually makes it very clear what the difference between "struct
> thread_struct" and "struct task_struct" are. One is shared between
> fibrils, the other isn't.

Btw, this is also something where we should just disallow certain system
calls from being done through the asynchronous method.

Notably, clone/fork(), execve() and exit() are all things that we probably
simply shouldn't allow as "AIO" events.

The process handling ones are obvious: they are very much about the shared
"struct task_struct", so they rather clearly should only done "natively".

More interesting is the question about "close()", though. Currently we
have an optimization (fget/fput_light) that basically boils down to "we
know we are the only owners". That optimization becomes more "interesting"
with AIO - we need to disable it when fibrils are active (because other
fibrils or the main thread can do it), but we can still keep it for the
non-fibril case.

So we can certainly allow close() to happen in a fibril, but at the same
time, this is an area where just the *existence* of fibrils means that
certain other decisions that were thread-related may be modified to be
aware of the micro-threads too.

I suspect there are rather few of those, though. The only one I can think
of is literally the fget/fput_light() case, but there could be others.

Linus

2007-01-31 05:17:05

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks


> NOTE! This is with the understanding that we *never* do any preemption.
> The whole point of the microthreading as far as I'm concerned is exactly
> that it is cooperative. It's not preemptive, and it's emphatically *not*
> concurrent (ie you'd never have two fibrils running at the same time on
> separate CPU's).

That makes it indeed much less worrisome...

> If you want preemptive of concurrent CPU usage, you use separate threads.
> The point of AIO scheduling is very much inherent in its name: it's for
> filling up CPU's when there's IO.

Ok, I see, that's in fact pretty similar to some task switching hack I
did about 10 years ago on MacOS to have "asynchronous" IO code be
implemented linearily :-)

Makes lots of sense imho.

Ben.


2007-01-31 05:37:24

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

Linus Torvalds wrote:
>
> On Wed, 31 Jan 2007, Benjamin Herrenschmidt wrote:
>
>
>>>- We would now have some measure of task_struct concurrency. Read that twice,
>>>it's scary. As two fibrils execute and block in turn they'll each be
>>>referencing current->. It means that we need to audit task_struct to make sure
>>>that paths can handle racing as its scheduled away. The current implementation
>>>*does not* let preemption trigger a fibril switch. So one only has to worry
>>>about racing with voluntary scheduling of the fibril paths. This can mean
>>>moving some task_struct members under an accessor that hides them in a struct
>>>in task_struct so they're switched along with the fibril. I think this is a
>>>manageable burden.
>>
>>That's the one scaring me in fact ... Maybe it will end up being an easy
>>one but I don't feel too comfortable...
>
>
> We actually have almost zero "interesting" data in the task-struct.
>
> All the real meat of a task has long since been split up into structures
> that can be shared for threading anyway (ie signal/files/mm/etc).
>
> Which is why I'm personally very comfy with just re-using task_struct
> as-is.
>
> NOTE! This is with the understanding that we *never* do any preemption.
> The whole point of the microthreading as far as I'm concerned is exactly
> that it is cooperative. It's not preemptive, and it's emphatically *not*
> concurrent (ie you'd never have two fibrils running at the same time on
> separate CPU's).

So using stacks to hold state is (IMO) the logical choice to do async
syscalls, especially once you have a look at some of the other AIO
stuff going around.

I always thought that the AIO people didn't do this because they wanted
to avoid context switch overhead.

So now if we introduce the context switch overhead back, why do we need
just another scheduling primitive? What's so bad about using threads? The
upside is that almost everything is already there and working, and also
they don't have any of these preemption or concurrency restrictions.

The only thing I saw in Zach's post against the use of threads is that
some kernel API would change. But surely if this is the showstopper then
there must be some better argument than sys_getpid()?!

Aside from that, I'm glad that someone is looking at this way for AIO,
because I really don't like some aspects in the other approach.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-31 05:52:01

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

Nick Piggin wrote:
> Linus Torvalds wrote:
>
>>
>> On Wed, 31 Jan 2007, Benjamin Herrenschmidt wrote:
>>
>>
>>>> - We would now have some measure of task_struct concurrency. Read
>>>> that twice,
>>>> it's scary. As two fibrils execute and block in turn they'll each be
>>>> referencing current->. It means that we need to audit task_struct
>>>> to make sure
>>>> that paths can handle racing as its scheduled away. The current
>>>> implementation
>>>> *does not* let preemption trigger a fibril switch. So one only has
>>>> to worry
>>>> about racing with voluntary scheduling of the fibril paths. This
>>>> can mean
>>>> moving some task_struct members under an accessor that hides them in
>>>> a struct
>>>> in task_struct so they're switched along with the fibril. I think
>>>> this is a
>>>> manageable burden.
>>>
>>>
>>> That's the one scaring me in fact ... Maybe it will end up being an easy
>>> one but I don't feel too comfortable...
>>
>>
>>
>> We actually have almost zero "interesting" data in the task-struct.
>>
>> All the real meat of a task has long since been split up into
>> structures that can be shared for threading anyway (ie
>> signal/files/mm/etc).
>>
>> Which is why I'm personally very comfy with just re-using task_struct
>> as-is.
>>
>> NOTE! This is with the understanding that we *never* do any
>> preemption. The whole point of the microthreading as far as I'm
>> concerned is exactly that it is cooperative. It's not preemptive, and
>> it's emphatically *not* concurrent (ie you'd never have two fibrils
>> running at the same time on separate CPU's).
>
>
> So using stacks to hold state is (IMO) the logical choice to do async
> syscalls, especially once you have a look at some of the other AIO
> stuff going around.
>
> I always thought that the AIO people didn't do this because they wanted
> to avoid context switch overhead.
>
> So now if we introduce the context switch overhead back, why do we need
> just another scheduling primitive? What's so bad about using threads? The
> upside is that almost everything is already there and working, and also
> they don't have any of these preemption or concurrency restrictions.

In other words, while I share the appreciation for this clever trick of
using cooperative switching between these little thriblets, I don't
actually feel it is very elegant to then have to change the kernel so
much in order to handle them.

I would be fascinated to see where such a big advantage comes from using
these rather than threads. Maybe we can then improve threads not to suck
so much and everybody wins.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-31 06:07:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks



On Wed, 31 Jan 2007, Nick Piggin wrote:
>
> I always thought that the AIO people didn't do this because they wanted
> to avoid context switch overhead.

I don't think that scheduling overhead was ever a really the reason, at
least not the primary one, and at least not on Linux. Sure, we can
probably make cooperative thread switching a bit faster than even
VM-sharing thread switching (maybe), but it's not going to be *that* big
an issue.

Ifaik, the bigger issues were about setup costs (but also purely semantic
- it was hard to do AIO semantics with threads).

And memory costs. The "one stack page per outstanding AIO" may end up
still being too expensive, but threads were even more so.

[ Of course, that used to also be the claim by all the people who thought
we couldn't do native kernel threads for "normal" threading either, and
should go with the n*m setup. Shows how much they knew ;^]

But I've certainly _personally_ always wanted to do AIO with threads. I
wanted to do it with regular threads (ie the "clone()" kind). It didn't
fly. But I think we can possibly lower both the setup costs and the memory
costs with the cooperative approach, to the point where maybe this one is
more palatable and workable.

And maybe it also solves some of the scalability worries (threads have ID
space and scheduling setup things that essentially go away by just not
doing them - which is what the fibrils simply wouldn't have).

(Sadly, some of the people who really _use_ AIO are the database people,
and they really only care about a particularly stupid and trivial case:
pure reads and writes. A lot of other loads care about much more complex
things: filename lookups etc, that traditional AIO cannot do at all, and
that you really want something more thread-like for. But those other loads
get kind of swamped by the DB needs, which are might tighter and trivial
enough that you don't "need" a real thread for them).

Linus

2007-01-31 07:58:03

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls

Zach Brown <[email protected]> writes:

> This finally does something useful with the notion of being able to schedule
> stacks as fibrils under a task_struct. Again, i386-specific and in need of
> proper layering with archs.
>
> sys_asys_submit() is added to let userspace submit asynchronous system calls.
> It specifies the system call number and arguments. A fibril is constructed for
> each call. Each starts with a stack which executes the given system call
> handler and then returns to a function which records the return code of the
> system call handler. sys_asys_await_completion() then lets userspace collect
> these results.

Do you have any numbers how this compares cycle wise to just doing
clone+syscall+exit in user space?

If the difference is not too big might it be easier to fix
clone+syscall to be more efficient than teach all the rest
of the kernel about fibrils?

-Andi

2007-01-31 08:45:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks


* Linus Torvalds <[email protected]> wrote:

> [ Of course, that used to also be the claim by all the people who
> thought we couldn't do native kernel threads for "normal" threading
> either, and should go with the n*m setup. Shows how much they knew
> ;^]
>
> But I've certainly _personally_ always wanted to do AIO with threads.
> I wanted to do it with regular threads (ie the "clone()" kind). It
> didn't fly. But I think we can possibly lower both the setup costs and
> the memory costs with the cooperative approach, to the point where
> maybe this one is more palatable and workable.

as you know i've been involved in all the affected IO and scheduling
disciplines in question: i hacked on scheduling, 1:1 threading, on Tux,
i even optimized kaio under TPC workloads, and various other things. So
i believe to have seen most of these things first-hand and thus i
(pretend to) have no particular prejudice or other subjective
preference. In the following, rather long (and boring) analysis i touch
upon both 1:1 threading, light threads, state machines and AIO.

The first and most important thing is that i think there are only /two/
basic, fundamental IO approaches that matter to kernel design:

1) the most easy to program one.

2) the fastest one.

1), the most easily programmed model: this is tasks and synchronous IO.
Samba (and Squid, and even Apache most of the time ...) is still using
plain old /processes/ and is doing fine! Most of the old LWP arguments
are not really true anymore, TLB switches are very fast meanwhile (per
hardware improvements) and our scheduler is still pretty light.
Furthermore, for certain specialized environments that are able isolate
the programmer from the risks and complexities of thread programming,
threading can be somewhat faster (such as Java or C#). But for the
general C/C++ based server environment even threading is pure hell. (We
know it from the kernel that programming a threaded design is /hard/,
needs alot of discipline and takes alot more resources than any of the
saner models.)

2) the fastest model: is a pure, minimal state-machine tailored to the
task/workload we want to do. Tux does that (and i'm keeping it uptodate
for current 2.6 kernels too), and it still beats the pants off anything
user-space.

But reality is that most people care more about programmability: 300
MB/sec or 500 MB/sec web throughput limit (or a serving limit 30K
requests in parallel or 60K requests in parallel - whatever the
performance metric you care about is) isnt really making any difference
in all but the most specialized environments - /because/ the price to
pay for it is much worse programmability (and hence much worse
flexibility). [ And this is all about basic economy: if it's 10 times
harder to program something and what takes 1 month to program in one
model takes 10 months to program in the other model - but in 10 months
the hardware already got another 50% faster, likely offsetting the
advantage of that 'other' model to begin with ... ]

Note that often the speed difference between task based and state based
designs is alot smaller. But God is the programming complexity
difference huge. Not even huge but /HUGE/. All our programming languages
are procedural and stack based. (Not only is there a basic lack of
state-based programming languages, it's a product of our brain
structure: state machines are not really what the human mind is
structured for, but i digress.)

Now where do all these LWP, fibre, firbril, micro-thread or N:M concepts
fit? Most of the time they are just a /weakening/ of the #1 concept. And
that's why they will lose out, because #1 is all about programmability
and they dont offer anything new: because they cannot. Either you go for
programmability or you go for performance. There is /no/ middle ground
for us in the kernel! (User-space can do the middle ground thing, but
the kernel must only be structured around the two most fundamental
concepts that exist. [NOTE: more about KAIO later.])

Having a 1:1 relationship between user-space and kernel-space context is
a strong concept, and on modern CPUs we perform /process/ context
switches in 650 nsecs, we do user thread context switches in 500 nsecs,
and kernel<->kernel thread context switches in 350 nsecs. That's pretty
damn fast.

The big problem is, the M:N concepts (and i consider micro-threads or
'schedulable stacks' to be of that type) tend to operate the following
way: they build up hype by being "faster" - but the performance
advantage only comes from weakening #1! (for example by not making the
contexts generally schedulable, bindable, load-balancable, suspendable,
migratable, etc.) But then a few years later they grow (by virtue of
basic pressure from programmers, who /want/ a clean easily programmable
concept #1) the /very same/ features that they "optimized away" in the
beginning. With the difference that now all that infrastructure they
left out initially is replicated in user-space, in a poorer and
inconsistent way, blowing up cache-size, and slowing the /sane/ things
down in the end. (and then i havent even begun about the nightmare of
extending the debug infrastructure to light threads, ABI dependencies,
etc, etc...)

One often repeated (because pretty much only) performance advantage of
'light threads' is context-switch performance between user-space
threads. But reality is, nobody /cares/ about being able to
context-switch between "light user-space threads"! Why? Because there
are only two reasons why such a high-performance context-switch would
occur:

1) there's contention between those two tasks. Wonderful: now two
artificial threads are running on the /same/ CPU and they are even
contending each other. Why not run a single context on a single CPU
instead and only get contended if /another/ CPU runs a conflicting
context?? While this makes for nice "pthread locking benchmarks",
it is not really useful for anything real.

2) there has been an IO event. The thing is, for IO events we enter the
kernel no matter what - and we'll do so for the next 10 years at
minimum. We want to abstract away the hardware, we want to do
reliable resource accounting, we want to share hardware resources,
we want to rate-limit, etc., etc. While in /theory/ you could handle
IO purely from user-space, in practice you dont want to do that. And
if we accept the premise that we'll enter the kernel anyway, there's
zero performance difference between scheduling right there in the
kernel, or returning back to user-space to schedule there. (in fact
i submit that the former is faster). Or if we accept the theoretical
possibility of 'perfect IO hardware' that implements /all/ the
features that the kernel wants (in a secure and generic way, and
mind you, such IO hardware does not exist yet), then /at most/ the
performance advantage of user-space doing the scheduling is the
overhead of a null syscall entry. Which is a whopping 100 nsecs on
modern CPUs! That's roughly the latency of a /single/ DRAM access!

Furthermore, 'light thread' concepts can no way approach the performance
of #2 state-machines: if you /know/ what the structure of your context
is, and you can program it in a specialized state-machine way, there's
just so many shortcuts possible that it's not even funny.

> And maybe it also solves some of the scalability worries (threads have
> ID space and scheduling setup things that essentially go away by just
> not doing them - which is what the fibrils simply wouldn't have).

our PID management is very fast - i've never seen it show up in
profiles. (We could make it even more scalable if the need arises, for
example right now we still maintain the luxory of having a globally
increasing and tightly compressed PID/TID space.)

until a few days ago we even had a /global lock/ in the thread/task
creation and destruction hotpath since 2.5.43 (when oprofile support was
added, more than 4 years ago, the oprofile exit notifier lock) and
nobody really noticed or cared!

so i believe there are only two directions that make sense in the long
run:

1) improve our basic #1 design gradually. If something is a bottleneck,
if the scheduler has grown too fat, cut some slack. If micro-threads
or fibrils offer anything nice for our basic thread model: integrate
it into the kernel. (Dont worry about having the enter the kernel for
that, we /want/ that isolation! Resource control /needs/ a secure and
trustable context. Good debuggability /needs/ a secure and trustable
context. Hardware makers will make damn sure that such hardware-based
isolation is fast as lightning. SYSENTER will continue to be very
fast.) In any case, unless someone can answer the issues i raised
above, please dont mess with the basic 1:1 task model we have. It's
really what we want to have.

2) enable crazy people to do IO in the #2 way. For this i think the most
programmable interface is KAIO - because the 'context' /only/
involves the IO entity itself, which is simple enough for our brain
to wrap itself around. (and the hardware itself has the concept of
'in flight IO' anyway, so people are forced to learn about it and to
deal with it anyway, even in the synchronous IO model. So there's
quite some mindshare to build upon.) This also happens to realize
/most/ of the performance advantages that a state-machine like Tux
tends to have. (In that sense i dont see epoll to be conceptually
different from KAIO, although internally within the kernel it's quite
a bit different.)

now, KAIO 'behind the hood' (within the kernel) is like /very/ hard to
program in a fully state-driven manner - but we are crazy folks who
/might/ be able to pull it off. Initially we should just simulate the
really hard bits via a pool of in-kernel synchronous threads. Some IO
disciplines are already state-driven internally (networking most
notably, but also timers), so there KAIO can be implemented 'natively'.

And it will be /faster/ than micro-threads based AIO!

but frankly, KAIO is the most i can see a normal, sane human programmer
to be able to deal with. Any more exposure to state-machines will just
drive them crazy, and they'll lynch us (or more realistically: ignore
us) if we try anything more. And with KAIO they still have the /option/
to make all their user-space functionality state-driven. Also, abstract,
fully managed programming environments like Java can hide KAIO
complexities by implementing their synchronous primitives using KAIO.

Ingo

2007-01-31 10:50:48

by Xavier Bestel

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Tue, 2007-01-30 at 19:02 -0800, Linus Torvalds wrote:
> Btw, this is also something where we should just disallow certain system
> calls from being done through the asynchronous method.

Does that mean that doing an AIO-disabled syscall will wait for all in-
flight AIO syscalls to finish ?

Xav


2007-01-31 17:15:49

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls


On Jan 31, 2007, at 12:58 AM, Andi Kleen wrote:

> Do you have any numbers how this compares cycle wise to just doing
> clone+syscall+exit in user space?

Not yet, no. Release early, release often, and all that. I'll throw
something together.

- z

2007-01-31 17:22:35

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls

On Wednesday 31 January 2007 18:15, Zach Brown wrote:
>
> On Jan 31, 2007, at 12:58 AM, Andi Kleen wrote:
>
> > Do you have any numbers how this compares cycle wise to just doing
> > clone+syscall+exit in user space?
>
> Not yet, no. Release early, release often, and all that. I'll throw
> something together.

So what was the motivation for doing this then? It's only point
is to have smaller startup costs for AIO than clone+fork without
fixing the VFS code to be a state machine, right?

I'm personally unclear if it's really less work to teach a lot of
code in the kernel about a new thread abstraction than changing VFS.

Your patches don't look that complicated yet but you openly
admitted you waved away many of the more tricky issues (like
signals etc.) and I bet there are yet-unknown side effects
of this too that will need more changes.

I would expect a VFS solution to be the fastest of any at least.

I'm not sure the fibrils thing will be that much faster than
a possibly somewhat fast pathed for this case clone+syscall+exit.

-Andi

2007-01-31 17:38:45

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

>> - We would now have some measure of task_struct concurrency. Read
>> that twice,
>> it's scary.

> That's the one scaring me in fact ... Maybe it will end up being an
> easy
> one but I don't feel too comfortable...

Indeed, that was my first reaction too. I dismissed the idea for a
good six months after initially realizing that it implied sharing
journal_info, etc.

But when I finally sat down and started digging through the
task_struct members and, after quickly dismissing involuntary
preemption of the fibrils, it didn't seem so bad. I haven't done an
exhaustive audit yet (and I won't advocate merging until I have) but
I haven't seen any train wrecks.

> we didn't create fibril-like
> things for threads, instead, we share PIDs between tasks. I wonder if
> the sane approach would be to actually create task structs (or have a
> pool of them pre-created sitting there for performances) and add a way
> to share the necessary bits so that syscalls can be run on those
> spin-offs.

Maybe, if it comes to that. I have some hopes that sharing by
default and explicitly marking the bits that we shouldn't share will
be good enough.

- z

2007-01-31 17:48:19

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

> Does that mean that we might not have some cases where we'd need to
> make
> sure we do things differently? Of course not. Something migt show up.

Might, and has. For a good time, take journal_info out of
per_call_chain() in the patch set and watch it helpfully and loudly
wet itself. There really really are bits of thread_struct which are
strictly thread-local-storage, of a sort, for a kernel call path.
Sharing them, even only through cooperate scheduling, is fatal.
link_count is another obvious one.

They're also the only ones I've bothered to discover so far :).

> But
> this actually makes it very clear what the difference between "struct
> thread_struct" and "struct task_struct" are. One is shared between
> fibrils, the other isn't.

Indeed.

Right now the per-fibril uses of task_struct members are left inline
in task_struct and are copied on fibril switches.

We *could* put them in thread_info, at the cost of stack pressure, or
hang them off task_struct in their own struct to avoid the copies, at
the cost of indirection. I didn't like imposing a cost on paths that
don't use fibrils, though, so I left them inline.

(I think you know all this. I'm clarifying for the peanut gallery, I
hope.)

- z

2007-01-31 17:51:33

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Wed, Jan 31, 2007 at 09:38:11AM -0800, Zach Brown wrote:
> Indeed, that was my first reaction too. I dismissed the idea for a
> good six months after initially realizing that it implied sharing
> journal_info, etc.
>
> But when I finally sat down and started digging through the
> task_struct members and, after quickly dismissing involuntary
> preemption of the fibrils, it didn't seem so bad. I haven't done an
> exhaustive audit yet (and I won't advocate merging until I have) but
> I haven't seen any train wrecks.

I'm still of the opinion that you cannot do this without creating actual
threads. That said, there is room for setting up the task_struct beforehand
without linking it into the system lists. The reason I don't think this
approach works (and I looked at it a few times) is that many things end
up requiring special handling: things like permissions, signals, FPU state,
segment registers.... The list ends up looking exactly the way task_struct
is, making the actual savings very small.

What the fibrils approach is useful for is the launching of the thread
initially, as you don't have to retain things like the current FPU state,
change segment registers, etc. Changing the stack is cheap, the rest of
the work is not.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2007-01-31 18:00:16

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

> Btw, this is also something where we should just disallow certain
> system
> calls from being done through the asynchronous method.

Yeah. Maybe just a bitmap built from __NR_ constants? I don't know
if we can do it in a way that doesn't require arch maintainer's
attention.

It seems like it would be nice to avoid putting a test in the
handlers themselves, and leave it up to the aio syscall submission
processing.

> More interesting is the question about "close()", though. Currently we
> have an optimization (fget/fput_light) that basically boils down to
> "we
> know we are the only owners". That optimization becomes more
> "interesting"
> with AIO - we need to disable it when fibrils are active (because
> other
> fibrils or the main thread can do it), but we can still keep it for
> the
> non-fibril case.

I'll take a look, thanks for pointing it out.

- z

2007-01-31 18:22:04

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

> The only thing I saw in Zach's post against the use of threads is that
> some kernel API would change. But surely if this is the showstopper
> then
> there must be some better argument than sys_getpid()?!

Haha, yeah, that's the silly example I keep throwing around :). I
guess it does leave a little too much of the exercise up to the reader.

Perhaps a less goofy example are the uses of current->ioprio and
current->io_context?

If you create and destroy threads around each operation then you're
going to be creating and destroying an io_context around each op
instead of getting a reference on a pre-existing context in
additional ops. ioprio is inherited it seems, though, so that's not
so bad.

If you have a pool of threads and you want to update the ioprio for
future IOs, you now have to sync up the pool's ioprio with new
desired priority.

It's all solvable, sure. Get an io_context ref in copy_process().
Share ioprio instead of inheriting it. Have a fun conversation with
Jens about the change in behaviour this implies. Broadcasting to
threads to update ioprio isn't exactly rocket science.

But with the fibril model the user don't have to know to worry about
the inconsistencies and we kernel developers don't have to worry
about pro-actively stamping them out. A series of sync write and
ioprio setting calls behaves exactly the same as that series of calls
issued sequentially as "async" calls. That's worth aiming for, I think.

- z

2007-01-31 19:24:13

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls


On Jan 31, 2007, at 9:21 AM, Andi Kleen wrote:

> On Wednesday 31 January 2007 18:15, Zach Brown wrote:
>>
>> On Jan 31, 2007, at 12:58 AM, Andi Kleen wrote:
>>
>>> Do you have any numbers how this compares cycle wise to just doing
>>> clone+syscall+exit in user space?
>>
>> Not yet, no. Release early, release often, and all that. I'll throw
>> something together.
>
> So what was the motivation for doing this then?

Most fundamentally? Providing AIO system call functionality at a
much lower maintenance cost. The hope is that the cost of adopting
these fibril things will be lower than the cost of having to touch a
code path that wants AIO support.

I simply don't believe that it's cheap to update code paths to
support non-blocking state machines. As just one example of a
looming cost, consider the retry-based buffered fs AIO patches that
exist today. Their requirement to maintain these precisely balanced
code paths that know to only return -EIOCBRETRY once they're at a
point where retries won't access current-> seems.. unsustainable to
me. This stems from the retries being handled off in the aio kernel
threads which have their own task_struct. fs/aio.c goes to the
trouble of migrating ->mm from the submitting task_struct, but
nothing else. Continually adjusting this finely balanced
relationship between paths that return -EIOCBRETY and the fields of
task_struct that fs/aio.c knows to share with the submitting context
seems unacceptably fragile.

Even with those buffered IO patches we still only get non-blocking
behaviour at a few specific blocking points in the buffered IO path.
It's nothing like the guarantee of non-blocking submission returns
that the fibril-based submission guarantees.

> It's only point
> is to have smaller startup costs for AIO than clone+fork without
> fixing the VFS code to be a state machine, right?

Smaller startup costs and fewer behavioural differences. Did that
message to Nick about ioprio and io_context resonate with you at all?

> I'm personally unclear if it's really less work to teach a lot of
> code in the kernel about a new thread abstraction than changing VFS.

Why are we limiting the scope of moving to a state machine just to
the VFS? If you look no further than some hypothetical AIO iscsi/aoe/
nbd/whatever target you obviously include networking. Probably splice
() if you're aggressive :).

Let's be clear. I would be thrilled if AIO was implemented by native
non-blocking handler implementations. I don't think it will happen.
Not because we don't think it sounds great on paper, but because it's
a hugely complex endeavor that would take development and maintenance
effort away from the task of keeping basic functionality working.

So the hope with fibrils is that we lower the barrier to getting AIO
syscall support across the board at an acceptable cost.

It doesn't *stop* us from migrating very important paths (storage,
networking) to wildly optimized AIO implementations. But it also
doesn't force AIO support to wait for that.

> Your patches don't look that complicated yet but you openly
> admitted you waved away many of the more tricky issues (like
> signals etc.) and I bet there are yet-unknown side effects
> of this too that will need more changes.

To quibble, "waved away" implies that they've been dismissed. That's
not right. It's a work in progress, so yes, there will be more
fiddly details discovered and addressed over time. The hope is that
when it's said and done it'll still be worth merging. If at some
point it gets to be too much, well, at least we'll have this work to
reference as a decisive attempt.

> I'm not sure the fibrils thing will be that much faster than
> a possibly somewhat fast pathed for this case clone+syscall+exit.

I'll try and get some numbers for you sooner rather than later.

Thanks for being diligent, this is exactly the kind of hard look I
want this work to get.

- z

2007-01-31 19:26:06

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

> without linking it into the system lists. The reason I don't think
> this
> approach works (and I looked at it a few times) is that many things
> end
> up requiring special handling: things like permissions, signals,
> FPU state,
> segment registers....

Can you share a specific example of the special handling required?

- z

2007-01-31 19:29:07

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks


On Jan 31, 2007, at 2:50 AM, Xavier Bestel wrote:

> On Tue, 2007-01-30 at 19:02 -0800, Linus Torvalds wrote:
>> Btw, this is also something where we should just disallow certain
>> system
>> calls from being done through the asynchronous method.
>
> Does that mean that doing an AIO-disabled syscall will wait for all
> in-
> flight AIO syscalls to finish ?

That seems unlikely. I imagine we'd just return EINVAL or ENOSYS or
something to that effect.

- z

2007-01-31 20:05:18

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Wed, Jan 31, 2007 at 11:25:30AM -0800, Zach Brown wrote:
> >without linking it into the system lists. The reason I don't think
> >this
> >approach works (and I looked at it a few times) is that many things
> >end
> >up requiring special handling: things like permissions, signals,
> >FPU state,
> >segment registers....
>
> Can you share a specific example of the special handling required?

Take FPU state: memory copies and RAID xor functions use MMX/SSE and
require that the full task state be saved and restored.

Task priority is another. POSIX AIO lets you specify request priority, and
it really is needed for realtime workloads where things like keepalive
must be processed at a higher priority. This is especially important on
embedded systems which don't have a surplus of CPU cycles.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2007-01-31 20:20:45

by Joel Becker

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Tue, Jan 30, 2007 at 10:06:48PM -0800, Linus Torvalds wrote:
> (Sadly, some of the people who really _use_ AIO are the database people,
> and they really only care about a particularly stupid and trivial case:
> pure reads and writes. A lot of other loads care about much more complex
> things: filename lookups etc, that traditional AIO cannot do at all, and
> that you really want something more thread-like for. But those other loads
> get kind of swamped by the DB needs, which are might tighter and trivial
> enough that you don't "need" a real thread for them).

While certainly not an exhaustive list, DB people love their
reads and writes, but are pining for network reads and writes. They
also are very excited about async poll(), connect(), and accept(). One
of the big problems today is that you can either sleep for your I/O in
io_getevents() or for your connect()/accept() in poll()/epoll(), but
there is nowhere you can sleep for all of them at once. That's why the
aio list continually proposes aio_poll() or returning aio events
via epoll().
Fibril based syscalls would allow async connect()/accept() and
the rest of networking, plus one stop shopping for completions.

Joel


--

"Born under a bad sign.
I been down since I began to crawl.
If it wasn't for bad luck,
I wouldn't have no luck at all."

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2007-01-31 20:41:55

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

> Take FPU state: memory copies and RAID xor functions use MMX/SSE and
> require that the full task state be saved and restored.

Sure, that much is obvious. I was hoping to see what FPU state
juggling actually requires. I'm operating under the assumption that
it won't be *terrible*.

> Task priority is another. POSIX AIO lets you specify request
> priority, and
> it really is needed for realtime workloads where things like keepalive
> must be processed at a higher priority.

Yeah. A first-pass approximation might be to have threads with asys
system calls grouped by priority. Leaving all that priority handling
to the *task* scheduler, instead of the dirt-stupid fibril
"scheduler", would be great. If we can get away with it. I don't
have a good feeling for what portion of the world actually cares
about this, or to what degree.

- z

2007-02-01 08:38:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling


* Zach Brown <[email protected]> wrote:

> This patch introduces the notion of a 'fibril'. It's meant to be a
> lighter kernel thread. [...]

as per my other email, i dont really like this concept. This is the
killer:

> [...] There can be multiple of them in the process of executing for a
> given task_struct, but only one can every be actively running at a
> time. [...]

there's almost no scheduling cost from being able to arbitrarily
schedule a kernel thread - but there are /huge/ benefits in it.

would it be hard to redo your AIO patches based on a pool of plain
simple kernel threads?

We could even extend the scheduling properties of kernel threads so that
they could also be 'companion threads' of any given user-space task.
(i.e. they'd always schedule on the same CPu as that user-space task)

I bet most of the real benefit would come from co-scheduling them on the
same CPU. But this should be a performance property, not a basic design
property. (And i also think that having a limited per-CPU pool of AIO
threads works better than having a per-user-thread pool - but again this
is a detail that can be easily changed, not a fundamental design
property.)

Ingo

2007-02-01 11:07:55

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls

On Wed, Jan 31, 2007 at 11:23:39AM -0800, Zach Brown wrote:
> On Jan 31, 2007, at 9:21 AM, Andi Kleen wrote:
>
> >On Wednesday 31 January 2007 18:15, Zach Brown wrote:
> >>
> >>On Jan 31, 2007, at 12:58 AM, Andi Kleen wrote:
> >>
> >>>Do you have any numbers how this compares cycle wise to just doing
> >>>clone+syscall+exit in user space?
> >>
> >>Not yet, no. Release early, release often, and all that. I'll throw
> >>something together.
> >
> >So what was the motivation for doing this then?
>
> Most fundamentally? Providing AIO system call functionality at a
> much lower maintenance cost. The hope is that the cost of adopting
> these fibril things will be lower than the cost of having to touch a
> code path that wants AIO support.
>
> I simply don't believe that it's cheap to update code paths to
> support non-blocking state machines. As just one example of a
> looming cost, consider the retry-based buffered fs AIO patches that
> exist today. Their requirement to maintain these precisely balanced
> code paths that know to only return -EIOCBRETRY once they're at a
> point where retries won't access current-> seems.. unsustainable to
> me. This stems from the retries being handled off in the aio kernel
> threads which have their own task_struct. fs/aio.c goes to the
> trouble of migrating ->mm from the submitting task_struct, but
> nothing else. Continually adjusting this finely balanced
> relationship between paths that return -EIOCBRETY and the fields of
> task_struct that fs/aio.c knows to share with the submitting context
> seems unacceptably fragile.

Wooo ...hold on ... I think this is swinging out of perspective :)

I have said some of this before, but let me try again.

As you already discovered when going down the fibril path, there are
two kinds of accesses to current-> state, (1) common state
for a given call chain (e.g. journal info etc), and (2) for
various validations against the caller's process (uid, ulimit etc).

(1) is not an issue when it comes to execution in background threads
(the VFS already uses background writeback for example).

As for (2), such checks need to happen upfront at the time of IO submission,
so again are not an issue.

This is aside from access to the caller's address space, a familiar
concept which the AIO threads use. If there is any state that
relates to address space access, then it belongs in the ->mm struct, rather
than in current (and we should fix that if we find anything which isn't
already there).

I don't see any other reason why IO paths should be assuming that they are
running in the original caller's context, midway through doing the IO. If
that were the case background writeouts and readaheads could be fragile as
well (or ptrace). The reason it isn't is because of this conceptual division of
responsibility.

Sure, having explicit separation of submission checks as an interface
would likely make this clearer and cleaner, but I'm just
pointing out that usage of current-> state isn't and shouldn't be arbitrary
when it comes to filesystem IO paths. We should be concerned in any case
if that starts happening.

Of course, this is fundamentally connected to the way filesystem IO is
designed to work, and may not necessarily apply to all syscal

When you want to make any and every syscall asynchronous, then indeed
the challenge is magnified and that is where it could get scary. But that
isn't the problem the current AIO code is trying to tackle.

>
> Even with those buffered IO patches we still only get non-blocking
> behaviour at a few specific blocking points in the buffered IO path.
> It's nothing like the guarantee of non-blocking submission returns
> that the fibril-based submission guarantees.

This one is a better reason, and why I have thought of fibrils (or the
equivalent alternative of enhancing kernel theads to become even lighter)
as an interesting fallback option to implement AIO for cases which we
haven't (maybe some of which are too complicated) gotton around to
supporting natively. Especially if it could be coupled with some clever
tricks to keep stack space to be minimal (I have wondered whether any of
the ideas from similar user-level efforts like Cappricio, or laio would help).

>
> > It's only point
> >is to have smaller startup costs for AIO than clone+fork without
> >fixing the VFS code to be a state machine, right?
>
> Smaller startup costs and fewer behavioural differences. Did that
> message to Nick about ioprio and io_context resonate with you at all?
>
> >I'm personally unclear if it's really less work to teach a lot of
> >code in the kernel about a new thread abstraction than changing VFS.
>
> Why are we limiting the scope of moving to a state machine just to
> the VFS? If you look no further than some hypothetical AIO iscsi/aoe/
> nbd/whatever target you obviously include networking. Probably splice
> () if you're aggressive :).
>
> Let's be clear. I would be thrilled if AIO was implemented by native
> non-blocking handler implementations. I don't think it will happen.
> Not because we don't think it sounds great on paper, but because it's
> a hugely complex endeavor that would take development and maintenance
> effort away from the task of keeping basic functionality working.
>
> So the hope with fibrils is that we lower the barrier to getting AIO
> syscall support across the board at an acceptable cost.
>
> It doesn't *stop* us from migrating very important paths (storage,
> networking) to wildly optimized AIO implementations. But it also
> doesn't force AIO support to wait for that.
>
> >Your patches don't look that complicated yet but you openly
> >admitted you waved away many of the more tricky issues (like
> >signals etc.) and I bet there are yet-unknown side effects
> >of this too that will need more changes.
>
> To quibble, "waved away" implies that they've been dismissed. That's
> not right. It's a work in progress, so yes, there will be more
> fiddly details discovered and addressed over time. The hope is that
> when it's said and done it'll still be worth merging. If at some
> point it gets to be too much, well, at least we'll have this work to
> reference as a decisive attempt.
>
> >I'm not sure the fibrils thing will be that much faster than
> >a possibly somewhat fast pathed for this case clone+syscall+exit.
>
> I'll try and get some numbers for you sooner rather than later.
>
> Thanks for being diligent, this is exactly the kind of hard look I
> want this work to get.

BTW, I like the way you are approaching this with a cautiously
critical eye cognizant of lurking details/issues, despite the obvious
(and justified) excitement/eureka feeling. AIO _is_ hard !

Regards
Suparna

>
> - z
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2007-02-01 13:04:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling


* Ingo Molnar <[email protected]> wrote:

> * Zach Brown <[email protected]> wrote:
>
> > This patch introduces the notion of a 'fibril'. It's meant to be a
> > lighter kernel thread. [...]
>
> as per my other email, i dont really like this concept. This is the
> killer:

let me clarify this: i very much like your AIO patchset in general, in
the sense that it 'completes' the AIO implementation: finally everything
can be done via it, greatly increasing its utility and hopefully its
penetration. This is the most important step, by far.

what i dont really like /the particular/ concept above - the
introduction of 'fibrils' as a hard distinction of kernel threads. They
are /almost/ kernel threads, but still by being different they create
alot of duplication and miss out on a good deal of features that kernel
threads have naturally.

It kind of hurts to say this because i'm usually quite concept-happy -
one can easily get addicted to the introduction of new core kernel
concepts :-) But i really, really think we dont want to do fibrils but
we want to do kernel threads, and i havent really seen a discussion
about why they shouldnt be done via kernel threads.

Nor have i seen a discussion that whatever threading concept we use for
AIO within the kernel, it is really a fallback thing, not the primary
goal of "native" KAIO design. The primary goal of KAIO design is to
arrive at a state machine - and for one of the most important IO
disciplines, networking, that is reality already. (For filesystem events
i doubt we will ever be able to build an IO state machine - but there
are lots of crazy folks out there so it's not fundamentally impossible,
just very, very hard.)

so my suggestions center around the notion of extending kernel threads
to support the features you find important in fibrils:

> would it be hard to redo your AIO patches based on a pool of plain
> simple kernel threads?
>
> We could even extend the scheduling properties of kernel threads so
> that they could also be 'companion threads' of any given user-space
> task. (i.e. they'd always schedule on the same CPu as that user-space
> task)
>
> I bet most of the real benefit would come from co-scheduling them on
> the same CPU. But this should be a performance property, not a basic
> design property. (And i also think that having a limited per-CPU pool
> of AIO threads works better than having a per-user-thread pool - but
> again this is a detail that can be easily changed, not a fundamental
> design property.)

but i'm willing to be convinced of the opposite as well, as always. (I'm
real good at quickly changing my mind, especially when i'm embarrasingly
wrong about something. So please fire away and dont hold back.)

Ingo

2007-02-01 13:19:14

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Thu, Feb 01, 2007 at 02:02:34PM +0100, Ingo Molnar wrote:
> what i dont really like /the particular/ concept above - the
> introduction of 'fibrils' as a hard distinction of kernel threads. They
> are /almost/ kernel threads, but still by being different they create
> alot of duplication and miss out on a good deal of features that kernel
> threads have naturally.
>
> It kind of hurts to say this because i'm usually quite concept-happy -
> one can easily get addicted to the introduction of new core kernel
> concepts :-) But i really, really think we dont want to do fibrils but
> we want to do kernel threads, and i havent really seen a discussion
> about why they shouldnt be done via kernel threads.

I tend to agree. Note that there is one thing we should be doing one
one day (not only if we want to use it for aio) is to make kernel threads
more lightweight. There?is a lot of baggae we keep around in task_struct
and co that only makes sense for threads that have a user space part and
aren't or shouldn't be needed for a purely kernel-resistant thread.

2007-02-01 13:54:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling


* Christoph Hellwig <[email protected]> wrote:

> I tend to agree. Note that there is one thing we should be doing one
> one day (not only if we want to use it for aio) is to make kernel
> threads more lightweight. There a lot of baggae we keep around in
> task_struct and co that only makes sense for threads that have a user
> space part and aren't or shouldn't be needed for a purely
> kernel-resistant thread.

yeah. I'm totally open to such efforts. I'd also be most happy if this
was primarily driven via the KAIO effort: i.e. to implement it via
kernel threads and then to benchmark the hell out of it. I volunteer to
fix whatever fat kernel thread handling has left.

and if people agree with me that 'native' state-machine driven KAIO is
where we want to ultimately achieve (it is certainly the best performing
implementation) then i dont see the point in fibrils as an interim
mechanism anyway. Lets just hide AIO complexities from userspace via
kernel threads, and optimize this via two methods: by making kernel
threads faster, and by simultaneously and gradually converting as much
KAIO code to a native state machine - which would not need any kind of
kernel thread help anyway.

(plus as i mentioned previously, co-scheduling kernel threads with
related user space threads on the same CPU might be something useful too
- not just for KAIO, and we could add that too.)

also, we context-switch kernel threads in 350 nsecs on current hardware
and the -rt kernel is certainly happy with that and runs all hardirqs
and softirqs in separate kernel thread contexts. There's not /that/ much
fat left to cut off - and if there's something more to optimize there
then there are a good number of projects interested in that, not just
the KAIO effort :)

Ingo

2007-02-01 17:13:51

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

Ingo Molnar wrote:
>
> also, we context-switch kernel threads in 350 nsecs on current hardware
> and the -rt kernel is certainly happy with that and runs all hardirqs

Ingo, how relevant is that "350 nsecs on current hardware" claim?

I don't mean that in a bad way, but my own experience suggests that
most people doing real hard RT (or tight soft RT) are not doing it
on x86 architectures. But rather on lowly 1GHz (or less) ARM based
processors and the like.

For RT issues, those are the platforms I care more about,
as those are the ones that get embedded into real-time devices.

??

Cheers

2007-02-01 18:05:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling


* Mark Lord <[email protected]> wrote:

> >also, we context-switch kernel threads in 350 nsecs on current
> >hardware and the -rt kernel is certainly happy with that and runs all
> >hardirqs
>
> Ingo, how relevant is that "350 nsecs on current hardware" claim?
>
> I don't mean that in a bad way, but my own experience suggests that
> most people doing real hard RT (or tight soft RT) are not doing it on
> x86 architectures. But rather on lowly 1GHz (or less) ARM based
> processors and the like.

it's not relevant to those embedded boards, but it's relevant to the AIO
discussion, which centers around performance.

> For RT issues, those are the platforms I care more about, as those are
> the ones that get embedded into real-time devices.

yeah. Nevertheless if you want to use -rt on your desktop (under Fedora
4/5/6) you can track an rpmized+distroized full kernel package quite
easily, via 3 easy commands:

cd /etc/yum.repos.d
wget http://people.redhat.com/~mingo/realtime-preempt/rt.repo

yum install kernel-rt.x86_64 # on x86_64
yum install kernel-rt # on i686

which is closely tracking latest upstream -git. (for example, the
current kernel-rt-2.6.20-rc7.1.rt3.0109.i686.rpm is based on
2.6.20-rc7-git1, so if you want to run a kernel rpm that has all of
Linus' latest commits from yesterday, this might be for you.)

it's rumored to be a quite smooth kernel ;-) So in this sense, because
this also runs on all my testboxes by default, it matters on modern
hardware too, at least to me. Today's commodity hardware is tomorrow's
embedded hardware. If a kernel is good on today's colorful desktop
hardware then it will be perfect for tomorrow's embedded hardware.

Ingo

2007-02-01 19:50:35

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls

On Thu, 2007-02-01 at 16:43 +0530, Suparna Bhattacharya wrote:
> Wooo ...hold on ... I think this is swinging out of perspective :)
>
> I have said some of this before, but let me try again.
>
> As you already discovered when going down the fibril path, there are
> two kinds of accesses to current-> state, (1) common state
> for a given call chain (e.g. journal info etc), and (2) for
> various validations against the caller's process (uid, ulimit etc).
>
> (1) is not an issue when it comes to execution in background threads
> (the VFS already uses background writeback for example).
>
> As for (2), such checks need to happen upfront at the time of IO submission,
> so again are not an issue.

Wrong! These checks can and do occur well after the time of I/O
submission in the case of remote filesystems with asynchronous writeback
support.

Consider, for instance, the cases where the server reboots and loses all
state. Then there is the case of failover and/or migration events, where
the entire filesystem gets moved from one server to another, and again
you may have to recover state, etc...

> I don't see any other reason why IO paths should be assuming that they are
> running in the original caller's context, midway through doing the IO. If
> that were the case background writeouts and readaheads could be fragile as
> well (or ptrace). The reason it isn't is because of this conceptual division of
> responsibility.

The problem with this is that the security context is getting
progressively more heavy as we add more and more features. In addition
to the original uid/gid/fsuid/fsgid/groups, we now have stuff like
keyrings to carry around. Then there is all the context needed to
support selinux,...
In the end, you end up recreating most of struct task_struct...

Cheers
Trond

2007-02-01 20:08:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling



On Thu, 1 Feb 2007, Ingo Molnar wrote:
>
> there's almost no scheduling cost from being able to arbitrarily
> schedule a kernel thread - but there are /huge/ benefits in it.

That's a singularly *stupid* argument.

Of course scheduling is fast. That's the whole *point* of fibrils. They
still schedule. Nobody claimed anything else.

Bringing up RT kernels and scheduling latency is idiotic. It's like saying
"we should do this because the sky is blue". Sure, that's true, but what
the *hell* does raleigh scattering have to do with anything?

The cost has _never_ been scheduling. That was never the point. Why do you
even bring it up? Only to make an argument that makes no sense?

The cost of AIO is

- maintenance. It'sa separate code-path, and it's one that simply doesn't
fit into anything else AT ALL. It works (mostly) for simple things, ie
reads and writes, but even there, it's really adding a lot of crud that
we could do without.

- setup and teardown costs: both in CPU and in memory. These are the big
costs. It's especially true since a lot of AIO actually ends up cached.
The user program just wants the data - 99% of the time it's likely to
be there, and the whole point of AIO is to get at it cheaply, but not
block if it's not there.

So your scheduling arguments are inane. They totally miss the point. They
have nothing to do with *anything*.

Ingo: everybody *agrees* that scheduling is cheap. Scheduling isn't the
issue. Scheduling isn't even needed in the perfect path where the AIO
didn't need to do any real IO (and that _is_ the path we actually would
like to optimize most).

So instead of talking about totally irrelevant things, please keep your
eyes on the ball.

So I claim that the ball is here:

- cached data (and that is *espectally* true of some of the more
interesting things we can do with a more generic AIO thing: path
lookup, inode filling (stat/fstat) etc usually has hit-rates in the 99%
range, but missing even just 1% of the time can be deadly, if the miss
costs you a hundred msec of not doing anythign else!

Do the math. A "stat()" system call generally takes on the other of a
couple of microseconds. But if it misses even just 1% of the time (and
takes 100 msec when it does that, because there is other IO also
competing for the disk arm), ON AVERAGE it takes 1ms.

So what you should aim for is improving that number. The cached case
should hopefully still be in the microseconds, and the uncached case
should be nonblocking for the caller.

- setup/teardown costs. Both memory and CPU. This is where the current
threads simply don't work. The setup cost of doing a clone/exit is
actually much higher than the cost of doing the whole operation, most
of the time. Remember: caches still work.

- maintenance. Clearly AIO will always have some special code, but if we
can move the special code *away* from filesystems and networking and
all the thousands of device drivers, and into core kernel code, we've
done something good. And if we can extend it from just pure read/write
into just about *anything*, then people will be happy.

So stop blathering about scheduling costs, RT kernels and interrupts.
Interrupts generally happen a few thousand times a second. This is
soemthing you want to do a *million* times a second, without any IO
happening at all except for when it has to.

Linus

2007-02-01 20:26:36

by bert hubert

[permalink] [raw]
Subject: Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls

On Tue, Jan 30, 2007 at 01:39:45PM -0700, Zach Brown wrote:

> sys_asys_submit() is added to let userspace submit asynchronous system calls.
> It specifies the system call number and arguments. A fibril is constructed for
> each call. Each starts with a stack which executes the given system call
> handler and then returns to a function which records the return code of the
> system call handler. sys_asys_await_completion() then lets userspace collect
> these results.

Zach,

Do you have any userspace code that can be used to get started experimenting
with your fibril based AIO stuff?

I want to try it on from a userspace perspective.

I'm confident I could hack it up from scratch, but I'm sure you'll have some
test code lying around.

I scanned the discussion so far, but I might've missed any references to
userspace code so far.

Thanks!

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

2007-02-01 21:30:14

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls

> Do you have any userspace code that can be used to get started
> experimenting
> with your fibril based AIO stuff?

I only have a goofy little test app so far:

http://www.zabbo.net/~zab/aio-walk-tree.c

It's not to be taken too seriously :)

> I want to try it on from a userspace perspective.

Frankly, I'm not sure its ready for that yet. You're welcome to give
it a try, but it's early enough that you're sure to hit problems
almost immediately.

- z

2007-02-01 21:52:56

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> let me clarify this: i very much like your AIO patchset in general, in
> the sense that it 'completes' the AIO implementation: finally
> everything
> can be done via it, greatly increasing its utility and hopefully its
> penetration. This is the most important step, by far.

We violently agree on this :).

> what i dont really like /the particular/ concept above - the
> introduction of 'fibrils' as a hard distinction of kernel threads.
> They
> are /almost/ kernel threads, but still by being different they create
> alot of duplication and miss out on a good deal of features that
> kernel
> threads have naturally.

I might quibble with some of the details, but I understand your
fundamental concern. I do. I don't get up each morning *thrilled*
by the idea of having to update lockdep, sysrq-t, etc, to understand
these fibril things :). The current fibril switch isn't nearly as
clever as the lock-free task scheduling switch. It'd be nice if we
didn't have to do that work to optimize the hell out of it, sure.

> It kind of hurts to say this because i'm usually quite concept-happy -
> one can easily get addicted to the introduction of new core kernel
> concepts :-)

:)

> so my suggestions center around the notion of extending kernel threads
> to support the features you find important in fibrils:
>
>> would it be hard to redo your AIO patches based on a pool of plain
>> simple kernel threads?

It'd certainly be doable to throw together a credible attempt to
service "asys" system call submission with full-on kernel threads.
That seems like reasonable due diligence to me. If full-on threads
are almost as cheap, great. If fibrils are so much cheaper that they
seem to warrant investing in, great.

I am concerned about the change in behaviour if we fall back to full
kernel threads, though. I really, really, want aio syscalls to
behave just like sync ones.

Would your strategy be to update the syscall implementations to share
data in task_struct so that there isn't as significant a change in
behaviour? (sharing current->ioprio, instead if just inheriting it,
for example.). We'd be betting that there would be few of these and
that they'd be pretty reasonable to share?

- z

2007-02-01 22:19:43

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls

> Wooo ...hold on ... I think this is swinging out of perspective :)

I'm sorry, but I don't. I think using the EIOCBRETRY method in
complicated code paths requires too much maintenance cost to justify
its benefits. We can agree to disagree on that judgement :).

- z

2007-02-01 22:24:12

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Thu, Feb 01, 2007 at 01:52:13PM -0800, Zach Brown wrote:
> >let me clarify this: i very much like your AIO patchset in general, in
> >the sense that it 'completes' the AIO implementation: finally
> >everything
> >can be done via it, greatly increasing its utility and hopefully its
> >penetration. This is the most important step, by far.
>
> We violently agree on this :).

There is also the old kernel_thread based method that should probably be
compared, especially if pre-created threads are thrown into the mix. Also,
since the old days, a lot of thread scaling issues have been fixed that
could even make userland threads more viable.

> Would your strategy be to update the syscall implementations to share
> data in task_struct so that there isn't as significant a change in
> behaviour? (sharing current->ioprio, instead if just inheriting it,
> for example.). We'd be betting that there would be few of these and
> that they'd be pretty reasonable to share?

Priorities cannot be shared, as they have to adapt to the per-request
priority when we get down to the nitty gitty of POSIX AIO, as otherwise
realtime issues like keepalive transmits will be handled incorrectly.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2007-02-01 22:38:38

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> Priorities cannot be shared, as they have to adapt to the per-request
> priority when we get down to the nitty gitty of POSIX AIO, as
> otherwise
> realtime issues like keepalive transmits will be handled incorrectly.

Well, maybe not *blind* sharing. But something more than the
disconnect threads currently have with current->ioprio.

Today an existing kernel thread would most certainly ignore a
sys_ioprio_set() in the submitter and then handle an aio syscall with
an old current->ioprio.

Something more smart than that is all I'm on about.

- z

2007-02-02 03:30:05

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls

On Thu, Feb 01, 2007 at 02:18:55PM -0800, Zach Brown wrote:
> >Wooo ...hold on ... I think this is swinging out of perspective :)
>
> I'm sorry, but I don't. I think using the EIOCBRETRY method in
> complicated code paths requires too much maintenance cost to justify
> its benefits. We can agree to disagree on that judgement :).

I don't disagree about limitations of the EIOCBRETRY approach, nor do I
recommend it for all sorts of complicated code paths. It is only good as
an approximation for specific blocking points involving idempotent
behaviour, and I was trying to emphasise that that is *all* it was ever
intended for. I certainly do not see it as a viable path to make all syscalls
asynchronous, or to address all blocking points in filesystem IO.

And I do like the general direction of your approach, only need to think
deeper about the details like how to reduce stack per IO request so this
can scale better. So we don't disagree as much as you think :)

The point where we seem to disagree is that I think there is goodness in
maintaining the conceptual clarity between what parts of the operation assume
that it is executing in the original submitters context. For the IO paths
this is what allows things like readahead and writeback to work and to cluster
operations which may end up to/from multiple submitters. This means that
if there is some context that is needed thereafter it could be associated
with the IO request (as an argument or in some other manner), so that this
division is still maintained.

Regards
Suparna

>
> - z
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2007-02-02 07:12:34

by bert hubert

[permalink] [raw]
Subject: Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls

On Thu, Feb 01, 2007 at 01:29:41PM -0800, Zach Brown wrote:

> >I want to try it on from a userspace perspective.
>
> Frankly, I'm not sure its ready for that yet. You're welcome to give
> it a try, but it's early enough that you're sure to hit problems
> almost immediately.

I'm counting on it - what I want to taste is if the concept is a match for
the things I want to do.

Thanks!

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

2007-02-02 07:14:06

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls

On Thu, Feb 01, 2007 at 11:50:06AM -0800, Trond Myklebust wrote:
> On Thu, 2007-02-01 at 16:43 +0530, Suparna Bhattacharya wrote:
> > Wooo ...hold on ... I think this is swinging out of perspective :)
> >
> > I have said some of this before, but let me try again.
> >
> > As you already discovered when going down the fibril path, there are
> > two kinds of accesses to current-> state, (1) common state
> > for a given call chain (e.g. journal info etc), and (2) for
> > various validations against the caller's process (uid, ulimit etc).
> >
> > (1) is not an issue when it comes to execution in background threads
> > (the VFS already uses background writeback for example).
> >
> > As for (2), such checks need to happen upfront at the time of IO submission,
> > so again are not an issue.
>
> Wrong! These checks can and do occur well after the time of I/O
> submission in the case of remote filesystems with asynchronous writeback
> support.
>
> Consider, for instance, the cases where the server reboots and loses all
> state. Then there is the case of failover and/or migration events, where
> the entire filesystem gets moved from one server to another, and again
> you may have to recover state, etc...
>
> > I don't see any other reason why IO paths should be assuming that they are
> > running in the original caller's context, midway through doing the IO. If
> > that were the case background writeouts and readaheads could be fragile as
> > well (or ptrace). The reason it isn't is because of this conceptual division of
> > responsibility.
>
> The problem with this is that the security context is getting
> progressively more heavy as we add more and more features. In addition
> to the original uid/gid/fsuid/fsgid/groups, we now have stuff like
> keyrings to carry around. Then there is all the context needed to
> support selinux,...

Isn't that kind of information supposed to be captured in nfs_open_context ?
Which is associated with the open file instance ...

I know this has been a traditional issue with network filesystems, and I
haven't kept up with the latest code and decisions in that respect, but how
would you do background writeback if there is an assumption of running in
the context of the original submitter ?

Regards
Suparna

> In the end, you end up recreating most of struct task_struct...
>
> Cheers
> Trond
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2007-02-02 07:45:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls


> Isn't that kind of information supposed to be captured in nfs_open_context ?
> Which is associated with the open file instance ...

Or a refcounted struct cred. Which would be needed for strict POSIX thread
semantics likely anyways. But there never was enough incentive to go down
that path and it would be likely somewhat slow.

>
> I know this has been a traditional issue with network filesystems, and I
> haven't kept up with the latest code and decisions in that respect, but how
> would you do background writeback if there is an assumption of running in
> the context of the original submitter ?

AFAIK (Trond will hopefully correct me if I'm wrong) in the special case of
NFS there isn't much problem because the server does the (passive) authentication
and there is no background writeback from server to client. The client just
does the usual checks at open time and then forgets about it. The server
threads don't have own credentials but just check those of others.

I can't think of any cases where you would need to do authentication
in the client for every read() or write()

Overall the arguments for reusing current don't seem to be strong to me.

-Andi

2007-02-02 10:50:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling


* Linus Torvalds <[email protected]> wrote:

> So stop blathering about scheduling costs, RT kernels and interrupts.
> Interrupts generally happen a few thousand times a second. This is
> soemthing you want to do a *million* times a second, without any IO
> happening at all except for when it has to.

we might be talking past each other.

i never suggested every aio op should create/destroy a kernel thread!

My only suggestion was to have a couple of transparent kernel threads
(not fibrils) attached to a user context that does asynchronous
syscalls! Those kernel threads would be 'switched in' if the current
user-space thread blocks - so instead of having to 'create' any of them
- the fast path would be to /switch/ them to under the current
user-space, so that user-space processing can continue under that other
thread!

That means that in the 'switch kernel context' fastpath it simply needs
to copy the blocked threads' user-space ptregs (~64 bytes) to its own
kernel stack, and then it can do a return-from-syscall without
user-space noticing the switch! Never would we really see the cost of
kernel thread creation. We would never see that cost in the fully cached
case (no other thread is needed then), nor would we see it in the
blocking-IO case, due to pooling. (there are some other details related
to things like the FPU context, but you get the idea.)

Let me quote Zach's reply to my suggestions:

| It'd certainly be doable to throw together a credible attempt to
| service "asys" system call submission with full-on kernel threads.
| That seems like reasonable due diligence to me. If full-on threads
| are almost as cheap, great. If fibrils are so much cheaper that they
| seem to warrant investing in, great.

that's all i wanted to see being considered!

Please ignore my points about scheduling costs - i only talked about
them at length because the only fundamental difference between kernel
threads and fibrils is their /scheduling/ properties. /Not/ the
setup/teardown costs - those are not relevant /precisely/ because they
can be pooled and because they happen relatively rarely, compared to the
cached case. The 'switch to the blocked thread's ptregs' operation also
involves a context-switch under this design. That's why i was talking
about scheduling so much: the /only/ true difference between fibrils and
kernel threads is their /scheduling/.

I believe this is the point where your argument fails:

> - setup/teardown costs. Both memory and CPU. This is where the current
> threads simply don't work. The setup cost of doing a clone/exit is
> actually much higher than the cost of doing the whole operation,
> most of the time.

you are comparing apples to oranges - i never said we should
create/destroy a kernel thread for every async op. That would be insane!

what we need to support asynchronous system-calls is the ability to pick
up an /already created/ kernel thread from a pool of per-task kernel
threads and to switch it to under the current user-space and return to
the user-space stack with that new kernel thread running. (The other,
blocked kernel thread stays blocked and is returned into the pool of
'pending' AIO kernel threads.) And this only needs to happen in the
'cachemiss' case anyway. In the 'cached' case no other kernel thread
would be involved at all, the current one just falls straight through
the system-call.

my argument is that the whole notion of cutting this at the kernel stack
and thread info level and making fibrils in essence a separate
scheduling entitity is wrong, wrong, wrong. Why not use plain kernel
threads for this?

[ finally, i think you totally ignored my main argument, state machines.
The networking stack is a full and very nice state machine. It's
kicked from user-space, and zillions of small contexts (sockets) are
living on without any of the originating tasks having to be involved.
So i'm still holding to the fundamental notion that within the kernel
this form of AIO is a nice but /secondary/ mechanism. If a subsystem
is able to pull it off, it can implement asynchronity via a state
machine - and it will outperform any thread based AIO. Or not. We'll
see. For something like the VFS i doubt we'll see (and i doubt we
/want/ to see) a 'native' state-machine implementation.

this is btw. quite close to the Tux model of doing asynchronous block
IO and asynchronous VFS events such as asynchronous open(). Tux uses a
pool of kernel threads to pass blocking work to, while not holding up
the 'main' thread. But the main Tux speedup comes from having a native
state machine for all the networking IO. ]

Ingo

2007-02-02 12:21:43

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

Ingo Molnar <[email protected]> writes:

> and for one of the most important IO
> disciplines, networking, that is reality already.

Not 100% -- a few things in TCP/IP at least are blocking still.
Mostly relatively obscure things though.

Also the sockets model is currently incompatible with direct zero-copy RX/TX,
which needs fixing.

-Andi

2007-02-02 12:22:39

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

Christoph Hellwig <[email protected]> writes:
>
> I tend to agree. Note that there is one thing we should be doing one
> one day (not only if we want to use it for aio) is to make kernel threads
> more lightweight. There?is a lot of baggae we keep around in task_struct
> and co that only makes sense for threads that have a user space part and
> aren't or shouldn't be needed for a purely kernel-resistant thread.

I suspect you will get a lot of this for free from the current namespace
efforts.

-Andi

2007-02-02 15:57:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling



On Fri, 2 Feb 2007, Ingo Molnar wrote:
>
> My only suggestion was to have a couple of transparent kernel threads
> (not fibrils) attached to a user context that does asynchronous
> syscalls! Those kernel threads would be 'switched in' if the current
> user-space thread blocks - so instead of having to 'create' any of them
> - the fast path would be to /switch/ them to under the current
> user-space, so that user-space processing can continue under that other
> thread!

But in that case, you really do end up with "fibrils" anyway.

Because those fibrils are what would be the state for the blocked system
calls when they aren't scheduled.

We may have a few hundred thousand system calls a second (maybe that's not
actually reasonable, but it should be what we *aim* for), and 99% of them
will hopefully hit the cache and never need any separate IO, but even if
it's just 1%, we're talking about thousands of threads.

I do _not_ think that it's reasonable to have thousands of threads state
around just "in case". Especially if all those threadlets are then
involved in signals etc - something that they are totally uninterested in.

I think it's a lot more reasonable to have just the kernel stack page for
"this was where I was when I blocked". IOW, a fibril-like thing. You need
some data structure to set up the state *before* you start doing any
threads at all, because hopefully the operation will be totally
synchronous, and no secondary thread is ever really needed!

What I like about fibrils is that they should be able to handle the cached
case well: the case where no "real" scheduling (just the fibril stack
switches) takes place.

Now, most traditional DB loads would tend to use AIO only when they "know"
that real IO will take place (the AIO call itself will probably be
O_DIRECT most of the time). So I suspect that a lot of those users will
never really have the cached case, but one of my hopes is to be able to do
exactly the things that we have *not* done well: asynchronous file opens
and pathname lookups, which is very common in a file server.

If done *really* right, a perfectly normal app could do things like
asynchronous stat() calls to fill in the readdir results. In other words,
what *I* would like to see is the ability to have something *really*
simple like "ls" use this, without it actually being a performance hit
for the common case where everythign is cached.

Have you done "ls -l" on a big uncached directory where the inodes
are all over the disk lately? You can hear the disk whirr. THAT is the
kind of "normal user" thing I'd like to be able to fix, and the db case is
actually secondary. The DB case is much much more limited (ok, so somebody
pointed out that they want slightly more than just read/write, but
still.. We're talking "special code".)

> [ finally, i think you totally ignored my main argument, state machines.

I ignored your argument, because it's not really relevant. The fact that
networking (and TCP in particular) has state machines is because it is a
packetized environment. Nothing else is. Think pathname lookup etc. They
are all *fundamentally* environments with a call stack.

So the state machine argument is totally bogus - it results in a
programming model that simply doesn't match the *normal* setup. You want
the kernel programming model to appear "linear" even when it isn't,
because it's too damn hard to think nonlinearly.

Yes, we could do pathname lookup with that kind of insane setup too. But
it would be HORRID!

Linus

2007-02-02 19:48:12

by Alan

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

This one got shelved while I sorted other things out as it warranted a
longer look. Some comments follow, but firstly can we please bury this
"fibril" name. The constructs Zach is using appear to be identical to
co-routines, and they've been called that in computer science literature
for fifty years. They are one of the great and somehow forgotten ideas.
(and I admit I've used them extensively in past things where its
wonderful for multi-player gaming so I'm a convert already).

The stuff however isn't as free as you make out. Current kernel logic
knows about various things being "safe" but with fibrils you have to
address additional questions such as "What happens if I issue an I/O and
change priority". You also have an 800lb gorilla hiding behind a tree
waiting for you in priviledge and permission checking.

Right now current->*u/gid is safe across a syscall start to end, with an
asynchronous setuid all hell breaks loose. I'm not saying we shouldn't do
this, in fact we'd be able to do some of the utterly moronic poxix thread
uid handling in kernel space if we did, just that it isn't free. We have
locking rules defined by the magic serializing construct called
"the syscall" and you break those.

I'd expect the odd other gorilla waiting to mug you as well and the ones
nobody has thought of will be the worst 8)

The number of co-routines and stacks can be dealt with two ways - you use
small stacks allocated when you create a fibril, or you grab a page, use
separate IRQ stacks and either fail creation with -ENOBUFS etc which
drops work on user space, or block (for which cases ??) which also means
an overhead on co-routine exits. That can be tunable, for embedded easily
tuned right down.

Traditional co-routines have clear notions of being able to create a
co-routine, stack them and fire up specific ones. In part this is done
because many things expressed in this way know what to fire up next. It's
also a very clean way to express driver problem with a lot of state

Essentially as a co-routine is simply making "%esp" roughly the same as
the C++ world's "self".

You get some other funny things from co-routines which are very powerful,
very dangerous, or plain insane depending upon your view of life. One big
one is the ability for real men (and women) to do stuff like this,
because you don't need to keep the context attached to the same task.

send_reset_command(dev);
wait_for_irq_event(dev->irq);
/* co-routine continues in IRQ context here */
clean_up_reset_command(dev);
exit_irq_event();
/* co-routine continues out of IRQ context here */
send_identify_command(dev);

Notice we just dealt with all the IRQ stack problems the moment an IRQ is
a co-routine transfer 8)

Ditto with timers, although for the kernel that might not be smart as we
have a lot of timers.

Less insanely you can create a context, start doing stuff in it and then
pass it to someone else local variables, state and all. This one is
actually rather useful for avoiding a lot of the 'D' state crap in the
kernel.

For example we have driver code that sleeps uninterruptibly because its
too hard to undo the mess and get out of the current state if it is
interrupted. In the world of sending other people co-routines you just do
this

coroutine_set(MUST_COMPLETE);

and in exit

foreach(coroutine)
if(coroutine->flags & MUST_COMPLETE)
inherit_coroutine(init, coroutine);

and obviously you don't pass any over that will then not do the right
thing before accessing user space (well unless implementing
'read_for_someone_else()' or other strange syscalls - like ptrace...)

Other questions really relate to the scheduling - Zach do you intend
schedule_fibrils() to be a call code would make or just from schedule() ?


Linus will now tell me I'm out of my tree...


Alan (who used to use Co-routines in real languages on 36bit
computers with 9bit bytes before learning C)

2007-02-02 20:14:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling



On Fri, 2 Feb 2007, Alan wrote:
>
> This one got shelved while I sorted other things out as it warranted a
> longer look. Some comments follow, but firstly can we please bury this
> "fibril" name. The constructs Zach is using appear to be identical to
> co-routines, and they've been called that in computer science literature
> for fifty years. They are one of the great and somehow forgotten ideas.
> (and I admit I've used them extensively in past things where its
> wonderful for multi-player gaming so I'm a convert already).

Well, they are indeed coroutines, but they are coroutines in the same
sense any "CPU scheduler" ends up being a coroutine.

They are NOT the generic co-routine that some languages support natively.
So I think trying to call them coroutines would be even more misleading
than calling them fibrils.

In other workds the whole *point* of the fibril is that you can do
"coroutine-like stuff" while using a "normal functional linear programming
paradign".

Wouldn't you agree?

(I love the concept of coroutines, but I absolutely detest what the code
ends up looking like. There's a good reason why people program mostly in
linear flow: that's how people think consciously - even if it's obviously
not how the brain actually works).

And we *definitely* don't want to have a coroutine programming interface
in the kernel. Not in C.

> The stuff however isn't as free as you make out. Current kernel logic
> knows about various things being "safe" but with fibrils you have to
> address additional questions such as "What happens if I issue an I/O and
> change priority". You also have an 800lb gorilla hiding behind a tree
> waiting for you in priviledge and permission checking.

This is why I think it should be 100% clear that things happen in process
context. That just answers everything. If you want to synchronize with
async events and change IO priority, you should do exactly that:

wait_for_async();
ioprio(newprority);

and that "solves" that problem. Leave it to user space.

> Right now current->*u/gid is safe across a syscall start to end, with an
> asynchronous setuid all hell breaks loose. I'm not saying we shouldn't do
> this, in fact we'd be able to do some of the utterly moronic poxix thread
> uid handling in kernel space if we did, just that it isn't free. We have
> locking rules defined by the magic serializing construct called
> "the syscall" and you break those.

I agree. As mentioned, we probably will have fallout.

> The number of co-routines and stacks can be dealt with two ways - you use
> small stacks allocated when you create a fibril, or you grab a page, use
> separate IRQ stacks and either fail creation with -ENOBUFS etc which
> drops work on user space, or block (for which cases ??) which also means
> an overhead on co-routine exits. That can be tunable, for embedded easily
> tuned right down.

Right. It should be possible to just say "use a max parallelism factor of
5", and if somebody submits a hundred AIO calls and they all block, when
it hits #6, it will just do it synchronously.

Basically, what I'm hoping can come out of this (and this is a simplistic
example, but perhaps exactly *because* of that it hopefully also shows
that we canactually make *simple* interfaces for complex asynchronous
things):

struct one_entry *prev = NULL;
struct dirent *de;

while ((de = readdir(dir)) != NULL) {
struct one_entry *entry = malloc(..);

/* Add it to the list, fill in the name */
entry->next = prev;
prev = entry;
strcpy(entry->name, de->d_name);

/* Do the stat lookup async */
async_stat(de->d_name, &entry->stat_buf);
}
wait_for_async();
.. Ta-daa! All done ..

and it *should* allow us to do all the stat lookup asynchronously.

Done right, this should basically be no slower than doing it with a real
stat() if everything was cached. That would kind of be the holy grail
here.

> You get some other funny things from co-routines which are very powerful,
> very dangerous, or plain insane

You forgot "very hard to think about".

We DO NOT want coroutines in general. It's clever, but it's
(a) impossible to do without language support that C doesn't have, or
some really really horrid macro constructs that really only work for
very specific and simple cases.
(b) very non-intuitive unless you've worked with coroutines a lot (and
almost nobody has)

> Linus will now tell me I'm out of my tree...

I don't think you're wrong in theory, I just thnk that in practice,
withing the confines of (a) existing code, (b) existing languages, and (c)
existing developers, we really REALLY don't want to expose coroutines as
such.

But if you wanted to point out that what we want to do is get the
ADVANTAGES of coroutines, without actually have to program them as such,
then yes, I agree 100%. But we shouldn't call them coroutines, because the
whole point is that as far as the user interface is concerned, they don't
look like that. In the kernel, they just look like normal linear
programming.

Linus

2007-02-02 21:06:26

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Fri, 2 Feb 2007, Linus Torvalds wrote:

> > You get some other funny things from co-routines which are very powerful,
> > very dangerous, or plain insane
>
> You forgot "very hard to think about".
>
> We DO NOT want coroutines in general. It's clever, but it's
> (a) impossible to do without language support that C doesn't have, or
> some really really horrid macro constructs that really only work for
> very specific and simple cases.
> (b) very non-intuitive unless you've worked with coroutines a lot (and
> almost nobody has)

Actually, coroutines are not too bad to program once you have a
total-coverage async scheduler to run them. The attached (very sketchy)
example uses libpcl ( http://www.xmailserver.org/libpcl.html ) and epoll
as scheduler (but here you can really use anything). You can implement
coroutines in many way, from C preprocessor macros up to anything, but in
the libpcl case they are simply switched stacks. Like fibrils are supposed
to be. The problem is that in order to make a real-life example of
coroutine-based application work, you need everything that can put you at
sleep (syscalls or any external library call you have no control on)
implemented in an async way. And what I ended up doing is exactly what Zab
did inside the kernel. In my case a dynamic pool of (userspace) threads
servicing any non-native potentially pre-emptive call, and signaling the
result to a pollable fd (pipe in my case) that is integrated in the epoll
(poll/select whatever) scheduler.
I personally find Zab idea a really good one, since it allows for generic
kernel async implementation, w/out the burden of dirtying kernel code
paths with AIO knowledge. Being it fibrils or real kthreads, it is IMO
definitely worth a very close look.




- Davide


Attachments:
cotest.c (2.10 kB)

2007-02-02 21:10:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling



On Fri, 2 Feb 2007, Davide Libenzi wrote:
>
> Actually, coroutines are not too bad to program once you have a
> total-coverage async scheduler to run them.

No, no, I don't disagree at all. In fact, I agree emphatically.

It's just that you need the scheduler to run them, in order to not "see"
them as coroutines. Then, you can program everything *as*if* it was just a
regular declarative linear language with multiple threads).

And that gets us the same programming interface as we always have, and
people can forget about the fact that in a very real sense, they are using
coroutines with the scheduler just keeping track of it all for them.

After all, that's what we do between processes *anyway*. You can
technically see the kernel as one big program that uses coroutines and the
scheduler just keeping track of every coroutine instance. It's just that I
doubt that any kernel programmer really thinks in those terms. You *think*
in terms of "threads".

Linus

2007-02-02 21:18:18

by Alan

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> They are NOT the generic co-routine that some languages support natively.
> So I think trying to call them coroutines would be even more misleading
> than calling them fibrils.

Its actually pretty damned close the Honeywell B co-routine package, with
a kernel twist to be honest.

> ends up looking like. There's a good reason why people program mostly in
> linear flow: that's how people think consciously - even if it's obviously
> not how the brain actually works).

The IRQ example below is an example of how it linearizes - so it cuts
both ways like most tools, admittedly one of the blades is at the handle
end in this case ...

> Basically, what I'm hoping can come out of this (and this is a simplistic
> example, but perhaps exactly *because* of that it hopefully also shows
> that we canactually make *simple* interfaces for complex asynchronous
> things):
>
> struct one_entry *prev = NULL;
> struct dirent *de;
>
> while ((de = readdir(dir)) != NULL) {
> struct one_entry *entry = malloc(..);
>
> /* Add it to the list, fill in the name */
> entry->next = prev;
> prev = entry;
> strcpy(entry->name, de->d_name);
>
> /* Do the stat lookup async */
> async_stat(de->d_name, &entry->stat_buf);
> }
> wait_for_async();

The brown and sticky will hit the rotating air impeller pretty hard if you
are not very careful about how that ends up scheduled. Its one thing to
exploit the ability to pull all the easy lookups out in advance, and
another having created all the parallelism to turn into into sane disk
scheduling and wakeups without scaling hit. But you do at least have the
opportunity to exploit it I guess.

> > You get some other funny things from co-routines which are very powerful,
> > very dangerous, or plain insane
>
> You forgot "very hard to think about".

I'm not sure handing a fibril off to another task is that hard to think
about. It's not easy to turn it around as an async_exit() keeping the
other fibrils around because of the mass of rules and behaviours tied to
process exit but its perhaps not impossible.

Other minor evil. If we use fibrils we need to be careful we
know in advance how many fibrils an operation needs so we don't deadlock
on them in critical places like writeout paths when we either hit the per
task limit or we have no page for another stack.

Alan

2007-02-02 21:30:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling



On Fri, 2 Feb 2007, Alan wrote:
>
> The brown and sticky will hit the rotating air impeller pretty hard if you
> are not very careful about how that ends up scheduled

Why do you think that?

With cooperative scheduling (like the example Zach posted), there is
absolutely no "brown and sticky" wrt any CPU usage. Which is why
cooperative scheduling is a *good* thing. If you want to blow up your
1024-node CPU cluster, you'd to it with "real threads".

Also, with sane default limits of fibrils per process (say, in the 5-10),
it also ends up beign good for IO. No "insane" IO bombs, but an easy way
for users to just just get a reasonable amount of IO parallelism without
having to use threading (which is hard).

So, best of both worlds.

Yes, *of*course* you want to have limits on outstanding work. And yes, a
database server would set those limits much higher ("Only a thousand
outstanding IO requests? Can we raise that to ten thousand, please?") than
a regular process ("default: 5, and the super-user can raise it for you if
you're good").

But there really shouldn't be any downsides.

(Of course, there will be downsides. I'm sure there will be. But I don't
see any really serious and obvious ones).

> Other minor evil. If we use fibrils we need to be careful we
> know in advance how many fibrils an operation needs so we don't deadlock
> on them in critical places like writeout paths when we either hit the per
> task limit or we have no page for another stack.

Since we'd only create fibrils on a system call entry level, and system
calls are independent, how would you do that anyway?

Once a fibril has been created, it will *never* depend on any other fibril
resources ever again. At least not in any way that any normal non-fibril
call wouldn't already do as far as I can see.

Linus

2007-02-02 22:27:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling


* Linus Torvalds <[email protected]> wrote:

> On Fri, 2 Feb 2007, Ingo Molnar wrote:
> >
> > My only suggestion was to have a couple of transparent kernel threads
> > (not fibrils) attached to a user context that does asynchronous
> > syscalls! Those kernel threads would be 'switched in' if the current
> > user-space thread blocks - so instead of having to 'create' any of them
> > - the fast path would be to /switch/ them to under the current
> > user-space, so that user-space processing can continue under that other
> > thread!
>
> But in that case, you really do end up with "fibrils" anyway.
>
> Because those fibrils are what would be the state for the blocked
> system calls when they aren't scheduled.
>
> We may have a few hundred thousand system calls a second (maybe that's
> not actually reasonable, but it should be what we *aim* for), and 99%
> of them will hopefully hit the cache and never need any separate IO,
> but even if it's just 1%, we're talking about thousands of threads.
>
> I do _not_ think that it's reasonable to have thousands of threads
> state around just "in case". Especially if all those threadlets are
> then involved in signals etc - something that they are totally
> uninterested in.
>
> I think it's a lot more reasonable to have just the kernel stack page
> for "this was where I was when I blocked". IOW, a fibril-like thing.

ok, i think i noticed another misunderstanding. The kernel thread based
scheme i'm suggesting would /not/ 'switch' to another kernel thread in
the cached case, by default. It would just execute in the original
context (as if it were a synchronous syscall), and the switch to a
kernel thread from the pool would only occur /if/ the context is about
to block. (this 'switch' thing would be done by the scheduler)
User-space gets back an -EAIO error code immediately and transparently -
but already running under the new kernel thread.

i.e. in the fully cached case there would be no scheduling at all - in
fact no thread pool is needed at all.

regarding cost:

the biggest memory resource cost of a kernel thread (assuming it has no
real user-space context) /is/ its kernel stack page, which is 4K or 8K.
The task struct takes ~1.5K. Once we have a ready kernel thread around,
it's quite cheap to 'flip' it to under any arbitrary user-space context:
change its thread_info->task pointer to the user-space context's task
struct, copy the mm pointer, the fs pointer to the "worker thread",
switch the thread_info, update ptregs - done. Hm?

Note: such a 'flip' would only occur when the original context blocks,
/not/ on every async syscall.

regarding CPU resource costs, i dont think there should be significant
signal overhead, because the original task is still only one instance,
and the kernel thread that is now running with the blocked kernel stack
is not part of the signal set. (Although it might make sense to make
such async syscalls interruptible, just like any syscall.)

The 'pool' of kernel threads doesnt even have to be per-task, it can be
a natural per-CPU thing - and its size will grow/shrink [with a low
update frequency] depending on how much AIO parallelism there is in the
workload. (But it can also be strictly per-user-context - to make sure
that a proper ->mm ->fs, etc. is set up and that when the async system
calls execute they have all the right context info.)

and note the immediate scheduling benefits: if an app (say like
OpenOffice) is single-threaded but has certain common ops coded as async
syscalls, then if any of those syscalls blocks then it could utilize
/more than one/ CPU. I.e. we could 'spread' a single-threaded app's
processing to multiple cores/hardware-threads /without/ having to
multi-thread the app in an intrusive way. I.e. this would be a
finegrained threading of syscalls, executed as coroutines in essence.
With fibrils all sorts of scheduling limitations occur and no
parallelism is possible.

in fact an app could also /trigger/ the execution of a syscall in a
different context - to create parallelism artificially - without any
blocking event. So we could do:

cookie1 = sys_async(sys_read, params);
cookie2 = sys_async(sys_write, params);

[ ... calculation loop ... ]

wait_on_async_syscall(cookie1);
wait_on_async_syscall(cookie2);

or something like that. Without user-space having to create threads
itself, etc. So basically, we'd make kernel threads more useful, and
we'd make threading safer - by only letting syscalls thread.

> What I like about fibrils is that they should be able to handle the
> cached case well: the case where no "real" scheduling (just the fibril
> stack switches) takes place.

the cached case (when a system call would not block at all) would not
necessiate any switch to another kernel thread at all - the task just
executes its system call as if it were synchronous!

that's the nice thing: we can do this switch-to-another-kernel-thread
magic thing right in the scheduler when we block - and the switched-to
thread will magically return to user-space (with a -EAIO return code) as
if nothing happened (while the original task blocks). I.e. under this
scheme i'm suggesting we have /zero/ setup cost in the cached case. The
optimistic case just falls through and switches to nothing else. Any
switching cost only occurs in the slowpath - and even that cost is very
low.

once a kernel thread that ran off with the original stack finishes the
async syscall and wants to return the return code, this can be gathered
via a special return-code ringbuffer that notifies finished syscalls. (A
magic cookie is associated to every async syscall.)

> So the state machine argument is totally bogus - it results in a
> programming model that simply doesn't match the *normal* setup. You
> want the kernel programming model to appear "linear" even when it
> isn't, because it's too damn hard to think nonlinearly.
>
> Yes, we could do pathname lookup with that kind of insane setup too.
> But it would be HORRID!

yeah, but i guess not nearly as horrid as writing a new OS from scratch
;-)

seriously, i very much think and agree that programming state machines
is hard and not desired in most of the kernel. But it can be done, and
sometimes (definitely not in the common case) it's /cleaner/ than
functional programming. I've programmed an HTTP and an FTP in-kernel
server via a state machine and it worked better than i initially
expected. It needs different thinking but there /are/ people around with
that kind of thinking, so we just cannot exclude the possibility. [ It's
just that such people usually dedicate their brain to mental
fantasies^H^H^Hexcercises called 'Higher Mathematics' :-) ]

> [...] The fact that networking (and TCP in particular) has state
> machines is because it is a packetized environment.

rough ballpark figures: for things like webserving or fileserving (or
mailserving), networking sockets are the reason for context-blocking
events in 90% of the cases (mostly due to networking latency). 9% of the
blocking happens due to plain block IO, and 1% happens due to VFS
metadata (inode, directory, etc.) blocking.

( in Tux i had to handle /all/ of these sources of blocking because even
1% kills your performance if you do a hundred thousand requests per
second - but in terms of design weight, networking is pretty damn
important. )

and interestingly, modern IO frameworks tend to gravitate towards a
packetized environment as well. I.e. i dont think state machines are
/that/ unimportant.

Ingo

2007-02-02 22:36:54

by Alan

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> > The brown and sticky will hit the rotating air impeller pretty hard if you
> > are not very careful about how that ends up scheduled
>
> Why do you think that?
>
> With cooperative scheduling (like the example Zach posted), there is
> absolutely no "brown and sticky" wrt any CPU usage. Which is why
> cooperative scheduling is a *good* thing. If you want to blow up your
> 1024-node CPU cluster, you'd to it with "real threads".

You end up with a lot more things running asynchronously. In the current
world we see a series of requests for attributes and hopefully we do
readahead and all is neatly ordered. If fibrils are not ordered the same
way then we could make it worse as we might not pick the right readahead
for example.

> Since we'd only create fibrils on a system call entry level, and system
> calls are independent, how would you do that anyway?

If we stick to that limit it ought to be ok. We've been busy slapping
people who call sys_*, except for internal magic like kernel_thread

2007-02-02 22:49:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling


* Linus Torvalds <[email protected]> wrote:

> With cooperative scheduling (like the example Zach posted), there is
> absolutely no "brown and sticky" wrt any CPU usage. Which is why
> cooperative scheduling is a *good* thing. If you want to blow up your
> 1024-node CPU cluster, you'd to it with "real threads".

i'm not worried about the 1024-node cluster case.

i also fully agree that in some cases /not/ going parallel and having a
cooperative relationship between execution contexts can be good.

but if the application /has/ identified fundamental parallelism, we
/must not/ shut that parallelism off by /designing/ this interface to
use the fibril thing which is a limited cooperative, single-CPU entity.
I cannot over-emphasise it how wrong that feels to me. Cooperativeness
isnt bad, but it should be an /optional/ thing, not hardcoded into the
design!

If the application tells us: "gee, you can execute this syscall in
parallel!" (which AIO /is/ about after all), and if we have idle
cores/hardware-threads nearby, it would be the worst thing to not
execute that in parallel if the syscall blocks or if the app asks for
that syscall to be executed in parallel right away, even in the cached
case.

if we were in the 1.2 days i might agree that fibrils are perhaps easier
on the kernel, but today the Linux kernel doesnt even use this
cooperativeness anywhere. We have all the hard work done already. The
full kernel is threaded. We can execute arbitrary number of kernel
contexts off a single user context, we can execute parallel syscalls and
we scale very well doing so.

all that is needed is this new facility and some scheduler hacking to
enable "transparent, kernel-side threading". That enables AIO,
coroutines and more. It brings threading to a whole new level, because
it makes it readily and gradually accessible to single-threaded apps
too.

[ and if we are worried about the 1024 CPU cluster (or about memory use)
then we could limit such threads to only overlap in a limited number,
etc. Just like we'd have to do with fibrils anyway. But with fibrils
we /force/ single-threadedness, which, i'm quite sure, is just about
the worst thing we can do. ]

Ingo

2007-02-02 22:49:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling



On Fri, 2 Feb 2007, Ingo Molnar wrote:
>
> Note: such a 'flip' would only occur when the original context blocks,
> /not/ on every async syscall.

Right.

So can you take a look at Zach's fibril idea again? Because that's exactly
what it does. It basically sets a flag, saying "flip to this when you
block or yield". Of course, it's a bit bigger than just a flag, since it
needs to describe what to flip to, but that's the basic idea.

Now, if you want to make fibrils *also* then actually use a separate
thread, that's an extension. But you were arguing as if they should use
threads to begin with, and that sounds stupid. Now you seem to retract it,
since you say "only if you need to block".

THAT'S THE POINT. That's what makes fibrils cooperative. The "only if you
block" is really what makes a fibril be something else than a regular
thread.

Linus

2007-02-02 23:02:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling



On Fri, 2 Feb 2007, Ingo Molnar wrote:
>
> but if the application /has/ identified fundamental parallelism, we
> /must not/ shut that parallelism off by /designing/ this interface to
> use the fibril thing which is a limited cooperative, single-CPU entity.

Right. We should for example encourage people to use some kind of
paralellizing construct.

I know! We could even call them "threads", so to give people the idea that
they are independent smaller entities in a thicker "rope", and we could
call that bigger entity a "task" or "process", since it "processes" data.

Or is that just too far out?

Linus

2007-02-02 23:19:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling



On Fri, 2 Feb 2007, Linus Torvalds wrote:
> On Fri, 2 Feb 2007, Ingo Molnar wrote:
> >
> > but if the application /has/ identified fundamental parallelism, we
> > /must not/ shut that parallelism off by /designing/ this interface to
> > use the fibril thing which is a limited cooperative, single-CPU entity.
>
> Right. We should for example encourage people to use some kind of
> paralellizing construct.
>
> I know! We could even call them "threads", so to give people the idea that
> they are independent smaller entities in a thicker "rope", and we could
> call that bigger entity a "task" or "process", since it "processes" data.
>
> Or is that just too far out?

So the above was obviously tongue-in-cheek, but you should really think
about the context here.

We're discussing doing *single* system calls. There is absolutely zero
point to try to parallelize the work over multiple CPU's or threads. We're
literally talking about doing things where the actual CPU cost is in the
hundreds of nanoseconds, and where traditionally a rather noticeable part
of the cost is not the code itself, but the high cost of taking a system
call trap, and saving all the register state.

When parallelising "real work", I absolutely agree with you: we should use
threads. But you need to look at what it is we parallelize here, and ask
yourself why we're doing what we're doing, and why people aren't *already*
just using a separate thread for it.

Linus

2007-02-02 23:37:14

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Fri, 2 Feb 2007, Ingo Molnar wrote:

> in fact an app could also /trigger/ the execution of a syscall in a
> different context - to create parallelism artificially - without any
> blocking event. So we could do:
>
> cookie1 = sys_async(sys_read, params);
> cookie2 = sys_async(sys_write, params);
>
> [ ... calculation loop ... ]
>
> wait_on_async_syscall(cookie1);
> wait_on_async_syscall(cookie2);
>
> or something like that. Without user-space having to create threads
> itself, etc. So basically, we'd make kernel threads more useful, and
> we'd make threading safer - by only letting syscalls thread.

Since I still think that the many-thousands potential async operations
coming from network sockets are better handled with a classical event
machanism [1], and since smooth integration of new async syscall into the
standard POSIX infrastructure is IMO a huge win, I think we need to have a
"bridge" to allow async completions being detectable through a pollable
(by the mean of select/poll/epoll whatever) device.
In that way you can handle async operations with the best mechanism that
is fit for them, and gather them in a single async scheduler.



[1] Unless you really want to have thousands of kthreads/fibrils lingering
on the system.



- Davide


2007-02-02 23:52:25

by Alan

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> When parallelising "real work", I absolutely agree with you: we should use
> threads. But you need to look at what it is we parallelize here, and ask
> yourself why we're doing what we're doing, and why people aren't *already*
> just using a separate thread for it.

Because its a pain in the arse and because its very hard to self tune. If
you've got async_anything then the thread/fibril/synchronous/whatever
decision can be made kernel side based upon expected cost and other
tradeoffs, even if its as dumb as per syscall or per syscall/filp type
guessing.

Alan

2007-02-03 00:02:15

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Fri, 2 Feb 2007, Davide Libenzi wrote:

> On Fri, 2 Feb 2007, Ingo Molnar wrote:
>
> > in fact an app could also /trigger/ the execution of a syscall in a
> > different context - to create parallelism artificially - without any
> > blocking event. So we could do:
> >
> > cookie1 = sys_async(sys_read, params);
> > cookie2 = sys_async(sys_write, params);
> >
> > [ ... calculation loop ... ]
> >
> > wait_on_async_syscall(cookie1);
> > wait_on_async_syscall(cookie2);
> >
> > or something like that. Without user-space having to create threads
> > itself, etc. So basically, we'd make kernel threads more useful, and
> > we'd make threading safer - by only letting syscalls thread.
>
> Since I still think that the many-thousands potential async operations
> coming from network sockets are better handled with a classical event
> machanism [1], and since smooth integration of new async syscall into the
> standard POSIX infrastructure is IMO a huge win, I think we need to have a
> "bridge" to allow async completions being detectable through a pollable
> (by the mean of select/poll/epoll whatever) device.
> In that way you can handle async operations with the best mechanism that
> is fit for them, and gather them in a single async scheduler.

To clarify further, below are the API and the use case of my userspace
implementation. The guasi_fd() gives you back a pollable (POLLIN) fd to be
integrated in your prefered event retrieval interface. Once it fd is
signaled, you can fetch your completed requests using guasi_fetch() and
schedule work based on that.
The GUASI implementation uses pthreads, but it is clear that an in-kernel
async syscall implementation can take wiser decisions, and optimize the
heck out of it (locks, queues, ...).




- Davide



/*
* Example of async pread using GUASI
*/
static long guasi_wrap__pread(void *priv, long const *params) {

return (long) pread((int) params[0], (void *) params[1],
(size_t) params[2], (off_t) params[3]);
}

guasi_req_t guasi__pread(guasi_t hctx, void *priv, void *asid, int prio,
int fd, void *buf, size_t size, off_t off) {

return guasi_submit(hctx, priv, asid, prio, guasi_wrap__pread, 4,
(long) fd, (long) buf, (long) size, (long) off);
}


---
/*
* guasi by Davide Libenzi (generic userspace async syscall implementation)
* Copyright (C) 2003 Davide Libenzi
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
*
* Davide Libenzi <[email protected]>
*
*/

#if !defined(_GUASI_H)
#define _GUASI_H


#define GUASI_MAX_PARAMS 16

#define GUASI_STATUS_PENDING 1
#define GUASI_STATUS_ACTIVE 2
#define GUASI_STATUS_COMPLETE 3


typedef long (*guasi_syscall_t)(void *, long const *);

typedef struct s_guasi { } *guasi_t;
typedef struct s_guasi_req { } *guasi_req_t;

struct guasi_reqinfo {
void *priv; /* Call private data. Passed to guasi_submit */
void *asid; /* Async request ID. Passed to guasi_submit */
long result; /* Return code of "proc" passed to guasi_submit */
long error; /* errno */
int status; /* GUASI_STATUS_* */
};



guasi_t guasi_create(int min_threads, int max_threads, int max_priority);
void guasi_free(guasi_t hctx);
int guasi_fd(guasi_t hctx);
guasi_req_t guasi_submit(guasi_t hctx, void *priv, void *asid, int prio,
guasi_syscall_t proc, int nparams, ...);
int guasi_fetch(guasi_t hctx, guasi_req_t *reqs, int nreqs);
int guasi_req_info(guasi_req_t hreq, struct guasi_reqinfo *rinf);
void guasi_req_free(guasi_req_t hreq);

#endif


2007-02-03 00:03:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling


* Linus Torvalds <[email protected]> wrote:

> THAT'S THE POINT. That's what makes fibrils cooperative. The "only if
> you block" is really what makes a fibril be something else than a
> regular thread.

Well, in my picture, 'only if you block' is a pure thread utilization
decision: bounce a piece of work to another thread if this thread cannot
complete it. (if the kernel is lucky enough that the user context told
it "it's fine to do that".)

it is 'incidental parallelism' instead of 'intentional parallelism', but
the random and unpredictable nature of it doesnt change anything about
the fundamental fact: we start a new thread of execution in essence.

Typically it will be rare in a workload as it will be driven by
cachemisses, but for example in DB workloads the 'cachemiss' will be the
/common case/ - because the DB manages the cache itself.

And how to run a thread of execution is a fundamental /scheduling/
decision: it is the acceptance of and the adoption to the cost of work
migration - if no forced wait happens then often it's cheaper to execute
all work locally and serially.

[ in fact, such a mechanism doesnt even always have to be driven from
the scheduler itself: such a 'bounce current work to another thread'
event could occur when we detect that a pagecache page is missing and
that we have to do a ->readpage, etc. Tux does that since 1999: the
cutoff for 'bounce work' was when a soft cache (the pagecache or the
dentry cache) was missed - not when we went into the IO path. This has
the advantage that the Tux cachemiss threads could do /all/ the IO
preparation and IO completion on the same CPU and in one go - while
the user context was able to continue executing. ]

But this is also a function of hardware: for example on a Transputer i'd
bounce off all such work immediately (even if it's a sys_time()
syscall), all the time, even if fully cached, no exceptions, because the
hardware is such that another CPU can pick it up in the next cycle.

while we definitely dont want to bounce short-lived cached syscalls to
another thread, for longer ones or ones which we /expect/ to block we
might want to do it like that straight away. [Especially on a multi-core
CPU that has a shared L2 cache (and doubly so on a HT/SMT CPU that has a
shared L1 cache).]

i dont see anything here that mandates (or even strongly supports) the
notion of cooperative scheduling. The moment a context sees a 'cache
miss', it is totally fair to potentially distribute it to other CPUs. It
wont run for a long time and it will be totally cache-cold when the 'IO
done' event occurs - hence we should schedule it where the IO event
occured. Which might easily be the same CPU where the user context is
running right now (we prefer last-run CPUs on wakeups), but not
necessarily - it's a general scheduling decision.

> > Note: such a 'flip' would only occur when the original context
> > blocks, /not/ on every async syscall.
>
> Right.
>
> So can you take a look at Zach's fibril idea again? Because that's
> exactly what it does. It basically sets a flag, saying "flip to this
> when you block or yield". Of course, it's a bit bigger than just a
> flag, since it needs to describe what to flip to, but that's the basic
> idea.

i know Zach's code ... i really do. Even if i didnt look at the code
(which i did), Jonathon Corbet did a very nice writeup about fibrils on
LWN.net two days ago, which i've read as well:

http://lwn.net/Articles/219954/

So there's no misunderstanding on my side i think.

> Now, if you want to make fibrils *also* then actually use a separate
> thread, that's an extension.

oh please, Linus. I /did/ suggest this as an extension to Zach's idea!
Look at the Subject line - i'm reacting to the specific fibril code of
Zach. I wrote this:

| as per my other email, i dont really like this concept. This is the
| killer:
|
| > [...] There can be multiple of them in the process of executing for
| > a given task_struct, but only one can every be actively running at a
| > time. [...]
|
| there's almost no scheduling cost from being able to arbitrarily
| schedule a kernel thread - but there are /huge/ benefits in it.
|
| would it be hard to redo your AIO patches based on a pool of plain
| simple kernel threads?

see http://lkml.org/lkml/2007/2/1/40.

Ingo

2007-02-03 00:24:01

by bert hubert

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Fri, Feb 02, 2007 at 03:17:57PM -0800, Linus Torvalds wrote:

> threads. But you need to look at what it is we parallelize here, and ask
> yourself why we're doing what we're doing, and why people aren't *already*
> just using a separate thread for it.

Partially this is for the bad reason that creating "i/o threads" (or even
processes) has a bad stigma to it, and additionally has always felt crummy.

On the first reason, the 'pain' of creating threads is actually rather
minor, so this feeling may have been wrong. The main thing is that you don't
wantonly create a thousand i/o threads, whereas you conceivably might want
to have a thousand outstanding i/o requests. At least I know I want to have
that ability.

Secondly, the actual mechanics of i/o processes isn't trivial, and feels
wasteful with lots of additional copying, or in the case of threads,
queueing and posting.

Bert

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

2007-02-03 00:56:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling



On Sat, 3 Feb 2007, Ingo Molnar wrote:
>
> Well, in my picture, 'only if you block' is a pure thread utilization
> decision: bounce a piece of work to another thread if this thread cannot
> complete it. (if the kernel is lucky enough that the user context told
> it "it's fine to do that".)

Sure, you can do it that way too. But at that point, your argument that we
shouldn't do it with fibrils is wrong: you'd still need basically the
exact same setup that Zach does in his fibril stuff, and the exact same
hook in the scheduler, testing the exact same value ("do we have a pending
queue of work").

So at that point, you really are arguing about a rather small detail in
the implementation, I think.

Which is fair enough.

But I actually think the *bigger* argument and problems are elsewhere,
namely in the interface details. Notably, I think the *real* issues end up
how we handle synchronization, and how we handle signalling. Those are in
many ways (I think) more important than whether we actually can schedule
these trivial things on multiple CPU's concurrently or not.

For example, I think serialization is potentially a much more expensive
issue. Could we, for example, allow users to serialize with these things
*without* having to go through the expense of doing a system call? Again,
I'm thinking of the case of no IO happening, in which case there also
won't be any actual threading taking place, in which case it's a total
waste of time to do a system call at all.

And trying to do that actually has implications for the interfaces (like
possibly returning a zero cookie for the async() system call if it was
doable totally synchronously?)

Signal handling is similar: I actually think that a "async()" system call
should be interruptible within the context of the caller, since we would
want to *try* to execute it synchronously. That automatically means that
we have semantic meaning for fibrils and signal handling.

Finally, can we actually get POSIX aio semantics with this? Can we
implement the current aio_xyzzy() system calls using this same feature?
And most importantly - does it perform well enough that we really can do
that?

THOSE are to me bigger questions than what happens inside the kernel, and
whether we actually end up using another thread if we end up doing it
non-synchronously.

Linus

2007-02-03 07:10:18

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Fri, Feb 02, 2007 at 04:56:22PM -0800, Linus Torvalds wrote:
>
> On Sat, 3 Feb 2007, Ingo Molnar wrote:
> >
> > Well, in my picture, 'only if you block' is a pure thread utilization
> > decision: bounce a piece of work to another thread if this thread cannot
> > complete it. (if the kernel is lucky enough that the user context told
> > it "it's fine to do that".)
>
> Sure, you can do it that way too. But at that point, your argument that we
> shouldn't do it with fibrils is wrong: you'd still need basically the
> exact same setup that Zach does in his fibril stuff, and the exact same
> hook in the scheduler, testing the exact same value ("do we have a pending
> queue of work").
>
> So at that point, you really are arguing about a rather small detail in
> the implementation, I think.
>
> Which is fair enough.
>
> But I actually think the *bigger* argument and problems are elsewhere,
> namely in the interface details. Notably, I think the *real* issues end up
> how we handle synchronization, and how we handle signalling. Those are in
> many ways (I think) more important than whether we actually can schedule
> these trivial things on multiple CPU's concurrently or not.
>
> For example, I think serialization is potentially a much more expensive
> issue. Could we, for example, allow users to serialize with these things
> *without* having to go through the expense of doing a system call? Again,
> I'm thinking of the case of no IO happening, in which case there also
> won't be any actual threading taking place, in which case it's a total
> waste of time to do a system call at all.
>
> And trying to do that actually has implications for the interfaces (like
> possibly returning a zero cookie for the async() system call if it was
> doable totally synchronously?)

This would be useful - the application wouldn't have to set up state
to remember for handling completions for operations that complete synchronously
I know Samba folks would like that.

The laio_syscall implementation (Lazy asynchronous IO) seems to have
experimented with such an interface
http://www.usenix.org/events/usenix04/tech/general/elmeleegy.html

Regards
Suparna

>
> Signal handling is similar: I actually think that a "async()" system call
> should be interruptible within the context of the caller, since we would
> want to *try* to execute it synchronously. That automatically means that
> we have semantic meaning for fibrils and signal handling.
>
> Finally, can we actually get POSIX aio semantics with this? Can we
> implement the current aio_xyzzy() system calls using this same feature?
> And most importantly - does it perform well enough that we really can do
> that?
>
> THOSE are to me bigger questions than what happens inside the kernel, and
> whether we actually end up using another thread if we end up doing it
> non-synchronously.
>
> Linus
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2007-02-03 08:37:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling


* Linus Torvalds <[email protected]> wrote:

> On Sat, 3 Feb 2007, Ingo Molnar wrote:
> >
> > Well, in my picture, 'only if you block' is a pure thread
> > utilization decision: bounce a piece of work to another thread if
> > this thread cannot complete it. (if the kernel is lucky enough that
> > the user context told it "it's fine to do that".)
>
> Sure, you can do it that way too. But at that point, your argument
> that we shouldn't do it with fibrils is wrong: you'd still need
> basically the exact same setup that Zach does in his fibril stuff, and
> the exact same hook in the scheduler, testing the exact same value
> ("do we have a pending queue of work").

did i ever lose a single word of complaint about those bits? Those are
not an issue to me. They can be applied to kernel threads just as much.

As i babbled in the very first email about this topic:

| 1) improve our basic #1 design gradually. If something is a
| bottleneck, if the scheduler has grown too fat, cut some slack. If
| micro-threads or fibrils offer anything nice for our basic thread
| model: integrate it into the kernel.

i should have said explicitly that to flip user-space from one kernel
thread to another one (upon blocking or per request) is a nice thing and
we should integrate that into the kernel's thread model.

But really, being a scheduler guy i was much more concerned about the
duplication and problems caused by the fibril concept itself - which
duplication and complexity makes up 80% of Zach's submitted patchset.
For example this bit:

[PATCH 3 of 4] Teach paths to wake a specific void * target

would totally go away if we used kernel threads for this. In the fibril
approach this is where the mess starts. Either a 'normal' wakeup has to
wake up all fibrils, or we have to make damn sure that a wakeup that in
reality goes to a fibril is never woken via wake_up/wake_up_process.

( Furthremore, i tried to include user-space micro-threads in the
argument as well, which Evgeniy Polyako raised not so long ago related
to the kevent patchset. All these micro-thread things are of a similar
genre. )

i totally agree that the API /should/ be the main focus - but i didnt
pick the topic and most of the patchset's current size is due to the IMO
avoidable fibril concept.

regarding the API, i dont really agree with the current form and design
of Zach's interface.

fundamentally, the basic entity of this thing should be a /system call/,
not the artificial fibril thing:

+struct asys_call {
+ struct asys_result *result;
+ struct fibril fibril;
+};

i.e. the basic entity should be something that represents a system call,
with its up to 6 arguments, the later return code, state, flags and two
list entries:

struct async_syscall {
unsigned long nr;
unsigned long args[6];
long err;
unsigned long state;
unsigned long flags;
struct list_head list;
struct list_head wait_list;
unsigned long __pad[2];
};

(64 bytes on 32-bit, 128 bytes on 64-bit)

furthermore, i think this API should be fundamentally vectored and
fundamentally async, and hence could solve another issue as well:
submitting many little pieces of work of different IO domains in one go.

[ detail: there should be no traditional signals used at all (Zach's
stuff doesnt use them, and correctly so), only if the async syscall
that is performed generates a signal. ]

The normal and most optimal workflow should be a user-space ring-buffer
of these constant-size struct async_syscall entries:

struct async_syscall ringbuffer[1024];

LIST_HEAD(submitted);
LIST_HEAD(pending);
LIST_HEAD(completed);

the 3 list heads are both known to the kernel and to user-space, and are
actively managed by both. The kernel drives the execution of the async
system calls based on the 'submitted' list head (until it empties it)
and moves them over to the 'pending' list. User-space can complete async
syscalls based on the 'completed' list. (but a sycall can optinally be
marked as 'autocomplete' as well via the 'flags' field, in that case
it's not moved to the 'completed' list but simply removed from the
'pending' list. This can be useful for system calls that have some
implicit notification effect.)

( Note: optionally, a helper kernel-thread, when it finishes processing
a syscall, could also asynchronously check the 'submitted' list and
pick up new work. That would allow the submission of new syscalls
without any entry into the kernel. So for example on an SMT system,
this could result in essence one CPU could running in pure user-space
submitting async syscalls via the ringbuffer, while another CPU would
in essence be running pure kernel-space, executing those entries. )

another crutial bit is the waiting on pending work. But because every
pending syscall entity is either already completed or has a real kernel
thread associated with it, that bit is mostly trivial: user-space can
wait on 'any' pending syscall to complete, or it could wait for a
specific list of syscalls to complete (using the ->wait_list). It could
also wait on 'a minimum number of N syscalls to complete' - to create
batching of execution. And of course it can periodically check the
'completed' list head if it has a constant and highly parallel flow of
workload - that way the 'waiting' does not actually have to happen most
of the time.

Looks like we can hit many birds with this single stone: AIO, vectored
syscalls, finegrained system-call parallelism. Hm?

Ingo

2007-02-03 09:39:13

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Sat, Feb 03, 2007 at 09:23:08AM +0100, Ingo Molnar wrote:
> The normal and most optimal workflow should be a user-space ring-buffer
> of these constant-size struct async_syscall entries:
>
> struct async_syscall ringbuffer[1024];
>
> LIST_HEAD(submitted);
> LIST_HEAD(pending);
> LIST_HEAD(completed);

It's wrong to call this a ring buffer as things won't be completed in
any particular order. So you'll need a fourth list head for which
buffer elements are free. At which point, you might as well leave it
entirely up to the application to manage the allocation of
async_syscall structs. It may know it only needs two, or ten thousand,
or five per client...

--
Mathematics is the supreme nostalgia of our time.

2007-02-03 10:19:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling


* Matt Mackall <[email protected]> wrote:

> On Sat, Feb 03, 2007 at 09:23:08AM +0100, Ingo Molnar wrote:
> > The normal and most optimal workflow should be a user-space ring-buffer
> > of these constant-size struct async_syscall entries:
> >
> > struct async_syscall ringbuffer[1024];
> >
> > LIST_HEAD(submitted);
> > LIST_HEAD(pending);
> > LIST_HEAD(completed);
>
> It's wrong to call this a ring buffer as things won't be completed in
> any particular order. [...]

yeah, i realized this when i sent the mail. I wanted to say 'array of
elements' - and it's clear from these list heads that it's fully out of
order. (it should be an array so that the pages of those entries can be
pinned and that completion can be manipulated from any context,
anytime.)

(the queueing i described closely resembles Tux's "Tux syscall request"
handling scheme.)

> [...] So you'll need a fourth list head for which buffer elements are
> free. At which point, you might as well leave it entirely up to the
> application to manage the allocation of async_syscall structs. It may
> know it only needs two, or ten thousand, or five per client...

sure - it should be variable but still the array should be compact, and
should be registered with the kernel. That way security checks can be
done once, the pages can be pinned, accessed anytime, etc.

Ingo

2007-02-03 14:11:55

by George Spelvin

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

First of all, may I say, this is a wonderful piece of work.
It absolutely reeks of The Right Thing. Well done!

However, while I need to study it in a lot more detail, I think Ingo's
implementation ideas make a lot more immediate sense. It's the same
idea that I thought up.

Let me make it concrete. When you start an async system call:
- Preallocate a second kernel stack, but don't do anything
with it. There should probably be a per-CPU pool of
preallocated threads to avoid too much allocation and
deallocation.
- Also at this point, do any resource limiting.
- Set the (normally NULL) "thread blocked" hook pointer to
point to a handler, as explained below.
- Start down the regular system call path.
- In the fast-path case, the system call completes without blocking and
we set up the completion structure and return to user space.
We may want to return a special value to user space to tell it that
there's no need to call asys_await_completion. I think of it as the
Amiga's IOF_QUICK.
- Also, when returning, check and clear the thread-blocked hook.

Note that we use one (cache-hot) stack for everything and do as little
setup as possible on the fast path.

However, if something blocks, it hits the slow path:
- If something would block the thread, the scheduler invokes the
thread-blocked hook before scheduling a new thread.
- The hook copies the necessary state to a new (preallocated) kernel
stack, which takes over the original caller's identity, so it can return
immediately to user space with an "operation in progress" indicator.
- The scheduler hook is also cleared.
- The original thread is blocked.
- The new thread returns to user space and execution continues.

- The original thread completes the system call. It may block again,
but as its block hook is now clear, no more scheduler magic happens.

- When the operation completes and returns to sys_sys_submit(), it
notices that its scheduler hook is no longer set. Thus, this is a
kernel-only worker thread, and it fills in the completion structure,
places itself back in the available pool, and commits suicide.

Now, there is no chance that we will ever implement kernel state machines
for every little ioctl. However, there may be some "async fast paths"
state machines that we can use. If we're in a situation where we can
complete the operation without a kernel thread at all, then we can
detect the "would block" case (probably in-line, but you could
use a different scheduler hook function) and set up the state machine
structure. Then return "operation in progress" and let the I/O
complete in its own good time.

Note that you don't need to implement all of a system call as an explicit
state machine; only its completion. So, for example, you could do
indirect block lookups via an implicit (stack-based) state machine,
but the final I/O via an explicit one. And you could do this only for
normal block devices and not NFS. You only need to convert the hot
paths to the explicit state machine form; the bulk of the kernel code
can use separate kernel threads to do async system calls.

I'm also in the "why do we need fibrils?" camp. I'm studying the code,
and looking for a reason, but using the existing thread abstraction
seems best. If you encountered some fundamental reason why kernel threads
were Really Really Hard, then maybe it's worth it, but it's a new entity,
and entia non sunt multiplicanda praeter necessitatem.


One thing you can do for real-time tasks is, in addition to the
non-blocking flag (return EAGAIN from asys_submit rather than blocking),
you could have an "atomic" flag that would avoid blocking to preallocate
the additional kernel thread! Then you'd really be guaranteed no
long delays, ever.

2007-02-04 05:12:36

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Tue, 30 Jan 2007, Zach Brown wrote:

> + /*
> + * XXX The idea is to copy all but the actual call stack. Obviously
> + * this is wildly arch-specific and belongs abstracted out.
> + */
> + *next->ti = *ti;
> + *thread_info_pt_regs(next->ti) = *thread_info_pt_regs(ti);

arch copy_thread_info()?



> + current->per_call = next->per_call;

Pointer instead of structure copy? percall_clone()/percall_free()?



> + /* always switch to a runnable fibril if we aren't being preempted */
> + if (unlikely(!(preempt_count() & PREEMPT_ACTIVE) &&
> + !list_empty(&prev->runnable_fibrils))) {
> + schedule_fibril(prev);
> + /*
> + * finish_fibril_switch() drops the rq lock and enables
> + * premption, but the popfl disables interrupts again. Watch
> + * me learn how context switch locking works before your very
> + * eyes! XXX This will need to be fixed up by throwing
> + * together something like the prepare_lock_switch() path the
> + * scheduler does. Guidance appreciated!
> + */
> + local_irq_enable();
> + return;
> + }

Yes, please (prepare/finish) ... ;)



- Davide


2007-02-04 05:12:53

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 4 of 4] Introduce aio system call submission and completion system calls

On Tue, 30 Jan 2007, Zach Brown wrote:

> +void asys_task_exiting(struct task_struct *tsk)
> +{
> + struct asys_result *res, *next;
> +
> + list_for_each_entry_safe(res, next, &tsk->asys_completed, item)
> + kfree(res);
> +
> + /*
> + * XXX this only works if tsk->fibril was allocated by
> + * sys_asys_submit(), not if its embedded in an asys_call. This
> + * implies that we must forbid sys_exit in asys_submit.
> + */
> + if (tsk->fibril) {
> + BUG_ON(!list_empty(&tsk->fibril->run_list));
> + kfree(tsk->fibril);
> + tsk->fibril = NULL;
> + }
> +}

What happens to lingering fibrils? Better keep track of both runnable and
sleepers, and do proper cleanup.



> +asmlinkage long sys_asys_submit(struct asys_input __user *user_inp,
> + unsigned long nr_inp)
> +{
> + struct asys_input inp;
> + struct asys_result *res;
> + struct asys_call *call;
> + struct thread_info *ti;
> + unsigned long i;
> + long err = 0;
> +
> + /* Allocate a fibril for the submitter's thread_info */
> + if (current->fibril == NULL) {
> + current->fibril = kzalloc(sizeof(struct fibril), GFP_KERNEL);
> + if (current->fibril == NULL)
> + return -ENOMEM;
> +
> + INIT_LIST_HEAD(&current->fibril->run_list);
> + current->fibril->state = TASK_RUNNING;
> + current->fibril->ti = current_thread_info();
> + }

Why do we need the "special" submission fibril?




> + for (i = 0; i < nr_inp; i++) {
> +
> + if (copy_from_user(&inp, &user_inp[i], sizeof(inp))) {
> + err = -EFAULT;
> + break;
> + }
> +
> + res = kmalloc(sizeof(struct asys_result), GFP_KERNEL);
> + if (res == NULL) {
> + err = -ENOMEM;
> + break;
> + }
> +
> + /* XXX kzalloc to init call.fibril.per_cpu, add helper */
> + call = kzalloc(sizeof(struct asys_call), GFP_KERNEL);
> + if (call == NULL) {
> + kfree(res);
> + err = -ENOMEM;
> + break;
> + }
> +
> + ti = alloc_thread_info(tsk);
> + if (ti == NULL) {
> + kfree(res);
> + kfree(call);
> + err = -ENOMEM;
> + break;
> + }
> +
> + err = asys_init_fibril(&call->fibril, ti, &inp);
> + if (err) {
> + kfree(res);
> + kfree(call);
> + free_thread_info(ti);
> + break;
> + }
> +
> + res->comp.cookie = inp.cookie;
> + call->result = res;
> + ti->task = current;
> +
> + sched_new_runnable_fibril(&call->fibril);
> + schedule();
> + }
> +
> + return i ? i : err;
> +}

Streamline error path (kfree(NULL) is OK):

err = -ENOMEM;
a = alloc();
b = alloc();
c = alloc();
if (!a || !b || !c)
goto error;
...
error:
kfree(c);
kfree(b);
kfree(a);
return err;


- Davide


2007-02-04 05:13:06

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Tue, 30 Jan 2007, Zach Brown wrote:

> This very rough patch series introduces a different way to provide AIO support
> for system calls.

Zab, great stuff!
I've found a little time to take a look at the patches and throw some
comments at you.
Keep in mind though, that the last time I seriously looked at this stuff,
sched.c was like 2K lines of code, and now it's 7K. Enough said ;)




> Right now to provide AIO support for a system call you have to express your
> interface in the iocb argument struct for sys_io_submit(), teach fs/aio.c to
> translate this into some call path in the kernel that passes in an iocb, and
> then update your code path implement with either completion-based (EIOCBQUEUED)
> or retry-based (EIOCBRETRY) AIO with the iocb.
>
> This patch series changes this by moving the complexity into generic code such
> that a system call handler would provide AIO support in exactly the same way
> that it supports a synchronous call. It does this by letting a task have
> multiple stacks executing system calls in the kernel. Stacks are switched in
> schedule() as they block and are made runnable.

As I said in another email, I think this is a *great* idea. It allows for
generic async execution, while leaving the core kernel AIO unware. Of
course, ot be usable, a lot of details needs to be polished, and
performance evaluated.




> We start in sys_asys_submit(). It allocates a fibril for this executing
> submission syscall and hangs it off the task_struct. This lets this submission
> fibril be scheduled along with the later asys system calls themselves.

Why do you need this "special" fibril for the submission task?



> The specific switching mechanics of this implementation rely on the notion of
> tracking a stack as a full thread_info pointer. To make the switch we transfer
> the non-stack bits of the thread_info from the old fibril's ti to the new
> fibril's ti. We update the book keeping in the task_struct to
> consider the new thread_info as the current thread_info for the task. Like so:
>
> *next->ti = *ti;
> *thread_info_pt_regs(next->ti) = *thread_info_pt_regs(ti);
>
> current->thread_info = next->ti;
> current->thread.esp0 = (unsigned long)(thread_info_pt_regs(next->ti) + 1);
> current->fibril = next;
> current->state = next->state;
> current->per_call = next->per_call;
>
> Yeah, messy. I'm interested in aggressive feedback on how to do this sanely.
> Especially from the perspective of worrying about all the archs.

Maybe an arch-specific copy_thread_info()? Or, since there's a 1:1
relationship, just merging them.



> Did everyone catch that "per_call" thing there? That's to switch members of
> task_struct which are local to a specific call. link_count, journal_info, that
> sort of thing. More on that as we talk about the costs later.

Yes ;) Brutally copying the structure over does not look good IMO. Better
keep a pointer and swapping them. A clone_per_call() and free_per_call()
might be needed.




> Eventually the timer fires and the hrtimer code path wakes the fibril:
>
> - if (task)
> - wake_up_process(task);
> + if (wake_target)
> + wake_up_target(wake_target);
>
> We've doctored try_to_wake_up() to be able to tell if its argument is a
> task_struct or one of these fibril targets. In the fibril case it calls
> try_to_wake_up_fibril(). It notices that the target fibril does need to be
> woken and sets it TASK_RUNNING. It notices that it it's current in the task so
> it puts the fibril on the task's fibril run queue and wakes the task. There's
> grossness here. It needs the task to come through schedule() again so that it
> can find the new runnable fibril instead of continuing to execute its current
> fibril. To this end, wake-up marks the task's current ti TIF_NEED_RESCHED.

Fine IMO. Better keep scheduling code localized inside schedule().



> - With get AIO support for all syscalls. Every single one. (Well, please, no
> asys sys_exit() :)). Buffered IO, sendfile, recvmsg, poll, epoll, hardware
> crypto ioctls, open, mmap, getdents, the entire splice API, etc.

Eeek, ... poll, epoll :)
That might solve the async() <-> POSIX bridge in the other way around. The
collector will become the async() events fetcher, instead of the other way
around. Will work just fine ...



> - We wouldn't multiply testing and maintenance burden with separate AIO paths.
> No is_sync_kiocb() testing and divergence between returning or calling
> aio_complete(). No auditing to make sure that EIOCBRETRY only being returned
> after any significant references of current->. No worries about completion
> racing from the submission return path and some aio_complete() being called
> from another context. In this scheme if your sync syscall path isn't broken,
> your AIO path stands a great chance of working.

This is the *big win* of the whole thing IMO.



> - AIO syscalls which *don't* block see very little overhead. They'll allocate
> stacks and juggle the run queue locks a little, but they'll execute in turn on
> the submitting (cache-hot, presumably) processor. There's room to optimize
> this path, too, of course.

Stack allocation can be optimized/cached, as someone else already said.



> - The 800lb elephant in the room. It uses a full stack per blocked operation.
> I believe this is a reasonable price to pay for the flexibility of having *any*
> call pending. It rules out some loads which would want to keep *millions* of
> operations pending, but I humbly submit that a load rarely takes that number of
> concurrent ops to saturate a resource. (think of it this way: we've gotten
> this far by having to burn a full *task* to have *any* syscall pending.) While
> not optimal, it opens to door to a lot of functionality without having to
> rewrite the kernel as a giant non-blocking state machine.

This should not be a huge problem IMO. High latency operations like
network sockets can be handled with standard I/O events interfaces like
poll/epoll, so I do not expect to have a huge number of fibrils around.
The number of fibrils will be proportional to the number of active
connections, not to the total number of connections.



> It should be noted that my very first try was to copy the used part of stacks
> in to and out of one full allocated stack. This uses less memory per blocking
> operation at the cpu cost of copying the used regions. And it's a terrible
> idea, for at least two reasons. First, to actually get the memory overhead
> savings you allocate at stack switch time. If that allocation can't be
> satisfied you are in *trouble* because you might not be able to switch over to
> a fibril that is trying to free up memory. Deadlock city. Second, it means
> that *you can't reference on-stack data in the wake-up path*. This is a
> nightmare. Even our trivial sys_nanosleep() example would have had to take its
> hrtimer_sleeper off the stack and allocate it. Never mind, you know, basically
> every single user of <linux/wait.h>. My current thinking is that it's just
> not worth it.

Agreed. Most definitely not worth it, for the reasons above.



> - We would now have some measure of task_struct concurrency. Read that twice,
> it's scary. As two fibrils execute and block in turn they'll each be
> referencing current->. It means that we need to audit task_struct to make sure
> that paths can handle racing as its scheduled away. The current implementation
> *does not* let preemption trigger a fibril switch. So one only has to worry
> about racing with voluntary scheduling of the fibril paths. This can mean
> moving some task_struct members under an accessor that hides them in a struct
> in task_struct so they're switched along with the fibril. I think this is a
> manageable burden.

That seems the correct policy in any case.



> - Signals. I have no idea what behaviour we want. Help? My first guess is
> that we'll want signal state to be shared by fibrils by keeping it in the
> task_struct. If we want something like individual cancellation, we'll augment
> signal_pending() with some some per-fibril test which will cause it to return
> from TASK_INTERRUPTIBLE (the only reasonable way to implement generic
> cancellation, I'll argue) as it would have if a signal was pending.

Fibril should IMO use current thread signal policies. I think a signal
should hit (wake) any TASK_INTERRUPTIBLE fibril, if the current thread
policies mandate that. I'd keep a list_head of currently scheduled-out
TASK_INTERRUPTIBLE fibrils, and I'd make them runnable when a signal is
delivered to the thread (wake_target bit #1 set to mean wake-all-interruptable-fibrils?).
The other thing is signal_pending(). The sigpending flag test is not going
to work as is (cleared at the first do_signal). Setting a bit in each
fibril would mean walking the whole TASK_INTERRUPTIBLE fibril list. Maybe
a sequential signal counter in task_struct, matched by one in the fibril.
A signal would increment the task_struct counter, and a fibril
schedule-out would save the task_struct counter to the fibril. The
signal_pending() for a fibril is a compare of the two. Or something
similar.
In general, I think it'd make sense to have a fibril-based implemenation
and a kthread-based one, and compare the messyness :) of the two related
to cons/performnces.



- Davide

2007-02-04 20:00:09

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Sat, 3 Feb 2007, Davide Libenzi wrote:

> > - Signals. I have no idea what behaviour we want. Help? My first guess is
> > that we'll want signal state to be shared by fibrils by keeping it in the
> > task_struct. If we want something like individual cancellation, we'll augment
> > signal_pending() with some some per-fibril test which will cause it to return
> > from TASK_INTERRUPTIBLE (the only reasonable way to implement generic
> > cancellation, I'll argue) as it would have if a signal was pending.
>
> Fibril should IMO use current thread signal policies. I think a signal
> should hit (wake) any TASK_INTERRUPTIBLE fibril, if the current thread
> policies mandate that. I'd keep a list_head of currently scheduled-out
> TASK_INTERRUPTIBLE fibrils, and I'd make them runnable when a signal is
> delivered to the thread (wake_target bit #1 set to mean wake-all-interruptable-fibrils?).
> The other thing is signal_pending(). The sigpending flag test is not going
> to work as is (cleared at the first do_signal). Setting a bit in each
> fibril would mean walking the whole TASK_INTERRUPTIBLE fibril list. Maybe
> a sequential signal counter in task_struct, matched by one in the fibril.
> A signal would increment the task_struct counter, and a fibril
> schedule-out would save the task_struct counter to the fibril. The
> signal_pending() for a fibril is a compare of the two. Or something
> similar.

Another thing linked to signals that was not talked about, is cancellation
of an in-flight request. We want to give the ability to cancel an
in-flight request, with something like async_cancel(cookie). In my
userspace library I simply disable SA_RESTART of SIGUSR2, and I do a
pthread_kill() on the thread servicing the request. But this will IMO have
other implications (linked to signal delivery) in a kernel fibril-based
implementation, to think about it.



- Davide


2007-02-05 16:45:19

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> Other questions really relate to the scheduling - Zach do you intend
> schedule_fibrils() to be a call code would make or just from
> schedule() ?

I'd much rather keep the current sleeping API in as much as is
possible. So, yeah, if we can get schedule() to notice and behave
accordingly I'd prefer that. In the current code it's keyed off
finding a stack allocation hanging off of current->. If the caller
didn't care about guaranteeing non-blocking submission then we
wouldn't need that.. we could use a thread_info flag bit, or
something. Avoiding that allocation in the cached case would be nice.

> Alan (who used to use Co-routines in real languages on 36bit
> computers with 9bit bytes before learning C)

Yes, don't despair, I'm not co-routine ignorant. In fact, I'm almost
positive it was you who introduced them to me at some point in the
previous millennium ;).

- z

2007-02-05 17:02:35

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> ok, i think i noticed another misunderstanding. The kernel thread
> based
> scheme i'm suggesting would /not/ 'switch' to another kernel thread in
> the cached case, by default. It would just execute in the original
> context (as if it were a synchronous syscall), and the switch to a
> kernel thread from the pool would only occur /if/ the context is about
> to block. (this 'switch' thing would be done by the scheduler)

Yeah, this is what I imagined when you described doing this with
threads instead of these 'fibril' things.

It sounds like you're suggesting that we keep the 1:1 relationship
between task_struct and thread_info. That would avoid the risks that
the current fibril approach brings. It insists that all of
task_struct is shared between concurrent fibrils (even if only
between blocking points). As I understand what Ingo is suggesting,
we'd instead only explicitly share the fields that we migrate (copy
or get a reference) as we move the stack from the submitting
task_struct to a waiting_task struct as the submission blocks.

We trade initial effort to make things safe in the presence of
universal sharing for effort to introduce sharing as people notice
deficient behaviour. If that's the way we prefer to go, I'm cool
with that. I might have gone slightly nuts in preferring *identical*
sync and async behaviour.

The fast path would look almost identical to the existing fibril
switch. We'd just have a few more fields to sync up between the two
task_structs.

Ingo, am I getting this right? This sounds pretty straight forward
to prototype from the current patches. I can certainly give it a try.

> it's quite cheap to 'flip' it to under any arbitrary user-space
> context:
> change its thread_info->task pointer to the user-space context's task
> struct, copy the mm pointer, the fs pointer to the "worker thread",
> switch the thread_info, update ptregs - done. Hm?

Or maybe you're talking about having concurrent executing
thread_info's pointing to the user-space submitting task_struct?
That really does sound like the current fibril approach, with even
more sharing of thread_info's that might be executing on other cpus?

Either way, I want to give it a try. If we can measure it performing
reasonably in the cached case then I think everyone's happy?

> is not part of the signal set. (Although it might make sense to make
> such async syscalls interruptible, just like any syscall.)

I think we all agree that they have to be interruptible by now,
right? If for no other reason than to interrupt pending poll with no
timeout, say, as the task exits..

> The 'pool' of kernel threads doesnt even have to be per-task, it
> can be
> a natural per-CPU thing

Yeah, absolutely.

- z

2007-02-05 17:13:30

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> Since I still think that the many-thousands potential async operations
> coming from network sockets are better handled with a classical event
> machanism [1], and since smooth integration of new async syscall
> into the
> standard POSIX infrastructure is IMO a huge win, I think we need to
> have a
> "bridge" to allow async completions being detectable through a
> pollable
> (by the mean of select/poll/epoll whatever) device.

Ugh, I'd rather not if we don't have to.

It seems like you could get this behaviour from issuing a poll/select
(really?)/epoll as one of the async calls to complete. (And you
mention this in a later email? :))

Part of my thinking on this is that we might want it to be really
trivial to create and wait on groups of ops.. maybe as a context.
One of the things posix AIO wants is the notion of submitting and
waiting on a group of ops as a "list". That sounds like we might be
able to implement it by issuing ops against a context, created as
part of the submission, and then waiting for it to drain.

Being able to wait on that with file->poll() obviously requires
juggling file-> associations which sounds like more weight than we
might want. Or it'd be optional and we'd get more moving parts and
divergent paths to test.

So, sure, it's possible and not terribly difficult, but I'd rather
avoid it if people can be convinced to get the same behaviour by
issuing an async instance of their favourite readiness syscall.

- z

2007-02-05 17:44:43

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> But really, being a scheduler guy i was much more concerned about the
> duplication and problems caused by the fibril concept itself - which
> duplication and complexity makes up 80% of Zach's submitted patchset.
> For example this bit:
>
> [PATCH 3 of 4] Teach paths to wake a specific void * target
>
> would totally go away if we used kernel threads for this.

Uh, would it? Are you talking about handing off the *task_struct*
that it was submitted under to each worker thread that inherits the
stack?

I guess I hadn't considered going that far. I had somehow
constructed a block in my mind that we couldn't release the
task_struct from the submitting task. But maybe we can be clever
enough with the task_struct updating that userspace wouldn't notice a
significant change.

Hmm.

> i totally agree that the API /should/ be the main focus - but i didnt
> pick the topic and most of the patchset's current size is due to
> the IMO
> avoidable fibril concept.

I, too, totally agree. I didn't even approach the subject for
exactly the reason you allude to -- I wanted to get the hard parts of
the kernel side right first.

> regarding the API, i dont really agree with the current form and
> design
> of Zach's interface.

Haha, well, yes, of course. You couldn't have thought that the dirt-
stupid sys_asys_wait_for_completion() was anything more than simple
scaffolding to test the kernel bits.

> fundamentally, the basic entity of this thing should be a /system
> call/,
> not the artificial fibril thing:
>
> +struct asys_call {
> + struct asys_result *result;
> + struct fibril fibril;
> +};

You picked a weird struct to highlight here. struct asys_input seems
more related to the stuff you go on to discuss below. This asys_call
struct is a relatively irrelevant internal detail of how
asys_teardown_stack() gets from a fibril to the pre-allocated
completion state once the call has returned.

> The normal and most optimal workflow should be a user-space ring-
> buffer
> of these constant-size struct async_syscall entries:
>
> struct async_syscall ringbuffer[1024];
>
> LIST_HEAD(submitted);
> LIST_HEAD(pending);
> LIST_HEAD(completed);

I strongly disagree here, and I'm hoping you're not as keen on this
now -- your reply to Matt gives me hope.

As mentioned, that they complete out-of-order leads, at least, to
having separate submission and completion rings. I'm not sure a
submission ring makes any sense given the goal of processing the
calls in submission and only creating threads if it blocks. A simple
copy of an array of these input structs sounds fine to me.

When I think about the completion side I tend to hope we can end up
with something like what VJ talked about in his net channels work.
producer/consumer rings with head and tail pointers in different
cache lines. AFAIK the kevent work has headed in that direction, but
I haven't kept up. Uli has certainly mentioned it in his 'ec' (event
channels) proposals.

The posix AIO list completion and, sadly, signals on completion need
to be considered, too.

Honestly, though, I'm not worried about this part. We'll easily come
to an agreement. I'm just not going to distract myself with it until
we're happy with the scheduler side.

> Looks like we can hit many birds with this single stone: AIO, vectored
> syscalls, finegrained system-call parallelism. Hm?

Hmm, indeed. Some flags could let userspace tell the kernel not to
bother with all this threading/concurrency/aio nonsense and just
issue them serially. It'll sound nuts in these days of cheap
syscalls and vsyscall helpers, but some Oracle folks might love this
for issuing a gettimeofday() pair around syscalls they want to profile.

I hadn't considered that as a potential property of this interface.

- z

2007-02-05 17:54:39

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

>
>
>> + current->per_call = next->per_call;
>
> Pointer instead of structure copy?

Sure, there are lots of trade-offs there, but the story changes if we
keep the 1:1 relationship between task_struct and thread_info.

- z

2007-02-05 18:33:57

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Mon, 5 Feb 2007, Zach Brown wrote:

> > Since I still think that the many-thousands potential async operations
> > coming from network sockets are better handled with a classical event
> > machanism [1], and since smooth integration of new async syscall into the
> > standard POSIX infrastructure is IMO a huge win, I think we need to have a
> > "bridge" to allow async completions being detectable through a pollable
> > (by the mean of select/poll/epoll whatever) device.
>
> Ugh, I'd rather not if we don't have to.
>
> It seems like you could get this behaviour from issuing a
> poll/select(really?)/epoll as one of the async calls to complete. (And you
> mention this in a later email? :))

Yes, no need for the above. We can just host a poll/epoll in an async()
operation, and demultiplex once that gets ready.



- Davide


2007-02-05 18:52:39

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Mon, 5 Feb 2007, Zach Brown wrote:

> > The 'pool' of kernel threads doesnt even have to be per-task, it can be
> > a natural per-CPU thing
>
> Yeah, absolutely.

Hmmm, so we issue an async sys_read(), what a get_file(fd) will return for
a per-CPU kthread executing such syscall? Unless we teach context_switch()
to do a inherit-trick for "files" (even in that case, it won't work if
we switch from another context). And, is it all for it?
IMO it's got to be either a per-process thread pool or a fibril approach.
Or we need some sort of enter_context()/leave_context() (adopt mm, files,
...) to have a per-CPU kthread to be able to execute the syscall from the
async() caller context. Hmmm?



- Davide


2007-02-05 19:21:27

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> Or we need some sort of enter_context()/leave_context() (adopt mm,
> files,
> ...) to have a per-CPU kthread to be able to execute the syscall
> from the
> async() caller context.

I believe that's what Ingo is hoping for, yes.

- z

2007-02-05 19:26:28

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Mon, 5 Feb 2007, Zach Brown wrote:

> > The normal and most optimal workflow should be a user-space ring-buffer
> > of these constant-size struct async_syscall entries:
> >
> > struct async_syscall ringbuffer[1024];
> >
> > LIST_HEAD(submitted);
> > LIST_HEAD(pending);
> > LIST_HEAD(completed);
>
> I strongly disagree here, and I'm hoping you're not as keen on this now --
> your reply to Matt gives me hope.
>
> As mentioned, that they complete out-of-order leads, at least, to having
> separate submission and completion rings. I'm not sure a submission ring
> makes any sense given the goal of processing the calls in submission and only
> creating threads if it blocks. A simple copy of an array of these input
> structs sounds fine to me.

The "result" of one async operation is basically a cookie and a result
code. Eight or sixteen bytes at most. IMO, before going wacko designing
complex shared userspace-kernel result buffers, I think it'd be better
measuring the worth-value of the thing ;)



- Davide


2007-02-05 19:38:06

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Mon, 5 Feb 2007, Zach Brown wrote:

> > Or we need some sort of enter_context()/leave_context() (adopt mm, files,
> > ...) to have a per-CPU kthread to be able to execute the syscall from the
> > async() caller context.
>
> I believe that's what Ingo is hoping for, yes.

Ok, but then we should ask ourselves if it's really worth to have a
per-CPU pool (that will require quite a few changes to the current way
of doing things), or a per-process pool (that would basically work as is).
What advantage gives us a per-CPU pool?
Setup cost? Not really IMO. Thread creation is pretty cheap, and a typical
process using async will have a pretty huge lifespan (compared to the pool
creation cost).
Configurability scores for a per-process pool, because it may allow each
process (eventually) to size his own.
What's the real point in favour of a per-CPU pool, that justify all the
changes that will have to be done in order to adopt such concept?



- Davide


2007-02-05 19:42:10

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> The "result" of one async operation is basically a cookie and a result
> code. Eight or sixteen bytes at most.

s/basically/minimally/

Well, yeah. The patches I sent had:

struct asys_completion {
long return_code;
unsigned long cookie;
};

That's as stupid as it gets.

> IMO, before going wacko designing
> complex shared userspace-kernel result buffers, I think it'd be better
> measuring the worth-value of the thing ;)

Obviously, yes.

The potential win is to be able to have one place to wait for
collection from multiple sources. Some of them might want more data
per event. They can always indirect out via a cookie pointer, sure,
but at insanely high message rates (10gige small messages) one might
not want that.

See also: the kevent thread.

- z

2007-02-05 20:10:07

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Mon, 5 Feb 2007, Zach Brown wrote:

> > The "result" of one async operation is basically a cookie and a result
> > code. Eight or sixteen bytes at most.
>
> s/basically/minimally/
>
> Well, yeah. The patches I sent had:
>
> struct asys_completion {
> long return_code;
> unsigned long cookie;
> };
>
> That's as stupid as it gets.

No, that's *really* it ;)
The cookie you pass, and the return code of the syscall.
If there other data transfered? Sure, but that data transfered during the
syscall processing, and handled by the syscall (filling up a sys_read
buffer just for example).




> > IMO, before going wacko designing
> > complex shared userspace-kernel result buffers, I think it'd be better
> > measuring the worth-value of the thing ;)
>
> Obviously, yes.
>
> The potential win is to be able to have one place to wait for collection from
> multiple sources. Some of them might want more data per event. They can
> always indirect out via a cookie pointer, sure, but at insanely high message
> rates (10gige small messages) one might not want that.

Did I miss something? The async() syscall will allow (with few
restrictions) to execute whatever syscall in an async fashion. An syscall
returns a result code (long). Plus, you need to pass back the
userspace-provided cookie of course. A cookie is very likely a direct
pointer to the userspace session the async syscall applies to, so a
"(my_session *) results[i].cookie" will bring you directly on topic.
Collection of multiple sources? What do you mean? What's wrong with:

int async_wait(struct asys_completion *results, int nresults);

Is saving an 8/16 bytes double copy worth going wacko in designing shared
userspace/kernel buffers, when the syscall that lays behind an
asys_completion is prolly touching KBs of RAM during its execution?




- Davide


2007-02-05 20:22:07

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> No, that's *really* it ;)

For syscalls, sure.

The kevent work incorporates Uli's desire to have more data per
event. Have you read his OLS stuff? It's been a while since I did
so I've lost the details of why he cares to have more.

Let me say it again, maybe a little louder this time: I'm not
interested in worrying about this aspect of the API until the
scheduler mechanics are more solidified.

- z

2007-02-05 20:39:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling



On Mon, 5 Feb 2007, Davide Libenzi wrote:
>
> No, that's *really* it ;)
>
> The cookie you pass, and the return code of the syscall.
> If there other data transfered? Sure, but that data transfered during the
> syscall processing, and handled by the syscall (filling up a sys_read
> buffer just for example).

Indeed. One word is *exactly* what a normal system call returns too.

That said, normally we have a user-space library layer to turn that into
the "errno + return value" thing, and in the case of async() calls we
very basically wouldn't have that. So either:

- we'd need to do it in the kernel (which is actually nasty, since
different system calls have slightly different semantics - some don't
return any error value at all, and negative numbers are real numbers)

- we'd have to teach user space about the "negative errno" mechanism, in
which case one word really is alwats enough.

Quite frankly, I much prefer the second alternative. The "negative errno"
thing has not only worked really really well inside the kernel, it's so
obviously 100% superior to the standard UNIX "-1 + errno" approach that
it's not even funny.

To see why "negative errno" is better, just look at any threaded program,
or look at any program that does multiple calls and needs to save the
errno not from the last one, but from some earlier one (eg, it does a
"close()" in between returning *its* error, and the real operation that
we care about).

> Did I miss something? The async() syscall will allow (with few
> restrictions) to execute whatever syscall in an async fashion. An syscall
> returns a result code (long). Plus, you need to pass back the
> userspace-provided cookie of course.

HOWEVER, they get returned differently. The cookie gets returned
immediately, the system call result gets returned in-memory only after the
async thing has actually completed.

I would actually argue that it's not the kernel that should generate any
cookie, but that user-space should *pass*in* the cookie it wants to, and
the kernel should consider it a pointer to a 64-bit entity which is the
return code.

In other words, the only "cookie" we need is literally the pointer to the
results. And that's obviously something that the user space has to set up
anyway.

So how about making the async interface be:

// returns negative for error
// zero for "synchronous"
// positive kernel "wait for me" cookie for success
long sys_async_submit(
unsigned long flags,
long *user_result_ptr,
long syscall,
unsigned long *args);

and the "args" thing would literally just fill up the registers.

The real downside here is that it's very architecture-specific this way,
and that means that x86-64 (and other 64-bit ones) would need to have
emulation layers for the 32-bit ones, but they likely need to do that
*anyway*, so it's probably not a huge downside. The alternative is to:

- make a new architecture-independent system call enumeration for the
async interface

- make everything use 64-bit values.

Now, making an architecture-independent system call enumeration may
actually make sense regardless, because it would allow sys_async() to have
its own system call table and put the limitations and rules for those
system calls there, instead of depending on the per-architecture system
call table that tends to have some really architecture-specific details.

Hmm?

Linus

2007-02-05 20:42:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling



On Mon, 5 Feb 2007, Zach Brown wrote:
>
> For syscalls, sure.
>
> The kevent work incorporates Uli's desire to have more data per event. Have
> you read his OLS stuff? It's been a while since I did so I've lost the
> details of why he cares to have more.

You'd still do that as _arguments_ to the system call, not as the return
value.

Also, quite frankly, I tend to find Uli over-designs things. The whole
disease of "make things general" is a CS disease that some people take to
extreme.

The good thing about generic code is not that it solves some generic
problem. The good thing about generics is that they mean that you can
_avoid_ solving other problems AND HAVE LESS CODE. But some people seem to
think that "generic" means that you have to have tons of code to handle
all the possible cases, and that *completely* misses the point.

We want less code. The whole (and really, the _only_) point of the
fibrils, at least as far as I'm concerned, is to *not* have special code
for aio_read/write/whatever.

Linus

2007-02-05 21:09:21

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Mon, 5 Feb 2007, Linus Torvalds wrote:

> Indeed. One word is *exactly* what a normal system call returns too.
>
> That said, normally we have a user-space library layer to turn that into
> the "errno + return value" thing, and in the case of async() calls we
> very basically wouldn't have that. So either:
>
> - we'd need to do it in the kernel (which is actually nasty, since
> different system calls have slightly different semantics - some don't
> return any error value at all, and negative numbers are real numbers)
>
> - we'd have to teach user space about the "negative errno" mechanism, in
> which case one word really is alwats enough.
>
> Quite frankly, I much prefer the second alternative. The "negative errno"
> thing has not only worked really really well inside the kernel, it's so
> obviously 100% superior to the standard UNIX "-1 + errno" approach that
> it's not even funny.

Currently it's in the syscall wrapper. Couldn't we have it in the
asys_teardown_stack() stub?



> HOWEVER, they get returned differently. The cookie gets returned
> immediately, the system call result gets returned in-memory only after the
> async thing has actually completed.
>
> I would actually argue that it's not the kernel that should generate any
> cookie, but that user-space should *pass*in* the cookie it wants to, and
> the kernel should consider it a pointer to a 64-bit entity which is the
> return code.

Yes. Let's have the userspace to "mark" the async operation. IMO the
cookie should be something transparent to the kernel.
Like you said though, that'd require compat-code (unless we fix the size).



- Davide


2007-02-05 21:22:27

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> - we'd need to do it in the kernel (which is actually nasty, since
> different system calls have slightly different semantics - some
> don't
> return any error value at all, and negative numbers are real
> numbers)
>
> - we'd have to teach user space about the "negative errno"
> mechanism, in
> which case one word really is alwats enough.
>
> Quite frankly, I much prefer the second alternative. The "negative
> errno"
> thing has not only worked really really well inside the kernel,
> it's so
> obviously 100% superior to the standard UNIX "-1 + errno" approach
> that
> it's not even funny.

I agree, and I imagine you'd have a hard time finding someone who
actually *likes* the errno convention :)

> I would actually argue that it's not the kernel that should
> generate any
> cookie, but that user-space should *pass*in* the cookie it wants
> to, and
> the kernel should consider it a pointer to a 64-bit entity which is
> the
> return code.

Yup. That's how the current code (and epoll, and fs/aio.c, and..) work.

Cancelation comes into this discussion, I think. Hopefully its
reasonable to expect userspace to be able to manage cookies well
enough that they can use them to issue cancels and only hit the ops
they intend to. It means we have to give them the tools to
differentiate between a racing completion and cancelation so they can
reuse a cookie at the right time, but that doesn't sound fatal.

> - make everything use 64-bit values.

This would be my preference.

> Now, making an architecture-independent system call enumeration may
> actually make sense regardless, because it would allow sys_async()
> to have
> its own system call table and put the limitations and rules for those
> system calls there, instead of depending on the per-architecture
> system
> call table that tends to have some really architecture-specific
> details.

Maybe, sure. I don't have a lot of insight into this. Hopefully
some arch maintainers can jump in?

- z

2007-02-05 21:31:39

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> > HOWEVER, they get returned differently. The cookie gets returned
> > immediately, the system call result gets returned in-memory only after the
> > async thing has actually completed.
> >
> > I would actually argue that it's not the kernel that should generate any
> > cookie, but that user-space should *pass*in* the cookie it wants to, and
> > the kernel should consider it a pointer to a 64-bit entity which is the
> > return code.
>
> Yes. Let's have the userspace to "mark" the async operation. IMO the
> cookie should be something transparent to the kernel.
> Like you said though, that'd require compat-code (unless we fix the size).

You don't need an explicit cookie if you're passing in a pointer to
the return code, it doesn't really save you anything to do so. Say
you've got a bunch of user threads (with or without stacks, it doesn't
matter).

struct asys_ret {
int ret;
struct thread *p;
};

struct asys_ret r;
r.p = me;

async_read(fd, buf, nbytes, &r);

Then you just have your async_getevents return the same pointers you
passed in, and your main event loop gets pointers to its threads for
free.

It seems cleaner to do it this way vs. returning structs with the
actual return code and a cookie, as threads get the return code
exactly where they want it.

Keep in mind that the epoll way (while great for epoll, I do love it)
makes sense because it doesn't have to deal with any sort of return
codes.

My only other point is that you really do want a bulk asys_submit
instead of doing a syscall per async syscall; one of the great wins of
this approach is heavily IO driven apps can batch up syscalls.

2007-02-05 21:36:24

by bert hubert

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Fri, Feb 02, 2007 at 03:37:09PM -0800, Davide Libenzi wrote:

> Since I still think that the many-thousands potential async operations
> coming from network sockets are better handled with a classical event
> machanism [1], and since smooth integration of new async syscall into the
> standard POSIX infrastructure is IMO a huge win, I think we need to have a
> "bridge" to allow async completions being detectable through a pollable
> (by the mean of select/poll/epoll whatever) device.

> [1] Unless you really want to have thousands of kthreads/fibrils lingering
> on the system.

>From my end as an application developer, yes please. Either make it
perfectly ok to have thousands of outstanding asynchronous system calls (I
work with thousands of separate sockets), or allow me to select/poll/epoll
on the "async fd".

Alternatively, something like SIGIO ('SIGASYS'?) might be considered, but,
well, the fd might be easier.

In fact, perhaps the communication channel might simply *be* an fd. Queueing
up syscalls sounds remarkably like sending datagrams.

Bert

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

2007-02-05 21:44:44

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

From: Davide Libenzi <[email protected]>
Date: Mon, 5 Feb 2007 10:24:34 -0800 (PST)

> Yes, no need for the above. We can just host a poll/epoll in an async()
> operation, and demultiplex once that gets ready.

I can hear Evgeniy crying 8,000 miles away.

I strongly encourage a lot of folks commenting in this thread to
familiarize themselves with kevent and how it handles this stuff. I
see a lot of suggestions for things he has totally implemented and
solved already in kevent.

I'm not talking about Zach's fibril's, I'm talking about the interface
aspects of these discussions.

2007-02-05 21:57:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling



On Mon, 5 Feb 2007, bert hubert wrote:
>
> From my end as an application developer, yes please. Either make it
> perfectly ok to have thousands of outstanding asynchronous system calls (I
> work with thousands of separate sockets), or allow me to select/poll/epoll
> on the "async fd".

No can do.

Allocating an fd is actually too expensive, exactly because a lot of these
operations are supposed to be a few hundred ns, and taking locks is simply
a bad idea.

But if you want to, we could have a *separate* "convert async cookie to
fd" so that you can poll for it, or something.

I doubt very many people want to do that. It would tend to simply be nicer
to do

async(poll);
async(waitpid);
async(.. wait foranything else ..)

followed by a

wait_for_async();

That's just a much NICER approach, I would argue. And it automatically
and very naturally solves the "wait for different kinds of events"
question, in a way that "poll()" never did (except by turning all events
into file descriptors or signals).

> Alternatively, something like SIGIO ('SIGASYS'?) might be considered, but,
> well, the fd might be easier.

Again. NO WAY. Signals are just damn expensive. At most, it would be an
option again, but if you want high performance, signals simply aren't very
good. They are also a nice way to make your user-space code very racy.

> In fact, perhaps the communication channel might simply *be* an fd. Queueing
> up syscalls sounds remarkably like sending datagrams.

I'm the first to say that file descriptors is the UNIX way, but so are
processes, and I think this is MUCH better done as a "process" interface.
In other words, instead of doing it as a filedescriptor, do it as a
"micro-fork/exec", and have the "wait()" equivalent. It's just that we
don't fork a "real process", and we don't exec a "real program", we just
exec a single system call.

If you think of it in those terms, it all makes sense *without* any file
descriptors what-so-ever, and the "wait_for_async()" interface also makes
a ton of sense (it really *is* "waitpid()" for the system call).

Linus

2007-02-05 22:07:47

by bert hubert

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Mon, Feb 05, 2007 at 01:57:15PM -0800, Linus Torvalds wrote:

> I doubt very many people want to do that. It would tend to simply be nicer
> to do
>
> async(poll);

Yeah - I saw that technique being mentioned later on in the thread, and it
would work, I think.

To make up for the waste of time, some other news. I asked Matt Dillon of
DragonflyBSD why he removed asynchronous system calls from his OS, and he
told me it was because of the problems he had implementing them in the
kernel:

There were two basic problems: First, it added a lot of overhead when
most system calls are either non-blocking anyway (like getpid()).
Second and more importantly it was very, very difficult to recode the
system calls that COULD block to actually be asynchronous in the kernel.
I spent some time recoding nanosleep() to operate asynchronously and it
was a huge mess.

Aside from that, they did not discover any skeletons hidden in the closet,
although from mailing list traffic, I gather the asynchronous system calls
didn't see a lot of use. If I understand it correctly, for a number of years
they emulated asynchronous system calls using threads.

We'd be sidestepping the need to update all syscalls via 'fibrils' of
course.

> If you think of it in those terms, it all makes sense *without* any file
> descriptors what-so-ever, and the "wait_for_async()" interface also makes
> a ton of sense (it really *is* "waitpid()" for the system call).

It has me excited in any case. Once anything even remotely testable appears
(Zach tells me not to try the current code), I'll work it into MTasker
(http://ds9a.nl/mtasker) and make it power a nameserver that does async i/o,
for use with very very large zones that aren't preloaded.

Bert

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

2007-02-05 22:15:40

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> It has me excited in any case. Once anything even remotely testable
> appears
> (Zach tells me not to try the current code), I'll work it into MTasker
> (http://ds9a.nl/mtasker) and make it power a nameserver that does
> async i/o,
> for use with very very large zones that aren't preloaded.

I'll be sure to let you know :)

- z

2007-02-05 22:35:12

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Mon, 5 Feb 2007, Linus Torvalds wrote:

> On Mon, 5 Feb 2007, bert hubert wrote:
> >
> > From my end as an application developer, yes please. Either make it
> > perfectly ok to have thousands of outstanding asynchronous system calls (I
> > work with thousands of separate sockets), or allow me to select/poll/epoll
> > on the "async fd".
>
> No can do.
>
> Allocating an fd is actually too expensive, exactly because a lot of these
> operations are supposed to be a few hundred ns, and taking locks is simply
> a bad idea.
>
> But if you want to, we could have a *separate* "convert async cookie to
> fd" so that you can poll for it, or something.
>
> I doubt very many people want to do that. It would tend to simply be nicer
> to do
>
> async(poll);
> async(waitpid);
> async(.. wait foranything else ..)
>
> followed by a
>
> wait_for_async();
>
> That's just a much NICER approach, I would argue. And it automatically
> and very naturally solves the "wait for different kinds of events"
> question, in a way that "poll()" never did (except by turning all events
> into file descriptors or signals).

Bert, that was the first suggestion I gave to Zab. But then I realized
that a multiplexed poll/epoll can be "hosted" in an async op, just like
Linus showed above. Will work just fine IMO.




- Davide


2007-02-06 00:15:35

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Mon, 5 Feb 2007, David Miller wrote:

> From: Davide Libenzi <[email protected]>
> Date: Mon, 5 Feb 2007 10:24:34 -0800 (PST)
>
> > Yes, no need for the above. We can just host a poll/epoll in an async()
> > operation, and demultiplex once that gets ready.
>
> I can hear Evgeniy crying 8,000 miles away.
>
> I strongly encourage a lot of folks commenting in this thread to
> familiarize themselves with kevent and how it handles this stuff. I
> see a lot of suggestions for things he has totally implemented and
> solved already in kevent.

David, I'm sorry but I only briefly looked at the work Evgeniy did on
kevent. So excuse me if I say something broken in the next few sentences.
Zab's async syscall interface is a pretty simple one. It accepts the
syscall number, the parameters for the syscall, and a cookie. It returns a
syscall result code, and your cookie (that's the meat of it, at least).
IMO its interface should be optimized for what it does.
Could this submission/retrieval be inglobated inside a "generic"
submission/retrieval API? Sure you can. But then you end up having
submission/event structures with 17 members, 3 of which are valid at each
time. The API becomes more difficult to use IMO, because suddendly you
have to know which field are good for each event you're submitting/fetching.
IMHO, genericity can be built in userspace, *if* one really wants it and,
of course, provided that the OS gives you the tools to build it.
The problem before, was that it was hard to bridge something like poll/epoll
with other "by nature" sync operations. Evgeniy kevent is one attempt,
Zab's async is another one. IMO the async syscall is a *very poweful* one,
since it allows for *total coverage* async support w/out "plugs" all over
the kernel paths.
But as Zab said, the kernel implementation is more important ATM.




- Davide

2007-02-06 00:28:22

by Scot McKinley

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling


As Joel mentioned earlier, from an Oracle perspective, one of the key
things we are looking for is a nice clean *common* wait point. We don't
really care whether this common wait point is the old libaio:async-poll,
epoll, or "wait_for_async". And if "wait_for_async" has the added
benefit of scaling, all the better.

However, it is desirable for that common wait-routine to have the
ability to return explicit completions, instead of requiring a follow-on
call to some other query/wait for events/completions for each of the
different type of async submissions done (poll, pid, i/o, ...).
Obviously not a "must-have", but desirable.

It is also desirable (if possible) to have immediate completions (either
immediate errs or async submissions that complete synchronously)
communicated at submission time, instead of via the common wait-routine.

Finally, it is agreed that neg-errno is a much better approach for the
return code. The threading/concurrency issues associated w/ the current
unix errno has always been buggy area for Oracle Networking code.

Regards, -Scot

Linus Torvalds wrote:

>On Mon, 5 Feb 2007, bert hubert wrote:
>
>
>>From my end as an application developer, yes please. Either make it
>>perfectly ok to have thousands of outstanding asynchronous system calls (I
>>work with thousands of separate sockets), or allow me to select/poll/epoll
>>on the "async fd".
>>
>>
>
>No can do.
>
>Allocating an fd is actually too expensive, exactly because a lot of these
>operations are supposed to be a few hundred ns, and taking locks is simply
>a bad idea.
>
>But if you want to, we could have a *separate* "convert async cookie to
>fd" so that you can poll for it, or something.
>
>I doubt very many people want to do that. It would tend to simply be nicer
>to do
>
> async(poll);
> async(waitpid);
> async(.. wait foranything else ..)
>
>followed by a
>
> wait_for_async();
>
>That's just a much NICER approach, I would argue. And it automatically
>and very naturally solves the "wait for different kinds of events"
>question, in a way that "poll()" never did (except by turning all events
>into file descriptors or signals).
>
>
>
>>Alternatively, something like SIGIO ('SIGASYS'?) might be considered, but,
>>well, the fd might be easier.
>>
>>
>
>Again. NO WAY. Signals are just damn expensive. At most, it would be an
>option again, but if you want high performance, signals simply aren't very
>good. They are also a nice way to make your user-space code very racy.
>
>
>
>>In fact, perhaps the communication channel might simply *be* an fd. Queueing
>>up syscalls sounds remarkably like sending datagrams.
>>
>>
>
>I'm the first to say that file descriptors is the UNIX way, but so are
>processes, and I think this is MUCH better done as a "process" interface.
>In other words, instead of doing it as a filedescriptor, do it as a
>"micro-fork/exec", and have the "wait()" equivalent. It's just that we
>don't fork a "real process", and we don't exec a "real program", we just
>exec a single system call.
>
>If you think of it in those terms, it all makes sense *without* any file
>descriptors what-so-ever, and the "wait_for_async()" interface also makes
>a ton of sense (it really *is* "waitpid()" for the system call).
>
> Linus
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-aio' in
>the body to [email protected]. For more info on Linux AIO,
>see: http://www.kvack.org/aio/
>Don't email: <a href=mailto:"[email protected]">[email protected]</a>
>
>

2007-02-06 00:32:43

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Mon, 5 Feb 2007, Davide Libenzi wrote:

> On Mon, 5 Feb 2007, Linus Torvalds wrote:
>
> > Indeed. One word is *exactly* what a normal system call returns too.
> >
> > That said, normally we have a user-space library layer to turn that into
> > the "errno + return value" thing, and in the case of async() calls we
> > very basically wouldn't have that. So either:
> >
> > - we'd need to do it in the kernel (which is actually nasty, since
> > different system calls have slightly different semantics - some don't
> > return any error value at all, and negative numbers are real numbers)
> >
> > - we'd have to teach user space about the "negative errno" mechanism, in
> > which case one word really is alwats enough.
> >
> > Quite frankly, I much prefer the second alternative. The "negative errno"
> > thing has not only worked really really well inside the kernel, it's so
> > obviously 100% superior to the standard UNIX "-1 + errno" approach that
> > it's not even funny.
>
> Currently it's in the syscall wrapper. Couldn't we have it in the
> asys_teardown_stack() stub?

Eeeek, that was something *really* stupid I said :D



- Davide


2007-02-06 00:48:20

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

From: Scot McKinley <[email protected]>
Date: Mon, 05 Feb 2007 16:27:44 -0800

> As Joel mentioned earlier, from an Oracle perspective, one of the key
> things we are looking for is a nice clean *common* wait point.

How much investigation have the Oracle folks (besides Zach :-) done
into Evgeniy's kevent interfaces and how much feedback have they given
to him.

I know it sounds like I'm being a pain in the ass, but it saddens
me that there is this whole large body of work implemented to solve
a problem, the maintainer keeps posting patch sets and the whole
discussions has gone silent.

I'd be quiet if there were some well formulated objections to his work
being posted, but people are posting nothing. So either it's a
perfect API or people aren't giving it the attention and consideration
it deserves.

2007-02-06 00:49:13

by Joel Becker

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Mon, Feb 05, 2007 at 04:27:44PM -0800, Scot McKinley wrote:
> Finally, it is agreed that neg-errno is a much better approach for the
> return code. The threading/concurrency issues associated w/ the current
> unix errno has always been buggy area for Oracle Networking code.

As Scot knows, when Oracle started using the current io_submit(2)
and io_getevents(2), -errno was a big win.

Joel

--

"Born under a bad sign.
I been down since I began to crawl.
If it wasn't for bad luck,
I wouldn't have no luck at all."

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2007-02-06 13:41:15

by Al Boldi

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

Linus Torvalds wrote:
> On Mon, 5 Feb 2007, Zach Brown wrote:
> > For syscalls, sure.
> >
> > The kevent work incorporates Uli's desire to have more data per event.
> > Have you read his OLS stuff? It's been a while since I did so I've lost
> > the details of why he cares to have more.
>
> You'd still do that as _arguments_ to the system call, not as the return
> value.
>
> Also, quite frankly, I tend to find Uli over-designs things. The whole
> disease of "make things general" is a CS disease that some people take to
> extreme.
>
> The good thing about generic code is not that it solves some generic
> problem. The good thing about generics is that they mean that you can
> _avoid_ solving other problems AND HAVE LESS CODE.

Yes, that would be generic code, in the pure sense.

> But some people seem to
> think that "generic" means that you have to have tons of code to handle
> all the possible cases, and that *completely* misses the point.

That would be generic code too, but by way of functional awareness. This is
sometimes necessary, as no pure generic code has been found.

What's important is not the generic code, but rather the correct abstraction
of the problem-domain, regardless of it's implementation, as that can be
conveniently hidden behind the interface.

> We want less code. The whole (and really, the _only_) point of the
> fibrils, at least as far as I'm concerned, is to *not* have special code
> for aio_read/write/whatever.

What we want is correct code, and usually that means less code in the long
run.

So, instead of allowing the implementation to dictate the system design, it
may be advisable to concentrate on the design first, to achieve an abstract
interface that is realized by an implementation second.


Thanks!

--
Al

2007-02-06 20:25:13

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Mon, 5 Feb 2007, Kent Overstreet wrote:

> > > HOWEVER, they get returned differently. The cookie gets returned
> > > immediately, the system call result gets returned in-memory only after the
> > > async thing has actually completed.
> > >
> > > I would actually argue that it's not the kernel that should generate any
> > > cookie, but that user-space should *pass*in* the cookie it wants to, and
> > > the kernel should consider it a pointer to a 64-bit entity which is the
> > > return code.
> >
> > Yes. Let's have the userspace to "mark" the async operation. IMO the
> > cookie should be something transparent to the kernel.
> > Like you said though, that'd require compat-code (unless we fix the size).
>
> You don't need an explicit cookie if you're passing in a pointer to
> the return code, it doesn't really save you anything to do so. Say
> you've got a bunch of user threads (with or without stacks, it doesn't
> matter).
>
> struct asys_ret {
> int ret;
> struct thread *p;
> };
>
> struct asys_ret r;
> r.p = me;
>
> async_read(fd, buf, nbytes, &r);

Hmm, are you working for Symbian? Because that's exactly how they track
pending async operations (address of a status variable - wrapped in a
class of course, being them) ;)
That's another way of doing it, IMO no better no worse than letting
explicit cookie selection from userspace. You still have to have the
compat code though, either ways.



- Davide


2007-02-06 20:46:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling



On Mon, 5 Feb 2007, Kent Overstreet wrote:
>
> You don't need an explicit cookie if you're passing in a pointer to
> the return code, it doesn't really save you anything to do so. Say
> you've got a bunch of user threads (with or without stacks, it doesn't
> matter).
>
> struct asys_ret {
> int ret;
> struct thread *p;
> };
>
> struct asys_ret r;
> r.p = me;
>
> async_read(fd, buf, nbytes, &r);

That's horrible. It means that "r" cannot have automatic linkage (since
the stack will be *gone* by the time we need to fill in "ret"), so now you
need to track *two* pointers: "me" and "&r".

Wouldn't it be much better to just track one (both in user space and in
kernel space).

In kernel space, the "one pointer" would be the fibril pointer (which
needs to have all the information necessary for completing the operation
anyway), and in user space, it would be better to have just the cookie be
a pointer to the place where you expect the return value (since you need
both anyway).

I think the point here (for *both* the kernel and user space) would be to
try to keep the interfaces really easy to use. For the kernel, it means
that we don't ever pass anything new around: the "fibril" pointer is
basically defined by the current execution thread.

And for user space, it means that we pass the _one_ thing around that we
need for both identifying the async operation to the kernel (the "cookie")
for wait or cancel, and the place where we expect the return value to be
found (which in turn can _easily_ represent a whole "struct aiocb *",
since the return value obviously has to be embedded in there anyway).

Linus

2007-02-06 21:16:34

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

From: Linus Torvalds <[email protected]>
Date: Tue, 6 Feb 2007 12:46:11 -0800 (PST)

> And for user space, it means that we pass the _one_ thing around that we
> need for both identifying the async operation to the kernel (the "cookie")
> for wait or cancel, and the place where we expect the return value to be
> found (which in turn can _easily_ represent a whole "struct aiocb *",
> since the return value obviously has to be embedded in there anyway).

I really think that Evgeniy's kevent is a good event notification
mechanism for anything, including AIO.

Events are events, applications want a centralized way to receive and
process them.

It's already implemented, and if there are tangible problems with it,
Evgeniy has been excellent at responding to criticism and implementing
suggested changes to the interfaces.

2007-02-06 21:29:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling



On Tue, 6 Feb 2007, David Miller wrote:
>
> I really think that Evgeniy's kevent is a good event notification
> mechanism for anything, including AIO.
>
> Events are events, applications want a centralized way to receive and
> process them.

Don't be silly. AIO isn't an event. AIO is an *action*.

The event part is hopefully something that doesn't even *happen*.

Why do people ignore this? Look at a web server: I can pretty much
guarantee that 99% of all filesystem accesses are cached, and doing them
as "events" would be a total and utter waste of time.

You want to do them synchronously, as fast as possible, and you do NOT
want to see them as any kind of asynchronous events.

Yeah, in 1% of all cases it will block, and you'll want to wait for them.
Maybe the kevent queue works then, but if it needs any more setup than the
nonblocking case, that's a big no.

Linus

2007-02-06 21:32:07

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

From: Linus Torvalds <[email protected]>
Date: Tue, 6 Feb 2007 13:28:34 -0800 (PST)

> Yeah, in 1% of all cases it will block, and you'll want to wait for them.
> Maybe the kevent queue works then, but if it needs any more setup than the
> nonblocking case, that's a big no.

So the idea is to just run it to completion if it won't block and use
a fibril if it would?

kevent could support something like that too.

2007-02-06 21:47:43

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

David Miller a ?crit :
> From: Linus Torvalds <[email protected]>
> Date: Tue, 6 Feb 2007 13:28:34 -0800 (PST)
>
>> Yeah, in 1% of all cases it will block, and you'll want to wait for them.
>> Maybe the kevent queue works then, but if it needs any more setup than the
>> nonblocking case, that's a big no.
>
> So the idea is to just run it to completion if it won't block and use
> a fibril if it would?
>
> kevent could support something like that too.

It seems to me that kevent was designed to handle many events sources on a
single endpoint, like epoll (but with different internals). Typical load of
thousand of sockets/pipes providers glued into one queue.

In the fibril case, I guess a thread wont have many fibrils lying around...

Also, kevent needs a fd lookup/fput to retrieve some queued events, and that
may be a performance hit for the AIO case, (fget/fput in a multi-threaded
program cost some atomic ops)

2007-02-06 21:50:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling



On Tue, 6 Feb 2007, David Miller wrote:
>
> So the idea is to just run it to completion if it won't block and use
> a fibril if it would?

That's not how the patches work right now, but yes, I at least personally
think that it's something we should aim for (ie the interface shouldn't
_require_ us to always wait for things even if perhaps an early
implementation might make everything be delayed at first)

Linus

2007-02-06 22:29:25

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> That's not how the patches work right now, but yes, I at least
> personally
> think that it's something we should aim for (ie the interface
> shouldn't
> _require_ us to always wait for things even if perhaps an early
> implementation might make everything be delayed at first)

I agree that we shouldn't require a seperate syscall just to get the
return code from ops that didn't block.

It doesn't seem like much of a stretch to imagine a setup where we
can specify completion context as part of the submission itself.

declare_empty_ring(ring);
struct submission sub;

sub.ring = &ring;
sub.nr = SYS_fstat64;
sub.args == ...

ret = submit(&sub, 1);
if (ret == 0) {
wait_for_elements(&ring, 1);
printf("stat gave %d\n", ring[ring->head].rc);
}

You get the idea, it's just an outline.

wait_for_elements() could obviously check the ring before falling
back to kernel sync. I'm pretty keen on the notion of producer/
consumer rings where userspace writes the head as it plucks
completions and the kernel writes the tail as it adds them.

We might want per-call ring pointers, instead of per submission, to
help submitters wait for a group of ops to complete without having to
do their own tracking on event completion. That only makes sense if
we have the waiting mechanics let you only be woken as the number of
events in the ring crosses some threshold. Which I think we want
anyway.

We'd be trading building up a specific completion state with syscalls
for some complexity during submission that pins (and kmaps on
completion) the user pages. Submission could return failure if
pinning these new pages would push us over some rlimit. We'd have to
be *awfully* careful not to let userspace corrupt (munmap?) the ring
and confuse the hell out of the kernel.

Maybe not worth it, but if we *really* cared about making the non-
blocking case almost identical to the sync case and wanted to use the
same interface for batch submission and async completion then this
seems like a possibility.

- z

2007-02-06 22:45:52

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On 2/6/07, Linus Torvalds <[email protected]> wrote:
> On Mon, 5 Feb 2007, Kent Overstreet wrote:
> >
> > struct asys_ret {
> > int ret;
> > struct thread *p;
> > };
> >
> > struct asys_ret r;
> > r.p = me;
> >
> > async_read(fd, buf, nbytes, &r);
>
> That's horrible. It means that "r" cannot have automatic linkage (since
> the stack will be *gone* by the time we need to fill in "ret"), so now you
> need to track *two* pointers: "me" and "&r".

You'd only allocate r on the stack if that stack is going to be around
later; i.e. if you're using user threads. Otherwise, you just allocate
it in some struct containing your aiocb or whatever.

> And for user space, it means that we pass the _one_ thing around that we
> need for both identifying the async operation to the kernel (the "cookie")
> for wait or cancel, and the place where we expect the return value to be
> found (which in turn can _easily_ represent a whole "struct aiocb *",
> since the return value obviously has to be embedded in there anyway).
>
> Linus

The "struct aiocb" isn't something you have to or necessarily want to
keep around. It's the way the current aio interface works (which I've
coded to), but I don't really see the point. All it really contains is
the syscall arguments, but once the syscall's in progress there's no
reason the kernel has to refer back to it; similarly for userspace,
it's just another struct that userspace has to keep track of and free
at some later time.

In fact, that's the only sane way you can have a ring for submitted
system calls, as otherwise elements of the ring are getting freed in
essentially random order.

I don't see the point in having a ring for completed events, since
it's at most two pointers per completion; quite a bit less data being
sent back than for submissions.

-----

The trouble with differentiating between calls that block and calls
that don't is you completely loose the ability to batch syscalls
together; this is potentially a major win of an asynchronous
interface.

An app can have a bunch of cheap, fast user space threads servicing
whatever; as they run, they can push their system calls onto a global
stack. When no more can run, it does a giant asys_submit (something
similar to io_submit), then the io_getevents equivilant, running the
user threads that had their syscalls complete.

This doesn't mean you can't run synchronously the syscalls that
wouldn't block, or that you have to allocate a fibril for every
syscall - but for servers that care more about throughput than
latency, this is potentially a big win, in cache effects if nothing
else.

(And this doesn't prevent you from having a different syscall that
submits an asynchronous syscall, but runs it right away if it was able
to without blocking).

2007-02-06 23:05:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling



On Tue, 6 Feb 2007, Kent Overstreet wrote:
>
> The "struct aiocb" isn't something you have to or necessarily want to
> keep around.

Oh, don't get me wrong - the _only_ reason for "struct aiocb" would be
backwards compatibility. The point is, we'd need to keep that
compatibility to be useful - otherwise we just end up having to duplicate
the work (do _both_ fibrils _and_ the in-kernel AIO).

> I don't see the point in having a ring for completed events, since
> it's at most two pointers per completion; quite a bit less data being
> sent back than for submissions.

I'm certainly personally perfectly happy with the kernel not remembering
any completed events at all - once it's done, it's done and forgotten. So
doing

async(mycookie)
wait_for_async(mycookie)

could actually return with -ECHILD (or similar error).

In other words, if you see it as a "process interface" (instead of as a
"filedescriptor interface"), I'd suggest automatic reaping of the fibril
children. I do *not* think we want the equivalent of zombies - if only
because they are just a lot of work to reap, and potentially a lot of
memory to keep around.

Linus

2007-02-06 23:23:52

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Tue, 6 Feb 2007, Kent Overstreet wrote:

> The trouble with differentiating between calls that block and calls
> that don't is you completely loose the ability to batch syscalls
> together; this is potentially a major win of an asynchronous
> interface.

It doesn't necessarly have to, once you extend the single return code to a
vector:

struct async_submit {
void *cookie;
int sysc_nbr;
int nargs;
long args[ASYNC_MAX_ARGS];
int async_result;
};

int async_submit(struct async_submit *a, int n);

And async_submit() can mark each one ->async_result with -EASYNC (syscall
has been batched), or another code (syscall completed w/out schedule).
IMO, once you get a -EASYNC for a syscall, you *have* to retire the result.



- Davide


2007-02-06 23:39:37

by Joel Becker

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Tue, Feb 06, 2007 at 03:23:47PM -0800, Davide Libenzi wrote:
> struct async_submit {
> void *cookie;
> int sysc_nbr;
> int nargs;
> long args[ASYNC_MAX_ARGS];
> int async_result;
> };
>
> int async_submit(struct async_submit *a, int n);
>
> And async_submit() can mark each one ->async_result with -EASYNC (syscall
> has been batched), or another code (syscall completed w/out schedule).
> IMO, once you get a -EASYNC for a syscall, you *have* to retire the result.

There are pains here, though. On every submit, you have to walk
the entire vector just to know what did or did not complete. I've seen
this in other APIs (eg, async_result would be -EAGAIN for lack of
resources to start this particular fibril). Userspace submit ends up
always walking the array of submissions twice - once to prep them, and
once to check if they actually went async. For longer lists of I/Os,
this is expensive.

Joel

--

"Too much walking shoes worn thin.
Too much trippin' and my soul's worn thin.
Time to catch a ride it leaves today
Her name is what it means.
Too much walking shoes worn thin."

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2007-02-06 23:56:19

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Tue, 6 Feb 2007, Joel Becker wrote:

> On Tue, Feb 06, 2007 at 03:23:47PM -0800, Davide Libenzi wrote:
> > struct async_submit {
> > void *cookie;
> > int sysc_nbr;
> > int nargs;
> > long args[ASYNC_MAX_ARGS];
> > int async_result;
> > };
> >
> > int async_submit(struct async_submit *a, int n);
> >
> > And async_submit() can mark each one ->async_result with -EASYNC (syscall
> > has been batched), or another code (syscall completed w/out schedule).
> > IMO, once you get a -EASYNC for a syscall, you *have* to retire the result.
>
> There are pains here, though. On every submit, you have to walk
> the entire vector just to know what did or did not complete. I've seen
> this in other APIs (eg, async_result would be -EAGAIN for lack of
> resources to start this particular fibril). Userspace submit ends up
> always walking the array of submissions twice - once to prep them, and
> once to check if they actually went async. For longer lists of I/Os,
> this is expensive.

Async syscall submissions are a _one time_ things. It's not like a live fd
that you can push inside epoll and avoid the multiple O(N) passes.
First of all, the amount of syscalls that you'd submit in a vectored way
are limited. They do not depend on the total number of connections, but on
the number of syscalls that you are actualy able to submit in parallel.
Note that it's not a trivial tasks to extract a long enough level of
parallelism, that would make you feel pain in having to walk through the
submission array. Think about the trivial web server case. Remote HTTP
client asks one page, and you may think to batch a few ops together (like
a stat, open, send headers, and sendfile for example), but those cannot be
vectored since they have to complete in order. The stat would even trigger
different response to the HTTP client. You need the open() fd to submit
the send-headers and sendfile.
IMO there are no scalability problems in a multiple submission/retrieval
API like the above (or any variation of it).



- Davide


2007-02-07 00:06:52

by Joel Becker

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Tue, Feb 06, 2007 at 03:56:14PM -0800, Davide Libenzi wrote:
> Async syscall submissions are a _one time_ things. It's not like a live fd
> that you can push inside epoll and avoid the multiple O(N) passes.
> First of all, the amount of syscalls that you'd submit in a vectored way
> are limited. They do not depend on the total number of connections, but on

I regularly see apps that want to submit 1000 I/Os at once.
Every submit. But it's all against one or two file descriptors. So, if
you return to userspace, they have to walk all 1000 async_results every
time, just to see which completed and which didn't. And *then* go wait
for the ones that didn't. If they just wait for them all, they aren't
spinning cpu on the -EASYNC operations.
I'm not saying that "don't return a completion if we can
non-block it" is inherently wrong or not a good idea. I'm saying that
we need a way to flag them efficiently.

Joel

--

Life's Little Instruction Book #80

"Slow dance"

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2007-02-07 00:24:14

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Tue, 6 Feb 2007, Joel Becker wrote:

> On Tue, Feb 06, 2007 at 03:56:14PM -0800, Davide Libenzi wrote:
> > Async syscall submissions are a _one time_ things. It's not like a live fd
> > that you can push inside epoll and avoid the multiple O(N) passes.
> > First of all, the amount of syscalls that you'd submit in a vectored way
> > are limited. They do not depend on the total number of connections, but on
>
> I regularly see apps that want to submit 1000 I/Os at once.
> Every submit. But it's all against one or two file descriptors. So, if
> you return to userspace, they have to walk all 1000 async_results every
> time, just to see which completed and which didn't. And *then* go wait
> for the ones that didn't. If they just wait for them all, they aren't
> spinning cpu on the -EASYNC operations.
> I'm not saying that "don't return a completion if we can
> non-block it" is inherently wrong or not a good idea. I'm saying that
> we need a way to flag them efficiently.

To how many "sessions" those 1000 *parallel* I/O operations refer to?
Because, if you batch them in an async fashion, they have to be parallel.
Without the per-async operation status code, you'll need to wait a result
*for each* submitted syscall, even the ones that completed syncronously.
Open questions are:

- Is the 1000 *parallel* syscall vectored submission case common?

- Is it more expensive to forcibly have to wait and fetch a result even
for in-cache syscalls, or it's faster to walk the submission array?



- Davide


2007-02-07 00:45:17

by Joel Becker

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Tue, Feb 06, 2007 at 04:23:52PM -0800, Davide Libenzi wrote:
> To how many "sessions" those 1000 *parallel* I/O operations refer to?
> Because, if you batch them in an async fashion, they have to be parallel.

They're independant. Of course they have to be parallel, that's
what I/O wants.

> Without the per-async operation status code, you'll need to wait a result
> *for each* submitted syscall, even the ones that completed syncronously.

You are right, but it's more efficient in some cases.

> Open questions are:
>
> - Is the 1000 *parallel* syscall vectored submission case common?

Sure is for I/O. It's the majority of the case. If you have
1000 blocks to send out, you want them all down at the request queue at
once, where they can merge.

> - Is it more expensive to forcibly have to wait and fetch a result even
> for in-cache syscalls, or it's faster to walk the submission array?

Not everything is in-cache. Databases will be doing O_DIRECT
and will expect that 90% of their I/O calls will block. Why should they
have to iterate this list every time? If this is the API, they *have*
to. If there's an efficient way to get "just the ones that didn't
block", then it's not a problem.

Joel


--

"The real reason GNU ls is 8-bit-clean is so that they can
start using ISO-8859-1 option characters."
- Christopher Davis ([email protected])

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2007-02-07 01:15:08

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Tue, 6 Feb 2007, Joel Becker wrote:

> > - Is it more expensive to forcibly have to wait and fetch a result even
> > for in-cache syscalls, or it's faster to walk the submission array?
>
> Not everything is in-cache. Databases will be doing O_DIRECT
> and will expect that 90% of their I/O calls will block. Why should they
> have to iterate this list every time? If this is the API, they *have*
> to. If there's an efficient way to get "just the ones that didn't
> block", then it's not a problem.

If that's what is wanted, then the async_submit() API can detect the
syncronous completion soon, and drop a result inside the result-queue
immediately. It means that an immediately following async_wait() will find
some completions soon. Or:

struct async_submit {
void *cookie;
int sysc_nbr;
int nargs;
long args[ASYNC_MAX_ARGS];
};
struct async_result {
void *cookie;
long result:
};

int async_submit(struct async_submit *a, struct async_result *r, int n);

Where "r" will store the ones that completed syncronously. I mean, there
are really many ways to do this.
I think ATM the core kernel implementation should be the focus, because
IMO we just scratched the surface of the potential problems that something
like this can arise (scheduling, signaling, cleanup, cancel - just to
name a few).



- Davide


2007-02-07 01:22:08

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On 2/6/07, Linus Torvalds <[email protected]> wrote:
> On Tue, 6 Feb 2007, Kent Overstreet wrote:
> >
> > The "struct aiocb" isn't something you have to or necessarily want to
> > keep around.
>
> Oh, don't get me wrong - the _only_ reason for "struct aiocb" would be
> backwards compatibility. The point is, we'd need to keep that
> compatibility to be useful - otherwise we just end up having to duplicate
> the work (do _both_ fibrils _and_ the in-kernel AIO).

Bah, I was unclear here, sorry. I was talking about the userspace interface.

Right now, with the aio interface, io_submit passes in an array of
pointers to struct iocb; there's nothing that says the kernel will be
done with the structs when io_submit returns, so while userspace is
free to reuse the array of pointers, it can't free the actual iocbs
until they complete.

This is slightly stupid, for a couple reasons, and if we're making a
new pair of sycalls it'd be better to do it slightly differently.

What you want is for the async_submit syscall (or whatever it's
called) to pass in an array of structs, and for the kernel to not
reference them after async_submit returns. This is easy; after
async_submit returns, each syscall in the array is either completed
(if it could be without blocking), or in progress, and there's no
reason to need the arguments again.

It also means that the kernel has to copy in only a single userspace
buffer, instead of one buffer per syscall; as Joel mentions, there are
plenty of apps that will be doing 1000s of syscalls at once. From a
userspace perspective it's awesome, it simplifies coding for it and
means you have to hit the heap that much less.

2007-02-07 01:24:40

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

> If that's what is wanted, then the async_submit() API can detect the
> syncronous completion soon, and drop a result inside the result-queue
> immediately. It means that an immediately following async_wait() will find
> some completions soon. Or:
>
> struct async_submit {
> void *cookie;
> int sysc_nbr;
> int nargs;
> long args[ASYNC_MAX_ARGS];
> };
> struct async_result {
> void *cookie;
> long result:
> };
>
> int async_submit(struct async_submit *a, struct async_result *r, int n);
>
> Where "r" will store the ones that completed syncronously. I mean, there
> are really many ways to do this.

That interface (modifying async_submit to pass in the size of the
result array) would work great.

2007-02-07 01:30:40

by Joel Becker

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On Tue, Feb 06, 2007 at 05:15:02PM -0800, Davide Libenzi wrote:
> I think ATM the core kernel implementation should be the focus, because

Yeah, I was thinking the same thing. I originally posted just
to make the point :-)

Joel

--

Life's Little Instruction Book #99

"Think big thoughts, but relish small pleasures."

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2007-02-07 06:16:50

by Michael K. Edwards

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

On 2/6/07, Joel Becker <[email protected]> wrote:
> Not everything is in-cache. Databases will be doing O_DIRECT
> and will expect that 90% of their I/O calls will block. Why should they
> have to iterate this list every time? If this is the API, they *have*
> to. If there's an efficient way to get "just the ones that didn't
> block", then it's not a problem.

It's usually efficient, especially in terms of programmer effort, for
the immediate path to resemble as nearly as possible what you would
have done with the synchronous equivalent. (If there's some value in
parallelizing the query across multiple CPUs, you probably don't want
the kernel guessing how to partition it.) But what's efficient for
the delayed path is to be tightly bound to the arrival of the AIO
result, and to do little more than schedule it into the appropriate
event queue or drop it if it is stale. The immediate and delayed
paths will often share part, but not all, of their implementation, and
most of the shared part is probably data structure setup that can
precede the call itself. The rest of the delayed path is where the
design effort should go, because it's the part that has the sort of
complex impact on system performance that is hard for application
programmers to think clearly about.

Oracle isn't the only potential userspace user of massively concurrent
AIO with a significant, but not dominant, fraction of cache hits. I'm
familiar with a similar use case in network monitoring, in which one
would like to implement the attribute tree and query translation more
or less as a userspace filesystem, while leaving both the front-end
caching and the back-end throttling, retries, etc. to in-kernel state
machines. When 90% of the data requested by the front end (say, a
Python+WxWidgets GUI) is available from the VFS cache, only the other
10% should actually carry the AIO overhead.

Let's look at that immediately available fraction from the GUI
programmer's perspective. He wants to look up some attributes from a
whole batch of systems, and wants to present all immediately available
results to the user, with the rest grayed out or something. Each
request for data that is available from cache should result
immediately in a call to his (perhaps bytecode-language) callback,
which fills in a slot in the data structure that he's going to present
wholesale. There's no reason why the immediate version of the
callback should be unable to allocate memory, poke at thread-local
structures, etc.; and in practice there's little to be gained by
parallelizing this fraction (or even aggressively delivering AIOs that
complete quickly) because you'd need to thread-safe that data
structure, which probably isn't worth it in performance and certainly
isn't in programmer effort and likelihood of Heisenbugs.

Delayed results, on the other hand, probably have to use the GUI's
event posting mechanism to queue the delivered data (probably in a
massaged form) into a GUI update thread. Hence the delayed callback
can be delivered in some totally other context if it's VM- and
scheduler-efficient to do so; it's probably just doing a couple of
memcpys and a sem_post or some such. The only reason it isn't a
totally separate chunk of code is that it uses the same context layout
as the immediate path, and may have to poke at some of the same
pre-allocated places to update completion statistics, etc.

(I implemented something similar to this in userspace using Python
generators for the closure-style callbacks, in the course of rewriting
a GUI that had a separate protocol translator process in place of the
userspace filesystem. The thread pool that serviced responses from
the protocol translator operated much like Zach's fibrils, and used a
sort of lookup by request cookie to retrieve the closure and feed it
the result, which had the side effect of posting the appropriate
event. It worked, fast, and it was more or less comprehensible to
later maintainers despite the use of Python's functional features,
because the AIO delivery was kept separate from both the plain-vanilla
immediate-path code and the GUI-idiom event queue processing.)

The broader issue here is that only the application developer really
knows how the AIO results ought to be funneled into the code that
wants them, which could be a database query engine or a GUI update
queue or Murphy knows what. This "application AIO closure" step is
distinct from the application-neutral closure that needs to run in a
kernel "fibril" (extracting stat() results from an NFS response, or
whatever). So it seems to me that applications ought to be able to
specify a userspace closure to be executed on async I/O completion (or
timeout, error, etc.), and this closure should be scheduled
efficiently on completion of the kernel bit.

The delayed path through the userspace closure would partly resemble a
signal handler in that it shouldn't touch thread or heap context, just
poke at pre-allocated process-global memory locations and/or
synchronization primitives. (A closer parallel, for those familiar
with it, would be the "event handlers" of O/S's with cooperative
multitasking and a single foreground application; MacOS 6.x with
MultiFinder and PalmOS 4.x come to mind.)

What if we share a context+stack page between kernel and userspace to
be used by both the kernel "I/O completion" closure and the userspace
"event handler" closure? After all, these are the pieces that
cooperatively multitask with one another. Pop the kernel AIO closure
scheduler into the tasklet queue right after the softirq tasklet --
surely 99% of "fibrils" would become runnable due to something that
happens in a softirq, and it would make at least as much sense to run
there as in the task's schedule() path. The event handler would be
scheduled in most respects like a signal handler in a POSIX threaded
process -- running largely in the context of some application thread
(on syscall exit or by preemption), and limited in the set of APIs it
can call.

In this picture, the ideal peristalsis would usually be ISR exit path
-> softirq -> kernel closure (possibly not thread-like at all, just a
completion scheduled from a tasklet) -> userspace closure ->
application thread. The kernel and userspace closures could actually
share a stack page which also contains the completion context for
both. Linus's async_stat() example is a good one, I think. Here is
somewhat fuller userspace code, without the syntactic sugar that could
easily be used to make the callbacks more closure-ish:

/* linux/aeiou.h */
typedef void (*aeiou_stat_cb_t) (int, struct aeiou_stat *);

struct aeiou_stat __ALIGN_ME_PROPERLY__ {
aeiou_stat_cb_t cb; /* userspace completion hook */
struct stat stat_buf;
union {
int filedes;
char name[NAME_MAX+1];
} u;
#ifdef __KERNEL__
... completion context for the kernel AIO closure ...
#endif
}

/* The returned pointer is the cookie for all */
/* subsequent aeiou calls in this request group. */
void *__aeiou_alloc_aeiou_stat(size_t uctx_bytes);

#define aeiou_begin(ktype, utype, field) \
(utype *)(__aeiou_alloc_##ktype(offsetof(utype, field))

/* foo.c */
struct one_entry {
... closure context for the userspace event handler ...
struct aeiou_stat s;
}

static void my_cb(int is_delayed, struct aeiou_stat *as) {
struct one_entry *my_context = container_of(as, struct
one_entry, s);
... code that runs in userspace "event handler" context ...
}

...

struct one_entry *entry = aeiou_begin(aeiou_stat, struct one_entry, s);
struct dirent *de;

entry->s.cb = my_cb;
/* set up some process-global data structure to hold */
/* the results of this burst of async_stat calls */

while ((de = readdir(dir)) != NULL) {
strcpy(entry->s.u.name, de->d_name);
/* set up any additional application context */
/* in *entry for this individual async_stat call */

aeiou_stat(entry);
}
/* application tracks outstanding AIOs using data structure */
/* there could also be an aeiou_checkprogress(entry) */
...
aeiou_end(entry);

(The use of "aeiou_stat" rather than a more general class of async I/O
calls is for illustration purposes.)

If the stat data is immediately available when aeiou_stat() is called,
the struct stat gets filled in and the callback is run immediately in
the current stack context. If not, the contents of *entry are copied
to a new page (possibly using COW VM magic), and the syscall returns.
On the next trip through the scheduler (or when a large enough batch
of AIOs have been queued to be worth initiating them at the cost of
shoving the userspace code out of cache), the kernel closures are set
up in the opaque trailer to aeiou_stat in the copies, and the AIOs are
initiated.

The signature of aeiou_stat is deliberately limited to a single
pointer, since all of its arguments are likely to be interesting to
one or both closures. There is no need to pass the offset to the
kernel parameter sub-struct into calls after the initial aeiou_begin;
the kernel has to check the validity of the "entry" pointer/cookie
anyway, so it had best keep track of the enclosing allocation bounds,
offset to the syscall parameter structure, etc. in a place where
userspace can't alter it. Both kernel and userspace closures
eventually run with their stack in the shared page, after the closure
context area. The userspace closure has to respect
signal-handler-like limitations on its powers if is_delayed is true;
it will run in the right process context but has no particular thread
context and can't call anything that could block or allocate memory.

I think this sort of interface might work well for both GUI event
frameworks and real-time streaming media playback/mixing, which are
two common ways for AIO to enter the mere userspace programmer's
sphere of concern (and also happen to be areas where I have some
expertise). Would it work for the Oracle use case?

Cheers,
- Michael

2007-02-07 09:17:47

by Michael K. Edwards

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

Man, I should have edited that down before sending it. Hopefully this
is clearer:

- The usual programming model for AIO completion in GUIs, media
engines, and the like is an application callback. Data that is
available immediately may be handled quite differently from data that
arrives after a delay, and usually the only reason for both code paths
to be in the same callback is shared code to maintain counters, etc.
associated with the AIO batch. These shared operations, and the other
things one might want to do in the delayed path, needn't be able to
block or allocate memory.

- AIO requests that are serviced from cache ought to immediately
invoke the callback, in the same thread context as the caller, fixing
up the stack so that the callback returns to the instruction following
the syscall. That way the "immediate completion" path through the
callback can manipulate data structures, allocate memory, etc. just as
if it had followed a synchronous call.

- AIO requests that need data not in cache should probably be
batched in order to avoid evicting the userspace AIO submission loop,
the immediate completion branch of the callback, and their data
structures from cache on every miss. If you can use VM copy-on-write
tricks to punt a page of AIO request parameters and closure context
out to another CPU for immediate processing without stomping on your
local caches, great.

- There's not much point in delivering AIO responses all the way
to userspace until the AIO submission loop is done, because they're
probably going to be handled through some completely different event
queue mechanism in the delayed path through the callback. Trying to
squeeze a few AIO responses into the same data structure as if they
had been in cache is likely to create race conditions or impose
needless locking overhead on the otherwise serialized immediate
completion branch.

- The result of the external AIO may arrive on a different CPU
with something completely else in foreground; but in real use cases
it's probably a different thread of the same process. If you can use
the closure context page as the stack page for the kernel bit of the
AIO completion, and then use it again from userspace as the stack page
for the application bit, then the whole ISR -> softirq -> kernel
closure -> application closure path has minimal system impact.

- The delayed path through the application callback can't block
and can't touch data structures that are thread-local or may be in an
incoherent state at this juncture (called during a more or less
arbitrary ISR exit path, a bit like a signal handler). That's OK,
because it's probably just massaging the AIO response into fields of a
preallocated object dangling off of a global data structure and doing
a sem_post or some such. (It might even just drop it if it's stale.)

- As far as I can tell (knowing little about the scheduler per
se), these kernel closures aren't much like Zach's "fibrils"; they'd
be invoked from a tasklet chained more or less immediately after the
softirq dispatch tasklet. I have no idea whether the cost of finding
the appropriate kernel closure(s) associated with the data that
arrived in the course of a softirq, pulling them over to the CPU where
the softirq just ran, and popping out to userspace to run the
application closure is exorbitant, or if it's even possible to force a
process switch from inside a tasklet that way.

Hope this helps, and sorry for the noise,
- Michael

2007-02-07 09:37:20

by Michael K. Edwards

[permalink] [raw]
Subject: Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

An idiot using my keyboard wrote:
> - AIO requests that are serviced from cache ought to immediately
> invoke the callback, in the same thread context as the caller, fixing
> up the stack so that the callback returns to the instruction following
> the syscall. That way the "immediate completion" path through the
> callback can manipulate data structures, allocate memory, etc. just as
> if it had followed a synchronous call.

Or, of course:
if (async_stat(entry) == 0) {
... immediate completion code path ...
}

Ugh. But I think the discussion about the delayed path still holds.

- Michael

2007-02-09 22:33:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks


Ok, here's another entry in this discussion.

This is a *really* small patch. Yes, it adds 174 lines, and yes it's
actually x86 (32-bit) only, but about half of it is totally generic, and
*all* of it is almost ludicrously simple.

There's no new assembly language. The one-liner addition to
"syscall_table.S" is just adding the system call entry stub. It's all in
C, and none of it is even hard to understand.

It's architecture-specific, because different architectures do the whole
"fork()" entrypath differently, and this is basically a "delayed fork()",
not really an AIO thing at all.

So what this does, very simply is:

- on system call entry just save away the pt_regs pointer needed to do a
fork (on some architectures, this means that you need to use a longer
system call entry that saves off all registers - on x86 even that isn't
an issue)

- save that away as a magic cookie in the task structure

- do the system call

- IF the system call blocks, we call the architecture-specific
"schedule_async()" function before we even get any scheduler locks, and
it can just do a fork() at that time, and let the *child* return to the
original user space. The process that already started doing the system
call will just continue to do the system call.

- when the system call is done, we check to see if it was done totally
synchronously or not. If we ended up doing the clone(), we just exit
the new thread.

Now, I agree that this is a bit ugly in some of the details: in
particular, it means that if the system call blocks, we will literally
return as a *different* thread to user space. If you care, you shouldn't
use this interface, or come up with some way to make it work nicely (doing
it this way meant that I could just re-use all the clone/fork code as-is).

Also, it actually does take the hit of creating a full new thread. We
could optimize that a bit. But at least the cached case has basically
*zero* overhead: we literally end up doing just a few extra CPU
instructions to copy the arguments around etc, but no locked cycles, no
memory allocations, no *nothing*.

So I actually like this, because it means that while we slow down real IO,
we don't slow down the cached cases at all.

Final warning: I didn't do any cancel/wait crud. It doesn't even return
the thread ID as it is now. And I only hooked up "stat64()" as an exmple.
So this really is just a total toy. But it's kind of funny how simple it
was, once I started thinking about how I could do this in some clever way.

I even added comments, so a lot of the few new added lines aren't even
code!

Linus

---

diff --git a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c
index c641056..0909724 100644
--- a/arch/i386/kernel/process.c
+++ b/arch/i386/kernel/process.c
@@ -698,6 +698,71 @@ struct task_struct fastcall * __switch_to(struct task_struct *prev_p, struct tas
return prev_p;
}

+/*
+ * This gets called when an async event causes a schedule.
+ * We should try to
+ *
+ * (a) create a new thread
+ * (b) within that new thread, return to the original
+ * user mode call-site.
+ * (c) clear the async event flag, since it is now no
+ * longer relevant.
+ *
+ * If anything fails (a resource issue etc), we just do
+ * the async system call as a normal synchronous event!
+ */
+#define CLONE_ALL (CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_PARENT | CLONE_THREAD)
+#define FAILED_CLONE ((struct pt_regs *)1)
+void schedule_async(void)
+{
+ struct pt_regs *regs = current->async_cookie;
+ int retval;
+
+ if (regs == FAILED_CLONE)
+ return;
+
+ current->async_cookie = NULL;
+ /*
+ * This is magic. The child will return through "ret_from_fork()" to
+ * where the original thread started it all. It's not the same thread
+ * any more, and we don't much care. The "real thread" has now become
+ * the async worker thread, and will exit once the async work is done.
+ */
+ retval = do_fork(CLONE_ALL, regs->esp, regs, 0, NULL, NULL);
+
+ /*
+ * If it failed, we could just restore the async_cookie and try again
+ * on the next scheduling event.
+ *
+ * But it's just better to set it to some magic value to indicate
+ * "do not try this again". If it failed once, we shouldn't waste
+ * time trying it over and over again.
+ *
+ * Any non-NULL value will tell "do_async()" at the end that it was
+ * done "synchronously".
+ */
+ if (retval < 0)
+ current->async_cookie = FAILED_CLONE;
+}
+
+asmlinkage int sys_async(struct pt_regs regs)
+{
+ void *async_cookie;
+ unsigned long syscall, flags;
+ int __user *status;
+ unsigned long __user *user_args;
+
+ /* Pick out the do_async() arguments.. */
+ async_cookie = &regs;
+ syscall = regs.ebx;
+ flags = regs.ecx;
+ status = (int __user *) regs.edx;
+ user_args = (unsigned long __user *) regs.esi;
+
+ /* ..and call the generic helper routine */
+ return do_async(async_cookie, syscall, flags, status, user_args);
+}
+
asmlinkage int sys_fork(struct pt_regs regs)
{
return do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index 2697e92..647193c 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,4 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+ .long sys_async /* 320 */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4463735..e14b11b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -844,6 +844,13 @@ struct task_struct {

struct mm_struct *mm, *active_mm;

+ /*
+ * The scheduler uses this to determine if the current call is a
+ * standalone thread or just an async system call that hasn't
+ * had its real thread created yet.
+ */
+ void *async_cookie;
+
/* task state */
struct linux_binfmt *binfmt;
long exit_state;
@@ -1649,6 +1656,12 @@ extern int sched_create_sysfs_power_savings_entries(struct sysdev_class *cls);

extern void normalize_rt_tasks(void);

+/* Async system call support */
+extern long do_async(void *async_cookie, unsigned int syscall, unsigned long flags,
+ int __user *status, unsigned long __user *user_args);
+extern void schedule_async(void);
+
+
#endif /* __KERNEL__ */

#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 14f4d45..13bda9f 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -8,7 +8,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o profile.o \
signal.o sys.o kmod.o workqueue.o pid.o \
rcupdate.o extable.o params.o posix-timers.o \
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
- hrtimer.o rwsem.o latency.o nsproxy.o srcu.o
+ hrtimer.o rwsem.o latency.o nsproxy.o srcu.o async.o

obj-$(CONFIG_STACKTRACE) += stacktrace.o
obj-y += time/
diff --git a/kernel/async.c b/kernel/async.c
new file mode 100644
index 0000000..29b14f3
--- /dev/null
+++ b/kernel/async.c
@@ -0,0 +1,71 @@
+/*
+ * kernel/async.c
+ *
+ * Create a light-weight kernel-level thread.
+ */
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+
+#include <asm/uaccess.h>
+
+/* Fake "generic" system call pointer type */
+typedef asmlinkage long (*master_syscall_t)(unsigned long arg, ...);
+
+#define ASYNC_SYSCALL(syscall, param) \
+ { (master_syscall_t) (syscall), (param) }
+
+static struct async_call {
+ master_syscall_t fn;
+ int args;
+} call_descriptor[] = {
+ ASYNC_SYSCALL(sys_stat64, 2),
+};
+
+long do_async(
+ void *async_cookie,
+ unsigned int syscall,
+ unsigned long flags,
+ int __user *status,
+ unsigned long __user *user_args)
+{
+ int ret, size;
+ struct async_call *desc;
+ unsigned long args[6];
+
+ if (syscall >= ARRAY_SIZE(call_descriptor))
+ return -EINVAL;
+
+ desc = call_descriptor + syscall;
+ if (!desc->fn)
+ return -EINVAL;
+
+ if (desc->args > ARRAY_SIZE(args))
+ return -EINVAL;
+
+ size = sizeof(unsigned long)*desc->args;
+ if (copy_from_user(args, user_args, size))
+ return -EFAULT;
+
+ /* We don't nest async calls! */
+ if (current->async_cookie)
+ return -EINVAL;
+ current->async_cookie = async_cookie;
+
+ ret = desc->fn(args[0], args[1], args[2], args[3], args[4], args[5]);
+ put_user(ret, status);
+
+ /*
+ * Did we end up doing part of the work in a separate thread?
+ *
+ * If so, the async thread-creation already returned in the
+ * origial parent, and cleared out the async_cookie. We're
+ * now just in the worker thread, and should just exit. Our
+ * job here is done.
+ */
+ if (!current->async_cookie)
+ do_exit(0);
+
+ /* We did it synchronously - return 0 */
+ current->async_cookie = 0;
+ return 0;
+}
diff --git a/kernel/fork.c b/kernel/fork.c
index d57118d..6f38c46 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1413,6 +1413,18 @@ long do_fork(unsigned long clone_flags,
return nr;
}

+/*
+ * Architectures that don't have async support get this
+ * dummy async thread scheduler callback.
+ *
+ * They had better not set task->async_cookie in the
+ * first place, so this should never get called!
+ */
+void __attribute__ ((weak)) schedule_async(void)
+{
+ BUG();
+}
+
#ifndef ARCH_MIN_MMSTRUCT_ALIGN
#define ARCH_MIN_MMSTRUCT_ALIGN 0
#endif
diff --git a/kernel/sched.c b/kernel/sched.c
index cca93cc..cc73dee 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3436,6 +3436,17 @@ asmlinkage void __sched schedule(void)
}
profile_hit(SCHED_PROFILING, __builtin_return_address(0));

+ /* Are we running within an async system call? */
+ if (unlikely(current->async_cookie)) {
+ /*
+ * If so, we now try to start a new thread for it, but
+ * not for a preemption event or a scheduler timeout
+ * triggering!
+ */
+ if (!(preempt_count() & PREEMPT_ACTIVE) && current->state != TASK_RUNNING)
+ schedule_async();
+ }
+
need_resched:
preempt_disable();
prev = current;

2007-02-09 23:11:58

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Fri, 9 Feb 2007, Linus Torvalds wrote:

>
> Ok, here's another entry in this discussion.

That's another way to do it. But you end up creating/destroying a new
thread for every request. May be performing just fine.
Another, even simpler way IMO, is to just have a plain per-task kthread
pool, and a queue. An async_submit() drops a request in the queue, and
wakes the requests queue-head where the kthreads are sleeping. One kthread
picks up the request, service it, drops a result in the result queue, and
wakes results queue-head (where async_fetch() are sleeping). Cancellation
is not problem here (by the mean of sending a signal to the service
kthread). Also, no problem with arch-dependent code. This is a 1:1
match of what my userspace implementation does.
Of course, no hot-path optimization are performed here, and you need a few
context switches more than necessary.
Let's have Zach (Ingo support to Zach would be great) play with the
optimized version, and then we can maybe bench the three to see if the
more complex code that the optimized version require, gets a pay-back from
the performance side.

/me thinks it likely will



- Davide


2007-02-09 23:37:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks



On Fri, 9 Feb 2007, Davide Libenzi wrote:
>
> That's another way to do it. But you end up creating/destroying a new
> thread for every request. May be performing just fine.

Well, I actually wanted to add a special CLONE_ASYNC flag, because I
think we could do it better if we know it's a particularly limited special
case. But that's really just a "small implementation detail", and I don't
know how big a deal it is. I didn't want to obscure the basic idea with
anything bigger.

I agree that the create/destroy is a big overhead, but at least it's now
only done when we actually end up doing some IO (and _after_ we've started
the IO, of course - that's when we block), so compared to doing it up
front, I'm hoping that it's not actually that horrid.

The "fork-like" approach also means that it's very flexible. It's not
really even limited to doing simple system calls any more: you *could*,
for example, decide that since you already have the thread, and now that
it's asynchronous, you'd actually return to user space (to let user space
"complete" whatever asynchronous action it wanted to complete).

> Another, even simpler way IMO, is to just have a plain per-task kthread
> pool, and a queue.

Yes, that is actually quite doable with basically the same interface. It's
literally a "small decision" inside of "schedule_async()" on how it
actually would want to handle the case of "hey, we now have concurrent
work to be done".

But I actually don't think a per-task kthread pool is necessarily a good
idea. If a thread pool works for this, then it should have worked for
regular thread create/destroy loads too - ie there really is little reason
to special-case the "async system call" case.

NOTE! I'm also not at all sure that we actually want to waste real threads
on this. My patch is in no way meant to be an "exclusive alternative" to
fibrils. Quite the reverse, actually: I _like_ those synchronous fibrils,
but I didn't like how Zach did the overhead of creating them up-front,
because I really would like the cached case to be totally *synchronous*.

So I wrote my patch with a "schedule_async()" implementation that just
creates a full-sized thread, but I actually wanted very much to try to
make it use fibrils that are allocated on-demand too. I was just too lazy.

So the patch is really meant as a "ok, this is how easy it is to make the
thread allocation be 'on-demand' instead of 'up-front'". The actual
_policy_ on how thread allocation is done isn't even interesting to me, to
some degree. I think Zack's fibrils would work fine, a thread pool would
work fine, and just the silly outright "new thread for everything" that
the example patch actually used may also possibly work well enough.

It's one reason I liked my patch. It was not only small and simple, it
really is very flexible, I think. It's also totally independent on how
you actually end up _executing_ the async requests.

(In fact, you could easily make it a config option whether you support any
asynchronous behaviour AT ALL. The "async()" system call might still be
there, but it would just return "0" all the time, and do the actual work
synchronously).

Linus

2007-02-10 00:06:30

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

Linus Torvalds a ?crit :
> Ok, here's another entry in this discussion.

>
> - IF the system call blocks, we call the architecture-specific
> "schedule_async()" function before we even get any scheduler locks, and
> it can just do a fork() at that time, and let the *child* return to the
> original user space. The process that already started doing the system
> call will just continue to do the system call.


Well, I guess if the original program was mono-threaded, and syscall used
fget_light(), we might have a problem here if the child try a close(). So you
may have to disable fget_light() magic if async call is the originator of the
syscall.

Eric

2007-02-10 00:15:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks



On Sat, 10 Feb 2007, Eric Dumazet wrote:
>
> Well, I guess if the original program was mono-threaded, and syscall used
> fget_light(), we might have a problem here if the child try a close(). So you
> may have to disable fget_light() magic if async call is the originator of the
> syscall.

Yes. All the issues that I already brought up with Zach's patches are
still there. This doesn't really change any of them. Any optimization that
checks for "am I single-threaded" will need to be aware of pending and
running async things.

With my patch, any _running_ async things will always be seen as normal
clones, but the pending ones won't. So you'd need to effectively change
anything that looks like

if (atomic_read(&current->mm->count) == 1)
.. do some simplified version ..

into

if (!current->async_cookie && atomic_read(..) == 1)
.. do the simplified thing ..

to make it safe.

I think we only do it for fget_light and some VM TLB simplification, so it
shouldn't be a big burden to check.

Side note: the real issues still remain. The interfaces, and the
performance testing.

Linus

2007-02-10 10:47:15

by bert hubert

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Fri, Feb 09, 2007 at 02:33:01PM -0800, Linus Torvalds wrote:

> - IF the system call blocks, we call the architecture-specific
> "schedule_async()" function before we even get any scheduler locks, and
> it can just do a fork() at that time, and let the *child* return to the
> original user space. The process that already started doing the system
> call will just continue to do the system call.

Ah - cool. The average time we have to wait is probably far greater than the
fork overhead, microseconds versus milliseconds.

However, and there probably is a good reason for this, why isn't it possible
to do it the other way around, and have the *child* do the work and the
original return to userspace?

Would confuse me at lot less in any case.

Bert

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

2007-02-10 18:19:19

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Sat, 10 Feb 2007, bert hubert wrote:

> On Fri, Feb 09, 2007 at 02:33:01PM -0800, Linus Torvalds wrote:
>
> > - IF the system call blocks, we call the architecture-specific
> > "schedule_async()" function before we even get any scheduler locks, and
> > it can just do a fork() at that time, and let the *child* return to the
> > original user space. The process that already started doing the system
> > call will just continue to do the system call.
>
> Ah - cool. The average time we have to wait is probably far greater than the
> fork overhead, microseconds versus milliseconds.
>
> However, and there probably is a good reason for this, why isn't it possible
> to do it the other way around, and have the *child* do the work and the
> original return to userspace?

If the parent is going to schedule(), someone above has already dropped
the parent's task_struct inside a wait queue, so the *parent* will be the
wakeup target [1].
Linus take to the generic AIO is a neat one, but IMO continuos fork/exits
are going to be expensive. Even if the task is going to sleep, that does
not mean that the parent (well, in Linus case, the child actually) does
not have more stuff to feed to async(). IMO the frequency of AIO
submission and retrieval can get pretty high (hence the frequency of
fork/exit), and there might be a price to pay for it at the end.
IMO one solution, following the non-fibril way, may be:

- Keep a pool of per-process threads (a per-process pool already has stuff
like "files" already correctly setup, just for example - no need to
teach everywhere around the kernel of the "async" special case)

- When a schedule happen on the submission thread, we get a thread
(task_struct really) of the available pool

- We setup the submission (now going to sleep) thread return IP to an
async_complete (or whatever name) stub. This will drop a result in a
queue, and wake the async_wait (or whatever name) wait queue head

- We may want to swap at least the PID (signals, ...?) between the two, so
even if we're re-emrging with a new task_struct, the TID will be the same

- We make the "returning" thread to come back to userspace through some
special helper ala ret_from_fork (ret_from_async ?)

- We want also to keep a record (hash?) of userspace cookies and threads
currently servicing them, so that we can implement cancel (send signal)

Open issues:

- What if the pool becomes empty since all thread are stuck under schedule?
o Grow the pool (and delay-shrink at quiter times)?
o Make the caller really sleep?
o Fall back in queue-request mode?

- Look at the Devil hiding in the details and showing up many times during
the process

Yup, I can see Zach having a lot of fun with it ;)



[1] Well, you could add a list_head to the task_struct, and teach the
add-to-waitqueue to drop a reference to all the wait queue entries
hosting the task_struct. Then walk&fix (likely be only one entry) when
you swap the submission thread context (thread_info, per_call stuff, ...)
over a service thread task_struct. At that point you can re-emerge
with the same task_struct. Pretty nasty though.


- Davide


2007-02-10 18:45:32

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Fri, 9 Feb 2007, Linus Torvalds wrote:

> > Another, even simpler way IMO, is to just have a plain per-task kthread
> > pool, and a queue.
>
> Yes, that is actually quite doable with basically the same interface. It's
> literally a "small decision" inside of "schedule_async()" on how it
> actually would want to handle the case of "hey, we now have concurrent
> work to be done".

For the queue approach, I meant the async_submit() to simply add the
request (cookie, syscall number and params) inside queue, and not trying
to execute the syscall. Once you're inside schedule, "stuff" has already
partially happened, and you cannot have the same request re-initiated by a
different thread.



> But I actually don't think a per-task kthread pool is necessarily a good
> idea. If a thread pool works for this, then it should have worked for
> regular thread create/destroy loads too - ie there really is little reason
> to special-case the "async system call" case.

A per-process thread pool already has things correctly inherited, so we
don't need to add special "adopt" routines for things like "files" and such.



> NOTE! I'm also not at all sure that we actually want to waste real threads
> on this. My patch is in no way meant to be an "exclusive alternative" to
> fibrils. Quite the reverse, actually: I _like_ those synchronous fibrils,
> but I didn't like how Zach did the overhead of creating them up-front,
> because I really would like the cached case to be totally *synchronous*.

I'm not advocating threads against fibrils. The use of threads may make
things easier under certain POVs (less ad-hoc changes into mainline). The
ideal would be to have a look at both and see Pros&Cons under different
POVs (performance, code impact, etc..).



- Davide


2007-02-10 19:03:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks



On Sat, 10 Feb 2007, Davide Libenzi wrote:
>
> For the queue approach, I meant the async_submit() to simply add the
> request (cookie, syscall number and params) inside queue, and not trying
> to execute the syscall. Once you're inside schedule, "stuff" has already
> partially happened, and you cannot have the same request re-initiated by a
> different thread.

But that makes it impossible to do things synchronously, which I think is
a *major* mistake.

The whole (and really _only_) point of my patch was really the whole
"synchronous call" part. I'm personally of the opinion that if you cannot
handle the cached case as fast as just doing the system call directly,
then the whole thing is almost pointless.

Things that take a long time we already have good methods for. "epoll" and
"kevent" are always going to be the best way to handle the "we have ten
thousand events outstanding". There simply isn't any question about it.
You can *never* handle ten thousand long-running events efficiently with
threads - even if you ignore all the CPU overhead, you're going to have a
much bigger memory (and thus *cache*) footprint.

So anybody who wants to use AIO to do those kinds of long-running async
things is out to lunch. It's not the point.

You use the AIO stuff for things that you *expect* to be almost
instantaneous. Even if you actually start ten thousand IO's in one go, and
they all do IO, you would hopefully expect that the first ones start
completingn before you've even submitted them all. If that's not true,
then you'd just be better off using epoll.

Also, if you can make the cached case as fast as just doing the direct
system call itself, that just changes the whole equation for using it. You
suddenly don't have any downsides. You can start using the async
interfaces in places you simply couldn't otherwise, or in places where
you'd have to do a lot of performance tuning ("it makes sense under this
particular load because I actually need to get 100 IO's going at the same
time to saturate the disk").

So the "do cached things synchronously" really is important. Just because
that makes a whole complicated optimization question go away: you
basically *always* win for normal stuff.

Linus

2007-02-10 19:36:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks



On Sat, 10 Feb 2007, Linus Torvalds wrote:
>
> But that makes it impossible to do things synchronously, which I think is
> a *major* mistake.
>
> The whole (and really _only_) point of my patch was really the whole
> "synchronous call" part. I'm personally of the opinion that if you cannot
> handle the cached case as fast as just doing the system call directly,
> then the whole thing is almost pointless.

Side note: one of the nice things with "do it synchronously if you can" is
that it also likely would allow us to do a reasonable job at "self-tuning"
things in the kernel. With my async approach, we get notified only when we
block, so it'seasy (for example) to have a simple counter that
automatically adapts to the number of outstanding IO's, in a way that it's
_not_ if we do things at submit time when we won't even know whether it
will block or not.

As a trivial example: we actually see what *kind* of blocking it is. Is it
blocking interruptibly ("long wait") or uninterruptibly ("disk wait")? So
by the time schedule_async() is called, we actually have some more
information about the situation, and we can even do different things
(possibly based on just hints that the user and/or system maintainer gives
us; ie you can tune the behaviour from _outside_ by setting different
limits, for example).

Linus

2007-02-10 21:00:06

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Sat, 10 Feb 2007, Linus Torvalds wrote:

> On Sat, 10 Feb 2007, Davide Libenzi wrote:
> >
> > For the queue approach, I meant the async_submit() to simply add the
> > request (cookie, syscall number and params) inside queue, and not trying
> > to execute the syscall. Once you're inside schedule, "stuff" has already
> > partially happened, and you cannot have the same request re-initiated by a
> > different thread.
>
> But that makes it impossible to do things synchronously, which I think is
> a *major* mistake.

Yes! That's what I said when I described the method. No synco fast-paths.
At that point you could implement the full-queued method in userspace.



> The whole (and really _only_) point of my patch was really the whole
> "synchronous call" part. I'm personally of the opinion that if you cannot
> handle the cached case as fast as just doing the system call directly,
> then the whole thing is almost pointless.
>
> Things that take a long time we already have good methods for. "epoll" and
> "kevent" are always going to be the best way to handle the "we have ten
> thousand events outstanding". There simply isn't any question about it.
> You can *never* handle ten thousand long-running events efficiently with
> threads - even if you ignore all the CPU overhead, you're going to have a
> much bigger memory (and thus *cache*) footprint.

Think about the old-fashioned web server, using epoll to handle thousands
of connections. You'll be hosting an epoll_wait() over an async
thread/fibril. Now a burst of 500 connections becomes suddendly "hot", and
you start looping through those 500 hot connections trying to handle them.
You'll need to stat/open/read (let's assume a trivial, non-cached HTTP
server) from the file pointed by the URL's doc, and those better be
handled in async fashion otherwise you'll starve the others and pay huge
time in performance. You can multiplex using a state machine or coroutines
for example. Using coroutines your epoll dispatching loop end up doing
something like:

struct conn {
coroutine_t co;
int res;
int skfd;
...
};

void events_dispatch(struct epoll_event *events, int n) {

for (i = 0; i < n; i++) {
struct conn *c = (struct conn *) events[i].data;
co_call(c->co);
}
}

Note that co_call() will make the coroutine to re-emerge from the last
co_resume() they issued.
Your code doesn't not need to be coroutine/async aware, once you wrap the
possibly-blocking calls with like:

int my_stat(struct conn *c, const char *path, struct stat *buf) {
/* "c" is the cookie */
if ((c->res = async_submit(c, __NR_stat, path, buf)) == EASYNC)
/* co_resume() will bounce back to the scheduler loop */
co_resume();
return c->res;
}

Now, the *main* loop will be the async_wait() driven one:

struct async_result {
void *cookie;
long result;
};

n = async_wait(ares, nares);
for (i = 0; i < n; i++) {
if (ares[i].cookie == epoll_special_cookie)
events_dispatch(...);
else {
struct conn *c = (struct conn *) ares[i].cookie;
c->res = ares[i].result;
co_call(c->co);
}
}

Many of the async submission will complete in a synco way, but many of
them will require reschedule and service-thread attention. Of these 500
burst, you can expect 100 or more to require async service. Right away,
not sometime later. According to Oracle, 1000 or more requests, 90% of
which can be expected to block, can be fired at a time.
It is true that we need to have a fast synco path, but it is also true
that we must not suck in the non-synco path, because the submitting thread
has something else to do then simply issue one async request (it like
have *many* of them to push).
There we go, I broke my "No more than 20 lines" rule again :)




- Davide


2007-02-11 00:56:06

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

From: Linus Torvalds <[email protected]>
Date: Fri, 9 Feb 2007 14:33:01 -0800 (PST)

> So I actually like this, because it means that while we slow down
> real IO, we don't slow down the cached cases at all.

Even if you have everything, every page, every log file, in the page
cache, everything talking over the network wants to block.

Will you create a thread every time tcp_sendmsg() hits the send queue
limits?

Add some logging to tcp_sendmsg() on a busy web server if you do not
believe me :-)

The idea is probably excellent for operations on real files, but it's
going to stink badly for networking stuff.

2007-02-11 02:50:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks



On Sat, 10 Feb 2007, David Miller wrote:
>
> Even if you have everything, every page, every log file, in the page
> cache, everything talking over the network wants to block.
>
> Will you create a thread every time tcp_sendmsg() hits the send queue
> limits?

No. You use epoll() for those.

> The idea is probably excellent for operations on real files, but it's
> going to stink badly for networking stuff.

And I actually talked about that in one of the emails already. There is no
way you can beat an event-based thing for things that _are_ event-based.
That means mainly networking.

For things that aren't event-based, but based on real IO (ie filesystems
etc), event models *suck*. They suck because the code isn't amenable to it
in the first place (ie anybody who thinks that a filesystem is like a
network stack and can be done as a state machine with packets is just
crazy!).

So you would be crazy to makea web server that uses this to handle _all_
outstanding IO. Network connections are often slow, and you can have tens
of thousands outstanding (and some may be outstanding for hours until they
time out, if ever). But that's the whole point: you can easily mix the
two, as given in several examples already (ie you can easily make the main
loop itself basically do just

for (;;) {
async(epoll); /* wait for networking events */
async_wait(); /* wait for epoll _or_ any of the outstanding file IO events */
handle_completed_events();
}

and it's actually a lot better than an event model, exactly because now
you can handle events _and_ non-events well (a pure event model requires
that _everything_ be an event, which works fine for some things, but works
really badly for other things).

There's a reason why a lot of UNIX system calls are blocking: they just
don't make sense as event models, because there is no sensible half-way
point that you can keep track of (filename lookup is the most common
example).

Linus

2007-02-12 00:23:09

by Alan

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

> I think we only do it for fget_light and some VM TLB simplification, so it
> shouldn't be a big burden to check.

And all the permission management stuff that relies on one thread not
being able to manipulate the uid/gid of another to race security checks.

Alan

2007-02-14 17:10:09

by James Antill

[permalink] [raw]
Subject: Re: [PATCH 0 of 4] Generic AIO by scheduling stacks

On Sat, 10 Feb 2007 18:49:56 -0800, Linus Torvalds wrote:

> And I actually talked about that in one of the emails already. There is no
> way you can beat an event-based thing for things that _are_ event-based.
> That means mainly networking.
>
> For things that aren't event-based, but based on real IO (ie filesystems
> etc), event models *suck*. They suck because the code isn't amenable to it
> in the first place (ie anybody who thinks that a filesystem is like a
> network stack and can be done as a state machine with packets is just
> crazy!).
>
> So you would be crazy to makea web server that uses this to handle _all_
> outstanding IO. Network connections are often slow, and you can have tens
> of thousands outstanding (and some may be outstanding for hours until they
> time out, if ever). But that's the whole point: you can easily mix the
> two, as given in several examples already (ie you can easily make the main
> loop itself basically do just

I don't see any replies to this, so here's my 2ยข. The simple model of
what a webserver does when sending static data is:

1. local_disk_fd = open()
2. fstat(local_disk_fd)
3. TCP_CORK on
4. send_headers();
5. LOOP
5a. sendfile(network_con_fd, local_disk_fd)
5b. epoll(network_con_fd)
6. TCP_CORK off

...and here's my personal plan (again, somewhat simplified), which I
think will be "better":

7. helper_proc_pipe_fd = DO open() + fstat()
8. read_stat_event_data(helper_proc_pipe_fd)
9. TCP_CORK on network_con_fd
10. send_headers(network_con_fd);
11. LOOP
11a. splice(helper_proc_pipe_fd, network_con_fd)
11b. epoll(network_con_fd && helper_proc_pipe_fd)
12. TCP_CORK off network_con_fd

...where the "helper proc" is doing splice() from disk to the pipe, on the
other end. This, at least in theory, gives you an async webserver and zero
copy disk to network[1]. My assumption is that Evgeniy's aio_sendfile()
could fit into that model pretty easily, and would be faster.

However, from what you've said above you're only trying to help #1 and #2
(which are likely to be cached in the app. anyway) and apps.
that want to sendfile() to the network either do horrible hacks like
lighttpd's "AIO"[2], do a read+write copy loop with AIO or don't use AIO.


[1] And allows things like IO limiting, which aio_sendfile() won't.

[2] http://illiterat.livejournal.com/2989.html

--
James Antill -- [email protected]
http://www.and.org/and-httpd/ -- $2,000 security guarantee
http://www.and.org/vstr/