2007-02-28 21:46:43

by Ingo Molnar

[permalink] [raw]
Subject: [patch 00/12] Syslets, Threadlets, generic AIO support, v5


this is the v5 release of the syslet/threadlet subsystem:

http://redhat.com/~mingo/syslet-patches/

this release took 4 days to get out, but there were a couple of key
changes that needed some time to settle down:

- ported the code from v2.6.20 to current -git (v2.6.20-rc2 should be
fine as a base)

- 64-bit support in terms of a x86_64 port. Jens has updated the FIO
syslet code to work on 64-bit too. (kernel/async.c was pretty 64-bit
clean already, it needed minimal changes for basic x86_64 support.)

- 32-bit user-space on 64-bit kernel compat support. 32-bit syslet and
threadlet binaries work fine on 64-bit kernels.

- various cleanups and simplifications

the v4->v5 delta is:

17 files changed, 327 insertions(+), 271 deletions(-)

amongst the plans for v6 are cleanups/simplifications to the syslet
engine API, a number of suggestions have been made for that already.

the linecount increase in v5 is mostly due to the x86_64 port. The ABI
had to change again - see the async-test userspace code for details.

the x86_64 patch is a bit monolithic at the moment, i'll split it up
further in v6.

As always, comments, suggestions, reports are welcome!

Ingo


2007-02-28 21:48:10

by Ingo Molnar

[permalink] [raw]
Subject: [patch 01/12] syslets: add async.h include file, kernel-side API definitions

From: Ingo Molnar <[email protected]>

add include/linux/async.h which contains the kernel-side API
declarations.

it also provides NOP stubs for the !CONFIG_ASYNC_SUPPORT case.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
include/linux/async.h | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 88 insertions(+)

Index: linux/include/linux/async.h
===================================================================
--- /dev/null
+++ linux/include/linux/async.h
@@ -0,0 +1,88 @@
+#ifndef _LINUX_ASYNC_H
+#define _LINUX_ASYNC_H
+
+#include <linux/completion.h>
+#include <linux/compiler.h>
+#include <linux/syslet.h>
+#include <asm/unistd.h>
+
+/*
+ * The syslet subsystem - asynchronous syscall execution support.
+ *
+ * Syslet-subsystem internal definitions:
+ */
+
+/*
+ * The kernel-side copy of a syslet atom - with arguments expanded:
+ */
+struct syslet_atom {
+ unsigned long flags;
+ unsigned long nr;
+ long __user *ret_ptr;
+ struct syslet_uatom __user *next;
+ unsigned long args[6];
+ syscall_fn_t *call_table;
+ unsigned int nr_syscalls;
+};
+
+/*
+ * The 'async head' is the thread which has user-space context (ptregs)
+ * 'below it' - this is the one that can return to user-space:
+ */
+struct async_head {
+ spinlock_t lock;
+ struct task_struct *user_task;
+
+ struct list_head ready_async_threads;
+ struct list_head busy_async_threads;
+
+ struct mutex completion_lock;
+ long events_left;
+ wait_queue_head_t wait;
+
+ struct async_head_user __user *ahu;
+
+ unsigned long __user *new_stackp;
+ unsigned long new_ip;
+ unsigned long restore_stack;
+ unsigned long restore_ip;
+ struct completion start_done;
+ struct completion exit_done;
+};
+
+/*
+ * The 'async thread' is either a newly created async thread or it is
+ * an 'ex-head' - it cannot return to user-space and only has kernel
+ * context.
+ */
+struct async_thread {
+ struct task_struct *task;
+ unsigned long user_stack;
+ unsigned long user_ip;
+ struct async_head *ah;
+
+ struct list_head entry;
+
+ unsigned int exit;
+};
+
+/*
+ * Generic kernel API definitions:
+ */
+#ifdef CONFIG_ASYNC_SUPPORT
+extern void async_init(struct task_struct *t);
+extern void async_exit(struct task_struct *t);
+extern void __async_schedule(struct task_struct *t);
+#else /* !CONFIG_ASYNC_SUPPORT */
+static inline void async_init(struct task_struct *t)
+{
+}
+static inline void async_exit(struct task_struct *t)
+{
+}
+static inline void __async_schedule(struct task_struct *t)
+{
+}
+#endif /* !CONFIG_ASYNC_SUPPORT */
+
+#endif

2007-02-28 21:48:17

by Ingo Molnar

[permalink] [raw]
Subject: [patch 02/12] syslets: add syslet.h include file, user API/ABI definitions

From: Ingo Molnar <[email protected]>

add include/linux/syslet.h which contains the user-space API/ABI
declarations. Add the new header to include/linux/Kbuild as well.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
include/linux/Kbuild | 1
include/linux/syslet.h | 155 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 156 insertions(+)

Index: linux/include/linux/Kbuild
===================================================================
--- linux.orig/include/linux/Kbuild
+++ linux/include/linux/Kbuild
@@ -141,6 +141,7 @@ header-y += sockios.h
header-y += som.h
header-y += sound.h
header-y += synclink.h
+header-y += syslet.h
header-y += telephony.h
header-y += termios.h
header-y += ticable.h
Index: linux/include/linux/syslet.h
===================================================================
--- /dev/null
+++ linux/include/linux/syslet.h
@@ -0,0 +1,155 @@
+#ifndef _LINUX_SYSLET_H
+#define _LINUX_SYSLET_H
+/*
+ * The syslet subsystem - asynchronous syscall execution support.
+ *
+ * Started by Ingo Molnar:
+ *
+ * Copyright (C) 2007 Red Hat, Inc., Ingo Molnar <[email protected]>
+ *
+ * User-space API/ABI definitions:
+ */
+
+#ifndef __user
+# define __user
+#endif
+
+/*
+ * This is the 'Syslet Atom' - the basic unit of execution
+ * within the syslet framework. A syslet always represents
+ * a single system-call plus its arguments, plus has conditions
+ * attached to it that allows the construction of larger
+ * programs from these atoms. User-space variables can be used
+ * (for example a loop index) via the special sys_umem*() syscalls.
+ *
+ * Arguments are implemented via pointers to arguments. This not
+ * only increases the flexibility of syslet atoms (multiple syslets
+ * can share the same variable for example), but is also an
+ * optimization: copy_uatom() will only fetch syscall parameters
+ * up until the point it meets the first NULL pointer. 50% of all
+ * syscalls have 2 or less parameters (and 90% of all syscalls have
+ * 4 or less parameters).
+ *
+ * [ Note: since the argument array is at the end of the atom, and the
+ * kernel will not touch any argument beyond the final NULL one, atoms
+ * might be packed more tightly. (the only special case exception to
+ * this rule would be SKIP_TO_NEXT_ON_STOP atoms, where the kernel will
+ * jump a full syslet_uatom number of bytes.) ]
+ */
+struct syslet_uatom {
+ u32 flags;
+ u32 nr;
+ u64 ret_ptr;
+ u64 next;
+ u64 arg_ptr[6];
+ /*
+ * User-space can put anything in here, kernel will not
+ * touch it:
+ */
+ u64 private;
+};
+
+/*
+ * Flags to modify/control syslet atom behavior:
+ */
+
+/*
+ * Immediately queue this syslet asynchronously - do not even
+ * attempt to execute it synchronously in the user context:
+ */
+#define SYSLET_ASYNC 0x00000001
+
+/*
+ * Never queue this syslet asynchronously - even if synchronous
+ * execution causes a context-switching:
+ */
+#define SYSLET_SYNC 0x00000002
+
+/*
+ * Do not queue the syslet in the completion ring when done.
+ *
+ * ( the default is that the final atom of a syslet is queued
+ * in the completion ring. )
+ *
+ * Some syscalls generate implicit completion events of their
+ * own.
+ */
+#define SYSLET_NO_COMPLETE 0x00000004
+
+/*
+ * Execution control: conditions upon the return code
+ * of the just executed syslet atom. 'Stop' means syslet
+ * execution is stopped and the atom is put into the
+ * completion ring:
+ */
+#define SYSLET_STOP_ON_NONZERO 0x00000008
+#define SYSLET_STOP_ON_ZERO 0x00000010
+#define SYSLET_STOP_ON_NEGATIVE 0x00000020
+#define SYSLET_STOP_ON_NON_POSITIVE 0x00000040
+
+#define SYSLET_STOP_MASK \
+ ( SYSLET_STOP_ON_NONZERO | \
+ SYSLET_STOP_ON_ZERO | \
+ SYSLET_STOP_ON_NEGATIVE | \
+ SYSLET_STOP_ON_NON_POSITIVE )
+
+/*
+ * Special modifier to 'stop' handling: instead of stopping the
+ * execution of the syslet, the linearly next syslet is executed.
+ * (Normal execution flows along atom->next, and execution stops
+ * if atom->next is NULL or a stop condition becomes true.)
+ *
+ * This is what allows true branches of execution within syslets.
+ */
+#define SYSLET_SKIP_TO_NEXT_ON_STOP 0x00000080
+
+/*
+ * This is the (per-user-context) descriptor of the async completion
+ * ring. This gets passed in to sys_async_exec():
+ */
+struct async_head_user {
+ /*
+ * Current completion ring index - managed by the kernel:
+ */
+ u64 kernel_ring_idx;
+ /*
+ * User-side ring index:
+ */
+ u64 user_ring_idx;
+
+ /*
+ * Ring of pointers to completed async syslets (i.e. syslets that
+ * generated a cachemiss and went async, returning -EASYNCSYSLET
+ * to the user context by sys_async_exec()) are queued here.
+ * Syslets that were executed synchronously (cached) are not
+ * queued here.
+ *
+ * Note: the final atom that generated the exit condition is
+ * queued here. Normally this would be the last atom of a syslet.
+ */
+ u64 completion_ring_ptr;
+
+ /*
+ * Ring size in bytes:
+ */
+ u64 ring_size_bytes;
+
+ /*
+ * The head task can become a cachemiss thread later on
+ * too, if it blocks - so it needs its separate thread
+ * stack and start address too:
+ */
+ u64 head_stack;
+ u64 head_ip;
+
+ /*
+ * Newly started async kernel threads will take their
+ * user stack and user start address from here. User-space
+ * code has to check for new_thread_stack going to NULL
+ * and has to refill it with a new stack if that happens.
+ */
+ u64 new_thread_stack;
+ u64 new_thread_ip;
+};
+
+#endif

2007-02-28 21:48:37

by Ingo Molnar

[permalink] [raw]
Subject: [patch 03/12] syslets: generic kernel bits

From: Ingo Molnar <[email protected]>

add the kernel generic bits - these are present even if !CONFIG_ASYNC_SUPPORT.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
fs/exec.c | 4 ++++
include/linux/sched.h | 23 ++++++++++++++++++++++-
kernel/capability.c | 3 +++
kernel/exit.c | 7 +++++++
kernel/fork.c | 5 +++++
kernel/sched.c | 9 +++++++++
kernel/sys.c | 36 ++++++++++++++++++++++++++++++++++++
7 files changed, 86 insertions(+), 1 deletion(-)

Index: linux/fs/exec.c
===================================================================
--- linux.orig/fs/exec.c
+++ linux/fs/exec.c
@@ -1444,6 +1444,10 @@ static int coredump_wait(int exit_code)
tsk->vfork_done = NULL;
complete(vfork_done);
}
+ /*
+ * Make sure we exit our async context before waiting:
+ */
+ async_exit(tsk);

if (core_waiters)
wait_for_completion(&startup_done);
Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -83,12 +83,12 @@ struct sched_param {
#include <linux/timer.h>
#include <linux/hrtimer.h>
#include <linux/task_io_accounting.h>
+#include <linux/async.h>

#include <asm/processor.h>

struct exec_domain;
struct futex_pi_state;
-
/*
* List of flags we want to share for kernel threads,
* if only because they are not used by them anyway.
@@ -997,6 +997,12 @@ struct task_struct {
/* journalling filesystem info */
void *journal_info;

+/* async syscall support: */
+ struct async_thread *at, *async_ready;
+ struct async_head *ah;
+ struct async_thread __at;
+ struct async_head __ah;
+
/* VM state */
struct reclaim_state *reclaim_state;

@@ -1055,6 +1061,21 @@ struct task_struct {
#endif
};

+/*
+ * Is an async syscall being executed currently?
+ */
+#ifdef CONFIG_ASYNC_SUPPORT
+static inline int async_syscall(struct task_struct *t)
+{
+ return t->async_ready != NULL;
+}
+#else /* !CONFIG_ASYNC_SUPPORT */
+static inline int async_syscall(struct task_struct *t)
+{
+ return 0;
+}
+#endif /* !CONFIG_ASYNC_SUPPORT */
+
static inline pid_t process_group(struct task_struct *tsk)
{
return tsk->signal->pgrp;
Index: linux/kernel/capability.c
===================================================================
--- linux.orig/kernel/capability.c
+++ linux/kernel/capability.c
@@ -178,6 +178,9 @@ asmlinkage long sys_capset(cap_user_head
int ret;
pid_t pid;

+ if (async_syscall(current))
+ return -ENOSYS;
+
if (get_user(version, &header->version))
return -EFAULT;

Index: linux/kernel/exit.c
===================================================================
--- linux.orig/kernel/exit.c
+++ linux/kernel/exit.c
@@ -26,6 +26,7 @@
#include <linux/ptrace.h>
#include <linux/profile.h>
#include <linux/mount.h>
+#include <linux/async.h>
#include <linux/proc_fs.h>
#include <linux/mempolicy.h>
#include <linux/taskstats_kern.h>
@@ -890,6 +891,12 @@ fastcall NORET_TYPE void do_exit(long co
schedule();
}

+ /*
+ * Note: async threads have to exit their context before the MM
+ * exit (due to the coredumping wait):
+ */
+ async_exit(tsk);
+
tsk->flags |= PF_EXITING;

if (unlikely(in_atomic()))
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -22,6 +22,7 @@
#include <linux/personality.h>
#include <linux/mempolicy.h>
#include <linux/sem.h>
+#include <linux/async.h>
#include <linux/file.h>
#include <linux/key.h>
#include <linux/binfmts.h>
@@ -1056,6 +1057,7 @@ static struct task_struct *copy_process(

p->lock_depth = -1; /* -1 = no lock */
do_posix_clock_monotonic_gettime(&p->start_time);
+ async_init(p);
p->security = NULL;
p->io_context = NULL;
p->io_wait = NULL;
@@ -1623,6 +1625,9 @@ asmlinkage long sys_unshare(unsigned lon
struct uts_namespace *uts, *new_uts = NULL;
struct ipc_namespace *ipc, *new_ipc = NULL;

+ if (async_syscall(current))
+ return -ENOSYS;
+
check_unshare_flags(&unshare_flags);

/* Return -EINVAL for all unsupported flags */
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -38,6 +38,7 @@
#include <linux/vmalloc.h>
#include <linux/blkdev.h>
#include <linux/delay.h>
+#include <linux/async.h>
#include <linux/smp.h>
#include <linux/threads.h>
#include <linux/timer.h>
@@ -3455,6 +3456,14 @@ asmlinkage void __sched schedule(void)
}
profile_hit(SCHED_PROFILING, __builtin_return_address(0));

+ prev = current;
+ if (unlikely(prev->async_ready)) {
+ if (prev->state && !(preempt_count() & PREEMPT_ACTIVE) &&
+ (!(prev->state & TASK_INTERRUPTIBLE) ||
+ !signal_pending(prev)))
+ __async_schedule(prev);
+ }
+
need_resched:
preempt_disable();
prev = current;
Index: linux/kernel/sys.c
===================================================================
--- linux.orig/kernel/sys.c
+++ linux/kernel/sys.c
@@ -941,6 +941,9 @@ asmlinkage long sys_setregid(gid_t rgid,
int new_egid = old_egid;
int retval;

+ if (async_syscall(current))
+ return -ENOSYS;
+
retval = security_task_setgid(rgid, egid, (gid_t)-1, LSM_SETID_RE);
if (retval)
return retval;
@@ -987,6 +990,9 @@ asmlinkage long sys_setgid(gid_t gid)
int old_egid = current->egid;
int retval;

+ if (async_syscall(current))
+ return -ENOSYS;
+
retval = security_task_setgid(gid, (gid_t)-1, (gid_t)-1, LSM_SETID_ID);
if (retval)
return retval;
@@ -1057,6 +1063,9 @@ asmlinkage long sys_setreuid(uid_t ruid,
int old_ruid, old_euid, old_suid, new_ruid, new_euid;
int retval;

+ if (async_syscall(current))
+ return -ENOSYS;
+
retval = security_task_setuid(ruid, euid, (uid_t)-1, LSM_SETID_RE);
if (retval)
return retval;
@@ -1120,6 +1129,9 @@ asmlinkage long sys_setuid(uid_t uid)
int old_ruid, old_suid, new_suid;
int retval;

+ if (async_syscall(current))
+ return -ENOSYS;
+
retval = security_task_setuid(uid, (uid_t)-1, (uid_t)-1, LSM_SETID_ID);
if (retval)
return retval;
@@ -1160,6 +1172,9 @@ asmlinkage long sys_setresuid(uid_t ruid
int old_suid = current->suid;
int retval;

+ if (async_syscall(current))
+ return -ENOSYS;
+
retval = security_task_setuid(ruid, euid, suid, LSM_SETID_RES);
if (retval)
return retval;
@@ -1214,6 +1229,9 @@ asmlinkage long sys_setresgid(gid_t rgid
{
int retval;

+ if (async_syscall(current))
+ return -ENOSYS;
+
retval = security_task_setgid(rgid, egid, sgid, LSM_SETID_RES);
if (retval)
return retval;
@@ -1269,6 +1287,9 @@ asmlinkage long sys_setfsuid(uid_t uid)
{
int old_fsuid;

+ if (async_syscall(current))
+ return -ENOSYS;
+
old_fsuid = current->fsuid;
if (security_task_setuid(uid, (uid_t)-1, (uid_t)-1, LSM_SETID_FS))
return old_fsuid;
@@ -1298,6 +1319,9 @@ asmlinkage long sys_setfsgid(gid_t gid)
{
int old_fsgid;

+ if (async_syscall(current))
+ return -ENOSYS;
+
old_fsgid = current->fsgid;
if (security_task_setgid(gid, (gid_t)-1, (gid_t)-1, LSM_SETID_FS))
return old_fsgid;
@@ -1373,6 +1397,9 @@ asmlinkage long sys_setpgid(pid_t pid, p
struct task_struct *group_leader = current->group_leader;
int err = -EINVAL;

+ if (async_syscall(current))
+ return -ENOSYS;
+
if (!pid)
pid = group_leader->pid;
if (!pgid)
@@ -1496,6 +1523,9 @@ asmlinkage long sys_setsid(void)
pid_t session;
int err = -EPERM;

+ if (async_syscall(current))
+ return -ENOSYS;
+
write_lock_irq(&tasklist_lock);

/* Fail if I am already a session leader */
@@ -1739,6 +1769,9 @@ asmlinkage long sys_setgroups(int gidset
struct group_info *group_info;
int retval;

+ if (async_syscall(current))
+ return -ENOSYS;
+
if (!capable(CAP_SETGID))
return -EPERM;
if ((unsigned)gidsetsize > NGROUPS_MAX)
@@ -2080,6 +2113,9 @@ asmlinkage long sys_prctl(int option, un
{
long error;

+ if (async_syscall(current))
+ return -ENOSYS;
+
error = security_task_prctl(option, arg2, arg3, arg4, arg5);
if (error)
return error;

2007-02-28 21:48:54

by Ingo Molnar

[permalink] [raw]
Subject: [patch 05/12] syslets: core, documentation

From: Ingo Molnar <[email protected]>

Add Documentation/syslet-design.txt with a high-level description
of the syslet concepts.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
Documentation/syslet-design.txt | 137 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 137 insertions(+)

Index: linux/Documentation/syslet-design.txt
===================================================================
--- /dev/null
+++ linux/Documentation/syslet-design.txt
@@ -0,0 +1,137 @@
+Syslets / asynchronous system calls
+===================================
+
+started by Ingo Molnar <[email protected]>
+
+Goal:
+-----
+
+The goal of the syslet subsystem is to allow user-space to execute
+arbitrary system calls asynchronously. It does so by allowing user-space
+to execute "syslets" which are small scriptlets that the kernel can execute
+both securely and asynchronously without having to exit to user-space.
+
+the core syslet concepts are:
+
+The Syslet Atom:
+----------------
+
+The syslet atom is a small, fixed-size (44 bytes on 32-bit) piece of
+user-space memory, which is the basic unit of execution within the syslet
+framework. A syslet represents a single system-call and its arguments.
+In addition it also has condition flags attached to it that allows the
+construction of larger programs (syslets) from these atoms.
+
+Arguments to the system call are implemented via pointers to arguments.
+This not only increases the flexibility of syslet atoms (multiple syslets
+can share the same variable for example), but is also an optimization:
+copy_uatom() will only fetch syscall parameters up until the point it
+meets the first NULL pointer. 50% of all syscalls have 2 or less
+parameters (and 90% of all syscalls have 4 or less parameters).
+
+ [ Note: since the argument array is at the end of the atom, and the
+ kernel will not touch any argument beyond the first NULL one, atoms
+ might be packed more tightly. (the only special case exception to
+ this rule would be SKIP_TO_NEXT_ON_STOP atoms, where the kernel will
+ jump a full syslet_uatom number of bytes.) ]
+
+The Syslet:
+-----------
+
+A syslet is a program, represented by a graph of syslet atoms. The
+syslet atoms are chained to each other either via the atom->next pointer,
+or via the SYSLET_SKIP_TO_NEXT_ON_STOP flag.
+
+Running Syslets:
+----------------
+
+Syslets can be run via the sys_async_exec() system call, which takes
+the first atom of the syslet as an argument. The kernel does not need
+to be told about the other atoms - it will fetch them on the fly as
+execution goes forward.
+
+A syslet might either be executed 'cached', or it might generate a
+'cachemiss'.
+
+'Cached' syslet execution means that the whole syslet was executed
+without blocking. The system-call returns the submitted atom's address
+in this case.
+
+If a syslet blocks while the kernel executes a system-call embedded in
+one of its atoms, the kernel will keep working on that syscall in
+parallel, but it immediately returns to user-space with a NULL pointer,
+so the submitting task can submit other syslets.
+
+Completion of asynchronous syslets:
+-----------------------------------
+
+Completion of asynchronous syslets is done via the 'completion ring',
+which is a ringbuffer of syslet atom pointers in user-space memory,
+provided by user-space as an argument to the sys_async_exec() syscall.
+The kernel fills in the ringbuffer starting at index 0, and user-space
+must clear out these pointers. Once the kernel reaches the end of
+the ring it wraps back to index 0. The kernel will not overwrite
+non-NULL pointers (but will return an error), and thus user-space has
+to make sure it completes all events it asked for.
+
+Waiting for completions:
+------------------------
+
+Syslet completions can be waited for via the sys_async_wait()
+system call - which takes the number of events it should wait for as
+a parameter. This system call will also return if the number of
+pending events goes down to zero.
+
+Sample Hello World syslet code:
+
+--------------------------->
+/*
+ * Set up a syslet atom:
+ */
+static void
+init_atom(struct syslet_uatom *atom, int nr,
+ void *arg_ptr0, void *arg_ptr1, void *arg_ptr2,
+ void *arg_ptr3, void *arg_ptr4, void *arg_ptr5,
+ void *ret_ptr, unsigned long flags, struct syslet_uatom *next)
+{
+ atom->nr = nr;
+ atom->arg_ptr[0] = arg_ptr0;
+ atom->arg_ptr[1] = arg_ptr1;
+ atom->arg_ptr[2] = arg_ptr2;
+ atom->arg_ptr[3] = arg_ptr3;
+ atom->arg_ptr[4] = arg_ptr4;
+ atom->arg_ptr[5] = arg_ptr5;
+ atom->ret_ptr = ret_ptr;
+ atom->flags = flags;
+ atom->next = next;
+}
+
+int main(int argc, char *argv[])
+{
+ unsigned long int fd_out = 1; /* standard output */
+ char *buf = "Hello Syslet World!\n";
+ unsigned long size = strlen(buf);
+ struct syslet_uatom atom, *done;
+
+ async_head_init();
+
+ /*
+ * Simple syslet consisting of a single atom:
+ */
+ init_atom(&atom, __NR_sys_write, &fd_out, &buf, &size,
+ NULL, NULL, NULL, NULL, SYSLET_ASYNC, NULL);
+ done = sys_async_exec(&atom);
+ if (!done) {
+ sys_async_wait(1);
+ if (completion_ring[curr_ring_idx] == &atom) {
+ completion_ring[curr_ring_idx] = NULL;
+ printf("completed an async syslet atom!\n");
+ }
+ } else {
+ printf("completed an cached syslet atom!\n");
+ }
+
+ async_head_exit();
+
+ return 0;
+}

2007-02-28 21:49:24

by Ingo Molnar

[permalink] [raw]
Subject: [patch 04/12] syslets: core code

From: Ingo Molnar <[email protected]>

the core syslet / async system calls infrastructure code.

Is built only if CONFIG_ASYNC_SUPPORT is enabled.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
kernel/Makefile | 1
kernel/async.c | 989 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 990 insertions(+)

Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile
+++ linux/kernel/Makefile
@@ -10,6 +10,7 @@ obj-y = sched.o fork.o exec_domain.o
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
hrtimer.o rwsem.o latency.o nsproxy.o srcu.o

+obj-$(CONFIG_ASYNC_SUPPORT) += async.o
obj-$(CONFIG_STACKTRACE) += stacktrace.o
obj-y += time/
obj-$(CONFIG_DEBUG_MUTEXES) += mutex-debug.o
Index: linux/kernel/async.c
===================================================================
--- /dev/null
+++ linux/kernel/async.c
@@ -0,0 +1,989 @@
+/*
+ * kernel/async.c
+ *
+ * The syslet and threadlet subsystem - asynchronous syscall and
+ * user-space code execution support.
+ *
+ * Started by Ingo Molnar:
+ *
+ * Copyright (C) 2007 Red Hat, Inc., Ingo Molnar <[email protected]>
+ *
+ * This file is released under the GPLv2.
+ *
+ * This code implements asynchronous syscalls via 'syslets'.
+ *
+ * Syslets consist of a set of 'syslet atoms' which are residing
+ * purely in user-space memory and have no kernel-space resource
+ * attached to them. These atoms can be linked to each other via
+ * pointers. Besides the fundamental ability to execute system
+ * calls, syslet atoms can also implement branches, loops and
+ * arithmetics.
+ *
+ * Thus syslets can be used to build small autonomous programs that
+ * the kernel can execute purely from kernel-space, without having
+ * to return to any user-space context. Syslets can be run by any
+ * unprivileged user-space application - they are executed safely
+ * by the kernel.
+ *
+ * "Threadlets" are the user-space equivalent of syslets: small
+ * functions of execution that user-space attempts/expects to execute
+ * without scheduling. If the threadlet nevertheless blocks, the kernel
+ * creates a real thread from it, and that thread is put aside sleeping.
+ * The 'head' context (the context that never blocks) returns to the
+ * original function that called the threadlet. Once the sleeping thread
+ * wakes up again (after it got for whatever it was waiting - IO, timeout,
+ * etc.) the function continues executing asynchronously, as a thread.
+ * A user-space completion ring connects these asynchronous function calls
+ * back to the head context.
+ */
+#include <linux/syscalls.h>
+#include <linux/syslet.h>
+#include <linux/delay.h>
+#include <linux/async.h>
+#include <linux/sched.h>
+#include <linux/init.h>
+#include <linux/err.h>
+
+#include <asm/uaccess.h>
+#include <asm/unistd.h>
+
+/*
+ * An async 'cachemiss context' is either busy, or it is ready.
+ * If it is ready, the 'head' might switch its user-space context
+ * to that ready thread anytime - so that if the ex-head blocks,
+ * one ready thread can become the next head and can continue to
+ * execute user-space code.
+ */
+static void
+__mark_async_thread_ready(struct async_thread *at, struct async_head *ah)
+{
+ list_del(&at->entry);
+ list_add_tail(&at->entry, &ah->ready_async_threads);
+ if (list_empty(&ah->busy_async_threads))
+ wake_up(&ah->wait);
+}
+
+static void
+mark_async_thread_ready(struct async_thread *at, struct async_head *ah)
+{
+ spin_lock(&ah->lock);
+ __mark_async_thread_ready(at, ah);
+ spin_unlock(&ah->lock);
+}
+
+static void
+__mark_async_thread_busy(struct async_thread *at, struct async_head *ah)
+{
+ list_del(&at->entry);
+ list_add_tail(&at->entry, &ah->busy_async_threads);
+}
+
+static void
+mark_async_thread_busy(struct async_thread *at, struct async_head *ah)
+{
+ spin_lock(&ah->lock);
+ __mark_async_thread_busy(at, ah);
+ spin_unlock(&ah->lock);
+}
+
+static void
+__async_thread_init(struct task_struct *t, struct async_thread *at,
+ struct async_head *ah)
+{
+ INIT_LIST_HEAD(&at->entry);
+ at->exit = 0;
+ at->task = t;
+ at->ah = ah;
+
+ t->at = at;
+}
+
+static void
+async_thread_init(struct task_struct *t, struct async_thread *at,
+ struct async_head *ah)
+{
+ spin_lock(&ah->lock);
+ __async_thread_init(t, at, ah);
+ __mark_async_thread_ready(at, ah);
+ spin_unlock(&ah->lock);
+}
+
+static void
+async_thread_exit(struct async_thread *at, struct task_struct *t)
+{
+ struct async_head *ah = at->ah;
+
+ spin_lock(&ah->lock);
+ list_del_init(&at->entry);
+ if (at->exit)
+ complete(&ah->exit_done);
+ t->at = NULL;
+ at->task = NULL;
+ spin_unlock(&ah->lock);
+}
+
+static struct async_thread *
+pick_ready_cachemiss_thread(struct async_head *ah)
+{
+ struct list_head *head = &ah->ready_async_threads;
+
+ if (list_empty(head))
+ return NULL;
+
+ return list_entry(head->next, struct async_thread, entry);
+}
+
+void __async_schedule(struct task_struct *t)
+{
+ struct async_thread *new_async_thread;
+ struct async_thread *async_ready;
+ struct async_head *ah = t->ah;
+ struct task_struct *new_task;
+
+ WARN_ON(!ah);
+ spin_lock(&ah->lock);
+
+ new_async_thread = pick_ready_cachemiss_thread(ah);
+ if (!new_async_thread)
+ goto out_unlock;
+
+ async_ready = t->async_ready;
+ WARN_ON(!async_ready);
+ t->async_ready = NULL;
+
+ new_task = new_async_thread->task;
+
+ move_user_context(new_task, t);
+ if (ah->restore_stack) {
+ set_task_stack_reg(new_task, ah->restore_stack);
+ WARN_ON(!ah->restore_ip);
+ task_ip_reg(new_task) = ah->restore_ip;
+ /*
+ * The return code 0 is needed to tell the
+ * head user-context that the threadlet went async:
+ */
+ task_ret_reg(new_task) = 0;
+ }
+
+ new_task->at = NULL;
+ t->ah = NULL;
+ new_task->ah = ah;
+ ah->user_task = new_task;
+
+ wake_up_process(new_task);
+
+ __async_thread_init(t, async_ready, ah);
+ __mark_async_thread_busy(t->at, ah);
+
+ out_unlock:
+ spin_unlock(&ah->lock);
+}
+
+static void async_schedule(struct task_struct *t)
+{
+ if (t->async_ready)
+ __async_schedule(t);
+}
+
+static long __exec_atom(struct task_struct *t, struct syslet_atom *atom)
+{
+ struct async_thread *async_ready_save;
+ long ret;
+
+ /*
+ * If user-space expects the syscall to schedule then
+ * (try to) switch user-space to another thread straight
+ * away and execute the syscall asynchronously:
+ */
+ if (unlikely(atom->flags & SYSLET_ASYNC))
+ async_schedule(t);
+ /*
+ * Does user-space want synchronous execution for this atom?:
+ */
+ async_ready_save = t->async_ready;
+ if (unlikely(atom->flags & SYSLET_SYNC))
+ t->async_ready = NULL;
+
+ if (unlikely(atom->nr >= atom->nr_syscalls))
+ return -ENOSYS;
+
+ ret = atom->call_table[atom->nr](atom->args[0], atom->args[1],
+ atom->args[2], atom->args[3],
+ atom->args[4], atom->args[5]);
+
+ if (atom->ret_ptr && put_user(ret, atom->ret_ptr))
+ return -EFAULT;
+
+ if (t->ah)
+ t->async_ready = async_ready_save;
+
+ return ret;
+}
+
+/*
+ * Arithmetics syscall, add a value to a user-space memory location.
+ *
+ * Generic C version - in case the architecture has not implemented it
+ * in assembly.
+ */
+asmlinkage __attribute__((weak)) long
+sys_umem_add(unsigned long __user *uptr, unsigned long inc)
+{
+ unsigned long val, new_val;
+
+ if (get_user(val, uptr))
+ return -EFAULT;
+ /*
+ * inc == 0 means 'read memory value':
+ */
+ if (!inc)
+ return val;
+
+ new_val = val + inc;
+ if (__put_user(new_val, uptr))
+ return -EFAULT;
+
+ return new_val;
+}
+
+/*
+ * Open-coded because this is a very hot codepath during syslet
+ * execution and every cycle counts ...
+ *
+ * [ NOTE: it's an explicit fastcall because optimized assembly code
+ * might depend on this. There are some kernels that disable regparm,
+ * so lets not break those if possible. ]
+ */
+fastcall __attribute__((weak)) long
+copy_uatom(struct syslet_atom *atom, struct syslet_uatom __user *uatom)
+{
+ unsigned long __user *arg_ptr;
+ long ret = 0;
+
+ if (!access_ok(VERIFY_READ, uatom, sizeof(*uatom)))
+ return -EFAULT;
+
+ ret = __get_user(atom->nr, &uatom->nr);
+ ret |= __get_user(atom->ret_ptr, (long __user **)&uatom->ret_ptr);
+ ret |= __get_user(atom->flags, (unsigned long __user *)&uatom->flags);
+ ret |= __get_user(atom->next,
+ (struct syslet_uatom __user **)&uatom->next);
+
+ memset(atom->args, 0, sizeof(atom->args));
+
+ ret |= __get_user(arg_ptr, (unsigned long __user **)&uatom->arg_ptr[0]);
+ if (!arg_ptr)
+ return ret;
+ if (!access_ok(VERIFY_READ, arg_ptr, sizeof(*arg_ptr)))
+ return -EFAULT;
+ ret |= __get_user(atom->args[0], arg_ptr);
+
+ ret |= __get_user(arg_ptr, (unsigned long __user **)&uatom->arg_ptr[1]);
+ if (!arg_ptr)
+ return ret;
+ if (!access_ok(VERIFY_READ, arg_ptr, sizeof(*arg_ptr)))
+ return -EFAULT;
+ ret |= __get_user(atom->args[1], arg_ptr);
+
+ ret |= __get_user(arg_ptr, (unsigned long __user **)&uatom->arg_ptr[2]);
+ if (!arg_ptr)
+ return ret;
+ if (!access_ok(VERIFY_READ, arg_ptr, sizeof(*arg_ptr)))
+ return -EFAULT;
+ ret |= __get_user(atom->args[2], arg_ptr);
+
+ ret |= __get_user(arg_ptr, (unsigned long __user **)&uatom->arg_ptr[3]);
+ if (!arg_ptr)
+ return ret;
+ if (!access_ok(VERIFY_READ, arg_ptr, sizeof(*arg_ptr)))
+ return -EFAULT;
+ ret |= __get_user(atom->args[3], arg_ptr);
+
+ ret |= __get_user(arg_ptr, (unsigned long __user **)&uatom->arg_ptr[4]);
+ if (!arg_ptr)
+ return ret;
+ if (!access_ok(VERIFY_READ, arg_ptr, sizeof(*arg_ptr)))
+ return -EFAULT;
+ ret |= __get_user(atom->args[4], arg_ptr);
+
+ ret |= __get_user(arg_ptr, (unsigned long __user **)&uatom->arg_ptr[5]);
+ if (!arg_ptr)
+ return ret;
+ if (!access_ok(VERIFY_READ, arg_ptr, sizeof(*arg_ptr)))
+ return -EFAULT;
+ ret |= __get_user(atom->args[5], arg_ptr);
+
+ return ret;
+}
+
+/*
+ * Should the next atom run, depending on the return value of
+ * the current atom - or should we stop execution?
+ */
+static int run_next_atom(struct syslet_atom *atom, long ret)
+{
+ switch (atom->flags & SYSLET_STOP_MASK) {
+ case SYSLET_STOP_ON_NONZERO:
+ if (!ret)
+ return 1;
+ return 0;
+ case SYSLET_STOP_ON_ZERO:
+ if (ret)
+ return 1;
+ return 0;
+ case SYSLET_STOP_ON_NEGATIVE:
+ if (ret >= 0)
+ return 1;
+ return 0;
+ case SYSLET_STOP_ON_NON_POSITIVE:
+ if (ret > 0)
+ return 1;
+ return 0;
+ }
+ return 1;
+}
+
+static struct syslet_uatom __user *
+next_uatom(struct syslet_atom *atom, struct syslet_uatom *uatom, long ret)
+{
+ /*
+ * If the stop condition is false then continue
+ * to atom->next:
+ */
+ if (run_next_atom(atom, ret))
+ return atom->next;
+ /*
+ * Special-case: if the stop condition is true and the atom
+ * has SKIP_TO_NEXT_ON_STOP set, then instead of
+ * stopping we skip to the atom directly after this atom
+ * (in linear address-space).
+ *
+ * This, combined with the atom->next pointer and the
+ * stop condition flags is what allows true branches and
+ * loops in syslets:
+ */
+ if (atom->flags & SYSLET_SKIP_TO_NEXT_ON_STOP)
+ return uatom + 1;
+
+ return NULL;
+}
+
+/*
+ * If user-space requested a completion event then put the last
+ * executed uatom into the completion ring:
+ */
+static long
+completion_event(struct async_head *ah, struct task_struct *t,
+ void __user *event, struct async_head_user __user *ahu)
+{
+ unsigned long ring_size_bytes, max_ring_idx, kernel_ring_idx;
+ struct syslet_uatom __user *slot_val = NULL;
+ u64 __user *completion_ring, *ring_slot;
+
+ WARN_ON(!t->at);
+ WARN_ON(t->ah);
+
+ if (!access_ok(VERIFY_WRITE, ahu, sizeof(*ahu)))
+ return -EFAULT;
+
+ if (__get_user(completion_ring,
+ (u64 __user **)&ahu->completion_ring_ptr))
+ return -EFAULT;
+ if (__get_user(ring_size_bytes,
+ (unsigned long __user *)&ahu->ring_size_bytes))
+ return -EFAULT;
+ if (!ring_size_bytes)
+ return -EINVAL;
+
+ max_ring_idx = ring_size_bytes / sizeof(u64);
+ if (ring_size_bytes != max_ring_idx * sizeof(u64))
+ return -EINVAL;
+ /*
+ * We pre-check the ring pointer, so that in the fastpath
+ * we can use __get_user():
+ */
+ if (!access_ok(VERIFY_WRITE, completion_ring, ring_size_bytes))
+ return -EFAULT;
+
+ mutex_lock(&ah->completion_lock);
+ /*
+ * Asynchron threads can complete in parallel so use the
+ * head-lock to serialize:
+ */
+ if (__get_user(kernel_ring_idx,
+ (unsigned long __user *)&ahu->kernel_ring_idx))
+ goto fault_unlock;
+ if (kernel_ring_idx >= max_ring_idx)
+ goto err_unlock;
+
+ ring_slot = completion_ring + kernel_ring_idx;
+ if (__get_user(slot_val, (struct syslet_uatom __user **)ring_slot))
+ goto fault_unlock;
+ /*
+ * User-space submitted more work than what fits into the
+ * completion ring - do not stomp over it silently and signal
+ * the error condition:
+ */
+ if (slot_val)
+ goto err_unlock;
+
+ slot_val = event;
+ if (__put_user(slot_val, (struct syslet_uatom __user **)ring_slot))
+ goto fault_unlock;
+ /*
+ * Update the ring index:
+ */
+ kernel_ring_idx++;
+ if (kernel_ring_idx == max_ring_idx)
+ kernel_ring_idx = 0;
+
+ if (__put_user(kernel_ring_idx, &ahu->kernel_ring_idx))
+ goto fault_unlock;
+
+ /*
+ * See whether the async-head is waiting and needs a wakeup:
+ */
+ if (ah->events_left) {
+ if (!--ah->events_left) {
+ /*
+ * We first unlock the mutex - to reduce the size
+ * of the critical section. We have a safe
+ * reference to 'ah':
+ */
+ mutex_unlock(&ah->completion_lock);
+ wake_up(&ah->wait);
+ goto out;
+ }
+ }
+
+ mutex_unlock(&ah->completion_lock);
+ out:
+ return 0;
+
+ fault_unlock:
+ mutex_unlock(&ah->completion_lock);
+
+ return -EFAULT;
+
+ err_unlock:
+ mutex_unlock(&ah->completion_lock);
+
+ return -EINVAL;
+}
+
+/*
+ * This is the main syslet atom execution loop. This fetches atoms
+ * and executes them until it runs out of atoms or until the
+ * exit condition becomes false:
+ */
+static struct syslet_uatom __user *
+exec_atom(struct async_head *ah, struct task_struct *t,
+ struct syslet_uatom __user *uatom,
+ struct async_head_user __user *ahu,
+ syscall_fn_t *call_table,
+ unsigned int nr_syscalls)
+{
+ struct syslet_uatom __user *last_uatom;
+ struct syslet_atom atom;
+ long ret;
+
+ atom.call_table = call_table;
+ atom.nr_syscalls = nr_syscalls;
+
+ run_next:
+ if (unlikely(copy_uatom(&atom, uatom)))
+ return ERR_PTR(-EFAULT);
+
+ last_uatom = uatom;
+ ret = __exec_atom(t, &atom);
+ if (unlikely(signal_pending(t)))
+ goto stop;
+ if (need_resched())
+ cond_resched();
+
+ uatom = next_uatom(&atom, uatom, ret);
+ if (uatom)
+ goto run_next;
+ stop:
+ /*
+ * We do completion only in async context:
+ */
+ if (t->at && !(atom.flags & SYSLET_NO_COMPLETE)) {
+ if (completion_event(ah, t, last_uatom, ahu))
+ return ERR_PTR(-EFAULT);
+ }
+
+ return last_uatom;
+}
+
+static long
+cachemiss_loop(struct async_thread *at, struct async_head *ah,
+ struct task_struct *t)
+{
+ for (;;) {
+ mark_async_thread_busy(at, ah);
+ set_task_state(t, TASK_INTERRUPTIBLE);
+ if (unlikely(t->ah || at->exit || signal_pending(t)))
+ break;
+ mark_async_thread_ready(at, ah);
+ schedule();
+ }
+ t->state = TASK_RUNNING;
+
+ async_thread_exit(at, t);
+
+ if (at->exit)
+ do_exit(0);
+
+ if (!t->ah) {
+ /*
+ * Cachemiss threads return to one given
+ * user-space instruction address and stack
+ * pointer:
+ */
+ set_task_stack_reg(t, at->user_stack);
+ task_ip_reg(t) = at->user_ip;
+
+ return -1;
+ }
+ return 0;
+}
+
+/*
+ * This is what a newly created cachemiss thread executes for the
+ * first time: initialize, pick up the user stack/IP addresses from
+ * the head and then execute the cachemiss loop. If the cachemiss
+ * loop returns then we return back to user-space:
+ */
+static long cachemiss_thread(void *data)
+{
+ struct pt_regs *head_regs, *regs;
+ struct task_struct *t = current;
+ struct async_head *ah = data;
+ struct async_thread *at;
+ int ret;
+
+ at = &t->__at;
+ async_thread_init(t, at, ah);
+
+ /*
+ * Clone the head thread's user-space ptregs over,
+ * now that we are in kernel-space:
+ */
+ head_regs = task_pt_regs(ah->user_task);
+ regs = task_pt_regs(t);
+
+ *regs = *head_regs;
+ ret = get_user(at->user_stack, ah->new_stackp);
+ WARN_ON(ret);
+ /*
+ * Clear the stack pointer, signalling to user-space that
+ * this thread stack has been used up:
+ */
+ ret = put_user(0, ah->new_stackp);
+ WARN_ON(ret);
+
+ complete(&ah->start_done);
+
+ return cachemiss_loop(at, ah, t);
+}
+
+/**
+ * sys_async_thread - do work as an async cachemiss thread again
+ *
+ * @event: completion event
+ * @ahu: async head
+ *
+ * If an async thread has returned back to user-space (due to say
+ * a signal) then it is a 'busy' thread during that period. It
+ * can again offer itself into the cachemiss pool by calling this
+ * syscall:
+ */
+asmlinkage long
+sys_async_thread(void __user *event, struct async_head_user __user *ahu)
+{
+ struct task_struct *t = current;
+ struct async_thread *at = t->at;
+ struct async_head *ah = t->__at.ah;
+
+ /*
+ * Only async threads are allowed to do this:
+ */
+ if (!ah || t->ah)
+ return -EINVAL;
+
+ /*
+ * A threadlet might want to signal a completion event:
+ */
+ if (event) {
+ /*
+ * threadlet - make sure the stack is never used
+ * again by this thread:
+ */
+ set_task_stack_reg(t, 0x11111111);
+ task_ip_reg(t) = 0x22222222;
+
+ if (completion_event(ah, t, event, ahu))
+ return -EFAULT;
+ }
+ /*
+ * If a cachemiss threadlet calls sys_async_thread()
+ * then we first have to mark it ready:
+ */
+ if (at) {
+ mark_async_thread_ready(at, ah);
+ } else {
+ at = &t->__at;
+ WARN_ON(!at->ah);
+
+ async_thread_init(t, at, ah);
+ }
+
+ return cachemiss_loop(at, at->ah, t);
+}
+
+/*
+ * Initialize the in-kernel async head, based on the user-space async
+ * head:
+ */
+static long
+async_head_init(struct task_struct *t, struct async_head_user __user *ahu)
+{
+ struct async_head *ah;
+
+ ah = &t->__ah;
+
+ spin_lock_init(&ah->lock);
+ INIT_LIST_HEAD(&ah->ready_async_threads);
+ INIT_LIST_HEAD(&ah->busy_async_threads);
+ init_waitqueue_head(&ah->wait);
+ mutex_init(&ah->completion_lock);
+ ah->events_left = 0;
+ ah->ahu = NULL;
+ ah->new_stackp = NULL;
+ ah->new_ip = 0;
+ ah->restore_stack = 0;
+ ah->restore_ip = 0;
+ ah->user_task = t;
+ t->ah = ah;
+
+ return 0;
+}
+
+/*
+ * If the head cache-misses then it will become a cachemiss
+ * thread after having finished its current syslet. If it
+ * returns to user-space after that point (to handle a signal
+ * for example) then it will need a thread stack of its own:
+ */
+static long init_head(struct async_head *ah, struct task_struct *t,
+ struct async_head_user __user *ahu)
+{
+ unsigned long head_stack, head_ip;
+
+ if (get_user(head_stack, (unsigned long __user *)&ahu->head_stack))
+ return -EFAULT;
+ if (get_user(head_ip, (unsigned long __user *)&ahu->head_ip))
+ return -EFAULT;
+ t->__at.user_stack = head_stack;
+ t->__at.user_ip = head_ip;
+
+ return async_head_init(t, ahu);
+}
+
+/*
+ * Simple limit and pool management mechanism for now:
+ */
+static long
+refill_cachemiss_pool(struct async_head *ah, struct task_struct *t,
+ struct async_head_user __user *ahu)
+{
+ unsigned long new_ip;
+ int pid, ret;
+
+ init_completion(&ah->start_done);
+ ah->new_stackp = (unsigned long __user *)&ahu->new_thread_stack;
+ ret = get_user(new_ip, (unsigned long __user *)&ahu->new_thread_ip);
+ WARN_ON(ret);
+ ah->new_ip = new_ip;
+
+ pid = create_async_thread(cachemiss_thread, (void *)ah,
+ CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND |
+ CLONE_THREAD | CLONE_SYSVSEM);
+ if (pid < 0)
+ return pid;
+
+ wait_for_completion(&ah->start_done);
+ ah->new_stackp = NULL;
+ ah->new_ip = 0;
+
+ return 0;
+}
+
+/**
+ * sys_async_exec - execute a syslet.
+ *
+ * returns the uatom that was last executed, if the kernel was able to
+ * execute the syslet synchronously, or NULL if the syslet became
+ * asynchronous. (in the latter case syslet completion will be notified
+ * via the completion ring)
+ *
+ * (Various errors might also be returned via the usual negative numbers.)
+ */
+static struct syslet_uatom __user *
+__sys_async_exec(struct syslet_uatom __user *uatom,
+ struct async_head_user __user *ahu,
+ syscall_fn_t *call_table,
+ unsigned int nr_syscalls)
+{
+ struct syslet_uatom __user *ret;
+ struct task_struct *t = current;
+ struct async_head *ah = t->ah;
+ struct async_thread *at = &t->__at;
+
+ /*
+ * Do not allow recursive calls of sys_async_exec():
+ */
+ if (async_syscall(t))
+ return ERR_PTR(-ENOSYS);
+
+ if (!uatom || !ahu || !ahu->new_thread_stack)
+ return ERR_PTR(-EINVAL);
+
+ if (unlikely(!ah)) {
+ ret = (void *)init_head(ah, t, ahu);
+ if (ret)
+ return ret;
+ ah = t->ah;
+ }
+
+ if (unlikely(list_empty(&ah->ready_async_threads))) {
+ ret = (void *)refill_cachemiss_pool(ah, t, ahu);
+ if (ret)
+ return ret;
+ }
+
+ t->async_ready = at;
+ ah->ahu = ahu;
+
+ ret = exec_atom(ah, t, uatom, ahu, call_table, nr_syscalls);
+
+ /*
+ * Are we still executing as head?
+ */
+ if (t->ah) {
+ t->async_ready = NULL;
+
+ return ret;
+ }
+
+ /*
+ * We got turned into a cachemiss thread,
+ * enter the cachemiss loop:
+ */
+ set_task_state(t, TASK_INTERRUPTIBLE);
+ mark_async_thread_ready(at, ah);
+
+ return ERR_PTR(cachemiss_loop(at, ah, t));
+}
+
+asmlinkage struct syslet_uatom __user *
+sys_async_exec(struct syslet_uatom __user *uatom,
+ struct async_head_user __user *ahu)
+{
+ return __sys_async_exec(uatom, ahu, sys_call_table, NR_syscalls);
+}
+
+#ifdef CONFIG_COMPAT
+
+asmlinkage struct syslet_uatom __user *
+compat_sys_async_exec(struct syslet_uatom __user *uatom,
+ struct async_head_user __user *ahu)
+{
+ return __sys_async_exec(uatom, ahu, compat_sys_call_table,
+ compat_NR_syscalls);
+}
+
+#endif
+
+/**
+ * sys_async_wait - wait for async completion events
+ *
+ * This syscall waits for @min_wait_events syslet completion events
+ * to finish or for all async processing to finish (whichever
+ * comes first).
+ */
+asmlinkage long
+sys_async_wait(unsigned long min_wait_events, unsigned long user_ring_idx,
+ struct async_head_user __user *ahu)
+{
+ struct task_struct *t = current;
+ struct async_head *ah = t->ah;
+ unsigned long kernel_ring_idx;
+
+ /*
+ * Do not allow async waiting:
+ */
+ if (async_syscall(t))
+ return -ENOSYS;
+ if (!ah)
+ return -EINVAL;
+
+ mutex_lock(&ah->completion_lock);
+ if (get_user(kernel_ring_idx,
+ (unsigned long __user *)&ahu->kernel_ring_idx))
+ goto err_unlock;
+ /*
+ * Account any completions that happened since user-space
+ * checked the ring:
+ */
+ ah->events_left = min_wait_events - (kernel_ring_idx - user_ring_idx);
+ mutex_unlock(&ah->completion_lock);
+
+ return wait_event_interruptible(ah->wait,
+ list_empty(&ah->busy_async_threads) || ah->events_left <= 0);
+
+ err_unlock:
+ mutex_unlock(&ah->completion_lock);
+ return -EFAULT;
+}
+
+asmlinkage long
+sys_threadlet_on(unsigned long restore_stack,
+ unsigned long restore_ip,
+ struct async_head_user __user *ahu)
+{
+ struct task_struct *t = current;
+ struct async_head *ah = t->ah;
+ struct async_thread *at = &t->__at;
+ long ret;
+
+ /*
+ * Do not allow recursive calls of sys_threadlet_on():
+ */
+ if (t->async_ready || t->at)
+ return -EINVAL;
+
+ if (unlikely(!ah)) {
+ ret = init_head(ah, t, ahu);
+ if (ret)
+ return ret;
+ ah = t->ah;
+ }
+
+ if (unlikely(list_empty(&ah->ready_async_threads))) {
+ ret = refill_cachemiss_pool(ah, t, ahu);
+ if (ret)
+ return ret;
+ }
+
+ t->async_ready = at;
+ ah->restore_stack = restore_stack;
+ ah->restore_ip = restore_ip;
+
+ ah->ahu = ahu;
+
+ return 0;
+}
+
+asmlinkage long sys_threadlet_off(void)
+{
+ struct task_struct *t = current;
+ struct async_head *ah = t->ah;
+
+ /*
+ * Are we still executing as head?
+ */
+ if (ah) {
+ t->async_ready = NULL;
+
+ return 1;
+ }
+
+ /*
+ * We got turned into a cachemiss thread,
+ * return to user-space, which can do
+ * the notification, etc:
+ */
+ return 0;
+}
+
+static void __notify_async_thread_exit(struct async_thread *at,
+ struct async_head *ah)
+{
+ list_del_init(&at->entry);
+ at->exit = 1;
+ init_completion(&ah->exit_done);
+ wake_up_process(at->task);
+}
+
+static void stop_cachemiss_threads(struct async_head *ah)
+{
+ struct async_thread *at;
+
+repeat:
+ spin_lock(&ah->lock);
+ list_for_each_entry(at, &ah->ready_async_threads, entry) {
+
+ __notify_async_thread_exit(at, ah);
+ spin_unlock(&ah->lock);
+
+ wait_for_completion(&ah->exit_done);
+
+ goto repeat;
+ }
+
+ list_for_each_entry(at, &ah->busy_async_threads, entry) {
+
+ __notify_async_thread_exit(at, ah);
+ spin_unlock(&ah->lock);
+
+ wait_for_completion(&ah->exit_done);
+
+ goto repeat;
+ }
+ spin_unlock(&ah->lock);
+}
+
+static void async_head_exit(struct async_head *ah, struct task_struct *t)
+{
+ stop_cachemiss_threads(ah);
+ WARN_ON(!list_empty(&ah->ready_async_threads));
+ WARN_ON(!list_empty(&ah->busy_async_threads));
+ WARN_ON(spin_is_locked(&ah->lock));
+
+ t->ah = NULL;
+}
+
+/*
+ * fork()-time initialization:
+ */
+void async_init(struct task_struct *t)
+{
+ t->at = NULL;
+ t->async_ready = NULL;
+ t->ah = NULL;
+ t->__at.ah = NULL;
+ t->__at.user_stack = 0;
+}
+
+/*
+ * do_exit()-time cleanup:
+ */
+void async_exit(struct task_struct *t)
+{
+ struct async_thread *at = t->at;
+ struct async_head *ah = t->ah;
+
+ /*
+ * If head does a sys_exit() then the final schedule() must
+ * not be passed on to another cachemiss thread:
+ */
+ t->async_ready = NULL;
+
+ if (unlikely(at))
+ async_thread_exit(at, t);
+
+ if (unlikely(ah))
+ async_head_exit(ah, t);
+}

2007-02-28 21:49:28

by Ingo Molnar

[permalink] [raw]
Subject: [patch 08/12] syslets: x86, add move_user_context() method

From: Ingo Molnar <[email protected]>

add the move_user_context() method to move the user-space
context of one kernel thread to another kernel thread.
User-space might notice the changed TID, but execution,
stack and register contents (general purpose and FPU) are
still the same.

An architecture must implement this interface before it can turn
CONFIG_ASYNC_SUPPORT on.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/i386/kernel/process.c | 21 +++++++++++++++++++++
include/asm-i386/system.h | 7 +++++++
2 files changed, 28 insertions(+)

Index: linux/arch/i386/kernel/process.c
===================================================================
--- linux.orig/arch/i386/kernel/process.c
+++ linux/arch/i386/kernel/process.c
@@ -839,6 +839,27 @@ unsigned long get_wchan(struct task_stru
}

/*
+ * Move user-space context from one kernel thread to another.
+ * This includes registers and FPU state. Callers must make
+ * sure that neither task is running user context at the moment:
+ */
+void
+move_user_context(struct task_struct *new_task, struct task_struct *old_task)
+{
+ struct pt_regs *old_regs = task_pt_regs(old_task);
+ struct pt_regs *new_regs = task_pt_regs(new_task);
+ union i387_union *tmp;
+
+ *new_regs = *old_regs;
+ /*
+ * Flip around the FPU state too:
+ */
+ tmp = new_task->thread.i387;
+ new_task->thread.i387 = old_task->thread.i387;
+ old_task->thread.i387 = tmp;
+}
+
+/*
* sys_alloc_thread_area: get a yet unused TLS descriptor index.
*/
static int get_free_idx(void)
Index: linux/include/asm-i386/system.h
===================================================================
--- linux.orig/include/asm-i386/system.h
+++ linux/include/asm-i386/system.h
@@ -33,6 +33,13 @@ extern struct task_struct * FASTCALL(__s
"2" (prev), "d" (next)); \
} while (0)

+/*
+ * Move user-space context from one kernel thread to another.
+ * This includes registers and FPU state for now:
+ */
+extern void
+move_user_context(struct task_struct *new_task, struct task_struct *old_task);
+
#define _set_base(addr,base) do { unsigned long __pr; \
__asm__ __volatile__ ("movw %%dx,%1\n\t" \
"rorl $16,%%edx\n\t" \

2007-02-28 21:49:52

by Ingo Molnar

[permalink] [raw]
Subject: [patch 10/12] syslets: x86: enable ASYNC_SUPPORT

From: Ingo Molnar <[email protected]>

enable CONFIG_ASYNC_SUPPORT on x86.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/i386/Kconfig | 4 ++++
1 file changed, 4 insertions(+)

Index: linux/arch/i386/Kconfig
===================================================================
--- linux.orig/arch/i386/Kconfig
+++ linux/arch/i386/Kconfig
@@ -55,6 +55,10 @@ config ZONE_DMA
bool
default y

+config ASYNC_SUPPORT
+ bool
+ default y
+
config SBUS
bool

2007-02-28 21:50:18

by Ingo Molnar

[permalink] [raw]
Subject: [patch 07/12] syslets: x86, add create_async_thread() method

From: Ingo Molnar <[email protected]>

add the create_async_thread() way of creating kernel threads:
these threads first execute a kernel function and when they
return from it they execute user-space.

An architecture must implement this interface before it can turn
CONFIG_ASYNC_SUPPORT on.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/i386/kernel/entry.S | 25 +++++++++++++++++++++++++
arch/i386/kernel/process.c | 31 +++++++++++++++++++++++++++++++
include/asm-i386/processor.h | 17 +++++++++++++++++
include/asm-i386/unistd.h | 10 ++++++++++
4 files changed, 83 insertions(+)

Index: linux/arch/i386/kernel/entry.S
===================================================================
--- linux.orig/arch/i386/kernel/entry.S
+++ linux/arch/i386/kernel/entry.S
@@ -1034,6 +1034,31 @@ ENTRY(kernel_thread_helper)
CFI_ENDPROC
ENDPROC(kernel_thread_helper)

+ENTRY(async_thread_helper)
+ CFI_STARTPROC
+ /*
+ * Allocate space on the stack for pt-regs.
+ * sizeof(struct pt_regs) == 64, and we've got 8 bytes on the
+ * kernel stack already:
+ */
+ subl $64-8, %esp
+ CFI_ADJUST_CFA_OFFSET 64-8
+ movl %edx,%eax
+ push %edx
+ CFI_ADJUST_CFA_OFFSET 4
+ call *%ebx
+ addl $4, %esp
+ CFI_ADJUST_CFA_OFFSET -4
+
+ movl %eax, PT_EAX(%esp)
+
+ GET_THREAD_INFO(%ebp)
+
+ jmp syscall_exit
+ CFI_ENDPROC
+ENDPROC(async_thread_helper)
+
+
.section .rodata,"a"
#include "syscall_table.S"

Index: linux/arch/i386/kernel/process.c
===================================================================
--- linux.orig/arch/i386/kernel/process.c
+++ linux/arch/i386/kernel/process.c
@@ -355,6 +355,37 @@ int kernel_thread(int (*fn)(void *), voi
EXPORT_SYMBOL(kernel_thread);

/*
+ * This gets run with %ebx containing the
+ * function to call, and %edx containing
+ * the "args".
+ */
+extern void async_thread_helper(void);
+
+/*
+ * Create an async thread
+ */
+int create_async_thread(long (*fn)(void *), void * arg, unsigned long flags)
+{
+ struct pt_regs regs;
+
+ memset(&regs, 0, sizeof(regs));
+
+ regs.ebx = (unsigned long) fn;
+ regs.edx = (unsigned long) arg;
+
+ regs.xds = __USER_DS;
+ regs.xes = __USER_DS;
+ regs.xfs = __KERNEL_PDA;
+ regs.orig_eax = -1;
+ regs.eip = (unsigned long) async_thread_helper;
+ regs.xcs = __KERNEL_CS | get_kernel_rpl();
+ regs.eflags = X86_EFLAGS_IF | X86_EFLAGS_SF | X86_EFLAGS_PF | 0x2;
+
+ /* Ok, create the new task.. */
+ return do_fork(flags, 0, &regs, 0, NULL, NULL);
+}
+
+/*
* Free current thread data structures etc..
*/
void exit_thread(void)
Index: linux/include/asm-i386/processor.h
===================================================================
--- linux.orig/include/asm-i386/processor.h
+++ linux/include/asm-i386/processor.h
@@ -472,6 +472,11 @@ extern void prepare_to_copy(struct task_
*/
extern int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags);

+/*
+ * create an async thread:
+ */
+extern int create_async_thread(long (*fn)(void *), void * arg, unsigned long flags);
+
extern unsigned long thread_saved_pc(struct task_struct *tsk);
void show_trace(struct task_struct *task, struct pt_regs *regs, unsigned long *stack);

@@ -504,6 +509,18 @@ unsigned long get_wchan(struct task_stru
#define KSTK_EIP(task) (task_pt_regs(task)->eip)
#define KSTK_ESP(task) (task_pt_regs(task)->esp)

+/*
+ * Register access methods for async syscall support.
+ *
+ * Note, task_stack_reg() must not be an lvalue, hence this macro:
+ */
+#define task_stack_reg(t) \
+ ({ unsigned long __esp = task_pt_regs(t)->esp; __esp; })
+#define set_task_stack_reg(t, new_stack) \
+ do { task_pt_regs(t)->esp = (new_stack); } while (0)
+#define task_ip_reg(t) task_pt_regs(t)->eip
+#define task_ret_reg(t) task_pt_regs(t)->eax
+

struct microcode_header {
unsigned int hdrver;
Index: linux/include/asm-i386/unistd.h
===================================================================
--- linux.orig/include/asm-i386/unistd.h
+++ linux/include/asm-i386/unistd.h
@@ -1,6 +1,8 @@
#ifndef _ASM_I386_UNISTD_H_
#define _ASM_I386_UNISTD_H_

+#include <linux/linkage.h>
+
/*
* This file contains the system call numbers.
*/
@@ -330,6 +332,14 @@

#define NR_syscalls 320

+#ifndef __ASSEMBLY__
+
+typedef asmlinkage long (*syscall_fn_t)(long, long, long, long, long, long);
+
+extern syscall_fn_t sys_call_table[NR_syscalls];
+
+#endif
+
#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
#define __ARCH_WANT_OLD_STAT

2007-02-28 21:50:38

by Ingo Molnar

[permalink] [raw]
Subject: [patch 11/12] syslets: x86, wire up the syslet system calls

From: Ingo Molnar <[email protected]>

wire up the new syslet / async system call syscalls and make it
thus available to user-space.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/i386/kernel/syscall_table.S | 6 ++++++
include/asm-i386/unistd.h | 8 +++++++-
2 files changed, 13 insertions(+), 1 deletion(-)

Index: linux/arch/i386/kernel/syscall_table.S
===================================================================
--- linux.orig/arch/i386/kernel/syscall_table.S
+++ linux/arch/i386/kernel/syscall_table.S
@@ -319,3 +319,9 @@ ENTRY(sys_call_table)
.long sys_move_pages
.long sys_getcpu
.long sys_epoll_pwait
+ .long sys_async_exec /* 320 */
+ .long sys_async_wait
+ .long sys_umem_add
+ .long sys_async_thread
+ .long sys_threadlet_on
+ .long sys_threadlet_off /* 325 */
Index: linux/include/asm-i386/unistd.h
===================================================================
--- linux.orig/include/asm-i386/unistd.h
+++ linux/include/asm-i386/unistd.h
@@ -327,10 +327,16 @@
#define __NR_move_pages 317
#define __NR_getcpu 318
#define __NR_epoll_pwait 319
+#define __NR_async_exec 320
+#define __NR_async_wait 321
+#define __NR_umem_add 322
+#define __NR_async_thread 323
+#define __NR_threadlet_on 324
+#define __NR_threadlet_off 325

#ifdef __KERNEL__

-#define NR_syscalls 320
+#define NR_syscalls 326

#ifndef __ASSEMBLY__

2007-02-28 21:50:18

by Ingo Molnar

[permalink] [raw]
Subject: [patch 09/12] syslets: x86, mark async unsafe syscalls

From: Ingo Molnar <[email protected]>

mark clone() and fork() as not available for async execution.
Both need an intact user context beneath them to work.

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Arjan van de Ven <[email protected]>
---
arch/i386/kernel/ioport.c | 6 ++++++
arch/i386/kernel/ldt.c | 3 +++
arch/i386/kernel/process.c | 6 ++++++
arch/i386/kernel/vm86.c | 6 ++++++
4 files changed, 21 insertions(+)

Index: linux/arch/i386/kernel/ioport.c
===================================================================
--- linux.orig/arch/i386/kernel/ioport.c
+++ linux/arch/i386/kernel/ioport.c
@@ -62,6 +62,9 @@ asmlinkage long sys_ioperm(unsigned long
struct tss_struct * tss;
unsigned long *bitmap;

+ if (async_syscall(current))
+ return -ENOSYS;
+
if ((from + num <= from) || (from + num > IO_BITMAP_BITS))
return -EINVAL;
if (turn_on && !capable(CAP_SYS_RAWIO))
@@ -139,6 +142,9 @@ asmlinkage long sys_iopl(unsigned long u
unsigned int old = (regs->eflags >> 12) & 3;
struct thread_struct *t = &current->thread;

+ if (async_syscall(current))
+ return -ENOSYS;
+
if (level > 3)
return -EINVAL;
/* Trying to gain more privileges? */
Index: linux/arch/i386/kernel/ldt.c
===================================================================
--- linux.orig/arch/i386/kernel/ldt.c
+++ linux/arch/i386/kernel/ldt.c
@@ -233,6 +233,9 @@ asmlinkage int sys_modify_ldt(int func,
{
int ret = -ENOSYS;

+ if (async_syscall(current))
+ return -ENOSYS;
+
switch (func) {
case 0:
ret = read_ldt(ptr, bytecount);
Index: linux/arch/i386/kernel/process.c
===================================================================
--- linux.orig/arch/i386/kernel/process.c
+++ linux/arch/i386/kernel/process.c
@@ -750,6 +750,9 @@ struct task_struct fastcall * __switch_t

asmlinkage int sys_fork(struct pt_regs regs)
{
+ if (async_syscall(current))
+ return -ENOSYS;
+
return do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
}

@@ -759,6 +762,9 @@ asmlinkage int sys_clone(struct pt_regs
unsigned long newsp;
int __user *parent_tidptr, *child_tidptr;

+ if (async_syscall(current))
+ return -ENOSYS;
+
clone_flags = regs.ebx;
newsp = regs.ecx;
parent_tidptr = (int __user *)regs.edx;
Index: linux/arch/i386/kernel/vm86.c
===================================================================
--- linux.orig/arch/i386/kernel/vm86.c
+++ linux/arch/i386/kernel/vm86.c
@@ -209,6 +209,9 @@ asmlinkage int sys_vm86old(struct pt_reg
struct task_struct *tsk;
int tmp, ret = -EPERM;

+ if (async_syscall(current))
+ return -ENOSYS;
+
tsk = current;
if (tsk->thread.saved_esp0)
goto out;
@@ -239,6 +242,9 @@ asmlinkage int sys_vm86(struct pt_regs r
int tmp, ret;
struct vm86plus_struct __user *v86;

+ if (async_syscall(current))
+ return -ENOSYS;
+
tsk = current;
switch (regs.ebx) {
case VM86_REQUEST_IRQ:

2007-02-28 21:50:38

by Ingo Molnar

[permalink] [raw]
Subject: [patch 06/12] x86: split FPU state from task state

From: Arjan van de Ven <[email protected]>

Split the FPU save area from the task struct. This allows easy migration
of FPU context, and it's generally cleaner. It also allows the following
two (future) optimizations:

1) allocate the right size for the actual cpu rather than 512 bytes always
2) only allocate when the application actually uses FPU, so in the first
lazy FPU trap. This could save memory for non-fpu using apps.

Signed-off-by: Arjan van de Ven <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/i386/kernel/i387.c | 96 ++++++++++++++++++++---------------------
arch/i386/kernel/process.c | 56 +++++++++++++++++++++++
arch/i386/kernel/traps.c | 10 ----
include/asm-i386/i387.h | 6 +-
include/asm-i386/processor.h | 6 ++
include/asm-i386/thread_info.h | 6 ++
kernel/fork.c | 7 ++
7 files changed, 123 insertions(+), 64 deletions(-)

Index: linux/arch/i386/kernel/i387.c
===================================================================
--- linux.orig/arch/i386/kernel/i387.c
+++ linux/arch/i386/kernel/i387.c
@@ -31,9 +31,9 @@ void mxcsr_feature_mask_init(void)
unsigned long mask = 0;
clts();
if (cpu_has_fxsr) {
- memset(&current->thread.i387.fxsave, 0, sizeof(struct i387_fxsave_struct));
- asm volatile("fxsave %0" : : "m" (current->thread.i387.fxsave));
- mask = current->thread.i387.fxsave.mxcsr_mask;
+ memset(&current->thread.i387->fxsave, 0, sizeof(struct i387_fxsave_struct));
+ asm volatile("fxsave %0" : : "m" (current->thread.i387->fxsave));
+ mask = current->thread.i387->fxsave.mxcsr_mask;
if (mask == 0) mask = 0x0000ffbf;
}
mxcsr_feature_mask &= mask;
@@ -49,16 +49,16 @@ void mxcsr_feature_mask_init(void)
void init_fpu(struct task_struct *tsk)
{
if (cpu_has_fxsr) {
- memset(&tsk->thread.i387.fxsave, 0, sizeof(struct i387_fxsave_struct));
- tsk->thread.i387.fxsave.cwd = 0x37f;
+ memset(&tsk->thread.i387->fxsave, 0, sizeof(struct i387_fxsave_struct));
+ tsk->thread.i387->fxsave.cwd = 0x37f;
if (cpu_has_xmm)
- tsk->thread.i387.fxsave.mxcsr = 0x1f80;
+ tsk->thread.i387->fxsave.mxcsr = 0x1f80;
} else {
- memset(&tsk->thread.i387.fsave, 0, sizeof(struct i387_fsave_struct));
- tsk->thread.i387.fsave.cwd = 0xffff037fu;
- tsk->thread.i387.fsave.swd = 0xffff0000u;
- tsk->thread.i387.fsave.twd = 0xffffffffu;
- tsk->thread.i387.fsave.fos = 0xffff0000u;
+ memset(&tsk->thread.i387->fsave, 0, sizeof(struct i387_fsave_struct));
+ tsk->thread.i387->fsave.cwd = 0xffff037fu;
+ tsk->thread.i387->fsave.swd = 0xffff0000u;
+ tsk->thread.i387->fsave.twd = 0xffffffffu;
+ tsk->thread.i387->fsave.fos = 0xffff0000u;
}
/* only the device not available exception or ptrace can call init_fpu */
set_stopped_child_used_math(tsk);
@@ -152,18 +152,18 @@ static inline unsigned long twd_fxsr_to_
unsigned short get_fpu_cwd( struct task_struct *tsk )
{
if ( cpu_has_fxsr ) {
- return tsk->thread.i387.fxsave.cwd;
+ return tsk->thread.i387->fxsave.cwd;
} else {
- return (unsigned short)tsk->thread.i387.fsave.cwd;
+ return (unsigned short)tsk->thread.i387->fsave.cwd;
}
}

unsigned short get_fpu_swd( struct task_struct *tsk )
{
if ( cpu_has_fxsr ) {
- return tsk->thread.i387.fxsave.swd;
+ return tsk->thread.i387->fxsave.swd;
} else {
- return (unsigned short)tsk->thread.i387.fsave.swd;
+ return (unsigned short)tsk->thread.i387->fsave.swd;
}
}

@@ -171,9 +171,9 @@ unsigned short get_fpu_swd( struct task_
unsigned short get_fpu_twd( struct task_struct *tsk )
{
if ( cpu_has_fxsr ) {
- return tsk->thread.i387.fxsave.twd;
+ return tsk->thread.i387->fxsave.twd;
} else {
- return (unsigned short)tsk->thread.i387.fsave.twd;
+ return (unsigned short)tsk->thread.i387->fsave.twd;
}
}
#endif /* 0 */
@@ -181,7 +181,7 @@ unsigned short get_fpu_twd( struct task_
unsigned short get_fpu_mxcsr( struct task_struct *tsk )
{
if ( cpu_has_xmm ) {
- return tsk->thread.i387.fxsave.mxcsr;
+ return tsk->thread.i387->fxsave.mxcsr;
} else {
return 0x1f80;
}
@@ -192,27 +192,27 @@ unsigned short get_fpu_mxcsr( struct tas
void set_fpu_cwd( struct task_struct *tsk, unsigned short cwd )
{
if ( cpu_has_fxsr ) {
- tsk->thread.i387.fxsave.cwd = cwd;
+ tsk->thread.i387->fxsave.cwd = cwd;
} else {
- tsk->thread.i387.fsave.cwd = ((long)cwd | 0xffff0000u);
+ tsk->thread.i387->fsave.cwd = ((long)cwd | 0xffff0000u);
}
}

void set_fpu_swd( struct task_struct *tsk, unsigned short swd )
{
if ( cpu_has_fxsr ) {
- tsk->thread.i387.fxsave.swd = swd;
+ tsk->thread.i387->fxsave.swd = swd;
} else {
- tsk->thread.i387.fsave.swd = ((long)swd | 0xffff0000u);
+ tsk->thread.i387->fsave.swd = ((long)swd | 0xffff0000u);
}
}

void set_fpu_twd( struct task_struct *tsk, unsigned short twd )
{
if ( cpu_has_fxsr ) {
- tsk->thread.i387.fxsave.twd = twd_i387_to_fxsr(twd);
+ tsk->thread.i387->fxsave.twd = twd_i387_to_fxsr(twd);
} else {
- tsk->thread.i387.fsave.twd = ((long)twd | 0xffff0000u);
+ tsk->thread.i387->fsave.twd = ((long)twd | 0xffff0000u);
}
}

@@ -298,8 +298,8 @@ static inline int save_i387_fsave( struc
struct task_struct *tsk = current;

unlazy_fpu( tsk );
- tsk->thread.i387.fsave.status = tsk->thread.i387.fsave.swd;
- if ( __copy_to_user( buf, &tsk->thread.i387.fsave,
+ tsk->thread.i387->fsave.status = tsk->thread.i387->fsave.swd;
+ if ( __copy_to_user( buf, &tsk->thread.i387->fsave,
sizeof(struct i387_fsave_struct) ) )
return -1;
return 1;
@@ -312,15 +312,15 @@ static int save_i387_fxsave( struct _fps

unlazy_fpu( tsk );

- if ( convert_fxsr_to_user( buf, &tsk->thread.i387.fxsave ) )
+ if ( convert_fxsr_to_user( buf, &tsk->thread.i387->fxsave ) )
return -1;

- err |= __put_user( tsk->thread.i387.fxsave.swd, &buf->status );
+ err |= __put_user( tsk->thread.i387->fxsave.swd, &buf->status );
err |= __put_user( X86_FXSR_MAGIC, &buf->magic );
if ( err )
return -1;

- if ( __copy_to_user( &buf->_fxsr_env[0], &tsk->thread.i387.fxsave,
+ if ( __copy_to_user( &buf->_fxsr_env[0], &tsk->thread.i387->fxsave,
sizeof(struct i387_fxsave_struct) ) )
return -1;
return 1;
@@ -343,7 +343,7 @@ int save_i387( struct _fpstate __user *b
return save_i387_fsave( buf );
}
} else {
- return save_i387_soft( &current->thread.i387.soft, buf );
+ return save_i387_soft( &current->thread.i387->soft, buf );
}
}

@@ -351,7 +351,7 @@ static inline int restore_i387_fsave( st
{
struct task_struct *tsk = current;
clear_fpu( tsk );
- return __copy_from_user( &tsk->thread.i387.fsave, buf,
+ return __copy_from_user( &tsk->thread.i387->fsave, buf,
sizeof(struct i387_fsave_struct) );
}

@@ -360,11 +360,11 @@ static int restore_i387_fxsave( struct _
int err;
struct task_struct *tsk = current;
clear_fpu( tsk );
- err = __copy_from_user( &tsk->thread.i387.fxsave, &buf->_fxsr_env[0],
+ err = __copy_from_user( &tsk->thread.i387->fxsave, &buf->_fxsr_env[0],
sizeof(struct i387_fxsave_struct) );
/* mxcsr reserved bits must be masked to zero for security reasons */
- tsk->thread.i387.fxsave.mxcsr &= mxcsr_feature_mask;
- return err ? 1 : convert_fxsr_from_user( &tsk->thread.i387.fxsave, buf );
+ tsk->thread.i387->fxsave.mxcsr &= mxcsr_feature_mask;
+ return err ? 1 : convert_fxsr_from_user( &tsk->thread.i387->fxsave, buf );
}

int restore_i387( struct _fpstate __user *buf )
@@ -378,7 +378,7 @@ int restore_i387( struct _fpstate __user
err = restore_i387_fsave( buf );
}
} else {
- err = restore_i387_soft( &current->thread.i387.soft, buf );
+ err = restore_i387_soft( &current->thread.i387->soft, buf );
}
set_used_math();
return err;
@@ -391,7 +391,7 @@ int restore_i387( struct _fpstate __user
static inline int get_fpregs_fsave( struct user_i387_struct __user *buf,
struct task_struct *tsk )
{
- return __copy_to_user( buf, &tsk->thread.i387.fsave,
+ return __copy_to_user( buf, &tsk->thread.i387->fsave,
sizeof(struct user_i387_struct) );
}

@@ -399,7 +399,7 @@ static inline int get_fpregs_fxsave( str
struct task_struct *tsk )
{
return convert_fxsr_to_user( (struct _fpstate __user *)buf,
- &tsk->thread.i387.fxsave );
+ &tsk->thread.i387->fxsave );
}

int get_fpregs( struct user_i387_struct __user *buf, struct task_struct *tsk )
@@ -411,7 +411,7 @@ int get_fpregs( struct user_i387_struct
return get_fpregs_fsave( buf, tsk );
}
} else {
- return save_i387_soft( &tsk->thread.i387.soft,
+ return save_i387_soft( &tsk->thread.i387->soft,
(struct _fpstate __user *)buf );
}
}
@@ -419,14 +419,14 @@ int get_fpregs( struct user_i387_struct
static inline int set_fpregs_fsave( struct task_struct *tsk,
struct user_i387_struct __user *buf )
{
- return __copy_from_user( &tsk->thread.i387.fsave, buf,
+ return __copy_from_user( &tsk->thread.i387->fsave, buf,
sizeof(struct user_i387_struct) );
}

static inline int set_fpregs_fxsave( struct task_struct *tsk,
struct user_i387_struct __user *buf )
{
- return convert_fxsr_from_user( &tsk->thread.i387.fxsave,
+ return convert_fxsr_from_user( &tsk->thread.i387->fxsave,
(struct _fpstate __user *)buf );
}

@@ -439,7 +439,7 @@ int set_fpregs( struct task_struct *tsk,
return set_fpregs_fsave( tsk, buf );
}
} else {
- return restore_i387_soft( &tsk->thread.i387.soft,
+ return restore_i387_soft( &tsk->thread.i387->soft,
(struct _fpstate __user *)buf );
}
}
@@ -447,7 +447,7 @@ int set_fpregs( struct task_struct *tsk,
int get_fpxregs( struct user_fxsr_struct __user *buf, struct task_struct *tsk )
{
if ( cpu_has_fxsr ) {
- if (__copy_to_user( buf, &tsk->thread.i387.fxsave,
+ if (__copy_to_user( buf, &tsk->thread.i387->fxsave,
sizeof(struct user_fxsr_struct) ))
return -EFAULT;
return 0;
@@ -461,11 +461,11 @@ int set_fpxregs( struct task_struct *tsk
int ret = 0;

if ( cpu_has_fxsr ) {
- if (__copy_from_user( &tsk->thread.i387.fxsave, buf,
+ if (__copy_from_user( &tsk->thread.i387->fxsave, buf,
sizeof(struct user_fxsr_struct) ))
ret = -EFAULT;
/* mxcsr reserved bits must be masked to zero for security reasons */
- tsk->thread.i387.fxsave.mxcsr &= mxcsr_feature_mask;
+ tsk->thread.i387->fxsave.mxcsr &= mxcsr_feature_mask;
} else {
ret = -EIO;
}
@@ -479,7 +479,7 @@ int set_fpxregs( struct task_struct *tsk
static inline void copy_fpu_fsave( struct task_struct *tsk,
struct user_i387_struct *fpu )
{
- memcpy( fpu, &tsk->thread.i387.fsave,
+ memcpy( fpu, &tsk->thread.i387->fsave,
sizeof(struct user_i387_struct) );
}

@@ -490,10 +490,10 @@ static inline void copy_fpu_fxsave( stru
unsigned short *from;
int i;

- memcpy( fpu, &tsk->thread.i387.fxsave, 7 * sizeof(long) );
+ memcpy( fpu, &tsk->thread.i387->fxsave, 7 * sizeof(long) );

to = (unsigned short *)&fpu->st_space[0];
- from = (unsigned short *)&tsk->thread.i387.fxsave.st_space[0];
+ from = (unsigned short *)&tsk->thread.i387->fxsave.st_space[0];
for ( i = 0 ; i < 8 ; i++, to += 5, from += 8 ) {
memcpy( to, from, 5 * sizeof(unsigned short) );
}
@@ -540,7 +540,7 @@ int dump_task_extended_fpu(struct task_s
if (fpvalid) {
if (tsk == current)
unlazy_fpu(tsk);
- memcpy(fpu, &tsk->thread.i387.fxsave, sizeof(*fpu));
+ memcpy(fpu, &tsk->thread.i387->fxsave, sizeof(*fpu));
}
return fpvalid;
}
Index: linux/arch/i386/kernel/process.c
===================================================================
--- linux.orig/arch/i386/kernel/process.c
+++ linux/arch/i386/kernel/process.c
@@ -648,7 +648,7 @@ struct task_struct fastcall * __switch_t

/* we're going to use this soon, after a few expensive things */
if (next_p->fpu_counter > 5)
- prefetch(&next->i387.fxsave);
+ prefetch(&next->i387->fxsave);

/*
* Reload esp0.
@@ -927,3 +927,57 @@ unsigned long arch_align_stack(unsigned
sp -= get_random_int() % 8192;
return sp & ~0xf;
}
+
+
+
+struct kmem_cache *task_struct_cachep;
+struct kmem_cache *task_i387_cachep;
+
+struct task_struct * alloc_task_struct(void)
+{
+ struct task_struct *tsk;
+ tsk = kmem_cache_alloc(task_struct_cachep, GFP_KERNEL);
+ if (!tsk)
+ return NULL;
+ tsk->thread.i387 = kmem_cache_alloc(task_i387_cachep, GFP_KERNEL);
+ if (!tsk->thread.i387)
+ goto error;
+ WARN_ON((unsigned long)tsk->thread.i387 & 15);
+ return tsk;
+
+error:
+ kfree(tsk);
+ return NULL;
+}
+
+void memcpy_task_struct(struct task_struct *dst, struct task_struct *src)
+{
+ union i387_union *ptr;
+ ptr = dst->thread.i387;
+ *dst = *src;
+ dst->thread.i387 = ptr;
+ memcpy(dst->thread.i387, src->thread.i387, sizeof(union i387_union));
+}
+
+void free_task_struct(struct task_struct *tsk)
+{
+ kmem_cache_free(task_i387_cachep, tsk->thread.i387);
+ tsk->thread.i387=NULL;
+ kmem_cache_free(task_struct_cachep, tsk);
+}
+
+
+void task_struct_slab_init(void)
+{
+ /* create a slab on which task_structs can be allocated */
+ task_struct_cachep =
+ kmem_cache_create("task_struct", sizeof(struct task_struct),
+ ARCH_MIN_TASKALIGN, SLAB_PANIC, NULL, NULL);
+ task_i387_cachep =
+ kmem_cache_create("task_i387", sizeof(union i387_union), 32,
+ SLAB_PANIC | SLAB_MUST_HWCACHE_ALIGN, NULL, NULL);
+}
+
+
+/* the very init task needs a static allocated i387 area */
+union i387_union init_i387_context;
Index: linux/arch/i386/kernel/traps.c
===================================================================
--- linux.orig/arch/i386/kernel/traps.c
+++ linux/arch/i386/kernel/traps.c
@@ -1157,16 +1157,6 @@ void __init trap_init(void)
set_trap_gate(19,&simd_coprocessor_error);

if (cpu_has_fxsr) {
- /*
- * Verify that the FXSAVE/FXRSTOR data will be 16-byte aligned.
- * Generates a compile-time "error: zero width for bit-field" if
- * the alignment is wrong.
- */
- struct fxsrAlignAssert {
- int _:!(offsetof(struct task_struct,
- thread.i387.fxsave) & 15);
- };
-
printk(KERN_INFO "Enabling fast FPU save and restore... ");
set_in_cr4(X86_CR4_OSFXSR);
printk("done.\n");
Index: linux/include/asm-i386/i387.h
===================================================================
--- linux.orig/include/asm-i386/i387.h
+++ linux/include/asm-i386/i387.h
@@ -34,7 +34,7 @@ extern void init_fpu(struct task_struct
"nop ; frstor %1", \
"fxrstor %1", \
X86_FEATURE_FXSR, \
- "m" ((tsk)->thread.i387.fxsave))
+ "m" ((tsk)->thread.i387->fxsave))

extern void kernel_fpu_begin(void);
#define kernel_fpu_end() do { stts(); preempt_enable(); } while(0)
@@ -60,8 +60,8 @@ static inline void __save_init_fpu( stru
"fxsave %[fx]\n"
"bt $7,%[fsw] ; jnc 1f ; fnclex\n1:",
X86_FEATURE_FXSR,
- [fx] "m" (tsk->thread.i387.fxsave),
- [fsw] "m" (tsk->thread.i387.fxsave.swd) : "memory");
+ [fx] "m" (tsk->thread.i387->fxsave),
+ [fsw] "m" (tsk->thread.i387->fxsave.swd) : "memory");
/* AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception
is pending. Clear the x87 state here by setting it to fixed
values. safe_address is a random variable that should be in L1 */
Index: linux/include/asm-i386/processor.h
===================================================================
--- linux.orig/include/asm-i386/processor.h
+++ linux/include/asm-i386/processor.h
@@ -407,7 +407,7 @@ struct thread_struct {
/* fault info */
unsigned long cr2, trap_no, error_code;
/* floating point info */
- union i387_union i387;
+ union i387_union *i387;
/* virtual 86 mode info */
struct vm86_struct __user * vm86_info;
unsigned long screen_bitmap;
@@ -420,11 +420,15 @@ struct thread_struct {
unsigned long io_bitmap_max;
};

+
+extern union i387_union init_i387_context;
+
#define INIT_THREAD { \
.vm86_info = NULL, \
.sysenter_cs = __KERNEL_CS, \
.io_bitmap_ptr = NULL, \
.fs = __KERNEL_PDA, \
+ .i387 = &init_i387_context, \
}

/*
Index: linux/include/asm-i386/thread_info.h
===================================================================
--- linux.orig/include/asm-i386/thread_info.h
+++ linux/include/asm-i386/thread_info.h
@@ -102,6 +102,12 @@ static inline struct thread_info *curren

#define free_thread_info(info) kfree(info)

+#define __HAVE_ARCH_TASK_STRUCT_ALLOCATOR
+extern struct task_struct * alloc_task_struct(void);
+extern void free_task_struct(struct task_struct *tsk);
+extern void memcpy_task_struct(struct task_struct *dst, struct task_struct *src);
+extern void task_struct_slab_init(void);
+
#else /* !__ASSEMBLY__ */

/* how to get the thread information struct from ASM */
Index: linux/kernel/fork.c
===================================================================
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -84,6 +84,8 @@ int nr_processes(void)
#ifndef __HAVE_ARCH_TASK_STRUCT_ALLOCATOR
# define alloc_task_struct() kmem_cache_alloc(task_struct_cachep, GFP_KERNEL)
# define free_task_struct(tsk) kmem_cache_free(task_struct_cachep, (tsk))
+# define memcpy_task_struct(dst, src) *dst = *src;
+
static struct kmem_cache *task_struct_cachep;
#endif

@@ -138,6 +140,8 @@ void __init fork_init(unsigned long memp
task_struct_cachep =
kmem_cache_create("task_struct", sizeof(struct task_struct),
ARCH_MIN_TASKALIGN, SLAB_PANIC, NULL, NULL);
+#else
+ task_struct_slab_init();
#endif

/*
@@ -176,7 +180,8 @@ static struct task_struct *dup_task_stru
return NULL;
}

- *tsk = *orig;
+ memcpy_task_struct(tsk, orig);
+
tsk->thread_info = ti;
setup_thread_stack(tsk, orig);

2007-02-28 21:51:54

by Ingo Molnar

[permalink] [raw]
Subject: [patch 12/12] syslets: x86_64: add syslet/threadlet support

From: Ingo Molnar <[email protected]>

add the arch specific bits of syslet/threadlet support to x86_64.

Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86_64/Kconfig | 4 ++
arch/x86_64/ia32/ia32entry.S | 20 ++++++++++-
arch/x86_64/kernel/entry.S | 72 ++++++++++++++++++++++++++++++++++++++++-
arch/x86_64/kernel/process.c | 11 ++++++
include/asm-x86_64/processor.h | 16 +++++++++
include/asm-x86_64/system.h | 12 ++++++
include/asm-x86_64/unistd.h | 29 +++++++++++++++-
7 files changed, 160 insertions(+), 4 deletions(-)

Index: linux/arch/x86_64/Kconfig
===================================================================
--- linux.orig/arch/x86_64/Kconfig
+++ linux/arch/x86_64/Kconfig
@@ -36,6 +36,10 @@ config ZONE_DMA32
bool
default y

+config ASYNC_SUPPORT
+ bool
+ default y
+
config LOCKDEP_SUPPORT
bool
default y
Index: linux/arch/x86_64/ia32/ia32entry.S
===================================================================
--- linux.orig/arch/x86_64/ia32/ia32entry.S
+++ linux/arch/x86_64/ia32/ia32entry.S
@@ -368,6 +368,14 @@ quiet_ni_syscall:
PTREGSCALL stub32_vfork, sys_vfork, %rdi
PTREGSCALL stub32_iopl, sys_iopl, %rsi
PTREGSCALL stub32_rt_sigsuspend, sys_rt_sigsuspend, %rdx
+ /*
+ * sys_async_thread() and sys_async_exec() both take 2 parameters,
+ * none of which is ptregs - but the syscalls rely on being able to
+ * modify ptregs. So we put ptregs into the 3rd parameter - so it's
+ * unused and it also does not mess up the first 2 parameters:
+ */
+ PTREGSCALL stub32_compat_async_exec, compat_sys_async_exec, %rdx
+ PTREGSCALL stub32_compat_async_thread, sys_async_thread, %rdx

ENTRY(ia32_ptregs_common)
popq %r11
@@ -394,6 +402,9 @@ END(ia32_ptregs_common)

.section .rodata,"a"
.align 8
+.globl compat_sys_call_table
+compat_sys_call_table:
+.globl ia32_sys_call_table
ia32_sys_call_table:
.quad sys_restart_syscall
.quad sys_exit
@@ -714,9 +725,16 @@ ia32_sys_call_table:
.quad compat_sys_get_robust_list
.quad sys_splice
.quad sys_sync_file_range
- .quad sys_tee
+ .quad sys_tee /* 315 */
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
.quad sys_getcpu
.quad sys_epoll_pwait
+ .quad stub32_compat_async_exec /* 320 */
+ .quad sys_async_wait
+ .quad sys_umem_add
+ .quad stub32_compat_async_thread
+ .quad sys_threadlet_on
+ .quad sys_threadlet_off /* 325 */
+.globl ia32_syscall_end
ia32_syscall_end:
Index: linux/arch/x86_64/kernel/entry.S
===================================================================
--- linux.orig/arch/x86_64/kernel/entry.S
+++ linux/arch/x86_64/kernel/entry.S
@@ -410,6 +410,14 @@ END(\label)
PTREGSCALL stub_rt_sigsuspend, sys_rt_sigsuspend, %rdx
PTREGSCALL stub_sigaltstack, sys_sigaltstack, %rdx
PTREGSCALL stub_iopl, sys_iopl, %rsi
+ /*
+ * sys_async_thread() and sys_async_exec() both take 2 parameters,
+ * none of which is ptregs - but the syscalls rely on being able to
+ * modify ptregs. So we put ptregs into the 3rd parameter - so it's
+ * unused and it also does not mess up the first 2 parameters:
+ */
+ PTREGSCALL stub_async_thread, sys_async_thread, %rdx
+ PTREGSCALL stub_async_exec, sys_async_exec, %rdx

ENTRY(ptregscall_common)
popq %r11
@@ -430,7 +438,7 @@ ENTRY(ptregscall_common)
ret
CFI_ENDPROC
END(ptregscall_common)
-
+
ENTRY(stub_execve)
CFI_STARTPROC
popq %r11
@@ -990,6 +998,68 @@ child_rip:
ENDPROC(child_rip)

/*
+ * Create an async kernel thread.
+ *
+ * C extern interface:
+ * extern long create_async_thread(int (*fn)(void *), void * arg, unsigned long flags)
+ *
+ * asm input arguments:
+ * rdi: fn, rsi: arg, rdx: flags
+ */
+ENTRY(create_async_thread)
+ CFI_STARTPROC
+ FAKE_STACK_FRAME $async_child_rip
+ SAVE_ALL
+
+ # rdi: flags, rsi: usp, rdx: will be &pt_regs
+ movq %rdx,%rdi
+ movq $-1, %rsi
+ movq %rsp, %rdx
+
+ xorl %r8d,%r8d
+ xorl %r9d,%r9d
+
+ # clone now
+ call do_fork
+ movq %rax,RAX(%rsp)
+ xorl %edi,%edi
+
+ /*
+ * It isn't worth to check for reschedule here,
+ * so internally to the x86_64 port you can rely on kernel_thread()
+ * not to reschedule the child before returning, this avoids the need
+ * of hacks for example to fork off the per-CPU idle tasks.
+ * [Hopefully no generic code relies on the reschedule -AK]
+ */
+ RESTORE_ALL
+ UNFAKE_STACK_FRAME
+ ret
+ CFI_ENDPROC
+ENDPROC(async_kernel_thread)
+
+async_child_rip:
+ CFI_STARTPROC
+
+ movq %rdi, %rax
+ movq %rsi, %rdi
+ call *%rax
+
+ /*
+ * Fix up the PDA - we might return with sysexit:
+ */
+ RESTORE_TOP_OF_STACK %r11
+
+ /*
+ * return to user-space:
+ */
+ movq %rax, RAX(%rsp)
+ RESTORE_REST
+ jmp int_ret_from_sys_call
+
+ CFI_ENDPROC
+ENDPROC(async_child_rip)
+
+/*
* execve(). This function needs to use IRET, not SYSRET, to set up all state properly.
*
* C extern interface:
Index: linux/arch/x86_64/kernel/process.c
===================================================================
--- linux.orig/arch/x86_64/kernel/process.c
+++ linux/arch/x86_64/kernel/process.c
@@ -418,6 +418,17 @@ void release_thread(struct task_struct *
}
}

+/*
+ * Move user-space context from one kernel thread to another.
+ * Callers must make sure that neither task is running user context
+ * at the moment:
+ */
+void
+move_user_context(struct task_struct *new_task, struct task_struct *old_task)
+{
+ *task_pt_regs(new_task) = *task_pt_regs(old_task);
+}
+
static inline void set_32bit_tls(struct task_struct *t, int tls, u32 addr)
{
struct user_desc ud = {
Index: linux/include/asm-x86_64/processor.h
===================================================================
--- linux.orig/include/asm-x86_64/processor.h
+++ linux/include/asm-x86_64/processor.h
@@ -322,6 +322,11 @@ extern void prepare_to_copy(struct task_
extern long kernel_thread(int (*fn)(void *), void * arg, unsigned long flags);

/*
+ * create an async thread:
+ */
+extern long create_async_thread(long (*fn)(void *), void * arg, unsigned long flags);
+
+/*
* Return saved PC of a blocked thread.
* What is this good for? it will be always the scheduler or ret_from_fork.
*/
@@ -332,6 +337,17 @@ extern unsigned long get_wchan(struct ta
#define KSTK_EIP(tsk) (task_pt_regs(tsk)->rip)
#define KSTK_ESP(tsk) -1 /* sorry. doesn't work for syscall. */

+/*
+ * Register access methods for async syscall support.
+ *
+ * Note, task_stack_reg() must not be an lvalue, hence this macro:
+ */
+#define task_stack_reg(t) \
+ ({ unsigned long __rsp = task_pt_regs(t)->rsp; __rsp; })
+#define set_task_stack_reg(t, new_stack) \
+ do { task_pt_regs(t)->rsp = (new_stack); } while (0)
+#define task_ip_reg(t) task_pt_regs(t)->rip
+#define task_ret_reg(t) task_pt_regs(t)->rax

struct microcode_header {
unsigned int hdrver;
Index: linux/include/asm-x86_64/system.h
===================================================================
--- linux.orig/include/asm-x86_64/system.h
+++ linux/include/asm-x86_64/system.h
@@ -20,6 +20,8 @@
#define __EXTRA_CLOBBER \
,"rcx","rbx","rdx","r8","r9","r10","r11","r12","r13","r14","r15"

+struct task_struct;
+
/* Save restore flags to clear handle leaking NT */
#define switch_to(prev,next,last) \
asm volatile(SAVE_CONTEXT \
@@ -42,7 +44,15 @@
[thread_info] "i" (offsetof(struct task_struct, thread_info)), \
[pda_pcurrent] "i" (offsetof(struct x8664_pda, pcurrent)) \
: "memory", "cc" __EXTRA_CLOBBER)
-
+
+
+/*
+ * Move user-space context from one kernel thread to another.
+ * This includes registers and FPU state for now:
+ */
+extern void
+move_user_context(struct task_struct *new_task, struct task_struct *old_task);
+
extern void load_gs_index(unsigned);

/*
Index: linux/include/asm-x86_64/unistd.h
===================================================================
--- linux.orig/include/asm-x86_64/unistd.h
+++ linux/include/asm-x86_64/unistd.h
@@ -619,8 +619,21 @@ __SYSCALL(__NR_sync_file_range, sys_sync
__SYSCALL(__NR_vmsplice, sys_vmsplice)
#define __NR_move_pages 279
__SYSCALL(__NR_move_pages, sys_move_pages)
+#define __NR_async_exec 280
+__SYSCALL(__NR_async_exec, stub_async_exec)
+#define __NR_async_wait 281
+__SYSCALL(__NR_async_wait, sys_async_wait)
+#define __NR_umem_add 282
+__SYSCALL(__NR_umem_add, sys_umem_add)
+#define __NR_async_thread 283
+__SYSCALL(__NR_async_thread, stub_async_thread)
+#define __NR_threadlet_on 284
+__SYSCALL(__NR_threadlet_on, sys_threadlet_on)
+#define __NR_threadlet_off 285
+__SYSCALL(__NR_threadlet_off, sys_threadlet_off)

-#define __NR_syscall_max __NR_move_pages
+#define __NR_syscall_max __NR_threadlet_off
+#define NR_syscalls __NR_syscall_max

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
@@ -654,6 +667,20 @@ __SYSCALL(__NR_move_pages, sys_move_page
#include <linux/types.h>
#include <asm/ptrace.h>

+typedef asmlinkage long (*syscall_fn_t)(long, long, long, long, long, long);
+
+extern syscall_fn_t sys_call_table[NR_syscalls];
+
+#ifdef CONFIG_COMPAT
+
+extern syscall_fn_t compat_sys_call_table[];
+extern syscall_fn_t ia32_syscall_end;
+extern syscall_fn_t ia32_sys_call_table;
+
+#define compat_NR_syscalls (&ia32_syscall_end - compat_sys_call_table)
+
+#endif
+
asmlinkage long sys_iopl(unsigned int level, struct pt_regs *regs);
asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int turn_on);
struct sigaction;

2007-03-01 03:26:30

by Kevin O'Connor

[permalink] [raw]
Subject: Re: [patch 02/12] syslets: add syslet.h include file, user API/ABI definitions

On Wed, Feb 28, 2007 at 10:41:17PM +0100, Ingo Molnar wrote:
> From: Ingo Molnar <[email protected]>
>
> add include/linux/syslet.h which contains the user-space API/ABI
> declarations. Add the new header to include/linux/Kbuild as well.

Hi Ingo,

I'd like to propose a simpler userspace API for syslets. I believe
this revised API is just as capable as yours (anything done purely in
kernel space with the existing API can also be done with this one).

An "atom" would look like:

struct syslet_uatom {
u32 nr;
u64 ret_ptr;
u64 next;
u64 arg_nr;
u64 args[6];
};

The sys_nr, ret_ptr, and next fields would be unchanged. The args
array would directly store the arguments to the system call. To
optimize the case where only a few arguments are necessary, an
explicit argument count would be set in the arg_nr field.

The above is very similar to what Linus and Davide described as a
"single submission" syslet interface - it differs only with the
addition of the next parameter. As with your API, a null next field
would immediately stop the syslet. So, a user wishing to run a single
system call asynchronously could use the above interface with the next
field set to null.

Of course, the above lacks the syscall return testing capabilities in
your atoms. To obtain that capability, one could add a new syscall:

long sys_syslet_helper(long flags, long *ptr, long inc, u64 new_next)

The above is effectively a combination of sys_umem_add and the "flags"
field from the existing syslet_uatom. The system call would only be
available from syslets. It would add "inc" to the integer stored in
"ptr" and return the result. The "flags" field could optionally
contain one of:
SYSLET_BRANCH_ON_NONZERO
SYSLET_BRANCH_ON_ZERO
SYSLET_BRANCH_ON_NEGATIVE
SYSLET_BRANCH_ON_NON_POSITIVE
If the flag were set and the return result of the syscall met the
specified condition, then the code would arrange for the calling
syslet to branch to "new_next" instead of the normal "next".

I would also change the event ring notification system. Instead of
building that support into all syslets, one could introduce an "add to
head" syscall specifically for that purpose. If done this way,
userspace could arrange for this new sys_addtoring call to always be
the last uatom executed in a syslet. This would make the support
optional - those userspace applications that prefer to use a futex or
signal as an event system could arrange to have those system calls as
the last one in the chain instead. With this change, the
sys_async_exec would simplify to:

long sys_async_exec(struct syslet_uatom *uatom);

As above, I believe this API has as much power as the existing system.
The general idea is to make the system call / atoms simpler and use
more atoms when building complex chains.

For example, the open & stat case could be done with a chain like the
following:

atom1: &atom3->args[1] = sys_open(...)
atom2: sys_syslet_helper(SYSLET_BRANCH_ON_NON_POSITIVE,
&atom3->args[1], 0, atom4)
atom3: sys_stat([arg1 filled above], ...)
atom4: sys_futex(...) // alert parent of completion

It is also possible to use sys_syslet_helper to push a return value to
multiple syslet parameters (for example, propagating an fd from open
to multiple reads). For example:

atom1: &atom3->args[1] = sys_open(...)
atom2: &atom4->args[1] = sys_syslet_helper(0, &atom3->args[1], 0, 0)
atom3: sys_read([arg1 filled in atom1], ...)
atom4: sys_read([arg1 filled in atom2], ...)
...

Although this is a bit ugly, I must wonder how many times one would
build chains complex enough to require it.

Cheers,
-Kevin