2007-12-06 23:20:45

by Zach Brown

[permalink] [raw]
Subject: syslets v7: back to basics


The following patches are a substantial refactoring of the syslet code. I'm
branding them as the v7 release of the syslet infrastructure, though they
represent a signifiant change in focus.

My current focus is to see the most fundamental functionality brought to
maturity. To me, this means getting a ABI that is used by applications through
glibc on x86 and PPC64. Only once that is ready should we distract ourselves
with advanced complexity.

To that end, this patch series differs from v6 in significant ways:

* syslets are initiated by providing syslet arguments to sys_indirect().

* uatoms, threadlets, and the kaio changes are postponed until they can be
justified and rebuilt on more complete infrastructure. (I'm not saying
these shouldn't or won't be persued. I'm saying that we should get the
simplest piece working first.)

* the code is clarified and commented, the patches are bisectable and pass
checkpatch.

The use of sys_indirect() and the move from 'atom's simplified the ABI
considerably. I've put a trivial example in a syslet-userspace git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/zab/syslets-userspace.git

This git repository will grow more tests and documentation over time.

The patches sent with this mail are based on the v6 indirect patches but they
aren't included. The full syslets patch series, including the indirect
patches, are available in a few forms:

broken out patch series:
http://www.kernel.org/pub/linux/kernel/people/zab/broken-out/syslets/

in a 'syslets' git branch off of current linux-2.6.git:
git://git.kernel.org/pub/scm/linux/kernel/git/zab/linux-2.6.git syslets

git tree of the guilt .git/patches directory:
git://git.kernel.org/pub/scm/linux/kernel/git/zab/guilt-series.git

The patches were barely tested on i386 and x86_64.

There are both implementation details and design problems left. My hope is
that we can address these in the coming weeks.

- Do we stop the user from initiating more syslets than fit in the ring?
- Do we worry now about the hashed ring mutexes scaling poorly? (They will.)
- What are the semantics of ptrace()ing a syslet submission which blocks?
- How should applications deal with waiting syslet tasks with stale data
in their task_struct? (syslet, setuid, syslet..)
- Issuing a syslet is an implicit sys_clone(), will apps pass in clone flags?
- Are the u32 ring index reads and writes atomic for supported architectures?

Any feedback on these questions would be greatly appreciated.

I'm particularly interested in hearing from people who are trying to use
syslets in their applications. This will involve awkward wrappers instead of
glibc calls for now, and your machine may explode, but hopefully the chance to
influence the design of syslets would make it worth the effort.

Finally, I carried the enormous cc: list for this mail over from previous
syslet releases. If you want to be removed or added to the list for future
syslet releases, please do let me know.

- z


2007-12-06 23:20:32

by Zach Brown

[permalink] [raw]
Subject: [PATCH 1/6] indirect: use asmlinkage in i386 syscall table prototype

call_indirect() was using the wrong calling convention for the system call
handlers. system call handlers would get mixed up arguments.

Signed-off-by: Zach Brown <[email protected]>

diff --git a/include/asm-x86/indirect_32.h b/include/asm-x86/indirect_32.h
index a1b72ac..e3dea8e 100644
--- a/include/asm-x86/indirect_32.h
+++ b/include/asm-x86/indirect_32.h
@@ -15,8 +15,8 @@ struct indirect_registers {

static inline long call_indirect(struct indirect_registers *regs)
{
- extern long (*sys_call_table[]) (__u32, __u32, __u32, __u32, __u32, __u32);
-
+ extern asmlinkage long (*sys_call_table[])(long, long, long,
+ long, long, long);
return sys_call_table[INDIRECT_SYSCALL(regs)](regs->ebx, regs->ecx,
regs->edx, regs->esi,
regs->edi, regs->ebp);
--
1.5.2.2

2007-12-06 23:20:59

by Zach Brown

[permalink] [raw]
Subject: [PATCH 2/6] syslet: asm-generic support to disable syslets

This provides an implementation of the architecture dependent portion of
syslets which disables syslet operations.

This patch is an incomplete demonstration. All asm-*/syslet*.h files would
include these files until their architectures provide implementations which
enable syslet support.

Signed-off-by: Zach Brown <[email protected]>

diff --git a/include/asm-generic/syslet-abi.h b/include/asm-generic/syslet-abi.h
new file mode 100644
index 0000000..5d19971
--- /dev/null
+++ b/include/asm-generic/syslet-abi.h
@@ -0,0 +1,11 @@
+#ifndef _ASM_GENERIC_SYSLET_ABI_H
+#define _ASM_GENERIC_SYSLET_ABI_H
+
+/*
+ * I'm assuming that a u64 ip and u64 esp won't be enough for all
+ * archs, so I just let each arch define its own.
+ */
+struct syslet_frame {
+};
+
+#endif
diff --git a/include/asm-generic/syslet.h b/include/asm-generic/syslet.h
new file mode 100644
index 0000000..de9a750
--- /dev/null
+++ b/include/asm-generic/syslet.h
@@ -0,0 +1,34 @@
+#ifndef _ASM_GENERIC_SYSLET_H
+#define _ASM_GENERIC_SYSLET_H
+
+/*
+ * This provider of the arch-specific syslet APIs is used when an architecture
+ * doesn't support syslets.
+ */
+
+/* this stops the other functions from ever being called */
+static inline int syslet_frame_valid(struct syslet_frame *frame)
+{
+ return 0;
+}
+
+static inline void set_user_frame(struct task_struct *task,
+ struct syslet_frame *frame)
+{
+ BUG();
+}
+
+static inline void move_user_context(struct task_struct *dest,
+ struct task_struct *src)
+{
+ BUG();
+}
+
+static inline int create_syslet_thread(long (*fn)(void *),
+ void *arg, unsigned long flags)
+{
+ BUG();
+ return 0;
+}
+
+#endif
diff --git a/include/asm-x86/syslet-abi.h b/include/asm-x86/syslet-abi.h
new file mode 100644
index 0000000..14a7182
--- /dev/null
+++ b/include/asm-x86/syslet-abi.h
@@ -0,0 +1 @@
+#include <asm-generic/syslet-abi.h>
diff --git a/include/asm-x86/syslet.h b/include/asm-x86/syslet.h
new file mode 100644
index 0000000..583d810
--- /dev/null
+++ b/include/asm-x86/syslet.h
@@ -0,0 +1 @@
+#include <asm-generic/syslet.h>
--
1.5.2.2

2007-12-06 23:21:25

by Zach Brown

[permalink] [raw]
Subject: [PATCH 3/6] syslet: introduce abi structs

This patch adds the architecture independent structures of the
syslet ABI.

Signed-off-by: Zach Brown <[email protected]>

diff --git a/include/linux/syslet-abi.h b/include/linux/syslet-abi.h
new file mode 100644
index 0000000..a8bc1a3
--- /dev/null
+++ b/include/linux/syslet-abi.h
@@ -0,0 +1,34 @@
+#ifndef _LINUX_SYSLET_ABI_H
+#define _LINUX_SYSLET_ABI_H
+
+#include <asm/syslet-abi.h> /* for struct syslet_frame */
+
+struct syslet_args {
+ u64 completion_ring_ptr;
+ u64 caller_data;
+ struct syslet_frame frame;
+};
+
+struct syslet_completion {
+ u64 status;
+ u64 caller_data;
+};
+
+/*
+ * The ring follows the "wrapping" convention as described by Andrew at:
+ * http://lkml.org/lkml/2007/4/11/276
+ * The head is updated by the kernel as completions are added and the
+ * tail is updated by userspace as completions are removed.
+ *
+ * The number of elements must be a power of two and the ring must be
+ * aligned to a u64.
+ */
+struct syslet_ring {
+ u32 kernel_head;
+ u32 user_tail;
+ u32 elements;
+ u32 wait_group;
+ struct syslet_completion comp[0];
+};
+
+#endif
--
1.5.2.2

2007-12-06 23:21:41

by Zach Brown

[permalink] [raw]
Subject: [PATCH 4/6] syslets: add indirect args

This adds the syslet indirect args to the indirect_params union.

This is broken, but it lets us simply demonstrate the rest of the syslet
universe around the indirect argument passing convention.

A caller could well want to perform a syscall that uses indirect arguments as a
syscall. Maybe we turn indirect_params into a struct that contains a union for
arguments which can never be used concurrently. This needs wider discussion.

Signed-off-by: Zach Brown <[email protected]>

diff --git a/include/linux/indirect.h b/include/linux/indirect.h
index 97f9ac4..5d5abd7 100644
--- a/include/linux/indirect.h
+++ b/include/linux/indirect.h
@@ -3,6 +3,7 @@
#define _LINUX_INDIRECT_H

#include <asm/indirect.h>
+#include <linux/syslet-abi.h>


/* IMPORTANT:
@@ -14,6 +15,7 @@ union indirect_params {
struct {
int flags;
} file_flags;
+ struct syslet_args syslet;
};

#define INDIRECT_PARAM(set, name) current->indirect_params.set.name
--
1.5.2.2

2007-12-06 23:21:57

by Zach Brown

[permalink] [raw]
Subject: [PATCH 5/6] syslets: add generic syslets infrastructure

The indirect syslet arguments specify where to store the completion and what
function in userspcae to return to once the syslet has been executed. The
details of how we pass the indirect syslet arguments needs help.

We parse the indirect syslet arguments in sys_indirect() before we call
the given system call. If they're OK we mark the task as ready to become
a syslet. We make sure that there is a child task waiting.

We call into kernel/syslet.c from the scheduler when we try to block a task
which has been marked as ready. A child task is woken and returns to
userspace.

We store the result of the system call in the userspace ring back up in
sys_indrect() as the system call finally finishes. At that point the original
task returns to the frame that userspace provided in the indirect syslet args.

This generic infrastructure relies on architecture specific routines to create
a new child task, move userspace state from one kernel task to another, and to
setup the userspace return frame in ptregs. Code in asm-generic just returns
-EINVAL until an architecture provides the needed routines.

This is a simplification of Ingo's more involved syslet and threatlet
infrastructure which was built around 'uatoms'. Enough code has changed that
it wasn't appropriate to bring the previous Signed-off-by lines forward.

Signed-off-by: Zach Brown <[email protected]>;

diff --git a/fs/exec.c b/fs/exec.c
index 282240a..942262f 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -51,6 +51,7 @@
#include <linux/tsacct_kern.h>
#include <linux/cn_proc.h>
#include <linux/audit.h>
+#include <linux/syslet.h>

#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -1614,6 +1615,8 @@ static int coredump_wait(int exit_code)
complete(vfork_done);
}

+ kill_syslet_tasks(tsk);
+
if (core_waiters)
wait_for_completion(&startup_done);
fail:
diff --git a/include/asm-generic/errno.h b/include/asm-generic/errno.h
index e8852c0..26674c4 100644
--- a/include/asm-generic/errno.h
+++ b/include/asm-generic/errno.h
@@ -106,4 +106,7 @@
#define EOWNERDEAD 130 /* Owner died */
#define ENOTRECOVERABLE 131 /* State not recoverable */

+/* for syslets */
+#define ESYSLETPENDING 132 /* syslet syscall now pending */
+
#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 61a4b83..a134966 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1182,6 +1182,13 @@ struct task_struct {

/* Additional system call parameters. */
union indirect_params indirect_params;
+
+ /* task waiting to return to userspace if we block as a syslet */
+ spinlock_t syslet_lock;
+ struct list_head syslet_tasks;
+ unsigned syslet_ready:1,
+ syslet_return:1,
+ syslet_exit:1;
};

/*
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index addb39f..1a44838 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -55,6 +55,7 @@ struct compat_timeval;
struct robust_list_head;
struct getcpu_cache;
struct indirect_registers;
+struct syslet_ring;

#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -615,6 +616,8 @@ asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
asmlinkage long sys_indirect(struct indirect_registers __user *userregs,
void __user *userparams, size_t paramslen,
int flags);
+asmlinkage long sys_syslet_ring_wait(struct syslet_ring __user *ring,
+ unsigned long user_idx);

int kernel_execve(const char *filename, char *const argv[], char *const envp[]);

diff --git a/include/linux/syslet.h b/include/linux/syslet.h
new file mode 100644
index 0000000..734b98f
--- /dev/null
+++ b/include/linux/syslet.h
@@ -0,0 +1,18 @@
+#ifndef _LINUX_SYSLET_H
+#define _LINUX_SYSLET_H
+
+#include <linux/syslet-abi.h>
+#include <asm/syslet.h>
+
+void syslet_init(struct task_struct *tsk);
+void kill_syslet_tasks(struct task_struct *cur);
+void syslet_schedule(struct task_struct *cur);
+int syslet_pre_indirect(void);
+int syslet_post_indirect(int status);
+
+static inline int syslet_args_present(union indirect_params *params)
+{
+ return params->syslet.completion_ring_ptr;
+}
+
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index bcad37d..7a7dfbe 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -9,7 +9,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o profile.o \
rcupdate.o extable.o params.o posix-timers.o \
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
hrtimer.o rwsem.o latency.o nsproxy.o srcu.o \
- utsname.o notifier.o
+ utsname.o notifier.o syslet.o

obj-$(CONFIG_SYSCTL) += sysctl_check.o
obj-$(CONFIG_STACKTRACE) += stacktrace.o
diff --git a/kernel/exit.c b/kernel/exit.c
index 549c055..831b2f9 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -44,6 +44,7 @@
#include <linux/resource.h>
#include <linux/blkdev.h>
#include <linux/task_io_accounting_ops.h>
+#include <linux/syslet.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -949,6 +950,12 @@ fastcall NORET_TYPE void do_exit(long code)
schedule();
}

+ /*
+ * syslet threads have to exit their context before the MM exit (due to
+ * the coredumping wait).
+ */
+ kill_syslet_tasks(tsk);
+
tsk->flags |= PF_EXITING;
/*
* tsk->flags are checked in the futex code to protect against
diff --git a/kernel/fork.c b/kernel/fork.c
index 8ca1a14..4b1efb9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -51,6 +51,7 @@
#include <linux/random.h>
#include <linux/tty.h>
#include <linux/proc_fs.h>
+#include <linux/syslet.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -1123,6 +1124,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->blocked_on = NULL; /* not blocked yet */
#endif

+ syslet_init(p);
+
/* Perform scheduler related setup. Assign this task to a CPU. */
sched_fork(p, clone_flags);

diff --git a/kernel/indirect.c b/kernel/indirect.c
index bc60850..3bcfaff 100644
--- a/kernel/indirect.c
+++ b/kernel/indirect.c
@@ -3,6 +3,8 @@
#include <linux/unistd.h>
#include <asm/asm-offsets.h>

+/* XXX would we prefer to generalize this somehow? */
+#include <linux/syslet.h>

asmlinkage long sys_indirect(struct indirect_registers __user *userregs,
void __user *userparams, size_t paramslen,
@@ -17,6 +19,24 @@ asmlinkage long sys_indirect(struct indirect_registers __user *userregs,
if (copy_from_user(&regs, userregs, sizeof(regs)))
return -EFAULT;

+ if (paramslen > sizeof(union indirect_params))
+ return -EINVAL;
+
+ if (copy_from_user(&current->indirect_params, userparams, paramslen)) {
+ result = -EFAULT;
+ goto out;
+ }
+
+ /* We need to come up with a better way to allow and forbid syscalls */
+ if (unlikely(syslet_args_present(&current->indirect_params))) {
+ result = syslet_pre_indirect();
+ if (result == 0) {
+ result = call_indirect(&regs);
+ result = syslet_post_indirect(result);
+ }
+ goto out;
+ }
+
switch (INDIRECT_SYSCALL (&regs))
{
#define INDSYSCALL(name) __NR_##name
@@ -24,16 +44,12 @@ asmlinkage long sys_indirect(struct indirect_registers __user *userregs,
break;

default:
- return -EINVAL;
+ result = -EINVAL;
+ goto out;
}

- if (paramslen > sizeof(union indirect_params))
- return -EINVAL;
-
- result = -EFAULT;
- if (!copy_from_user(&current->indirect_params, userparams, paramslen))
- result = call_indirect(&regs);
-
+ result = call_indirect(&regs);
+out:
memset(&current->indirect_params, '\0', paramslen);

return result;
diff --git a/kernel/sched.c b/kernel/sched.c
index 59ff6b1..27e799d 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -63,6 +63,7 @@
#include <linux/reciprocal_div.h>
#include <linux/unistd.h>
#include <linux/pagemap.h>
+#include <linux/syslet.h>

#include <asm/tlb.h>
#include <asm/irq_regs.h>
@@ -3612,6 +3613,14 @@ asmlinkage void __sched schedule(void)
struct rq *rq;
int cpu;

+ prev = current;
+ if (unlikely(prev->syslet_ready)) {
+ if (prev->state && !(preempt_count() & PREEMPT_ACTIVE) &&
+ (!(prev->state & TASK_INTERRUPTIBLE) ||
+ !signal_pending(prev)))
+ syslet_schedule(prev);
+ }
+
need_resched:
preempt_disable();
cpu = smp_processor_id();
diff --git a/kernel/syslet.c b/kernel/syslet.c
new file mode 100644
index 0000000..6fb2eb9
--- /dev/null
+++ b/kernel/syslet.c
@@ -0,0 +1,462 @@
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/mutex.h>
+#include <linux/completion.h>
+#include <linux/err.h>
+#include <linux/jhash.h>
+#include <linux/list.h>
+#include <linux/syslet.h>
+
+#include <asm/uaccess.h>
+
+/*
+ * XXX todo:
+ * - do we need all this '*cur = current' nonsense?
+ * - try to prevent userspace from submitting too much.. lazy user ptr read?
+ * - explain how to deal with waiting threads with stale data in current
+ * - how does userspace tell that a syslet completion was lost?
+ * provide an -errno argument to the userspace return function?
+ */
+
+/*
+ * These structs are stored on the kernel stack of tasks which are waiting to
+ * return to userspace. They are linked into their parent's list of syslet
+ * children stored in 'syslet_tasks' in the parent's task_struct.
+ */
+struct syslet_task_entry {
+ struct task_struct *task;
+ struct list_head item;
+};
+
+/*
+ * syslet_ring doesn't have any kernel-side storage. Userspace allocates them
+ * in their address space and initializes their fields and then passes them to
+ * the kernel.
+ *
+ * These hashes provide the kernel-side storage for the wait queues which
+ * sys_syslet_ring_wait() uses and the mutex which completion uses to serialize
+ * the (possible blocking) ordered writes of the completion and kernel head
+ * index into the ring.
+ *
+ * We chose the bucket that supports a given ring by hashing a u32 that
+ * userspace sets in the ring.
+ */
+#define SYSLET_HASH_BITS (CONFIG_BASE_SMALL ? 4 : 8)
+#define SYSLET_HASH_NR (1 << SYSLET_HASH_BITS)
+#define SYSLET_HASH_MASK (SYSLET_HASH_NR - 1)
+static wait_queue_head_t syslet_waitqs[SYSLET_HASH_NR];
+static struct mutex syslet_muts[SYSLET_HASH_NR];
+
+static wait_queue_head_t *ring_waitqueue(struct syslet_ring __user *ring)
+{
+ u32 group;
+
+ if (get_user(group, &ring->wait_group))
+ return ERR_PTR(-EFAULT);
+ else
+ return &syslet_waitqs[jhash_1word(group, 0) & SYSLET_HASH_MASK];
+}
+
+static struct mutex *ring_mutex(struct syslet_ring __user *ring)
+{
+ u32 group;
+
+ if (get_user(group, (u32 __user *)&ring->wait_group))
+ return ERR_PTR(-EFAULT);
+ else
+ return &syslet_muts[jhash_1word(group, 0) & SYSLET_HASH_MASK];
+}
+
+/*
+ * This is called for new tasks and for child tasks which might copy
+ * task_struct from their parent. So we clear the syslet indirect args,
+ * too, just to be clear.
+ */
+void syslet_init(struct task_struct *tsk)
+{
+ memset(&tsk->indirect_params.syslet, 0, sizeof(struct syslet_args));
+ spin_lock_init(&tsk->syslet_lock);
+ INIT_LIST_HEAD(&tsk->syslet_tasks);
+ tsk->syslet_ready = 0;
+ tsk->syslet_return = 0;
+ tsk->syslet_exit = 0;
+}
+
+static struct task_struct *first_syslet_task(struct task_struct *parent)
+{
+ struct syslet_task_entry *entry;
+
+ assert_spin_locked(&parent->syslet_lock);
+
+ if (!list_empty(&parent->syslet_tasks)) {
+ entry = list_first_entry(&parent->syslet_tasks,
+ struct syslet_task_entry, item);
+ return entry->task;
+ } else
+ return NULL;
+}
+
+/*
+ * XXX it's not great to wake up potentially lots of tasks under the lock
+ */
+/*
+ * We ask all the waiting syslet tasks to exit before we ourselves will
+ * exit. The tasks remove themselves from the list and wake our process
+ * with the lock held to be sure that we're still around when they wake us.
+ */
+void kill_syslet_tasks(struct task_struct *cur)
+{
+ struct syslet_task_entry *entry;
+
+ spin_lock(&cur->syslet_lock);
+
+ list_for_each_entry(entry, &cur->syslet_tasks, item) {
+ entry->task->syslet_exit = 1;
+ wake_up_process(entry->task);
+ }
+
+ while (!list_empty(&cur->syslet_tasks)) {
+ set_task_state(cur, TASK_INTERRUPTIBLE);
+ if (list_empty(&cur->syslet_tasks))
+ break;
+ spin_unlock(&cur->syslet_lock);
+ schedule();
+ spin_lock(&cur->syslet_lock);
+ }
+ spin_unlock(&cur->syslet_lock);
+
+ set_task_state(cur, TASK_RUNNING);
+}
+
+/*
+ * This task is cloned off of a syslet parent as the parent calls
+ * syslet_pre_indirect() from sys_indirect(). That parent waits for us to
+ * complete a completion struct on their stack.
+ *
+ * This task then waits until its parent tells it to return to user space on
+ * its behalf when the parent gets in to schedule().
+ *
+ * The parent in schedule will set this tasks's ptregs frame to return to the
+ * sys_indirect() call site in user space. Our -ESYSLETPENDING return code is
+ * given to userspace to indicate that the status of their system call
+ * will be delivered to the ring.
+ */
+struct syslet_task_args {
+ struct completion *comp;
+ struct task_struct *parent;
+};
+static long syslet_thread(void *data)
+{
+ struct syslet_task_args args;
+ struct task_struct *cur = current;
+ struct syslet_task_entry entry = {
+ .task = cur,
+ .item = LIST_HEAD_INIT(entry.item),
+ };
+
+ args = *(struct syslet_task_args *)data;
+
+ spin_lock(&args.parent->syslet_lock);
+ list_add_tail(&entry.item, &args.parent->syslet_tasks);
+ spin_unlock(&args.parent->syslet_lock);
+
+ complete(args.comp);
+
+ /* wait until the scheduler tells us to return to user space */
+ for (;;) {
+ set_task_state(cur, TASK_INTERRUPTIBLE);
+ if (cur->syslet_return || cur->syslet_exit ||
+ signal_pending(cur))
+ break;
+ schedule();
+ }
+ set_task_state(cur, TASK_RUNNING);
+
+ spin_lock(&args.parent->syslet_lock);
+ list_del(&entry.item);
+ /* our parent won't exit until it tests the list under the lock */
+ if (list_empty(&args.parent->syslet_tasks))
+ wake_up_process(args.parent);
+ spin_unlock(&args.parent->syslet_lock);
+
+ /* just exit if we weren't asked to return to userspace */
+ if (!cur->syslet_return)
+ do_exit(0);
+
+ /* inform userspace that their call will complete in the ring */
+ return -ESYSLETPENDING;
+}
+
+static int create_new_syslet_task(struct task_struct *cur)
+{
+ struct syslet_task_args args;
+ struct completion comp;
+ int ret;
+
+ init_completion(&comp);
+ args.comp = &comp;
+ args.parent = cur;
+
+ ret = create_syslet_thread(syslet_thread, &args,
+ CLONE_VM | CLONE_FS | CLONE_FILES |
+ CLONE_SIGHAND | CLONE_THREAD |
+ CLONE_SYSVSEM);
+ if (ret >= 0) {
+ wait_for_completion(&comp);
+ ret = 0;
+ }
+
+ return ret;
+}
+
+/*
+ * This is called by sys_indirect() when it sees that syslet args have
+ * been provided. We validate the arguments and make sure that there is
+ * a task waiting. If everything works out we tell the scheduler that it
+ * can call syslet_schedule() by setting syslet_ready.
+ */
+int syslet_pre_indirect(void)
+{
+ struct task_struct *cur = current;
+ struct syslet_ring __user *ring;
+ u32 elements;
+ int ret;
+
+ /* Not sure if returning -EINVAL on unsupported archs is right */
+ if (!syslet_frame_valid(&cur->indirect_params.syslet.frame)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ ring = (struct syslet_ring __user __force *)(unsigned long)
+ cur->indirect_params.syslet.completion_ring_ptr;
+ if (get_user(elements, &ring->elements)) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ if (!is_power_of_2(elements)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /*
+ * Racing to test this list outside the lock as the final task removes
+ * itself is OK. It should be very rare, and all it results in is
+ * syslet_schedule() finding the list empty and letting the task block.
+ */
+ if (list_empty(&cur->syslet_tasks)) {
+ ret = create_new_syslet_task(cur);
+ if (ret)
+ goto out;
+ } else
+ ret = 0;
+
+ cur->syslet_ready = 1;
+out:
+ return ret;
+}
+
+/*
+ * This is called by sys_indirect() after it has called the given system
+ * call handler. If we didn't block then we just return the status of the
+ * system call to userspace.
+ *
+ * If we did bock, however, then userspace got a -ESYSLETPENDING long ago.
+ * We need to deliver the status of the system call into the syslet ring
+ * and then return to the function in userspace which the caller specified
+ * in the frame in the syslet args. schedule() already set that up
+ * when we blocked. All we have to do is return to userspace.
+ *
+ * The return code from this function is lost. It could become the
+ * argument to the userspace return function which would let us tell
+ * userspace when we fail to copy the status into the ring.
+ */
+int syslet_post_indirect(int status)
+{
+ struct syslet_ring __user *ring;
+ struct syslet_completion comp;
+ struct task_struct *cur = current;
+ struct syslet_args *args = &cur->indirect_params.syslet;
+ wait_queue_head_t *waitq;
+ struct mutex *mutex;
+ u32 kidx;
+ u32 mask;
+ int ret;
+
+ /* we didn't block, just return the status to userspace */
+ if (cur->syslet_ready) {
+ cur->syslet_ready = 0;
+ return status;
+ }
+
+ ring = (struct syslet_ring __force __user *)(unsigned long)
+ args->completion_ring_ptr;
+
+ comp.status = status;
+ comp.caller_data = args->caller_data;
+
+ mutex = ring_mutex(ring);
+ if (IS_ERR(mutex))
+ return PTR_ERR(mutex);
+
+ waitq = ring_waitqueue(ring);
+ if (IS_ERR(waitq))
+ return PTR_ERR(waitq);
+
+ if (get_user(mask, &ring->elements))
+ return -EFAULT;
+
+ if (!is_power_of_2(mask))
+ return -EINVAL;
+ mask--;
+
+ mutex_lock(mutex);
+
+ ret = -EFAULT;
+ if (get_user(kidx, (u32 __user *)&ring->kernel_head))
+ goto out;
+
+ if (copy_to_user(&ring->comp[kidx & mask], &comp, sizeof(comp)))
+ goto out;
+
+ /*
+ * Make sure that the completion is stored before the index which
+ * refers to it. Notice that this means that userspace has to worry
+ * about issuing a read memory barrier after it reads the index.
+ */
+ smp_wmb();
+
+ kidx++;
+ if (put_user(kidx, &ring->kernel_head))
+ ret = -EFAULT;
+ else
+ ret = 0;
+out:
+ mutex_unlock(mutex);
+ if (ret == 0 && waitqueue_active(waitq))
+ wake_up(waitq);
+ return ret;
+}
+
+/*
+ * We're called by the scheduler when it sees that a task is about to block and
+ * has syslet_ready. Our job is to hand userspace's state off to a waiting
+ * task and tell it to return to userspace. That tells userspace that the
+ * system call that we're executing blocked and will complete in the future.
+ *
+ * The indirect syslet arguemnts specify the userspace instruction and stack
+ * that the child should return to.
+ */
+void syslet_schedule(struct task_struct *cur)
+{
+ struct task_struct *child = NULL;
+
+ spin_lock(&cur->syslet_lock);
+
+ child = first_syslet_task(cur);
+ if (child) {
+ move_user_context(child, cur);
+ set_user_frame(cur, &cur->indirect_params.syslet.frame);
+ cur->syslet_ready = 0;
+ child->syslet_return = 1;
+ }
+
+ spin_unlock(&cur->syslet_lock);
+
+ if (child)
+ wake_up_process(child);
+}
+
+/*
+ * Userspace calls this when the ring is empty. We return to userspace
+ * when the kernel head and user tail indexes are no longer equal, meaning
+ * that the kernel has stored a new completion.
+ *
+ * The ring is stored entirely in user space. We don't have a system call
+ * which initializes kernel state to go along with the ring.
+ *
+ * So we have to read the kernel head index from userspace. In the common
+ * case this will not fault or block and will be a very fast simple
+ * pointer dereference.
+ *
+ * Howerver, we need a way for the kernel completion path to wake us when
+ * there is a new event. We hash a field of the ring into buckets of
+ * wait queues for this.
+ *
+ * This relies on aligned u32 reads and writes being atomic with regard
+ * to other reads and writes, which I sure hope is true on linux's
+ * architectures. I'm crossing my fingers.
+ */
+asmlinkage long sys_syslet_ring_wait(struct syslet_ring __user *ring,
+ unsigned long user_idx)
+{
+ wait_queue_head_t *waitq;
+ struct task_struct *cur = current;
+ DEFINE_WAIT(wait);
+ u32 kidx;
+ int ret;
+
+ /* XXX disallow async waiting */
+
+ waitq = ring_waitqueue(ring);
+ if (IS_ERR(waitq)) {
+ ret = PTR_ERR(waitq);
+ goto out;
+ }
+
+ /*
+ * We have to be careful not to miss wake-ups by setting our
+ * state before testing the condition. Testing our condition includes
+ * copying the index from userspace, which can modify our state which
+ * can mask a wake-up setting our state.
+ *
+ * So we very carefully copy the index. We use the blocking copy
+ * to fault the index in and detect bad pointers. We only proceed
+ * with the test and sleeping if the non-blocking copy succeeds.
+ *
+ * In the common case the non-blocking copy will succeed and this
+ * will be very fast indeed.
+ */
+ for (;;) {
+ prepare_to_wait(waitq, &wait, TASK_INTERRUPTIBLE);
+ ret = __copy_from_user_inatomic(&kidx, &ring->kernel_head,
+ sizeof(u32));
+ if (ret) {
+ set_task_state(cur, TASK_RUNNING);
+ ret = copy_from_user(&kidx, &ring->kernel_head,
+ sizeof(u32));
+ if (ret) {
+ ret = -EFAULT;
+ break;
+ }
+ continue;
+ }
+
+ if (kidx != user_idx)
+ break;
+ if (signal_pending(cur)) {
+ ret = -ERESTARTSYS;
+ break;
+ }
+
+ schedule();
+ }
+
+ finish_wait(waitq, &wait);
+out:
+ return ret;
+}
+
+static int __init syslet_module_init(void)
+{
+ unsigned long i;
+
+ for (i = 0; i < SYSLET_HASH_NR; i++) {
+ init_waitqueue_head(&syslet_waitqs[i]);
+ mutex_init(&syslet_muts[i]);
+ }
+
+ return 0;
+}
+module_init(syslet_module_init);
--
1.5.2.2

2007-12-06 23:22:22

by Zach Brown

[permalink] [raw]
Subject: [PATCH 6/6] syslets: add both 32bit and 64bit x86 syslet support

This adds the architecture-specific routines needed by syslets for x86.

The syslet thread creation routines create a new thread which executes
a kernel function and then returns to userspace instead of exiting.

move_user_context() and set_user_frame() let the scheduler modify a child
thread so that it returns to userspace at the same place that a blocking
system call would have when it finished. This currently performs a very
expensive copy of the fpu state. Intel is working on a more robust patch
which allocates the i387 state off of thread_struct. When that is ready
this can just juggle pointers to transfer the fpu state.

The syslets infrastructure needs to work with ptregs for the task which
is in sys_indirect(). So we add a PTREGSCALL stub around sys_indirect()
in x86_64.

Finally, we wire up sys_syslet_ring_wait().

Signed-off-by: Zach Brown <[email protected]>

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index dc7f938..66a121d 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -1025,6 +1025,30 @@ ENTRY(kernel_thread_helper)
CFI_ENDPROC
ENDPROC(kernel_thread_helper)

+ENTRY(syslet_thread_helper)
+ CFI_STARTPROC
+ /*
+ * Allocate space on the stack for pt-regs.
+ * sizeof(struct pt_regs) == 64, and we've got 8 bytes on the
+ * kernel stack already:
+ */
+ subl $64-8, %esp
+ CFI_ADJUST_CFA_OFFSET 64-8
+ movl %edx,%eax
+ push %edx
+ CFI_ADJUST_CFA_OFFSET 4
+ call *%ebx
+ addl $4, %esp
+ CFI_ADJUST_CFA_OFFSET -4
+
+ movl %eax, PT_EAX(%esp)
+
+ GET_THREAD_INFO(%ebp)
+
+ jmp syscall_exit
+ CFI_ENDPROC
+ENDPROC(syslet_thread_helper)
+
#ifdef CONFIG_XEN
ENTRY(xen_hypervisor_callback)
CFI_STARTPROC
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 3a058bb..08e34f6 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -412,6 +412,7 @@ END(\label)
PTREGSCALL stub_rt_sigsuspend, sys_rt_sigsuspend, %rdx
PTREGSCALL stub_sigaltstack, sys_sigaltstack, %rdx
PTREGSCALL stub_iopl, sys_iopl, %rsi
+ PTREGSCALL stub_indirect, sys_indirect, %r8

ENTRY(ptregscall_common)
popq %r11
@@ -994,6 +995,71 @@ child_rip:
ENDPROC(child_rip)

/*
+ * Create a syslet kernel thread. This differs from a thread created with
+ * kernel_thread() in that it returns to userspace after it finishes executing
+ * its given kernel function.
+ *
+ * C extern interface:
+ * extern long create_syslet_thread(int (*fn)(void *),
+ * void * arg, unsigned long flags)
+ *
+ * asm input arguments:
+ * rdi: fn, rsi: arg, rdx: flags
+ */
+ENTRY(create_syslet_thread)
+ CFI_STARTPROC
+ FAKE_STACK_FRAME $syslet_child_rip
+ SAVE_ALL
+
+ # rdi: flags, rsi: usp, rdx: will be &pt_regs
+ movq %rdx,%rdi
+ movq $-1, %rsi
+ movq %rsp, %rdx
+
+ xorl %r8d,%r8d
+ xorl %r9d,%r9d
+
+ # clone now
+ call do_fork
+ movq %rax,RAX(%rsp)
+ xorl %edi,%edi
+
+ /*
+ * It isn't worth to check for reschedule here,
+ * so internally to the x86_64 port you can rely on kernel_thread()
+ * not to reschedule the child before returning, this avoids the need
+ * of hacks for example to fork off the per-CPU idle tasks.
+ * [Hopefully no generic code relies on the reschedule -AK]
+ */
+ RESTORE_ALL
+ UNFAKE_STACK_FRAME
+ ret
+ CFI_ENDPROC
+ENDPROC(syslet_kernel_thread)
+
+syslet_child_rip:
+ CFI_STARTPROC
+
+ movq %rdi, %rax
+ movq %rsi, %rdi
+ call *%rax
+
+ /*
+ * Fix up the PDA - we might return with sysexit:
+ */
+ RESTORE_TOP_OF_STACK %r11
+
+ /*
+ * return to user-space:
+ */
+ movq %rax, RAX(%rsp)
+ RESTORE_REST
+ jmp int_ret_from_sys_call
+
+ CFI_ENDPROC
+ENDPROC(syslet_child_rip)
+
+/*
* execve(). This function needs to use IRET, not SYSRET, to set up all state properly.
*
* C extern interface:
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 7b89958..7bf2836 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -394,6 +394,39 @@ int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags)
EXPORT_SYMBOL(kernel_thread);

/*
+ * This gets run with %ebx containing the
+ * function to call, and %edx containing
+ * the "args".
+ */
+void syslet_thread_helper(void);
+
+/*
+ * Create a syslet kernel thread. This differs from a thread created with
+ * kernel_thread() in that it returns to userspace after it finishes executing
+ * its given kernel function.
+ */
+int create_syslet_thread(long (*fn)(void *), void *arg, unsigned long flags)
+{
+ struct pt_regs regs;
+
+ memset(&regs, 0, sizeof(regs));
+
+ regs.ebx = (unsigned long)fn;
+ regs.edx = (unsigned long)arg;
+
+ regs.xds = __USER_DS;
+ regs.xes = __USER_DS;
+ regs.xfs = __KERNEL_PERCPU;
+ regs.orig_eax = -1;
+ regs.eip = (unsigned long)syslet_thread_helper;
+ regs.xcs = __KERNEL_CS | get_kernel_rpl();
+ regs.eflags = X86_EFLAGS_IF | X86_EFLAGS_SF | X86_EFLAGS_PF | 0x2;
+
+ /* Ok, create the new task.. */
+ return do_fork(flags, 0, &regs, 0, NULL, NULL);
+}
+
+/*
* Free current thread data structures etc..
*/
void exit_thread(void)
@@ -852,6 +885,32 @@ unsigned long get_wchan(struct task_struct *p)
}

/*
+ * This expensive hack will go away once thread->i387 is allocated instead of
+ * embedded in task_struct. Intel is working on it.
+ */
+static union i387_union i387_tmp[NR_CPUS] __cacheline_aligned_in_smp;
+
+/*
+ * Move user-space context from one kernel thread to another.
+ * This includes registers and FPU state. Callers must make
+ * sure that neither task is running user context at the moment:
+ */
+void move_user_context(struct task_struct *dest, struct task_struct *src)
+{
+ struct pt_regs *old_regs = task_pt_regs(src);
+ struct pt_regs *new_regs = task_pt_regs(dest);
+ union i387_union *tmp;
+
+ *new_regs = *old_regs;
+
+ tmp = &i387_tmp[get_cpu()];
+ *tmp = dest->thread.i387;
+ dest->thread.i387 = src->thread.i387;
+ src->thread.i387 = *tmp;
+ put_cpu();
+}
+
+/*
* sys_alloc_thread_area: get a yet unused TLS descriptor index.
*/
static int get_free_idx(void)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 6309b27..9fb050d 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -436,6 +436,16 @@ void release_thread(struct task_struct *dead_task)
}
}

+/*
+ * Move user-space context from one kernel thread to another.
+ * Callers must make sure that neither task is running user context
+ * at the moment:
+ */
+void move_user_context(struct task_struct *dest, struct task_struct *src)
+{
+ *task_pt_regs(dest) = *task_pt_regs(src);
+}
+
static inline void set_32bit_tls(struct task_struct *t, int tls, u32 addr)
{
struct user_desc ud = {
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 92095b2..5a532cf 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -325,3 +325,4 @@ ENTRY(sys_call_table)
.long sys_eventfd
.long sys_fallocate
.long sys_indirect /* 325 */
+ .long sys_syslet_ring_wait
diff --git a/include/asm-x86/syslet-abi.h b/include/asm-x86/syslet-abi.h
index 14a7182..06e7528 100644
--- a/include/asm-x86/syslet-abi.h
+++ b/include/asm-x86/syslet-abi.h
@@ -1 +1,9 @@
-#include <asm-generic/syslet-abi.h>
+#ifndef __ASM_X86_SYSLET_ABI_H
+#define __ASM_X86_SYSLET_ABI_H
+
+struct syslet_frame {
+ u64 ip;
+ u64 sp;
+};
+
+#endif
diff --git a/include/asm-x86/syslet.h b/include/asm-x86/syslet.h
index 583d810..75e4532 100644
--- a/include/asm-x86/syslet.h
+++ b/include/asm-x86/syslet.h
@@ -1 +1,32 @@
-#include <asm-generic/syslet.h>
+#ifndef __ASM_X86_SYSLET_H
+#define __ASM_X86_SYSLET_H
+
+#include "syslet-abi.h"
+
+/* These are provided by kernel/entry.S and kernel/process.c */
+void move_user_context(struct task_struct *dest, struct task_struct *src);
+int create_syslet_thread(long (*fn)(void *),
+ void *arg, unsigned long flags);
+
+static inline int syslet_frame_valid(struct syslet_frame *frame)
+{
+ return frame->ip && frame->sp;
+}
+
+#ifdef CONFIG_X86_32
+static inline void set_user_frame(struct task_struct *task,
+ struct syslet_frame *frame)
+{
+ task_pt_regs(task)->eip = frame->ip;
+ task_pt_regs(task)->esp = frame->sp;
+}
+#else
+static inline void set_user_frame(struct task_struct *task,
+ struct syslet_frame *frame)
+{
+ task_pt_regs(task)->rip = frame->ip;
+ task_pt_regs(task)->rsp = frame->sp;
+}
+#endif
+
+#endif
diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h
index 8ee0b20..0857c4d 100644
--- a/include/asm-x86/unistd_32.h
+++ b/include/asm-x86/unistd_32.h
@@ -331,10 +331,11 @@
#define __NR_eventfd 323
#define __NR_fallocate 324
#define __NR_indirect 325
+#define __NR_syslet_ring_wait 326

#ifdef __KERNEL__

-#define NR_syscalls 326
+#define NR_syscalls 327

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
diff --git a/include/asm-x86/unistd_64.h b/include/asm-x86/unistd_64.h
index 66eab33..e01f5dc 100644
--- a/include/asm-x86/unistd_64.h
+++ b/include/asm-x86/unistd_64.h
@@ -636,7 +636,9 @@ __SYSCALL(__NR_eventfd, sys_eventfd)
#define __NR_fallocate 285
__SYSCALL(__NR_fallocate, sys_fallocate)
#define __NR_indirect 286
-__SYSCALL(__NR_indirect, sys_indirect)
+__SYSCALL(__NR_indirect, stub_indirect)
+#define __NR_syslet_ring_wait 287
+__SYSCALL(__NR_syslet_ring_wait, sys_syslet_ring_wait)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
--
1.5.2.2

2007-12-07 12:06:51

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 5/6] syslets: add generic syslets infrastructure

Hi Zach.

On Thu, Dec 06, 2007 at 03:20:18PM -0800, Zach Brown ([email protected]) wrote:
> +/*
> + * XXX todo:
> + * - do we need all this '*cur = current' nonsense?
> + * - try to prevent userspace from submitting too much.. lazy user ptr read?
> + * - explain how to deal with waiting threads with stale data in current
> + * - how does userspace tell that a syslet completion was lost?
> + * provide an -errno argument to the userspace return function?
> + */
> +
> +/*
> + * These structs are stored on the kernel stack of tasks which are waiting to
> + * return to userspace. They are linked into their parent's list of syslet
> + * children stored in 'syslet_tasks' in the parent's task_struct.
> + */
> +struct syslet_task_entry {
> + struct task_struct *task;
> + struct list_head item;
> +};
> +
> +/*
> + * syslet_ring doesn't have any kernel-side storage. Userspace allocates them
> + * in their address space and initializes their fields and then passes them to
> + * the kernel.
> + *
> + * These hashes provide the kernel-side storage for the wait queues which
> + * sys_syslet_ring_wait() uses and the mutex which completion uses to serialize
> + * the (possible blocking) ordered writes of the completion and kernel head
> + * index into the ring.
> + *
> + * We chose the bucket that supports a given ring by hashing a u32 that
> + * userspace sets in the ring.
> + */
> +#define SYSLET_HASH_BITS (CONFIG_BASE_SMALL ? 4 : 8)
> +#define SYSLET_HASH_NR (1 << SYSLET_HASH_BITS)
> +#define SYSLET_HASH_MASK (SYSLET_HASH_NR - 1)
> +static wait_queue_head_t syslet_waitqs[SYSLET_HASH_NR];
> +static struct mutex syslet_muts[SYSLET_HASH_NR];

Why do you care about hashed tables scalability and not using trees?

--
Evgeniy Polyakov

2007-12-07 18:24:41

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 5/6] syslets: add generic syslets infrastructure


>> +/*
>> + * syslet_ring doesn't have any kernel-side storage. Userspace allocates them
>> + * in their address space and initializes their fields and then passes them to
>> + * the kernel.
>> + *
>> + * These hashes provide the kernel-side storage for the wait queues which
>> + * sys_syslet_ring_wait() uses and the mutex which completion uses to serialize
>> + * the (possible blocking) ordered writes of the completion and kernel head
>> + * index into the ring.
>> + *
>> + * We chose the bucket that supports a given ring by hashing a u32 that
>> + * userspace sets in the ring.
>> + */
>> +#define SYSLET_HASH_BITS (CONFIG_BASE_SMALL ? 4 : 8)
>> +#define SYSLET_HASH_NR (1 << SYSLET_HASH_BITS)
>> +#define SYSLET_HASH_MASK (SYSLET_HASH_NR - 1)
>> +static wait_queue_head_t syslet_waitqs[SYSLET_HASH_NR];
>> +static struct mutex syslet_muts[SYSLET_HASH_NR];
>
> Why do you care about hashed tables scalability and not using trees?

Well, this notion of letting tasks safely complete to any ring they can
address is just a possibility. We might decide that it's not worth it.
This implementation was an easy example that borrows from the way
futexes do similar work.

I like it because you could have, say, different processes completing
into a ring in shared memory.

If we do allow this kind of flexible ring specification, it's not at all
clear that trees would be the best way to address the scalability
limits. There are lots of possibilities, including locking the page
lock of the page which holds the head index.

- z

2007-12-08 12:43:18

by Simon Holm Thøgersen

[permalink] [raw]
Subject: Re: [PATCH 1/6] indirect: use asmlinkage in i386 syscall table prototype


tor, 06 12 2007 kl. 15:20 -0800, skrev Zach Brown:
> call_indirect() was using the wrong calling convention for the system call
> handlers. system call handlers would get mixed up arguments.
>
> Signed-off-by: Zach Brown <[email protected]>
>
> diff --git a/include/asm-x86/indirect_32.h b/include/asm-x86/indirect_32.h
> index a1b72ac..e3dea8e 100644
> --- a/include/asm-x86/indirect_32.h
> +++ b/include/asm-x86/indirect_32.h
> @@ -15,8 +15,8 @@ struct indirect_registers {
>
> static inline long call_indirect(struct indirect_registers *regs)
> {
> - extern long (*sys_call_table[]) (__u32, __u32, __u32, __u32, __u32, __u32);
> -
> + extern asmlinkage long (*sys_call_table[])(long, long, long,
This should be something like below instead, otherwise gcc wont parse
asmlinkage as being an attribute of the function signature.
extern long (asmlinkage *sys_call_table[])(long, long, long,
I don't now if it has changed with recent gcc versions, this works for
me with 4.2.0.
> + long, long, long);
> return sys_call_table[INDIRECT_SYSCALL(regs)](regs->ebx, regs->ecx,
> regs->edx, regs->esi,
> regs->edi, regs->ebp);


Simon Holm Thøgersen

2007-12-08 12:54:54

by Simon Holm Thøgersen

[permalink] [raw]
Subject: [PATCH] Fix casting on architectures with 32-bit pointers/longs.


tor, 06 12 2007 kl. 15:20 -0800, skrev Zach Brown:
> The following patches are a substantial refactoring of the syslet code. I'm
> branding them as the v7 release of the syslet infrastructure, though they
> represent a signifiant change in focus.
>
> My current focus is to see the most fundamental functionality brought to
> maturity. To me, this means getting a ABI that is used by applications through
> glibc on x86 and PPC64. Only once that is ready should we distract ourselves
> with advanced complexity.
>
> To that end, this patch series differs from v6 in significant ways:
>
> * syslets are initiated by providing syslet arguments to sys_indirect().
>
> * uatoms, threadlets, and the kaio changes are postponed until they can be
> justified and rebuilt on more complete infrastructure. (I'm not saying
> these shouldn't or won't be persued. I'm saying that we should get the
> simplest piece working first.)
>
> * the code is clarified and commented, the patches are bisectable and pass
> checkpatch.
>
> The use of sys_indirect() and the move from 'atom's simplified the ABI
> considerably. I've put a trivial example in a syslet-userspace git tree:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/zab/syslets-userspace.git
>

Signed-of-by: Simon Holm Thøgersen <[email protected]>
---

diff --git a/basic.c b/basic.c
index 418a1a3..5938d85 100644
--- a/basic.c
+++ b/basic.c
@@ -42,7 +42,7 @@ int main(int argc, char **argv)
params.syslet.frame.sp = (u64)(long)memalign(pagesize, pagesize);

memset(&params, 0, sizeof(params));
- params.syslet.frame.ip = (u64)syslet_return_func;
+ params.syslet.frame.ip = (u64)(long)syslet_return_func;
params.syslet.frame.sp = (u64)(long)memalign(pagesize, pagesize);
params.syslet.ring_ptr = (u64)(long)ring;

@@ -55,7 +55,7 @@ int main(int argc, char **argv)
pid, my_pid);
}

- params.syslet.frame.ip = (u64)syslet_return_func;
+ params.syslet.frame.ip = (u64)(long)syslet_return_func;
params.syslet.frame.sp = (u64)(long)memalign(pagesize, pagesize);
params.syslet.ring_ptr = (u64)(long)ring;
params.syslet.caller_data = CALLER_DATA;

2007-12-08 21:22:53

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 1/6] indirect: use asmlinkage in i386 syscall table prototype


>> + extern asmlinkage long (*sys_call_table[])(long, long, long,
> This should be something like below instead, otherwise gcc wont parse
> asmlinkage as being an attribute of the function signature.
> extern long (asmlinkage *sys_call_table[])(long, long, long,

Yeah, thanks for pointing these out. As it happened, Jens beat you to
it :).

Both problems have been fixed in the git repositories for the guilt
series and userspace tools, respectively. You can always fetch the most
recent versions from there.

- z

2007-12-10 19:46:36

by Jens Axboe

[permalink] [raw]
Subject: Re: syslets v7: back to basics

On Thu, Dec 06 2007, Zach Brown wrote:
>
> The following patches are a substantial refactoring of the syslet code. I'm
> branding them as the v7 release of the syslet infrastructure, though they
> represent a signifiant change in focus.

[snip]

In case anyone is interested in playing with this, I updated the fio
syslet engine to this interface. Not very well tested yet, but it seems
to work.

I have to say that I like the new interface a lot better, it's a lot
simpler to work with. I was able to cut about 33% of the code that
handles queueing IO and retrieving events (maybe even more, summed up
losely). I'm still not a big fan of the indirect stuff, but that's
minor.

You can get fio by doing a

$ git clone git://git.kernel.dk/fio.git

or just grabbing the latest snapshot from

http://brick.kernel.dk/snaps/fio-git-latest.tar.gz

--
Jens Axboe

2007-12-10 21:30:29

by Phillip Susi

[permalink] [raw]
Subject: Re: syslets v7: back to basics

Zach Brown wrote:
> The following patches are a substantial refactoring of the syslet code. I'm
> branding them as the v7 release of the syslet infrastructure, though they
> represent a signifiant change in focus.
>
> My current focus is to see the most fundamental functionality brought to
> maturity. To me, this means getting a ABI that is used by applications through
> glibc on x86 and PPC64. Only once that is ready should we distract ourselves
> with advanced complexity.

I pulled from your tree to look over the patches, and noticed that it
looks like several commits were merged improperly. It looks like they
were auto merged or something from an email, and the commit message
contains the email headers, rather than just the commit message in the
body. This leads to the shortlog showing entries that start with
"Return-Path:".

I was hoping to find at least some initial information on the overall
design in Documentation/ but don't see any. Have you written any yet
that I could take a look at elsewhere maybe?

Some of the things I was trying to figure out is does each syslet get
its own stack, and schedule only at a few well defined points, and if
so, would it then be fair to characterize them as kernel mode fibers?

2007-12-10 22:16:19

by Zach Brown

[permalink] [raw]
Subject: Re: syslets v7: back to basics


> I pulled from your tree to look over the patches, and noticed that it
> looks like several commits were merged improperly. It looks like they
> were auto merged or something from an email, and the commit message
> contains the email headers, rather than just the commit message in the
> body. This leads to the shortlog showing entries that start with
> "Return-Path:".

These are patches that guilt imported from email messages. It didn't
strip the headers and I didn't care to. I'll try to in the future, it
isn't a big deal.

> I was hoping to find at least some initial information on the overall
> design in Documentation/ but don't see any. Have you written any yet
> that I could take a look at elsewhere maybe?

No, but it's coming. I'd like to have some robust documentation so that
Ulrich can help me understand what more he'd need to support POSIX AIO
with syslets from glibc.

> Some of the things I was trying to figure out is does each syslet get
> its own stack,

Yes. Each blocking operation has a thread that is performing the
operation synchronously. The benefit is that the thread is only created
if the operation blocks. If it doesn't block then it's a normal system
call invocation. You don't have to manage threads and communicate the
arguments and results of system calls amongst threads for the case where
it never blocks.

> and schedule only at a few well defined points

No, every blocking point is considered a scheduling point.

> , and if
> so, would it then be fair to characterize them as kernel mode fibers?

I'm not sure what exactly you mean by kernel mode fibers (I can guess,
but I'd rather not). From the answer of to the last question, though,
I'm going to guess that it might not be the most apt characterization.

- z

2008-01-09 02:04:31

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 5/6] syslets: add generic syslets infrastructure

On Friday 07 December 2007 10:20:18 Zach Brown wrote:
> The indirect syslet arguments specify where to store the completion and
> what function in userspcae to return to once the syslet has been executed.
> The details of how we pass the indirect syslet arguments needs help.

Hi Zach,

Firstly, why not just specify an address for the return value and be done
with it? This infrastructure seems overkill, and you can always extend later
if required.

Secondly, you really should allow integration with an eventfd so you don't
make the posix AIO mistake of providing a poll-incompatible interface.

Finally, and probably most alarmingly, AFAICT randomly changing TID will break
all threaded programs, which means this won't be fitted into existing code
bases, making it YA niche Linux-only API 8(

Cheers,
Rusty.

2008-01-09 03:00:22

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 5/6] syslets: add generic syslets infrastructure


> Firstly, why not just specify an address for the return value and be done
> with it? This infrastructure seems overkill, and you can always extend later
> if required.

Sorry, which infrastructure?

Providing the function and stack to return to? Sure, I could certainly
entertain the idea of not having syslet tasks return to userspace in the
first pass. Ingo sure seemed excited by the idea.

Or do you mean the syscall return value ending up in the userspace
completion event ring? That's mostly about being able to wait for
pending syslets to complete.

> Secondly, you really should allow integration with an eventfd so you don't
> make the posix AIO mistake of providing a poll-incompatible interface.

Yeah, this seems straight forward enough that I haven't made it an
initial priority. I'm sure it will be helpful for people who are stuck
integrating with entrenched software that wants to wait for pollable fds.

For more flexible software, though, it's compelling to now be able to
aggregate waiting for completion of the existing waiting syscalls (poll,
epoll_wait, futexes, whatever) by issuing them as concurrent syslets.

> Finally, and probably most alarmingly, AFAICT randomly changing TID will break
> all threaded programs, which means this won't be fitted into existing code
> bases, making it YA niche Linux-only API 8(

Yeah, this still needs to be investigated. I haven't yet and I haven't
heard of anyone else trying their hand at it.

In the YANLOA mode apps would know that executing syslets is an implicit
clone() and would act accordingly. "8(", indeed.

I wonder if there isn't an opportunity to add a clone() flag which
juggles the association between TIDs and task_structs. I don't relish
the idea of investigating the life cycles of task_struct references that
derive from TIDs and seeing how those would race with a syslet blocking
and cloning, but, well, maybe that's what needs to be done.

This all isn't my area of expertise, though, sadly. It would be swell
if someone wanted to look into it before I'm forced to learn yet another
weird corner of the kernel.

- z

2008-01-09 03:49:20

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 5/6] syslets: add generic syslets infrastructure

On Wednesday 09 January 2008 14:00:04 Zach Brown wrote:
> > Firstly, why not just specify an address for the return value and be
> > done with it? This infrastructure seems overkill, and you can always
> > extend later if required.
>
> Sorry, which infrastructure?
>
> Providing the function and stack to return to? Sure, I could certainly
> entertain the idea of not having syslet tasks return to userspace in the
> first pass. Ingo sure seemed excited by the idea.
>
> Or do you mean the syscall return value ending up in the userspace
> completion event ring? That's mostly about being able to wait for
> pending syslets to complete.

The latter. A ring is optimal for processing a huge number of requests, but
if you're really going to be firing off syslet threads all over the place
you're not going to be optimal anyway. And being able to point the return
value to the stack or into some datastructure is way nicer to code (zero
setup == easy to understand and easy to convert).

For notification, see below.

> > Secondly, you really should allow integration with an eventfd so you
> > don't make the posix AIO mistake of providing a poll-incompatible
> > interface.
>
> Yeah, this seems straight forward enough that I haven't made it an
> initial priority. I'm sure it will be helpful for people who are stuck
> integrating with entrenched software that wants to wait for pollable fds.

Unfortunately, waiting for someone to write a killer app which uses your new
API is the road to disappointment. The real target is convincing the handful
of important apps (Samba, Apache, ...) to #ifdef around some small piece of
code in order to get performance. And a mere single design wart could mean
that never happens. Look at epoll, it's probably been the most successful
and it's still damn niche.

> For more flexible software, though, it's compelling to now be able to
> aggregate waiting for completion of the existing waiting syscalls (poll,
> epoll_wait, futexes, whatever) by issuing them as concurrent syslets.

Is replacing epoll with syslets really going to win, even if you're writing
apps from scratch? Anyway a fast notification mechanism is a different
problem than syslets, and should be separated.

> > Finally, and probably most alarmingly, AFAICT randomly changing TID will
> > break all threaded programs, which means this won't be fitted into
> > existing code bases, making it YA niche Linux-only API 8(
>
> I wonder if there isn't an opportunity to add a clone() flag which
> juggles the association between TIDs and task_structs. I don't relish
> the idea of investigating the life cycles of task_struct references that
> derive from TIDs and seeing how those would race with a syslet blocking
> and cloning, but, well, maybe that's what needs to be done.

This must be solved, yet all avenues seem crawling with worms. Redirecting
find_task_by_pid() to find the original and converting all the places where
we return tids to userspace? Swapping tids when we clone? Duplicate tids,
with only the non-syslet one being returned from find_task_by_pid()?

> This all isn't my area of expertise, though, sadly. It would be swell
> if someone wanted to look into it before I'm forced to learn yet another
> weird corner of the kernel.

Let's just tell Ingo it's impossible to solve :)

Rusty.

2008-01-09 18:17:14

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 5/6] syslets: add generic syslets infrastructure


>> Or do you mean the syscall return value ending up in the userspace
>> completion event ring? That's mostly about being able to wait for
>> pending syslets to complete.
>
> The latter. A ring is optimal for processing a huge number of requests, but
> if you're really going to be firing off syslet threads all over the place
> you're not going to be optimal anyway. And being able to point the return
> value to the stack or into some datastructure is way nicer to code (zero
> setup == easy to understand and easy to convert).

One of Linus' rhetorical requirements for the syslets work is that we be
able to wait for the result without spending overhead building up state
in some completion context. The userland rings in the current syslet
patches achieve that and don't seem outrageously complicated.

I have a hard time getting worked up about this particular piece of the
puzzle, though. If people are excited about *only* providing a pollable
fd to collect the syslet completions, well, great, whatever.

> This must be solved, yet all avenues seem crawling with worms.

Yup.

- z

2008-01-09 22:05:25

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 5/6] syslets: add generic syslets infrastructure

On Thursday 10 January 2008 05:16:57 Zach Brown wrote:
> > The latter. A ring is optimal for processing a huge number of requests,
> > but if you're really going to be firing off syslet threads all over the
> > place you're not going to be optimal anyway. And being able to point the
> > return value to the stack or into some datastructure is way nicer to code
> > (zero setup == easy to understand and easy to convert).
>
> One of Linus' rhetorical requirements for the syslets work is that we be
> able to wait for the result without spending overhead building up state
> in some completion context. The userland rings in the current syslet
> patches achieve that and don't seem outrageously complicated.

I'd have to read his original statement, but eventfd doesn't build up state,
so I think it qualifies.

YA incompatible userspace notification system just doesn't excite me though.

Cheers,
Rusty.

2008-01-09 23:00:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 5/6] syslets: add generic syslets infrastructure



On Thu, 10 Jan 2008, Rusty Russell wrote:
>
> I'd have to read his original statement, but eventfd doesn't build up state,
> so I think it qualifies.

How about you guys battle it out by giving an example program usign the
interface?

Here's a favourite really simple load of mine:

- do the equivalent of "ls -lR" or "find /usr" as quickly as possible,
without playing "sort by the inode numbers" games (that don't work in
general, but are great for some filesystems)

Do this on a directory that isn't newly created, but has had files
added and removed over time (so that the return order of "readdir()"
isn't dense and sorted in the inode tables already). The classic
example is "ls -l /usr/bin" or similar.

This is actually a fairly hard load, because there's two easily tested
cases that are both equally important: hot-cache (no IO at all, just CPU),
and cold-cache (all about trying to get concurrent IO on inode lookups
going, to get the disk elevator working in the *absense* of any sorting).

And almost all of the operations are operations that right now have no
asynchronous model except for full threads (ie neither filename nor inode
lookup have any aio_xyz() things).

How can we do something like *that*? It's about as simple an IO test
program you can imagine, while I'd argue that it is still a reasonably
"realistic" thing to do, and interesting for asynchronous operations.

How would do you something like this, striving to allow overlap of IO, and
getting (hopefully) the block layer to create bigger request sizes?

To make it simple, let's make the *only* operation we care about being
asynchronous be that "lstat()". And instead of printing out each file,
just add up the sizes or something (ie make this approximate "du -sh").

But the important thing is that if things are cached, it should be the
same speed as this trivial program, ie the async interfaces shouldn't slow
things down. But if things require IO, hopefully we can get at least some
reasonable number of concurrent IO's going, and making the program not
seek back and forth quite so much).

Try this with and without a "echo 3 > /proc/sys/vm/drop_caches" in
between. Both are real and relevant usage cases, I think.

And *simplicity* of the end result really does matter. If it's not simple
to use, people won't use it.

Linus

---
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <dirent.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>

static void find(const char *base, int baselen);

static void handle(const char *name, int namelen)
{
struct stat st;

if (lstat(name, &st) < 0)
return;
printf("%8lu %s\n", st.st_size, name);
if (S_ISDIR(st.st_mode))
find(name, namelen);
}

static void find(const char *base, int baselen)
{
DIR *dir;
char *name = malloc(baselen + 255 + 2);

if (!name)
return;
memcpy(name, base, baselen);
name[baselen] = '/';
dir = opendir(base);
if (dir) {
struct dirent *de;
while ((de = readdir(dir)) != NULL) {
const char *p = de->d_name;
int len = strlen(p);
if (len > 255)
continue;
if (p[0] == '.') {
if (len == 1)
continue;
if (len == 2 && p[1] == '.')
continue;
}
memcpy(name + baselen + 1, de->d_name, len+1);
handle(name, baselen + 1 + len);
}
closedir(dir);
}
free(name);
}

int main(int argc, char **argv)
{
find(".",1);
return 0;
}

2008-01-09 23:15:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 5/6] syslets: add generic syslets infrastructure



On Wed, 9 Jan 2008, Linus Torvalds wrote:
>
> Try this with and without a "echo 3 > /proc/sys/vm/drop_caches" in
> between. Both are real and relevant usage cases, I think.

Side note, for me the difference on my home directory for the cached vs
uncached case is 5 seconds vs 5 minutes. I like the 5 sec, I'd like to
improve on that 5 min.

Is this test a bit *too* simplistic? Probably. But I think that if we can
come up with an interface that works ok for that test, it at least signals
*something*.

Linus

2008-01-09 23:16:06

by Davide Libenzi

[permalink] [raw]
Subject: Re: [PATCH 5/6] syslets: add generic syslets infrastructure

On Thu, 10 Jan 2008, Rusty Russell wrote:

> On Thursday 10 January 2008 05:16:57 Zach Brown wrote:
> > > The latter. A ring is optimal for processing a huge number of requests,
> > > but if you're really going to be firing off syslet threads all over the
> > > place you're not going to be optimal anyway. And being able to point the
> > > return value to the stack or into some datastructure is way nicer to code
> > > (zero setup == easy to understand and easy to convert).
> >
> > One of Linus' rhetorical requirements for the syslets work is that we be
> > able to wait for the result without spending overhead building up state
> > in some completion context. The userland rings in the current syslet
> > patches achieve that and don't seem outrageously complicated.
>
> I'd have to read his original statement, but eventfd doesn't build up state,
> so I think it qualifies.

I think you and Zach are talking about different issues ;)
Eventfd born for two different reasons. First, to be able to have
userspace to signal to a poll/select/epoll based listener an event. This
can elso be done with pipes, but eventfd has a few advantages over pipes
(3-4 times faster and *a lot* less memory footprint). Second, as a generic
way for kernel subsystems to signal completions to a poll/select/epoll
userspace listener. And this is what is used in the new KAIO eventfd
feature (patch was like 5 lines IIRC).
This allow for KAIO events to be signaled to poll/select/epoll in a pretty
easy way, using a simple extension of the AIO API.
What we talked originally with Ingo, when the first syslet code came up,
was the ability to do the reverse thing. That is, host an epoll_wait()
inside a syslet, and gather the completion using whatever the syslet code
was/is going to use for it.
Given that 1) the eventfd completion patch was trivial and immediately
available 2) the future of the whole syslet concept was/is still unclear,
I believe it makes/made sense. If the syslets will become mainline, it'll
mean that userspace will then be able to select the event-completion
"hosting" method that better suits their needs. That are, either AIO
completion hosted inside an epoll_wait() via an eventfd, or an
epoll_wait() hosted inside a syslet.




- Davide

2008-01-09 23:48:03

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCH 5/6] syslets: add generic syslets infrastructure

Linus Torvalds wrote:
>
> On Thu, 10 Jan 2008, Rusty Russell wrote:
>> I'd have to read his original statement, but eventfd doesn't build up state,
>> so I think it qualifies.
>
> How about you guys battle it out by giving an example program usign the
> interface?
>
> Here's a favourite really simple load of mine:
>
> - do the equivalent of "ls -lR" or "find /usr" as quickly as possible,
> without playing "sort by the inode numbers" games (that don't work in
> general, but are great for some filesystems)
>
> Do this on a directory that isn't newly created, but has had files
> added and removed over time (so that the return order of "readdir()"
> isn't dense and sorted in the inode tables already). The classic
> example is "ls -l /usr/bin" or similar.

Sure, that's straight forward enough. We've all written little test
apps for variants of this load in the past, anyway. (It was one of the
first things I did for fibrils, Ingo had a variant for syslets which
read small file data too, Chris has a syslet mode in his 'acp' util, etc.)

I was going to send out a patch series pretty soon which includes
cleanups (I think) of the sys_indirect() infrastructure. I can throw
together this little test app along with that.

- z

2008-01-10 01:18:38

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 5/6] syslets: add generic syslets infrastructure

On Thursday 10 January 2008 09:58:10 Linus Torvalds wrote:
> On Thu, 10 Jan 2008, Rusty Russell wrote:
> > I'd have to read his original statement, but eventfd doesn't build up
> > state, so I think it qualifies.
>
> How about you guys battle it out by giving an example program usign the
> interface?

Nice idea.

> And *simplicity* of the end result really does matter. If it's not simple
> to use, people won't use it.

Completely agreed, but async is always more complex than sync.

eg. your malloc()-and-overwrite trick here assumes it's serial, similarly
stack vars. Even before anything's happened we've massively increased the
number of mallocs :( Maybe someone else can see a neater way?

Below is an async-ready version (I've assumed you still want each dir grouped
together). It's already slower (best 0m3.842s vs best 0m3.659s):

#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <dirent.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>

static void find(const char *base, int baselen);

struct result
{
struct result *next;
struct stat st;
unsigned int namelen;
char name[];
};

static struct result *new_result(const char *base, int baselen,
const char *sub, int sublen)
{
struct result *r;

r = malloc(sizeof(*r) + baselen + sublen + 2);
memcpy(r->name, base, baselen);
r->name[baselen] = '/';
memcpy(r->name + baselen + 1, sub, sublen+1);
r->namelen = baselen + sublen + 1;

return r;
}

static void process(struct result *r, struct result **dirs)
{
printf("%8lu %s\n", r->st.st_size, r->name);
if (S_ISDIR(r->st.st_mode)) {
r->next = *dirs;
*dirs = r;
} else
free(r);
}

/* -1 = fail, 0 = success, 1 = in progress. */
static int handle_async(struct result *r)
{
/* Ours is sync. */
return lstat(r->name, &r->st);
}

static void process_pending(struct result *pending, struct result **dirs)
{
/* Since we're sync, pending will be NULL. Otherwise call
pending as they complete. */
}

static void find(const char *base, int baselen)
{
DIR *dir;
struct result *r, *pending = NULL, *dirs = NULL;

dir = opendir(base);
if (dir) {
struct dirent *de;
while ((de = readdir(dir)) != NULL) {
const char *p = de->d_name;
int len = strlen(p);
if (p[0] == '.') {
if (len == 1)
continue;
if (len == 2 && p[1] == '.')
continue;
}
r = new_result(base, baselen, p, len);
switch (handle_async(r)) {
case 0:
process(r, &dirs);
break;
case -1:
free(r);
break;
case 1:
r->next = pending;
pending = r;
}
}
closedir(dir);
process_pending(pending, &dirs);
while (dirs) {
find(dirs->name, dirs->namelen);
r = dirs;
dirs = dirs->next;
free(r);
}
}
}

int main(int argc, char **argv)
{
find(".",1);
return 0;
}

2008-01-10 05:41:43

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 5/6] syslets: add generic syslets infrastructure

So my radical ultra tired rant o the week...

Rather than adding sys_indirect and syslets as is,

* admit that this is beginning to look like a new ABI. explore the
freedoms that that avenue opens...

* (even more radical) I wonder what a tiny, SANE register-based
bytecode interface might look like. Have a single page shared between
kernel and userland, for each thread. Userland fills that page with
bytecode, for a virtual machines with 256 registers -- where
instructions roughly equate to syscalls.

The common case -- a single syscall like open(2) -- would be a single
byte bytecode, plus a couple VM register stores. The result is stored
in another VM register.

But this format enables more complex cases, where userland programs can
pass strings of syscalls into the kernel, and let them execute until
some exceptional condition occurs. Results would be stored in VM
registers (or userland addresses stored in VM registers...).

This sort of interface would be
* fast

* equate to the current syscall regime (easy to get existing APIs
going... hopefully equivalent to glibc switching to a strange new
SYSENTER variant)

* be flexible enough to support a simple implementation today

* be flexible enough to enable experiments into syscall parallelism (aka
VM instruction parallelism <grin>)

* be flexible enough to enable experiments into syscall batching

One would probably want to add some simple logic opcodes in addition to
opcodes for syscalls and such -- but this should not turn into Forth or
Parrot or Java :)

Thus, this new ABI can easily and immediately support all existing
syscalls, while enabling

Now to come up with a good programming API and model(s) to match this
parallel, batched kernel<->userland interface...

Jeff, very tired and delirious, so feel free to laugh at this,
but I've been pondering this for a while