2004-01-14 22:48:38

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2.1

Hi All

This code proposes an implementation of kernel based mutexes,
taking ideas from the actual implementations using futexes
(namely NPTL), intending to solve its limitations (see doc) and
adding some features, such as real-time behavior, priority
inheritance and protection, deadlock detection and robustness.

Please look at the first patch, containing the documentation
for information on how is it implemented.

High level changelog since release 2.0:

- Fix docs, document the convoy phenomenon.

- Fix bugs in maintiaining the state of the ufulock when
switching back and forth to "fast-path mode".

- Added a few ioctl-like commands for querying state of the
ufulock as well as destruction.

- Add KCO mode, where the fast path is not available. This can
be used by architectures that don't have
compare-and-exchange. As well, it is the foundation for
priority protection. All tests pass, and seems to work
[although didn't have too much time to go over it].

- Ported to 2.6.1 (basing it on mm2 to get kgdb goodies);
patches vanilla ok; however, you will get a couple of rejects
in kernel/Makefile -- it expects a line with ktrhead.o [just
make sure the line with all the fuqueue stuff is there].

Still to-do:

- Ugly bug triggered by running rtnptl-test-misc::str03 with:

$ (ulimit -s 32; time ./str03 -b5 -d5 > /tmp/kk)

[triggers inconsistency warning on NPTL patched with the rtnptl
patch].

- Finish implementing priority protection; this requires the
changes to implement priority inheritance with SCHED_NORMAL
tasks--need to find a way to have kind of an 'effective'
priority vs the original one that is not too invasive.

- Add a flag to the passing of timeout information to indicate
if it is relative or absolute, and as we are there, the clock
source. This way we avoid a costly kernel access in
pthread_mutex_timelock() and friends.

- Need to rename VFULOCK_KCO to VFULOCK_WP, to reduce confusions
with the KCO mode [that was _a_ mistake].

- Wipe out debug stuff (when not needed any more)

- research using single bit mode for fast-path on arches without
compare-and-exchange but with test-and-set bit.

- Add CONFIG option to compile out parts or all.

- Call __fuqueue_wait_cancel() into try_to_wake_up?

- unlock()-to-task, as suggested by Jamie...is it really needed?

- Research more into safe static allocation as proposed by Scott
Wood--the basic idea has a bunch of problems, mainly that it
kills caching and some others, but needs further research.

Did I miss anything?

The patch is split in the following parts:

1/10: documentation files
2/10: modifications to the core
3/10: Support for i386
4/10: Support for ia64
5/10: kernel fuqueues
6/10: user space/kernel space tracker
7/10: user space fuqueues
8/10: kernel fulocks
9/10: stub for priority protection
10/10: user space fulocks

We have a site at http://developer.osdl.org/dev/robustmutexes
with references to all the releases, test code and NPTL
modifications (rtnptl) to use this code. As well, the patch
is there in a single file, in case you don't want to paste
them manually.


2004-01-14 22:49:30

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 3/10: Support for i386

arch/i386/kernel/entry.S | 5 +++
include/asm-i386/fulock.h | 70 ++++++++++++++++++++++++++++++++++++++++++++++
include/asm-i386/unistd.h | 7 +++-
3 files changed, 81 insertions(+), 1 deletion(-)

--- linux/arch/i386/kernel/entry.S:1.1.1.10 Tue Jan 13 21:13:29 2004
+++ linux/arch/i386/kernel/entry.S Wed Jan 14 11:11:48 2004
@@ -1048,5 +1048,10 @@
.long sys_utimes
.long sys_fadvise64_64
.long sys_ni_syscall /* sys_vserver */
+ .long sys_ufulock_lock
+ .long sys_ufulock_unlock /* 275 */
+ .long sys_ufulock_consistency
+ .long sys_ufuqueue_wait
+ .long sys_ufuqueue_wake /* 278 */

syscall_table_size=(.-sys_call_table)
--- /dev/null Wed Jan 14 14:39:30 2004
+++ linux/include/asm-i386/fulock.h Wed Jan 7 12:46:28 2004
@@ -0,0 +1,70 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ */
+
+#ifndef __asm_i386_fulock_h__
+#define __asm_i386_fulock_h__
+
+ /* fulock value / state; anything that is not this is a PID that
+ * currently owns the fulock. */
+
+enum vfulock {
+ VFULOCK_UNLOCKED = 0x00000000, /* Unlocked */
+ VFULOCK_HEALTHY = VFULOCK_UNLOCKED, /* KCO mode: the lock is healthy */
+ VFULOCK_KCO = 0xfffffffd, /* kernel controls ownership */
+ VFULOCK_DEAD = 0xfffffffe, /* dead, kernel controls ownership */
+ VFULOCK_NR = 0xffffffff /* fulock is not-recoverable */
+};
+
+
+#ifdef __KERNEL__
+
+/**
+ * [User usable] Atomic compare and swap.
+ *
+ * Used for locking a vfulock.
+ *
+ * @value Pointer to the value to compare and swap.
+ * @old_value Value that *value has to have for the swap to occur.
+ * @new_value New value to set it *value == old_value.
+ * @return !0 if the swap succeeded. 0 if failed.
+ */
+static inline
+unsigned vfulock_acas (volatile unsigned *value,
+ unsigned old_value, unsigned new_value)
+{
+ unsigned result;
+ asm __volatile__ (
+ "lock cmpxchg %3, %1"
+ : "=a" (result), "+m" ((*value))
+ : "a" (old_value), "r" (new_value)
+ : "memory");
+ return result == old_value;
+}
+
+
+/**
+ * Set an ufulock's associated value.
+ *
+ * @vfulock: Pointer to the address of the ufulock to contain for.
+ * @value: New value to assign.
+ *
+ * Wrapper for arch-specific idiosyncrasies when setting a value that
+ * is shared across different address mappings.
+ */
+static inline
+void vfulock_set (volatile unsigned *vfulock, unsigned value)
+{
+ *vfulock = value;
+}
+
+#endif /* #ifdef __KERNEL__ */
+#endif /* #ifndef __asm_i386_fulock_h__ */
--- linux/include/asm-i386/unistd.h:1.1.1.5 Mon Oct 20 13:55:33 2003
+++ linux/include/asm-i386/unistd.h Fri Nov 21 13:26:01 2003
@@ -279,8 +279,13 @@
#define __NR_utimes 271
#define __NR_fadvise64_64 272
#define __NR_vserver 273
+#define __NR_ufulock_lock 274
+#define __NR_ufulock_unlock 275
+#define __NR_ufulock_consistency 276
+#define __NR_ufuqueue_wait 277
+#define __NR_ufuqueue_wake 278

-#define NR_syscalls 274
+#define NR_syscalls 279

/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */

2004-01-14 22:52:41

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 2/10: modifications to the core

These are modifications needed to support fuqueues and
fulocks. They are, in a nutshell:

- Move some stuff from kernel/futex.c to futex.h to make it
usable by more people.

- Add neccesary fields to the task_struct.

- Hookup to initialization and cleanup points for process
creation and destruction.

- Hookup into the signalling and time out code for wait
cancellation, as well as for changing task priorities.

- Add error codes for dead-owner and not-recoverable.


include/asm-generic/errno-base.h | 4 +++
include/linux/futex.h | 52 +++++++++++++++++++++++++++++++++++++++
include/linux/init_task.h | 5 +++
include/linux/sched.h | 9 ++++++
kernel/Makefile | 3 +-
kernel/exit.c | 3 ++
kernel/fork.c | 2 +
kernel/futex.c | 39 +++--------------------------
kernel/sched.c | 52 ++++++++++++++++++++++++++++++++++++++-
kernel/signal.c | 16 +++++++++++-
kernel/timer.c | 11 +++++++-
11 files changed, 157 insertions(+), 39 deletions(-)

--- linux/include/linux/futex.h:1.1.1.1 Thu Jul 10 12:27:31 2003
+++ linux/include/linux/futex.h Sun Nov 16 07:21:32 2003
@@ -9,7 +9,11 @@
#define FUTEX_FD (2)
#define FUTEX_REQUEUE (3)

+#ifdef __KERNEL__

+#include <linux/jhash.h>
+
+struct timespec;
asmlinkage long sys_futex(u32 __user *uaddr, int op, int val,
struct timespec __user *utime, u32 __user *uaddr2);

@@ -17,4 +21,52 @@
long do_futex(unsigned long uaddr, int op, int val,
unsigned long timeout, unsigned long uaddr2, int val2);

+/*
+ * Futexes are matched on equal values of this key.
+ * The key type depends on whether it's a shared or private mapping.
+ * Don't rearrange members without looking at futex_hash_key().
+ *
+ * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
+ * We set bit 0 to indicate if it's an inode-based key.
+ */
+union futex_key {
+ struct {
+ unsigned long pgoff;
+ struct inode *inode;
+ int offset;
+ } shared;
+ struct {
+ unsigned long uaddr;
+ struct mm_struct *mm;
+ int offset;
+ } private;
+ struct {
+ unsigned long word;
+ void *ptr;
+ int offset;
+ } both;
+};
+
+static inline
+u32 futex_hash_key (const union futex_key *key)
+{
+ u32 hash = jhash2((u32*)&key->both.word,
+ (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
+ key->both.offset);
+ return hash;
+}
+
+static inline
+int match_futex_key (const union futex_key *key1, const union futex_key *key2)
+{
+ return (key1->both.word == key2->both.word
+ && key1->both.ptr == key2->both.ptr
+ && key1->both.offset == key2->both.offset);
+}
+
+extern int get_futex_key (unsigned long uaddr, union futex_key *key);
+extern void get_key_refs(union futex_key *key);
+extern void drop_key_refs(union futex_key *key);
+
+#endif /* #ifdef __KERNEL__ */
#endif
--- linux/include/linux/init_task.h:1.1.1.2 Thu Aug 28 12:38:00 2003
+++ linux/include/linux/init_task.h Sun Nov 16 07:21:32 2003
@@ -108,6 +108,11 @@
.proc_lock = SPIN_LOCK_UNLOCKED, \
.switch_lock = SPIN_LOCK_UNLOCKED, \
.journal_info = NULL, \
+ .fuqueue_wait_lock = SPIN_LOCK_UNLOCKED, \
+ .fuqueue_wait = NULL, \
+ .fuqueue_waiter = NULL, \
+ .fulock_olist = olist_INIT(&tsk.fulock_olist), \
+ .fulock_olist_lock = SPIN_LOCK_UNLOCKED, \
}


--- linux/include/linux/sched.h:1.1.1.11 Tue Jan 13 21:13:50 2004
+++ linux/include/linux/sched.h Wed Jan 14 11:13:30 2004
@@ -29,6 +29,7 @@
#include <linux/completion.h>
#include <linux/pid.h>
#include <linux/percpu.h>
+#include <linux/fulock-olist.h> /* olist_t */

struct exec_domain;

@@ -326,7 +327,8 @@
struct sigqueue *sigq; /* signal queue entry. */
};

-
+struct fuqueue;
+struct fuqueue_waiter;
struct io_context; /* See blkdev.h */
void exit_io_context(void);

@@ -464,6 +466,11 @@

unsigned long ptrace_message;
siginfo_t *last_siginfo; /* For ptrace use. */
+ struct fuqueue *fuqueue_wait; /* waiting for this qeueue */
+ struct fuqueue_waiter *fuqueue_waiter; /* waiting for this qeueue */
+ spinlock_t fuqueue_wait_lock; /* FIXME: locking too heavy -- better sollution? */
+ olist_t fulock_olist; /* Fulock ownership list */
+ spinlock_t fulock_olist_lock;
};

static inline pid_t process_group(struct task_struct *tsk)
--- linux/kernel/Makefile:1.1.1.7 Tue Jan 13 21:13:51 2004
+++ linux/kernel/Makefile Wed Jan 14 11:13:33 2004
@@ -7,7 +7,8 @@
sysctl.o capability.o ptrace.o timer.o user.o \
signal.o sys.o kmod.o workqueue.o pid.o \
rcupdate.o intermodule.o extable.o params.o posix-timers.o \
- kthread.o
+ kthread.o \
+ fuqueue.o fulock.o fulock-pp.o vlocator.o ufuqueue.o ufulock.o

obj-$(CONFIG_FUTEX) += futex.o
obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
--- linux/kernel/exit.c:1.1.1.8 Tue Jan 13 20:54:51 2004
+++ linux/kernel/exit.c Wed Jan 14 11:13:33 2004
@@ -740,6 +740,8 @@

}

+extern void exit_fulocks(struct task_struct *);
+
NORET_TYPE void do_exit(long code)
{
struct task_struct *tsk = current;
@@ -768,6 +770,7 @@
}

acct_process(code);
+ exit_fulocks(tsk);
__exit_mm(tsk);

exit_sem(tsk);
--- linux/kernel/fork.c:1.1.1.11 Tue Jan 13 21:13:51 2004
+++ linux/kernel/fork.c Wed Jan 14 11:13:33 2004
@@ -40,6 +40,7 @@

extern int copy_semundo(unsigned long clone_flags, struct task_struct *tsk);
extern void exit_sem(struct task_struct *tsk);
+extern void init_fulock (struct task_struct *task);

/* The idle threads do not count..
* Protected by write_lock_irq(&tasklist_lock)
@@ -968,6 +969,7 @@
goto bad_fork_cleanup_signal;
if ((retval = copy_namespace(clone_flags, p)))
goto bad_fork_cleanup_mm;
+ init_fulock(p);
retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs);
if (retval)
goto bad_fork_cleanup_namespace;
--- linux/kernel/futex.c:1.1.1.6 Tue Jan 13 21:13:51 2004
+++ linux/kernel/futex.c Wed Jan 14 11:13:33 2004
@@ -41,31 +41,6 @@

#define FUTEX_HASHBITS 8

-/*
- * Futexes are matched on equal values of this key.
- * The key type depends on whether it's a shared or private mapping.
- * Don't rearrange members without looking at hash_futex().
- *
- * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
- * We set bit 0 to indicate if it's an inode-based key.
- */
-union futex_key {
- struct {
- unsigned long pgoff;
- struct inode *inode;
- int offset;
- } shared;
- struct {
- unsigned long uaddr;
- struct mm_struct *mm;
- int offset;
- } private;
- struct {
- unsigned long word;
- void *ptr;
- int offset;
- } both;
-};

/*
* We use this hashed waitqueue instead of a normal wait_queue_t, so
@@ -109,9 +84,7 @@
*/
static struct futex_hash_bucket *hash_futex(union futex_key *key)
{
- u32 hash = jhash2((u32*)&key->both.word,
- (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
- key->both.offset);
+ u32 hash = futex_hash_key (key);
return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
}

@@ -120,9 +93,7 @@
*/
static inline int match_futex(union futex_key *key1, union futex_key *key2)
{
- return (key1->both.word == key2->both.word
- && key1->both.ptr == key2->both.ptr
- && key1->both.offset == key2->both.offset);
+ return match_futex_key (key1, key2);
}

/*
@@ -137,7 +108,7 @@
*
* Should be called with &current->mm->mmap_sem but NOT any spinlocks.
*/
-static int get_futex_key(unsigned long uaddr, union futex_key *key)
+int get_futex_key(unsigned long uaddr, union futex_key *key)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
@@ -232,7 +203,7 @@
* NOTE: mmap_sem MUST be held between get_futex_key() and calling this
* function, if it is called at all. mmap_sem keeps key->shared.inode valid.
*/
-static inline void get_key_refs(union futex_key *key)
+inline void get_key_refs(union futex_key *key)
{
if (key->both.ptr != 0) {
if (key->both.offset & 1)
@@ -246,7 +217,7 @@
* Drop a reference to the resource addressed by a key.
* The hash bucket spinlock must not be held.
*/
-static void drop_key_refs(union futex_key *key)
+inline void drop_key_refs(union futex_key *key)
{
if (key->both.ptr != 0) {
if (key->both.offset & 1)
--- linux/kernel/sched.c:1.1.1.15 Tue Jan 13 21:13:51 2004
+++ linux/kernel/sched.c Wed Jan 14 11:13:33 2004
@@ -658,7 +658,8 @@
int success = 0;
long old_state;
runqueue_t *rq;
-
+
+
repeat_lock_task:
rq = task_rq_lock(p, &flags);
old_state = p->state;
@@ -1761,6 +1762,8 @@

EXPORT_SYMBOL(default_wake_function);

+extern void fuqueue_wait_cancel(struct task_struct *, int);
+
/*
* The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
* wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
@@ -1769,6 +1772,9 @@
* There are circumstances in which we can try to wake a task which has already
* started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
* zero in this (rare) case, and we handle it by continuing to scan the queue.
+ *
+ * fuqueue_wait_cancel needs to hook up here to properly rescheduler
+ * priority inheritance/protected tasks. Check its doc to learn why.
*/
static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
int nr_exclusive, int sync)
@@ -1780,6 +1786,8 @@
unsigned flags;
curr = list_entry(tmp, wait_queue_t, task_list);
flags = curr->flags;
+ if (unlikely(curr->task->fuqueue_wait != NULL))
+ fuqueue_wait_cancel(curr->task, -EINTR);
if (curr->func(curr, mode, sync) &&
(flags & WQ_FLAG_EXCLUSIVE) &&
!--nr_exclusive)
@@ -2106,6 +2114,46 @@
return pid ? find_task_by_pid(pid) : current;
}

+/**
+ * Unconditionally set the effective priority of a task.
+ *
+ * Not too useful on SCHED_OTHER tasks, btw ...
+ *
+ * @p Pointer to the task in question
+ * @prio New priority to set
+ */
+#warning FIXME: need to play by POSIX rules on prio change and list repositioning because of prio inheritance
+
+void __set_prio (struct task_struct *p, int prio)
+{
+ runqueue_t *rq;
+ prio_array_t *array;
+ long flags;
+ int oldprio;
+
+ rq = task_rq_lock(p, &flags);
+ oldprio = p->prio;
+ array = p->array;
+ if (!array) {
+ p->prio = prio;
+ goto out_unlock;
+ }
+ deactivate_task(p, task_rq(p));
+ p->prio = prio;
+ __activate_task(p, task_rq(p));
+ if (rq->curr == p) {
+ if (p->prio > oldprio)
+ resched_task(rq->curr);
+ }
+ else if (TASK_PREEMPTS_CURR (p, rq))
+ resched_task(rq->curr);
+out_unlock:
+ task_rq_unlock (rq, &flags);
+}
+
+
+extern void fuqueue_chprio (struct task_struct *);
+
/*
* setscheduler - change the scheduling policy and/or RT priority of a thread.
*/
@@ -2185,6 +2233,8 @@
p->prio = MAX_USER_RT_PRIO-1 - p->rt_priority;
else
p->prio = p->static_prio;
+ if (p->fuqueue_wait != NULL)
+ fuqueue_chprio(p);
if (array) {
__activate_task(p, task_rq(p));
/*
--- linux/kernel/signal.c:1.1.1.9 Tue Jan 13 20:54:51 2004
+++ linux/kernel/signal.c Wed Jan 14 11:13:33 2004
@@ -524,6 +524,8 @@
return signr;
}

+extern void fuqueue_wait_cancel (struct task_struct *, int);
+
/*
* Tell a process that it has a new active signal..
*
@@ -547,12 +549,21 @@
* executing another processor and just now entering stopped state.
* By calling wake_up_process any time resume is set, we ensure
* the process will wake up and handle its stop or death signal.
+ *
+ * fuqueue_wait_cancel needs to hook up here to properly rescheduler
+ * priority inheritance/protected tasks. The reason is that
+ * when we resched a process that has boosted another one, we
+ * need to kick its butt off the CPU (and lower its priority) ASAP
+ * so that 't' can run.
*/
mask = TASK_INTERRUPTIBLE;
if (resume)
mask |= TASK_STOPPED;
- if (!wake_up_state(t, mask))
+ if (!wake_up_state(t, mask)) {
+ if (unlikely (t->fuqueue_wait != NULL))
+ fuqueue_wait_cancel(t, -EINTR);
kick_process(t);
+ }
}

/*
@@ -675,6 +686,9 @@
set_tsk_thread_flag(t, TIF_SIGPENDING);
state |= TASK_INTERRUPTIBLE;
}
+ /* FIXME: I am not that sure we need to cancel here */
+ if (unlikely(t->fuqueue_wait != NULL))
+ fuqueue_wait_cancel(t, -EINTR);
wake_up_state(t, state);

t = next_thread(t);
--- linux/kernel/timer.c:1.1.1.10 Tue Jan 13 21:13:51 2004
+++ linux/kernel/timer.c Wed Jan 14 11:13:34 2004
@@ -966,9 +966,18 @@

#endif

+/*
+ * fuqueue_wait_cancel needs to hook up here to properly rescheduler
+ * priority inheritance/protected tasks. Check its doc to learn why.
+ */
+extern void fuqueue_wait_cancel(struct task_struct *, int);
+
static void process_timeout(unsigned long __data)
{
- wake_up_process((task_t *)__data);
+ struct task_struct *task = (task_t *) __data;
+ if (unlikely(task->fuqueue_wait != NULL))
+ fuqueue_wait_cancel(task, -ETIMEDOUT);
+ wake_up_process(task);
}

/**
--- linux/include/asm-generic/errno-base.h:1.1.1.1 Thu Jul 10 12:27:27 2003
+++ linux/include/asm-generic/errno-base.h Sun Nov 16 07:21:32 2003
@@ -36,4 +36,8 @@
#define EDOM 33 /* Math argument out of domain of func */
#define ERANGE 34 /* Math result not representable */

+ /* FIXME: ugly hack to avoid conflicts -- need to get better numbers */
+#define EOWNERDEAD 525 /* Mutex owner died */
+#define ENOTRECOVERABLE 526 /* Mutex state is not recoverable */
+
#endif

2004-01-14 23:01:21

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 8/10: kernel fulocks

include/linux/fulock-olist.h | 55 ++++
include/linux/fulock.h | 282 +++++++++++++++++++++
kernel/fulock.c | 562 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 899 insertions(+)

--- /dev/null Wed Jan 14 14:39:30 2004
+++ linux/include/linux/fulock-olist.h Sun Nov 16 07:21:32 2003
@@ -0,0 +1,55 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ *
+ * Fulock Ownership List
+ *
+ * This file is here to avoid include hell, as sched.h needs it to be
+ * able to embed a fulock ownership list into 'struct
+ * task_struct'. Its sole users are sched.h and fulock.h.
+ *
+ * This is just a redirection of the methods used to manage the list
+ * of fulocks that a task owns in a given moment.
+ *
+ * I don't think anybody in its right mind wants to use a
+ * priority-array list for this, but just in case, and to ease the
+ * change of the implementation, I was redirecting it.
+ *
+ * I will pull the plug on this one, search and replace all the
+ * olist_*() for plist_*() and wipe this file out as soon as I have a
+ * minute. I consider this low prio.
+ */
+
+#ifndef __linux_fulock_olist_h__
+#define __linux_fulock_olist_h__
+
+#ifdef __KERNEL__
+
+#include <linux/plist.h>
+
+typedef struct plist olist_t;
+
+#define olist_add plist_add
+#define olist_rem plist_rem
+#define olist_init plist_init
+#define olist_INIT(plist) plist_INIT(plist)
+#define olist_for_each plist_for_each
+#define olist_for_each_safe plist_for_each_safe
+#define olist_empty plist_empty
+#define olist_first plist_first
+#define olist_prio plist_prio
+#define olist_chprio plist_chprio
+#if 0
+#define olist_update_prio plist_update_prio
+#define __olist_chprio __plist_chprio
+#endif
+
+#endif /* #ifdef __KERNEL__ */
+#endif /* #ifndef __linux_fulock_olist_h__ */
--- /dev/null Wed Jan 14 14:39:30 2004
+++ linux/include/linux/fulock.h Wed Jan 7 15:20:58 2004
@@ -0,0 +1,282 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ */
+
+#ifndef __linux_fulock_h__
+#define __linux_fulock_h__
+
+#include <asm/fulock.h>
+
+/**
+ * User space provided fulock flags
+ *
+ * fulocks/ufulocks are *always* robust, it is up to the caller to
+ * emulate a hang if a robust behavior is not desired. However, the
+ * NOT_RM (not robust mutex) flag is kept and checked on exit and if
+ * there are owned fulocks, a warning will be printed.
+ */
+#define FULOCK_FL_USER_MK 0xfffff000 /* Flags provided by user space */
+#define FULOCK_FL_PI 0x80000000 /* Priority inherit */
+#define FULOCK_FL_PP 0x40000000 /* Priority protected */
+#define FULOCK_FL_RM_SUN 0x20000000 /* Robust mutex (sun style) */
+#define FULOCK_FL_RM 0x10000000 /* Robust mutex [warns only] */
+#define FULOCK_FL_KCO 0x08000000 /* No fast path, KCO mode always */
+#define FULOCK_FL_ERROR_CHK 0x04000000 /* Perform POSIX error checks */
+/* Priority ceiling masks */
+#define FULOCK_FL_PP_PLC_MK 0x00f00000 /* Policy */
+#define FULOCK_FL_PP_PRIO_MK 0x000ff000 /* Priority */
+
+/** Fulock consistency state */
+enum fulock_st {
+ fulock_st_healthy, /* Normal, healthy */
+ fulock_st_dead, /* Some previous owner died */
+ fulock_st_nr, /* idem, plus cannot be recovered */
+ fulock_st_init /* fulock was re-initialized */
+};
+
+
+/** Fulock consistency action */
+enum fulock_con {
+ fulock_con_nop = 0, /* No-op, just return actual consistency */
+ fulock_con_init, /* Move from any state to initialized */
+ fulock_con_heal, /* Move from dead to healthy */
+ fulock_con_nr, /* Move from dead to not-recoverable */
+ fulock_con_destroy, /* Tell the kernel to forget about it */
+ fulock_con_waiters, /* Do we have waiters? */
+ fulock_con_locked /* Is it locked? */
+};
+
+
+#ifdef __KERNEL__
+
+#include <linux/fuqueue.h>
+#include <linux/plist.h>
+#include <linux/vlocator.h>
+#include <asm/errno.h>
+
+/* Internal fulock flags */
+#define FULOCK_FL_NR 0x00000100 /* not recoverable */
+#define FULOCK_FL_DEAD 0x00000200 /* dead-owner */
+#define FULOCK_FL_NEW 0x00000800 /* recently allocated ufulock */
+#define FULOCK_FL_PP_MK 0x000000ff /* raw priority ceiling */
+
+struct fulock;
+
+/** Operations on a fulock. */
+struct fulock_ops
+{
+ struct fuqueue_ops fuqueue;
+ void (* owner_set) (struct fulock *, struct task_struct *);
+ void (* owner_reset) (struct fulock *);
+ void (* exit) (struct fulock *);
+};
+
+/* In-kernel fulock operations */
+extern struct fulock_ops fulock_ops;
+extern struct fulock_ops ufulock_ops;
+extern struct vlocator_ops ufulock_vops;
+
+
+/** A fulock, mutex usable from the kernel. */
+struct fulock {
+ struct fuqueue fuqueue;
+ struct task_struct *owner;
+ unsigned flags;
+ olist_t olist_node;
+};
+
+
+/** Initialize a @fulock with given ops */
+static __inline__
+void __fulock_init (struct fulock *fulock, struct fulock_ops *ops)
+{
+ __fuqueue_init (&fulock->fuqueue, &ops->fuqueue);
+ fulock->owner = NULL;
+ fulock->flags = 0;
+ olist_init (&fulock->olist_node);
+}
+
+/** Statically initialize a @fulock with given ops */
+#define __fulock_INIT(fulock, ops) { \
+ .fuqueue = __fuqueue_INIT (&(fulock)->fuqueue, \
+ &(ops)->fuqueue), \
+ .owner = NULL, \
+ .flags = 0, \
+ .olist_node = olist_INIT (&(fulock)->olist_node) \
+}
+
+/** Initialize a @fulock for usage within the kernel */
+static __inline__
+void fulock_init (struct fulock *fulock)
+{
+ __fulock_init (fulock, &fulock_ops);
+}
+
+/** Statically initialize a @fulock for usage within the kernel */
+#define fulock_INIT(fulock, ops) __fulock_INIT (fulock, &kernel_ops)
+
+
+/* Primitives for locking/unlocking/waiting/signalling */
+extern int __fulock_lock (struct fulock *, signed long);
+extern int __fulock_unlock (struct fulock *, size_t, int);
+extern int __fulock_consistency (struct fulock *fulock,
+ enum fulock_con consistency);
+extern void __fulock_exit (struct fulock *);
+extern void exit_fulocks (struct task_struct *task);
+
+extern int __fulock_pp_allowed (struct fulock *);
+extern void __fulock_pp_boost (struct fulock *);
+extern void __fulock_pp_unboost (struct fulock *);
+extern int __fulock_check_deadlock (struct task_struct *, struct fulock *);
+
+/** More internal stuff for building on top of fulock */
+extern unsigned __fulock_wait_cancel (struct fuqueue *, struct fuqueue_waiter *);
+extern struct task_struct * __fulock_chprio (struct task_struct *,
+ struct fuqueue *,
+ struct fuqueue_waiter *);
+
+
+/** Check if it is ok to for current to lock @fulock */
+static __inline__
+int __fulock_lock_check (struct fulock *fulock)
+{
+ int result = 0;
+ return result;
+}
+
+
+/**
+ * Lock a fulock, maybe wait for it to be available
+ *
+ * @timeout: wait to acquire the fulock as much @timeout jiffies. If
+ * zero, don't block, tryonly. If MAX_SCHEDULE_TIMEOUT,
+ * block indefinitely until the lock is acquired.
+ *
+ * @returns: See __fulock_lock().
+ *
+ * Can ONLY be called from process context. Note __fulock_lock()
+ * unlocks the fulock on return and re-enables IRQs and preemption.
+ */
+static __inline__
+int fulock_lock (struct fulock *fulock, signed timeout)
+{
+ int result;
+ do {
+ spin_lock_irq (&fulock->fuqueue.lock);
+ result = __fulock_lock (fulock, timeout);
+ } while (result == -EAGAIN);
+ return result;
+}
+
+
+/**
+ * Unlock a fulock, wake up waiter(s)
+ *
+ * @f: fulock.
+ * @howmany: Wake up this many waiters; if 0, wake up only one,
+ * forcing a serialization in the acquisition of the
+ * futex, so that no other task (in user or kernel space)
+ * can acquire it.
+ * @returns: Number of tasks woken up, < 0 errno code on error.
+ *
+ * Can be called from any context [I hope].
+ */
+static __inline__
+int fulock_unlock (struct fulock *fulock, size_t howmany)
+{
+ int result;
+ unsigned long flags;
+
+ ftrace ("(%p, %zu)\n", fulock, howmany);
+
+ spin_lock_irqsave (&fulock->fuqueue.lock, flags);
+ result = __fulock_unlock (fulock, howmany, 0);
+ spin_unlock_irqrestore (&fulock->fuqueue.lock, flags);
+ return result;
+}
+
+
+/**
+ * Set the consistency of a fulock
+ *
+ * @f: fulock to set
+ * @consistency: New consistency state to move it to (unless it is
+ * fulock_con_nop, which is a nop and can be used to get
+ * the actual consistency state); see enum fulock_con.
+ *
+ * @returns: 'enum fulock_st' consistency state; < 0 errno code on
+ * error.
+ *
+ * FIXME: this function set is kind of too convoluted, I am afraid.
+ *
+ * Can be called from only from process context, as it checks for
+ * current being the current owner.
+ */
+static __inline__
+int fulock_consistency (struct fulock *fulock, enum fulock_con consistency)
+{
+ int result;
+ unsigned long flags;
+ ftrace ("(%p, %d)\n", fulock, consistency);
+
+ spin_lock_irqsave (&fulock->fuqueue.lock, flags);
+ result = __fulock_consistency (fulock, consistency);
+ spin_unlock_irqrestore (&fulock->fuqueue.lock, flags);
+ return result;
+}
+
+
+/** A ufulock, tied to a user-space vm address. */
+struct ufulock {
+ struct fulock fulock;
+ struct vlocator vlocator;
+ struct page *page;
+};
+
+
+/** @Return true if the fulock @f has no waiters. */
+static __inline__
+unsigned __fulock_empty (const struct fulock *f)
+{
+ return __fuqueue_empty (&f->fuqueue);
+}
+
+
+/** [Internal] Make task @task the owner of fulock @f. */
+static __inline__
+void __fulock_owner_set (struct fulock *fulock, struct task_struct *task)
+{
+ ftrace ("(%p, %p [%d])\n", fulock, task, task->pid);
+ CHECK_IRQs();
+
+ fulock->owner = task;
+ _raw_spin_lock (&task->fulock_olist_lock);
+ olist_add (&task->fulock_olist, &fulock->olist_node);
+ _raw_spin_unlock (&task->fulock_olist_lock);
+}
+
+
+/** [Internal] Reset ownership of fulock @f. */
+static __inline__
+void __fulock_owner_reset (struct fulock *fulock)
+{
+ struct task_struct *owner = fulock->owner;
+ ftrace ("(%p)\n", fulock);
+ CHECK_IRQs();
+
+ _raw_spin_lock (&owner->fulock_olist_lock);
+ olist_rem (&owner->fulock_olist, &fulock->olist_node);
+ _raw_spin_unlock (&owner->fulock_olist_lock);
+ fulock->owner = NULL;
+}
+
+#endif /* #ifdef __KERNEL__ */
+#endif /* #ifndef __linux_fulock_h__ */
--- /dev/null Wed Jan 14 14:39:30 2004
+++ linux/kernel/fulock.c Mon Jan 12 14:56:44 2004
@@ -0,0 +1,562 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ */
+
+#include <linux/fulock.h>
+#include <linux/plist.h>
+#include <linux/time.h> /* struct timespec */
+#include <linux/sched.h> /* MAX_SCHEDULE_TIMEOUT */
+#include <linux/errno.h>
+
+/** Get the real priority from the real-time priority of a task. */
+static inline
+unsigned long prio_from_rt_prio (struct task_struct *task)
+{
+ return MAX_USER_RT_PRIO-1 - task->rt_priority;
+}
+
+
+/**
+ * Perform priority inheritance over the ownership chain
+ *
+ * @fulock: a waiter in this fulock was re-prioritized and the wlist
+ * max priority changed; propagate that change. It has to be
+ * locked.
+ *
+ * Updates the position of the @fulock on its owner's ownership list
+ * and if this results on a changed maximum ownership list priority,
+ * propagate it up.
+ *
+ * This is also used for un-boosting when a waiter stops waiting and
+ * the priorities have to be shuffled back.
+ *
+ * WARNING: Needs to be called with IRQs and preemtion disabled.
+ */
+void __fulock_pi_boost (struct fulock *fulock)
+{
+ unsigned prio_changed;
+ struct task_struct *owner = fulock->owner;
+ unsigned long new_prio = wlist_prio (&fulock->fuqueue.wlist);
+
+ ftrace ("(%p)\n", fulock);
+ CHECK_IRQs();
+
+ _raw_spin_lock (&owner->fulock_olist_lock);
+ prio_changed = olist_chprio (&owner->fulock_olist, &fulock->olist_node,
+ new_prio);
+ _raw_spin_unlock (&owner->fulock_olist_lock);
+ if (prio_changed) {
+ new_prio = min (prio_from_rt_prio (owner), new_prio);
+ if (new_prio != owner->prio) {
+ ldebug (1, "__set_prio (%d, %lu)\n", owner->pid, new_prio);
+ __set_prio (owner, new_prio);
+ /* Now the priority changed for the owner, so we need
+ * to propagate it */
+ fuqueue_chprio (owner);
+ }
+ }
+}
+
+
+/**
+ * Test if a to-be queued task would deadlock if it waits for a fulock.
+ *
+ * Simple as it is, it looks ugly as hell. Basically, the fulock we
+ * are about to lock will have an owner, so we check the owner; if it
+ * is us, deadlock, if not, we see if the fulock is waiting for
+ * anything; if so, we check it is a fulock, and if so, who's the
+ * owner; if it is us, then deadlock, if not, start again ...
+ *
+ * Now, the trick is to keep the reference counts, and the locks and
+ * all the crap. A single lock for everything would be *so* beautiful
+ * (and not scalable :).
+ *
+ * @task: the task to check
+ * @fulock: the fulock to check
+ * @returns: 0 if ok, < 0 errno on error.
+ *
+ * This *should* be safe being lock-less [famous last words].
+ *
+ * WARNING: Needs to be called with IRQs and preemtion disabled.
+ */
+int __fulock_check_deadlock (struct task_struct *task, struct fulock *fulock)
+{
+ int result = 0;
+ struct fuqueue_ops *ops;
+ struct task_struct *owner;
+ struct fuqueue *fuqueue;
+
+ ftrace ("(%p, %p)\n", task, fulock);
+ CHECK_IRQs();
+
+ /* first fulock to trace is already locked and we won't unlock it */
+ owner = fulock->owner;
+ if (owner == NULL)
+ goto out;
+ result = -EDEADLK;
+ if (owner == task)
+ goto out;
+ get_task_struct (owner);
+next:
+ result = 0;
+ /* Who is the owner waiting for? safely acquire and lock it */
+ _raw_spin_lock (&owner->fuqueue_wait_lock);
+ fuqueue = owner->fuqueue_wait;
+ if (fuqueue == NULL) /* Ok, not waiting */
+ goto out_owner_unlock;
+ if (!_raw_spin_trylock (&fuqueue->lock)) { /* Spin dance... */
+ _raw_spin_unlock (&owner->fuqueue_wait_lock);
+ goto next;
+ }
+ ops = fuqueue->ops;
+ if (ops->get)
+ ops->get (fuqueue);
+ _raw_spin_unlock (&owner->fuqueue_wait_lock);
+ put_task_struct (owner);
+
+ /* Is a fulock whatever the owner is waiting for? */
+ if (ops != &fulock_ops.fuqueue && ops != &ufulock_ops.fuqueue)
+ goto out_fuqueue_unlock;
+ fulock = container_of (fuqueue, struct fulock, fuqueue);
+ owner = fulock->owner; /* Who's the fulock's owner? */
+ if (unlikely (owner == NULL)) /* Released before we locked it? */
+ goto out_fuqueue_unlock;
+ result = -EDEADLK; /* It is us? ooops */
+ if (owner == task)
+ goto out_fuqueue_unlock;
+
+ /* What's the owner waiting for? Proceed to it */
+ get_task_struct (owner);
+ _raw_spin_unlock (&fulock->fuqueue.lock);
+ if (ops->put)
+ ops->put (fuqueue);
+ goto next;
+
+out_owner_unlock:
+ _raw_spin_unlock (&owner->fuqueue_wait_lock);
+ put_task_struct (owner);
+ return result;
+
+out_fuqueue_unlock:
+ _raw_spin_unlock (&fulock->fuqueue.lock);
+ if (ops->put)
+ ops->put (fuqueue);
+out:
+ return result;
+}
+
+
+/**
+ * [Try]lock a fulock
+ *
+ * @f: fulock to [try]lock.
+ * @timeout: Time to wait for the lock to be acquired. If 0, trylock
+ * only, if MAX_SCHEDULE_TIMEOUT, wait for ever, else, wait
+ * the given time.
+ * @returns: 0 Acquired the fulock
+ * -EOWNERDEAD Acquired the fulock, some previous owner died.
+ * -ENOTRECOVERABLE Not acquired, fulock is not recoverable
+ * -EBUSY Not acquired, it is locked [and trylock was
+ * requested].
+ * -EAGAIN Not acquired, try again [FIXME: change this,
+ * POSIX uses it for mutexes].
+ * * Not acquired, error.
+ *
+ * Needs f->lock held; on return, it will have been released.
+ *
+ * Note that some operations involving flag reading don't need locking
+ * because for changing those values, we calling task needs to be the
+ * owner of the fulock, and once we come out of wait without errors,
+ * we are the owners.
+ *
+ * WARNING: Needs to be called with IRQs and preemtion disabled [the
+ * fuqueue lock acquired with spin_lock_irq()].
+ */
+int __fulock_lock (struct fulock *fulock, signed long timeout)
+{
+ int fulock_is_pp;
+ int result = 0;
+ unsigned prio_changed;
+ struct fulock_ops *ops =
+ container_of (fulock->fuqueue.ops, struct fulock_ops, fuqueue);
+ struct fuqueue_waiter w;
+
+ ftrace ("(%p [flags 0x%x], %ld)\n", fulock, fulock->flags, timeout);
+
+ result = -ENOTRECOVERABLE;
+ if (fulock->flags & FULOCK_FL_NR) /* can we lock? */
+ goto out_unlock;
+ if (fulock->flags & FULOCK_FL_ERROR_CHK) {
+ result = __fulock_check_deadlock (current, fulock);
+ if (result < 0)
+ goto out_unlock;
+ }
+ fulock_is_pp = fulock->flags & FULOCK_FL_PP;
+ if (fulock_is_pp) {
+ result = __fulock_pp_allowed (fulock);
+ if (result)
+ goto out_unlock;
+ }
+ if (fulock->owner == NULL) /* Unlocked? take it */
+ goto its_unlocked;
+ /* Nchts, we have to wait. */
+ result = -EBUSY;
+ if (!timeout)
+ goto out_unlock;
+ prio_changed = __fuqueue_wait_queue (&fulock->fuqueue, &w);
+ if (prio_changed && fulock->flags & FULOCK_FL_PI)
+ __fulock_pi_boost (fulock);
+ return __fuqueue_wait_block (&fulock->fuqueue, &w, timeout);
+
+its_unlocked:
+ result = fulock->flags & FULOCK_FL_DEAD? -EOWNERDEAD : 0;
+ ops->owner_set (fulock, current);
+ if (fulock_is_pp)
+ __fulock_pp_boost (fulock);
+out_unlock:
+ _raw_spin_unlock (&fulock->fuqueue.lock);
+ local_irq_enable();
+ preempt_enable();
+ return result;
+}
+
+
+/**
+ * Unlock a fulock, wake up waiter(s) [internal version]
+ *
+ * @fulock: Address for the fulock (kernel space).
+ * @howmany: Wake up this many waiters; if 0, wake up only one,
+ * forcing a serialization in the acquisition of the
+ * futex, so that no other task (in user or kernel space)
+ * can acquire it.
+ * @code: If waking up in parallel mode, return code to be passed to
+ * the waiters as a result of their wait.
+ * @returns: 0 if the fulock was unlocked/there are no waiters left or
+ * howmany > 0
+ * 1 fulock ownership transferred to 1st waiter, there was
+ * one waiter (now none)
+ * 2 fulock ownership transferred to 1st waiter, there was
+ * more than one waiter
+ *
+ * ufulock_unlock() uses this to update the vfulock, except
+ * if howmany > 0.
+ *
+ * Requires fulock->fuqueue.lock held, IRQs & preempt disabled!!!
+ */
+int __fulock_unlock (struct fulock *fulock, size_t howmany, int code)
+{
+ int result = 0;
+ struct fulock_ops *ops =
+ container_of (fulock->fuqueue.ops, struct fulock_ops, fuqueue);
+ struct task_struct *owner;
+ struct fuqueue_waiter *w;
+
+ ftrace ("(%p, %zu, %d)\n", fulock, howmany, code);
+ CHECK_IRQs();
+
+ if (fulock->owner == NULL) /* Unlocked? */
+ goto out;
+ owner = fulock->owner;
+ ops->owner_reset (fulock);
+ if (__fulock_empty (fulock))
+ goto out_unboost;
+ if (howmany > 0) { /* Parallel unlock */
+ code = code == 0? -EAGAIN : code;
+ while (howmany-- && !__fuqueue_empty (&fulock->fuqueue)) {
+ w = __fuqueue_first (&fulock->fuqueue);
+ __fuqueue_wait_unqueue (w, code);
+ wake_up_process (w->task);
+ }
+ }
+ else { /* Serialized unlock */
+ w = __fuqueue_first (&fulock->fuqueue);
+ ops->owner_set (fulock, w->task);
+ if (fulock->flags & FULOCK_FL_PP)
+ __fulock_pp_boost (fulock);
+ code = fulock->flags & FULOCK_FL_DEAD? -EOWNERDEAD : 0;
+ __fuqueue_wait_unqueue (w, code);
+ wake_up_process (w->task);
+ result = __fulock_empty (fulock)? 1 : 2;
+ }
+ /* Now, once we have done it, we can unboost PI or PP */
+out_unboost:
+ if (fulock->flags & FULOCK_FL_PP)
+ __fulock_pp_unboost (fulock);
+ else if (fulock->flags & FULOCK_FL_PI
+ && owner->prio != prio_from_rt_prio (owner)) {
+ /* We were boosted, undo that */
+ ldebug (1, "__set_prio (%d, %lu)\n", owner->pid, prio_from_rt_prio (owner));
+ __set_prio (owner, prio_from_rt_prio (owner));
+ fuqueue_chprio (owner); /* owner might be waiting */
+ }
+out:
+ return result;
+}
+
+
+/**
+ * Set the consistency of a fulock, get previous.
+ *
+ * @fulock: fulock whose consistency is to be set [fulock->fuqueue.lock
+ * has to be held].
+ * @consistency: New consistency state to move it to (unless it is
+ * fulock_con_get, which is a nop and can be used to get
+ * the actual consistency state.
+ *
+ * @returns: consistency state
+ */
+int __fulock_consistency (struct fulock *fulock, enum fulock_con consistency)
+{
+ int result;
+ enum fulock_st state;
+
+ ftrace ("(%p, %d)\n", fulock, consistency);
+ CHECK_IRQs();
+
+ /* Gather actual state */
+ if (fulock->flags & FULOCK_FL_NR)
+ state = fulock_st_nr;
+ else if (fulock->flags & FULOCK_FL_DEAD)
+ state = fulock_st_dead;
+ else
+ state = fulock_st_healthy;
+
+ /* Now, what to do? */
+ switch (consistency) {
+ /* Nothing, so just say how are we standing */
+ case fulock_con_nop:
+ result = state;
+ break;
+
+ /* Reinitialize it */
+ case fulock_con_init:
+ fulock->flags &= ~(FULOCK_FL_DEAD | FULOCK_FL_NR);
+ __fulock_unlock (fulock, (size_t) ~0, -ENOENT);
+ result = fulock_st_init;
+ break;
+
+ /* Mark it healthy */
+ case fulock_con_heal:
+ result = -EINVAL;
+ if (state != fulock_st_dead)
+ break;
+ result = -EPERM;
+ if (fulock->owner != current) /* Who are you? */
+ break;
+ fulock->flags &= ~FULOCK_FL_DEAD;
+ result = fulock_st_healthy;
+ break;
+
+ /* Make it not recoverable; wake up every waiter with error;
+ * unlock. */
+ case fulock_con_nr:
+ result = -EINVAL;
+ if (state != fulock_st_dead)
+ break;
+ result = -EPERM;
+ if (fulock->owner != current) /* Who are you? */
+ break;
+ result = __fulock_unlock (fulock, (size_t)~0, -ENOTRECOVERABLE);
+ if (result >= 0) {
+ fulock->flags &= ~FULOCK_FL_DEAD;
+ fulock->flags |= FULOCK_FL_NR;
+ result = fulock_st_nr;
+ }
+ break;
+
+ default:
+ result = -EINVAL;
+ }
+ return result;
+}
+
+
+/**
+ * Set the priority of a fulock waiter.
+ *
+ * @task: task to re-prioritize
+ * @fuqueue: fuqueue of the fulock the task is waiting for [locked]
+ * @w: waiter @task is waiting on in @fuqueue.
+ * @prio: new priority (prio
+ * @returns: NULL (as there is no propagation), task to propage to.
+ *
+ * This does not set the prio of the process itself!
+ *
+ * This will just reposition it in the wait list and if priority
+ * inheritance is enabled, reposition in the ownership list.
+ *
+ * Now, after repositioning in the wait list, we have to do the same
+ * in the ownership list. If we have set a new maximum, and that
+ * maximum is different to the current task->prio, then we have to
+ * update, so we return a task pointer that needs to be referenced.
+ *
+ * WARNING: Needs to be called with IRQs and preemtion disabled.
+ */
+struct task_struct * __fulock_chprio (struct task_struct *task,
+ struct fuqueue *fuqueue,
+ struct fuqueue_waiter *w)
+{
+ unsigned prio_changed;
+ unsigned long new_prio;
+ struct fuqueue_ops *ops;
+ struct fulock *fulock;
+ struct task_struct *owner = NULL;
+
+ ftrace ("(%p [%d], %p, %p)\n", task, task->pid, fuqueue, w);
+ CHECK_IRQs();
+
+ /* Verify this is really a fulock */
+ ops = fuqueue->ops;
+ BUG_ON (ops != &fulock_ops.fuqueue && ops != &ufulock_ops.fuqueue);
+ fulock = container_of (fuqueue, struct fulock, fuqueue);
+ /* Update the wlist of our fulock */
+ __fuqueue_chprio (task, fuqueue, w); /* fuqueue is locked */
+#warning FIXME: trap if in the middle, the task stopped waiting?
+ /* Now, if we ain't PI, there is no point in continuing */
+ if (!(fulock->flags & FULOCK_FL_PI))
+ goto out;
+ /* And if it is unlocked, somebody unlocked just before we
+ * came here, so we'll do nothing */
+ owner = fulock->owner;
+ if (unlikely (owner == NULL))
+ goto out;
+ /* Ok, we have to propagate, reposition in the ownership list,
+ * and if the max prio changed and it is higher than the
+ * owner's priority, then we have to go with him */
+ new_prio = wlist_prio (&fuqueue->wlist);
+ _raw_spin_lock (&owner->fulock_olist_lock);
+ prio_changed = olist_chprio (&owner->fulock_olist, &fulock->olist_node,
+ new_prio);
+ _raw_spin_unlock (&owner->fulock_olist_lock);
+ if (prio_changed) {
+ new_prio = min (prio_from_rt_prio (owner), new_prio);
+ if (new_prio != owner->prio)
+ goto out;
+ }
+ owner = NULL;
+out:
+ return owner;
+}
+
+
+/**
+ * Initialize fulock specific stuff for a task
+ *
+ */
+void init_fulock (struct task_struct *task)
+{
+ __ftrace (0, "(task %p)\n", task);
+
+ spin_lock_init (&task->fuqueue_wait_lock);
+ task->fuqueue_wait = NULL;
+ task->fuqueue_waiter = NULL;
+ spin_lock_init (&task->fulock_olist_lock);
+ olist_init (&task->fulock_olist);
+}
+
+
+/** Release as dead a @fulock because the owner is exiting. */
+void __fulock_exit (struct fulock *fulock)
+{
+ ftrace ("(%p)\n", fulock);
+
+ fulock->flags |= FULOCK_FL_DEAD;
+ if (!(fulock->flags & FULOCK_FL_RM))
+ printk (KERN_WARNING "Task %d [%s] exited holding non-robust "
+ "fulock %p; waiters might block for ever\n",
+ current->pid, current->comm, fulock);
+ __fulock_unlock (fulock, 0, -EOWNERDEAD);
+}
+
+
+/**
+ * When @task exits, release all the fulocks it holds as dead.
+ *
+ * We have to discriminate between ufulocks and locks; when it is an
+ * ufulock, we just need to see if we have to put() the reference that
+ * the owner had or not (when the ownership is successfully passed to
+ * somebody else, we don't have to put it).
+ */
+#warning FIXME: hook up to exec()?
+void exit_fulocks (struct task_struct *task)
+{
+ olist_t *itr;
+ struct fulock *fulock;
+ struct fulock_ops *ops;
+ unsigned long flags;
+
+ if (DEBUG > 0 && !olist_empty (&task->fulock_olist))
+ ftrace ("(%p [%d])\n", task, task->pid);
+
+ /* FIXME: there is a better way to do this, but I feel toooo
+ * thick today -- the problem is fulock->ops->exit() is going
+ * to take fulock_olist_lock to reset the ownership...so
+ * whatever it is, we have to call without holding it. */
+ local_irq_save (flags);
+ preempt_disable();
+ _raw_spin_lock (&task->fulock_olist_lock);
+ while (!olist_empty (&task->fulock_olist)) {
+ itr = olist_first (&task->fulock_olist);
+ fulock = container_of (itr, struct fulock, olist_node);
+ ldebug (7, "task %p [%d] still owns fulock %p\n",
+ task, task->pid, fulock);
+ ops = container_of (fulock->fuqueue.ops,
+ struct fulock_ops, fuqueue);
+ if (ops->fuqueue.get)
+ ops->fuqueue.get (&fulock->fuqueue);
+ _raw_spin_unlock (&task->fulock_olist_lock);
+
+ _raw_spin_lock (&fulock->fuqueue.lock);
+ ops->exit (fulock); /* releases fulock lock */
+ _raw_spin_unlock (&fulock->fuqueue.lock);
+
+ if (ops->fuqueue.put)
+ ops->fuqueue.put (&fulock->fuqueue);
+ _raw_spin_lock (&task->fulock_olist_lock);
+ }
+ _raw_spin_unlock (&task->fulock_olist_lock);
+ local_irq_restore (flags);
+ preempt_enable();
+}
+
+
+/** Cancel @task's wait on @fuqueue and update the wait list priority */
+unsigned __fulock_wait_cancel (struct fuqueue *fuqueue,
+ struct fuqueue_waiter *w)
+{
+ unsigned prio_change;
+ struct fulock *fulock =
+ container_of (fuqueue, struct fulock, fuqueue);
+
+ ftrace ("(%p, %p [%d], %p)\n",
+ fuqueue, w, w->task->pid, w);
+
+ prio_change = __fuqueue_wait_cancel (fuqueue, w);
+ ldebug (2, "prio_change is %u\n", prio_change);
+ if (prio_change && (fulock->flags & FULOCK_FL_PI))
+ __fulock_pi_boost (fulock);
+ return prio_change;
+}
+
+
+/** Fulock operations */
+struct fulock_ops fulock_ops = {
+ .fuqueue = {
+ .get = NULL,
+ .put = NULL,
+ .wait_cancel = __fulock_wait_cancel,
+ .chprio = __fulock_chprio
+ },
+ .owner_set = __fulock_owner_set,
+ .owner_reset = __fulock_owner_reset,
+ .exit = __fulock_exit
+};
+

2004-01-14 23:07:02

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 4/10: Support for ia64

arch/ia64/kernel/entry.S | 10 +++---
include/asm-ia64/fulock.h | 75 ++++++++++++++++++++++++++++++++++++++++++++++
include/asm-ia64/unistd.h | 8 ++++
3 files changed, 87 insertions(+), 6 deletions(-)

--- linux/arch/ia64/kernel/entry.S:1.1.1.7 Tue Jan 13 20:54:35 2004
+++ linux/arch/ia64/kernel/entry.S Wed Jan 14 11:11:48 2004
@@ -1469,11 +1469,11 @@
data8 sys_fstatfs64
data8 sys_statfs64
data8 sys_ni_syscall
- data8 sys_ni_syscall // 1260
- data8 sys_ni_syscall
- data8 sys_ni_syscall
- data8 sys_ni_syscall
- data8 sys_ni_syscall
+ data8 sys_ufulock_lock // 1260
+ data8 sys_ufulock_unlock
+ data8 sys_ufulock_consistency
+ data8 sys_ufuqueue_wait
+ data8 sys_ufuqueue_wake
data8 sys_ni_syscall // 1265
data8 sys_ni_syscall
data8 sys_ni_syscall
--- /dev/null Wed Jan 14 14:39:30 2004
+++ linux/include/asm-ia64/fulock.h Wed Jan 7 12:46:36 2004
@@ -0,0 +1,75 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * David P. Howell <[email protected]>
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ */
+
+#ifndef __asm_ia64_fulock_h__
+#define __asm_ia64_fulock_h__
+
+ /* fulock value / state; anything that is not this is a PID that
+ * currently owns the fulock. */
+
+enum vfulock {
+ VFULOCK_UNLOCKED = 0x00000000, /* Unlocked */
+ VFULOCK_HEALTHY = VFULOCK_UNLOCKED, /* KCO mode: the lock is healthy */
+ VFULOCK_KCO = 0xfffffffd, /* kernel controls ownership */
+ VFULOCK_DEAD = 0xfffffffe, /* dead, kernel controls ownership */
+ VFULOCK_NR = 0xffffffff /* fulock is not-recoverable */
+};
+
+
+#ifdef __KERNEL__
+
+/**
+ * [User usable] Atomic compare and swap.
+ *
+ * Used for locking a vfulock.
+ *
+ * @value Pointer to the value to compare and swap.
+ * @old_value Value that *value has to have for the swap to occur.
+ * @new_value New value to set it *value == old_value.
+ * @return !0 if the swap succeeded. 0 if failed.
+ */
+static inline
+unsigned vfulock_acas (volatile unsigned *value,
+ unsigned old_value, unsigned new_value)
+{
+ unsigned retval;
+
+ /* The following should be the expansion of the cmpxchg_acq() */
+ /* macro from intrinsics.h. Needed this due to glibc builds */
+ /* issues with including asm/types.h. */
+ asm volatile ("mov ar.ccv=%0;;" :: "rO"(old_value));
+ asm volatile ("cmpxchg4.acq %0=[%1],%2,ar.ccv"
+ : "=r"(retval)
+ : "r"(value),
+ "r"(new_value)
+ : "memory");
+ return retval == old_value;
+}
+
+
+/**
+ * Set an ufulock's associated value.
+ *
+ * @vfulock: Pointer to the address of the ufulock to contain for.
+ * @value: New value to assign.
+ *
+ * Wrapper for arch-specific idiosyncrasies when setting a value that
+ * is shared across different address mappings.
+ */
+static inline
+void vfulock_set (volatile unsigned *vfulock, unsigned value)
+{
+ *vfulock = value;
+}
+
+#endif /* #ifdef __KERNEL__ */
+#endif /* #ifndef __asm_ia64_fulock_h__ */
--- linux/include/asm-ia64/unistd.h:1.1.1.6 Mon Oct 20 13:56:48 2003
+++ linux/include/asm-ia64/unistd.h Fri Nov 21 13:26:01 2003
@@ -248,10 +248,16 @@
#define __NR_clock_nanosleep 1256
#define __NR_fstatfs64 1257
#define __NR_statfs64 1258
+/* Hole: 1259 */
+#define __NR_ufulock_lock 1260
+#define __NR_ufulock_unlock 1261
+#define __NR_ufulock_consistency 1262
+#define __NR_ufuqueue_wait 1263
+#define __NR_ufuqueue_wake 1264

#ifdef __KERNEL__

-#define NR_syscalls 256 /* length of syscall table */
+#define NR_syscalls 265 /* length of syscall table */

#if !defined(__ASSEMBLY__) && !defined(ASSEMBLER)

2004-01-14 22:56:33

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 9/10: stub for priority protection

fulock-pp.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 66 insertions(+)

--- /dev/null Wed Jan 14 14:39:30 2004
+++ linux/kernel/fulock-pp.c Wed Jan 14 11:13:33 2004
@@ -0,0 +1,66 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ *
+ * Stuff for priority-protected fulocks
+ */
+
+#include <linux/fulock.h>
+#include <linux/plist.h>
+#include <linux/time.h> /* struct timespec */
+#include <linux/sched.h> /* MAX_SCHEDULE_TIMEOUT */
+#include <linux/errno.h>
+#include <linux/fulock.h>
+
+/**
+ * Check if a process is allowed to lock a PP fulock.
+ *
+ * @fulock: fulock to check on.
+ * @task: Task to check for acquisition.
+ */
+int __fulock_pp_allowed (struct fulock *fulock)
+{
+ int policy = fulock->flags & FULOCK_FL_PP_PLC_MK > 20;
+ int priority = fulock->flags & FULOCK_FL_PP_PRIO_MK > 12;
+ int prio;
+
+ if (policy != SCHED_NORMAL)
+ prio = MAX_USER_RT_PRIO - 1 - priority;
+ else
+ prio = priority;
+ fulock->flags &= ~FULOCK_FL_PP_PRIO_MK;
+ fulock->flags |= prio & FULOCK_FL_PP_PRIO_MK;
+#warning FIXME: interaction with PI? Compare against static, not dynamic?
+ if (prio > current->prio)
+ return -EINVAL;
+ return 0;
+}
+
+
+/** FIXME */
+void __fulock_pp_boost (struct fulock *fulock)
+{
+#warning FIXME: finish me
+}
+
+
+/**
+ * Remove the boosting in priority of the owner of a
+ * priority-protected @fulock.
+ *
+ * @fulock: fulock whose owner's priority is to be boosted.
+ *
+ * If the fulock is a priority-protected lock, boost the priority of
+ * the owner to the fulock's priority ceiling.
+ */
+void __fulock_pp_unboost (struct fulock *fulock)
+{
+}
+

2004-01-14 22:54:09

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 1/10: documentation files

fusyn.txt | 559 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 559 insertions(+)

--- /dev/null Wed Jan 14 14:39:30 2004
+++ linux/Documentation/fusyn.txt Mon Jan 5 10:22:44 2004
@@ -0,0 +1,559 @@
+
+FUSYN - Fast User SYNChronization primitives
+
+http://developer.osdl.org/dev/robustmutexes/
+
+I am calling these things FUSYNs to distinguish them from the original
+futexes they base on, as they behave kind of different (the naming
+sucks, you are welcome to suggest new names).
+
+This is my second big time attempt to implement real-time locking in
+the Linux kernel to solve the short comings of a locking system based
+on futexes, being mainly:
+
+ - no robustness support without kernel cooperation
+
+ - no priority inheritance/protection
+
+ - no real-time wake up (priority based, no priority inversion
+ holes, etc)
+
+The main objects to implement are:
+
+ - fuqueues: Priority-sorted wait queues -- other than that, mostly
+ equal to futexes. Usable from kernel and user space.
+
+ - fulock:
+
+ This is a full blown "mutex" implementation that can be used from
+ kernel and user space (with user-space fast [un]locking on
+ non-contended situations), robustness (if owner dies, ownership is
+ passed on), priority inheritance and (FUTURE) priority protection.
+
+ They are just a fuqueue that supports the ownership concept, to
+ allow for robustness and priority inheritace/protection.
+
+ It also supports serialized and non-serialized unlocks [see FAQ].
+
+All the non-blocking calls (wake() and unlock() can be used from
+interrupt context; the data structures are protected by IRQ safe spin
+locks). This is heavier weight, but is needed to properly support
+priority inheritance and to avoid cache line bouncing of the spin
+locks that protect the structures.
+
+Released files:
+
+http://developer.osdl.org/dev/robustmutexes/
+
+fusyn-2.6.0XXX-2.X.patch Patch file against Linus' tree
+fusyn-test-2.X.tar.gz Sample test library and code
+
+
+Contents:
+---------
+
+- vlocators
+- fuqueues
+- fulocks
+- issues/future work
+- FAQ [some definitions here, try wild search]
+
+
+
+
+VLOCATORS
+---------
+
+This is an structure (struct vlocator) and the associated code to map
+a user space address to a kernel object that is kept in a hash
+table. As well, it provides and uses a reference count for the object,
+as well as a hash table cleanup function.
+
+It uses the 'struct futex_key' as done by Jamie Lokier; the code is in
+include/linux/vlocator.h and kernel/vlocator.c.
+
+Two very simple operations: find an object in 'current's space by
+address and find-or-allocate (vl_find() and vl_locate()).
+
+The cleanup function (or garbage collector) runs periodically and
+releases items with a reference count of zero. This allows the get/put
+operations to be lockless.
+
+
+FUQUEUES
+--------
+
+Fuqueues are just wait-queues, like futexes; the differences are in
+the wake up process, as it is done not in a FIFO order but by
+priority. As well, if the task changes its priority while waiting, its
+position in the list is updated. The code is in include/linux/fuqueue.h
+and kernel/fuqueue.c.
+
+They consist of a 'struct fuqueue' which has a priority-sorted wait
+list and a lock to protect access to it. They can be used in
+kernel-space as a wait queue.
+
+Entry points:
+
+fuqueue_wait() -- wait on a fuqueue
+fuqueue_wake() -- wake N waiter from a fuqueue with code C.
+
+The code is split in various functions to allow fulocks to use the
+fuqueue stuff for integration. The wlist*_t thing is a reminder of
+something that will go away; it stands for 'wait list' and is setup
+with a #define based redirection to support different types of sorted
+wait list implementations (for example, one that is O(1) using a
+priority array -- that is huge). That is going to be deprecated in
+favor of a O(1) priority sorted list that is not as big (see FUTURE
+WORK).
+
+'struct ufuqueue' is a fuqueue plus the stuff to link it to a possibly
+shared user space address (a vfuqueue) (the vlocator), so that is the
+real futex equivalent. The code is in kernel/ufuqueue.c and just
+consists on the syscall wrappers to associate the proper ufuqueue to
+the vfuqueue and then call the fuqueue layer.
+
+
+FULOCKS
+-------
+
+The mother of the whole thing. Fulocks are a full mutex
+implementation; it is basically the concept of an owner and a list of
+tasks waiting to own the mutex (implemented with a 'struct fuqueue').
+
+The 'struct fulock' holds everthing. To support the different modes
+of operation (priority inheritance, priority protection, deadlock
+check and sun-mode robustness [see FUTURE WORK]) there is a 'flags'
+member.
+
+As well, there is an ownership list node, where all the fulocks that a
+task currently owns are linked to the task (task->fulock_olist).
+
+As well, a fulock has the concept of 'state': healthy, dead-owner or
+not-recoverable (see the FAQ for the definitions).
+
+The entry points are [kernel/fulock.c, include/linux/fulock.h]:
+
+fulock_lock()
+fulock_unlock()
+fulock_consistency() [for manipulating the state]
+
+A user level fulock (struct ufulock) is a fulock that can be used from
+the user space--it is represented by a (potentially shared) memory
+address (a vfulock). A vlocator is used to track it. Implemented in
+kernel/ufulock.c.
+
+The vfulock may have different values that server to define the state
+of a lock:
+
+0 Unlocked [may be fast-locked]
+PID (< VFULOCK_KCO) Fast-locked by PID, no waiters in the
+ kernel. [May be fast-unlocked].
+VFULOCK_KCO Locked by someone, kernel knows who, waiters
+ in the kernel.
+VFULOCK_DEAD Previous owner died (maybe unlocked or
+ locked), the kernel keeps the status.
+VFULOCK_NR Not recoverable.
+
+Now, when user space goes to lock a ufulock with a fast operation, it
+issues an atomic compare and swap of its PID against 0; if it
+succeeds, its the owner, done; if not, it goes to the kernel
+(sys_ufulock_lock()), who will put it to wait [see
+test/src/include/kernel-lock.h:vfulock_lock() in the fusyn-test
+package].
+
+Unlock is fairly similar: if the value is VFULOCK_{KCO,DEAD}, go to
+the kernel, sys_ufulock_unlock(); if VFULOCK_NR, return error; if not,
+it is a PID and need to do an atomic compare and exchange of zero
+(unlock) against the PID [again, check vfulock_unlock()].
+
+Now, how it works is fairly simple: the kernel will always maintain
+the vfulock and the corresponding fulock in the 'struct ufulock' in
+sync [vfulock_sync() in kernel/ufulock.c], and will do that everytime
+we enter it through one of the fulock system calls
+(sys_ufulock_{[un]lock,consistency}().
+
+The kernel will use the PID set by the fast-locker to match who is the
+owner when he doesn't know about it [afterwards it will be registered
+in the kernel)--check __fulock_id_owner() for ideas on how to avoid
+collision due to PID reuse].
+
+Once that is done, what is left is a 'fulock' that can be handled by
+the fulock layer.
+
+Now [uv]fulocks support:
+
+ - Real time: the unlock procedure is realtime in the sense that it
+ is O(1) and the next owner is the highest priority one; as well,
+ the fulock (actually, the vfulock) is never unlocked in the
+ meantime, the ownership is transferred instead of unlocking the
+ lock, waking up the first waiter and waiting for it to acquire
+ it. This avoids priority inversions by lower priority threads
+ sneaking in from other processors at the worst time.
+
+ - Deadlock checking: complex dead lock scenarios where a
+ ownership/wait chain [see definition in FAQ] is involved are
+ catched if FULOCK_FL_ERROR_CHK is set.
+
+ - Priority change support: when the priority of the waiting task
+ is changed, it's position in the list is updated. See below for
+ effects on priority inheritance.
+
+ - Robustness: when a task who is a fulock owner dies and the
+ kernel knew about it (ie: it was registered in the
+ task->fulock_list), then the fulock is made dead-owner, unlocked
+ and the next waiter gets ownership, with a -EDEADOWNER return
+ code.
+
+ This is always enabled; user space can emulate the
+ hangs+timeouts that would happen if this were not detected.
+
+ If the kernel knew nothing about it (ie: it was fast-locked),
+ then __fulock_id_owner() will fail to map the PID in the vfulock
+ to an existing task; then the current claimer would be
+ considered the owner after marking the fulock dead-owner.
+
+ Note the comments in __fulock_id_owner() for ideas on how to
+ avoid collisions due to PID reuse.
+
+ - Priority protection: not implemented yet
+
+ - Priority inheritance: when a waiter queues for a fulock that has
+ the FULOCK_FL_PI bit set and its priority is higher than that of
+ the owner, it will boost the owner's priority to its own; this
+ will propagate in an ownership/wait chain (if the owner was
+ waiting on for a fulock, etc). As well, priority changes will
+ also be propagated.
+
+ This is done with __fulock_pi_boost(), fuqueue_chprio() and
+ __fulock_chprio() [through the fulock's 'struct
+ fuqueue_ops']. The unboosting is done in __fulock_unlock() or
+ __fulock_wait_cancel() [again, through the fulock's 'struct
+ fuqueue_ops'].
+
+
+FUTURE WORK
+-----------
+
+ - fucond: conditional variables; although they can be implemented
+ in user space + fuqueues, doing it in the kernel helps a lot in
+ atomicity issues (and the performance should be much better).
+
+ We tried doing that (see releases up to 1.12 in the website) and
+ generally it sucked because of the code bloat in the kernel, so
+ we decided to extirpate it.
+
+ - rw lock: only the first locker can do the fast operation; the
+ others go to the kernel to sign up. This way ownership is
+ supported. If a reader dies, nothing happens (after all, it is
+ supposed to be read-only access), but we need to keep track of
+ readers dying so they don't hold writers off. If a writer dies,
+ next locker (reader or writer) gets dead-owner.
+
+ These guys could also get, like this, PI and PP, as they would be
+ very similar to fulocks, but with two waiting lists. One for
+ writers, one for readers, and they allow many ownerships at the
+ same time (when there are readers).
+
+ Maybe different operation modes to primer writers over readers?
+ FIXME, need to explore.
+
+ - Spinlocks: they could be implemented as a trylock() on a fulock
+ for N loops, and after it'd degenerate into a mutex wait. This
+ wait they'd automagically support robustness, PI and PP.
+
+ - Barriers: futexes offer enough functionality for implementing
+ them, however wake up should be real-time (priority based). Not a
+ real issue though, as in barriers everybody is woken up. It can be
+ done also with fuqueues.
+
+ - Getting rid of the vlocator hash table and doing direct mapping
+ [so that we avoid the O(N) lookup] by storing in user space some
+ short of pointer to a in-kernel data struct. The pointer has to be
+ "validated", so that user space cannot have the kernel point to
+ some random or pontentially dangerous space.
+
+ A way would be to store two values, the pointer itself plus a
+ kernel-crypted copy that the can be used to verify.
+
+ Need more research into this.
+
+ - O(1) priority list: current plist is not O(1) in addition, because
+ it has to locate the proper position in the list where to add. I
+ plan to modify the plist code to be O(N) where N is the number of
+ priority levels, and as it is fixed at compilation time, it is
+ effectively O(1).
+
+ The idea is to have something similar to a priority array, but
+ instead of having N list heads, we have only the first node of
+ each priority being the list head, and the rest of the guys in
+ that prio hanging from him.
+
+ - Sun-mode robustness. Solaris implements robustness in a slightly
+ more restrictive way. We want to add an small compatibility layer
+ so both models can be used.
+
+ - Support for architectures that don't have atomic compare and
+ swap. Once priority protection is finished (that will involve that
+ a pure KCO method is developed), it can be solved using this;
+ however, they will loose the fast-lock path--maybe not an issue?
+ how many arches do not implement atomic compare and exchange?.
+
+ - That page pinning when waiting for a fulock...
+
+
+FAQ
+---
+
+This set of Q&A is what I use myself to track my ideas and concepts
+(and not to forget why did I decide anything).
+
+
+Q: What is PI?
+
+Priority Inheritance: when task A holds resource R and task B claims
+it, and prio (B) > prio (A), then B can force A to take its priority
+so it finishes sooner and B can take the resource ownership. The
+priority boost ends when A releases R.
+
+
+Q: What is PP?
+
+Priority Protection: resources have an associated priority ceiling;
+any task that acquires a resource will have its prio raised to that
+prioirty ceiling while holding it.
+
+
+Q: What is RM?
+
+Robust Mutex, or robustness, for short: when the owner of a resource
+dies, we want the next owner to know that somebody died while holding
+the resource, so s/he is able to determine if a cleanup is needed.
+
+
+Q: What is a healthy fulock?
+
+This is a fulock in its normal state, that is: initialized and not in
+dead-owner or not-recoverable states.
+
+
+Q: What is a dead-owner fulock?
+
+A fulock is in dead-owner state when it was locked (some task owned
+it) and the task died without unlocking it.
+
+
+Q: What is a not-recoverable fulock?
+
+A fulock is in not-recoverable state when it went into dead-owner
+state and some task that acquired it in dead-owner state decided that
+it had to be made not-recoverable.
+
+The rationale behind this is that normally you have some lock
+protecting access to some data. When the lock goes dead-owner, the
+task that owned it and died could have died in the middle of updating
+the data, and thus it can be inconsistent. Subsequent owners of the
+lock get it with the dead-owner state, so that they are aware of the
+situation. If any of them can fix it, it can move the lock back to
+healthy state and continue operating, but if there is no way to fix
+the data, it is moved to not-recoverable state.
+
+When moved, all the pending waiters are given an error code (EBADR)
+indicating the new state, so that they can bail out and report up to
+their managers for what to do. As well, new contenders that try to
+acquire the lock will get also the EBADR error code.
+
+The only way to make the fulock healthy again is to reinitialized it.
+
+
+Q: What is a dead-owner dead-lock?
+
+When some task that has to unlock a locked fulock dies and others are
+waiting for it to release the fulock.
+
+
+Q: What is a dead-owner recovery?
+
+When a lock owner dies, the next waiter or next guy who locks and gets
+ownership gets it with an special code that indicates that some
+previous owner died and that the state of the lock is "dead-owner",
+that recovery on the data structures protected by the lock must be
+done in order to ensure consistency.
+
+Once a fulock is in dead-owner state, it can be moved back to
+normal/healthy or made inconsistent (so only an initialization returns
+it to normal).
+
+
+Q: Why does the kernel have to set the value of the fulock?
+ Why cannot the value of the fulock after unlock be set by user
+ space?
+
+There is a risk of overwritten values and missed waiters.
+
+For example, task B claims fulock F (locked by task A) so it goes to
+the kernel to wait; now the fulock value is VFULOCK_KCO (kernel
+controlled ownership). Before it reaches the kernel, task C releases
+the fulock for task A; as there are no waiters, it returns UNLOCKED
+and task C has to set it to UNLOCKED, thus overwriting VFULOCK_KCO; as
+KCO is overwritten, task B is going to be dead-locked in the kernel,
+waiting.
+
+Furthermore, it helps guaranteeing robustness. If the just-woken up
+task that has to set the value the kernel passes dies before being
+able to do it, you hit a dead-owner dead-lock because nobody wakes up
+the waiters until somebody tries to lock and realizes the fulock is
+dead.
+
+
+Q: What are the two kinds of unlock?
+
+The two kinds of unlock are serialized and non-serialized.
+
+You need both because the serialized one can be slower, as it might
+force a context switch.
+
+I thought initially that this would show only in synthetic benchmarks
+(very tight loop acquiring and releasing the lock against some other
+threads doing the same thing), but I was wrong. Max Hailperin pointed
+to me that what I was seeing was the "Convoy Phenomenon", documented
+in by Mike Blasgen, Jim Gray, Mike Mitoma and Tom Price, in 1977
+[http://portal.acm.org/citation.cfm?id=850659&jmp=cit&dl=GUIDE&dl=ACM].
+
+After thinking about it, I concluded it would mostly apply in the
+following conditions:
+
+- user space only [in kernel space the lock itself is the spinlock
+ that protects the 'struct fulock'; as well, there is no preemption
+ and we spin]
+
+- non real-time environments/processes (where preemption is likely)
+
+- real-time SMP environments where two taks might compete for a lock
+ at the _same_ time (so to avoid it in this case, it might be
+ interesting to spin a wee bit before blocking).
+
+Now, in order to gain robustness you need serialization, so a
+userspace user is recommended to use serialized wake ups only when:
+
+- need *full* and complete robustness guarantee
+
+- needs real-time priority-inversion protection guarantee (in SMP, not
+ needed for UP).
+
+
+Q: What is a non-serialized unlock?
+
+A non-serialized unlock works by setting the fulock to unlocked and
+waking up as many waiters as desired. The waiters then re-contend for
+the fulock, the winner owns it and the others go back to wait on it.
+
+Main problem: can't find a way to guarantee robustness
+
+- Say we have a fulock with M guys waiting and we wake up N (1 < N <
+ M), a non-serialized wakeup. Thus, there are M - N guys still
+ waiting in the kernel.
+
+ In order for the unlock to be non-serialized, the waker first sets
+ the futex to UNLOCKED.
+
+ Now, how do the woken up processes know that there are still
+ processes in the kernel?
+
+ A solution is to set the lock not to UNLOCKED, but to
+ UNLOCKED+WP (ie: maintain the WP); this way, whowever tries to
+ acquire will see UNLOCKED+WP and will go down to the kernel to do
+ the lock operation. Also, the lock operation can return an special
+ code that says "contend, but set the lock to pid|WP" to indicate
+ waiters present.
+
+ However, it still does not solve the fact that when setting to
+ UNLOCKED+WP and waking N, if those N die before locking, the
+ waiters go into dead-owner dead-lock.
+
+ When somebody tries to lock that, the kernel should be able to
+ notice that there are waiters and it is unlocked and thus give
+ way to the first locker with dead-owner recover --it might be too late.
+
+ Another option might be to tag the woken-up processes before they
+ exit the kernel, so that if they die, do_exit can trap it (but there
+ are many side issues to this, like how do I make sure that the N who
+ I woke up have gone through it, one has locked, the other N-1 have
+ gone to sleep, how do I clean it up and stuff like that--makes it
+ pretty ugly, not to talk about how many resources it'd need to tag it).
+
+ Gosh, it is a mess -- I would say that robust mutexes have to
+ require serialized unlocks. Period.
+
+ Not even talk about adding a short timer to verify that the thing
+ was locked...shrivers
+
+ RESEARCH: tentative ownership: when we wake up some guys who are
+ supposed to go and try to lock again, tie the fulock they should
+ lock to their task_struct and on exit(), check they don't have it
+ there ... [many details need to be worked out].
+
+- It could also be solved by waking up _all_ the waiters; this way no
+ dead-owner dead-lock could ever happen; however, it is a sure recipe
+ for an scheduling storm some day.
+
+
+Q: What is a serialized unlock?
+
+A serialized unlock transfers the ownership of the fulock to the first
+waiter in the kernel.
+
+- Only one waiter can be woken up at the same time with this method.
+
+- It prevents priority inversion (as the fulock stays locked during
+ the whole operation no low priority thread can acquire it in the
+ meantime).
+
+- dead-owner dead-lock is not possible, because the owner is always
+ known during the operation. As well, if the new owner dies on it's
+ way up to user space, its ownership is also known.
+
+Slower (still not proved seriously--postulated and proven in some
+very synthetic benchmarks) because it forces a context switch.
+
+
+Q: What is an vfulock?
+
+It is the address in user space associated to a fulock in kernel
+space.
+
+
+Q: What is owner identification?
+
+Owner identification is a property that the fulocks have: basically,
+they can identify who is the owner based on the vfulock (in user
+space) or the internal kernel data structures that refer to it (if the
+vfulock is VFULOCK_KCO, that means that the kernel tracks the
+ownership); if vfulock < VFULOCK_KCO, it means that the ownership is
+tracked only in user space, and the vfulock is the PID of the owner.
+
+
+Q: What is a kernel-controlled-ownership fulock? (kco)
+
+A vfulock whose value is VFULOCK_KCO or VFULOCK_DEAD, indicating that
+the ownership for the fulock is tracked by the kernel [FIXME: also KCO
+fulock are those called with the FULOCK_FL_KCO flag--not implemented].
+
+This happens when:
+
+ - The fulock is locked and there are waiters on the kernel
+ - The fulock is dead (and the ownership keeps track for it)
+ - The fulock is a priority protected fulock
+
+Basically it is a way to indicate that the fastpath for
+locking/unlocking cannot be taken.
+
+
+Q: What is an ownership/wait chain?
+
+The structure that is formed when task A owns lock F and is waiting
+for lock G, owned by task B that is waiting for lock H, that is owned
+be task C that is waiting for lock I ... etc.
+
+When this chain is circular (eg: lock I is owned by A) then there is a
+deadlock.
\ No newline at end of file

2004-01-14 22:59:15

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 6/10: user space/kernel space tracker

include/linux/vlocator.h | 128 ++++++++++++
kernel/vlocator.c | 465 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 593 insertions(+)

--- /dev/null Wed Jan 14 14:39:30 2004
+++ linux/include/linux/vlocator.h Mon Jan 5 10:23:13 2004
@@ -0,0 +1,128 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ *
+ * Stuff to map user space addresses to kernel space objects in a
+ * controlled way.
+ */
+
+#ifndef __linux_vlocator_h__
+#define __linux_vlocator_h__
+
+#ifdef __KERNEL__
+
+#include <linux/list.h>
+#include <linux/hash.h>
+#include <asm/atomic.h>
+#include <linux/futex.h> /* for the futex hash */
+
+
+/** Track @what by a user level vm address. */
+struct vlocator {
+ atomic_t refcount;
+ struct list_head hash_list;
+ union futex_key key;
+ const struct vlocator_ops *ops;
+};
+
+static __inline__
+void vlocator_init (struct vlocator *vlocator)
+{
+ atomic_set (&vlocator->refcount, 0);
+ INIT_LIST_HEAD (&vlocator->hash_list);
+}
+
+
+/** A single queue in the hash. */
+struct vlocator_queue
+{
+ spinlock_t lock;
+ struct list_head queue;
+ unsigned additions;
+};
+
+static __inline__
+void vlocator_queue_init (struct vlocator_queue *vlq)
+{
+ vlq->lock = SPIN_LOCK_UNLOCKED;
+ INIT_LIST_HEAD (&vlq->queue);
+ vlq->additions = 0;
+}
+
+struct page;
+union futex_key;
+
+/** vlocator operations on the items that are to be located. */
+struct vlocator_ops
+{
+ struct vlocator * (* alloc) (void);
+ void (* create) (struct vlocator *, struct page *,
+ const union futex_key *, unsigned flags);
+ void (* release) (struct vlocator *);
+ void (* free) (struct vlocator *);
+};
+
+
+/** Reference an vlocator @vlocator. */
+#define vl_get(vlocator) \
+do { atomic_inc (&(vlocator)->refcount); } while (0)
+
+
+/** Unreference an @vlocator; return true if it drops to zero. */
+static inline
+unsigned vl_put (struct vlocator *vlocator)
+{
+ return atomic_dec_and_test (&vlocator->refcount);
+}
+
+extern int vl_find (struct page **ppage, struct vlocator **pvl,
+ const struct vlocator_ops *ops,
+ volatile const unsigned __user *uaddr);
+extern int vl_locate (struct page **ppage, struct vlocator **pvl,
+ const struct vlocator_ops *ops,
+ volatile const unsigned __user *uaddr, unsigned flags);
+extern int vl_dispose (const struct vlocator_ops *,
+ volatile const unsigned __user *);
+
+/* Debug stuff */
+
+/*
+ * Helpers for debugging memory allocation [needs to be here so it can
+ * be shared by ufuqueue.c and ufulock.c].
+ */
+
+#if DEBUG >= 1
+static atomic_t allocated_count = ATOMIC_INIT(0);
+static inline
+int allocated_get (void)
+{
+ return atomic_read (&allocated_count);
+}
+static inline
+int allocated_inc (void)
+{
+ atomic_inc (&allocated_count);
+ return allocated_get();
+}
+static inline
+int allocated_dec (void)
+{
+ atomic_dec (&allocated_count);
+ return allocated_get();
+}
+#else
+static inline int allocated_get (void) { return 0; }
+static inline int allocated_inc (void) { return 0; }
+static inline int allocated_dec (void) { return 0; }
+#endif /* DEBUG > 1 */
+
+
+#endif /* #ifdef __KERNEL__ */
+#endif /* #ifndef __linux_vlocator_h__ */
--- /dev/null Wed Jan 14 14:39:30 2004
+++ linux/kernel/vlocator.c Mon Jan 5 10:27:09 2004
@@ -0,0 +1,465 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ */
+
+#include <linux/time.h>
+#include <linux/sched.h>
+#include <linux/futex.h>
+#include <asm/uaccess.h>
+#include <linux/init.h>
+#include <linux/pagemap.h>
+#include <linux/fuqueue.h> /* For the debug statements */
+#include <linux/vlocator.h>
+
+#define VL_GC_PERIOD 20 /* do garbage collection every 20 seconds */
+#define VL_HASH_BITS 8
+#define VL_HASH_QUEUES (1 << VL_HASH_BITS)
+static struct vlocator_queue __vl_hash[VL_HASH_QUEUES];
+static struct list_head __vl_disposal = LIST_HEAD_INIT (__vl_disposal);
+static spinlock_t __vl_disposal_lock = SPIN_LOCK_UNLOCKED;
+
+/** Get the hash index for futex_key @k. */
+static __inline__
+struct vlocator_queue * vl_hash (const union futex_key *k)
+{
+ unsigned index = futex_hash_key (k) & (VL_HASH_QUEUES - 1);
+ return &__vl_hash[index];
+}
+
+
+/** Search in a queue. */
+static __inline__
+struct vlocator * __vlocator_find (struct vlocator_queue *vlq,
+ const union futex_key *key)
+{
+ struct vlocator *vl = NULL;
+ struct list_head *itr;
+
+ list_for_each (itr, &vlq->queue) {
+ vl = container_of (itr, struct vlocator, hash_list);
+ if (match_futex_key (key, &vl->key))
+ goto out;
+ }
+ vl = NULL;
+out:
+ return vl;
+}
+
+
+/** Search in a queue [backwards - locates faster recent additions]. */
+static inline
+struct vlocator * __vlocator_find_r (struct vlocator_queue *vlq,
+ const union futex_key *key)
+{
+ struct vlocator *vl = NULL;
+ struct list_head *itr;
+
+ list_for_each_prev (itr, &vlq->queue) {
+ vl = container_of (itr, struct vlocator, hash_list);
+ if (match_futex_key (key, &vl->key))
+ goto out;
+ }
+ vl = NULL;
+out:
+ return vl;
+}
+
+
+/** Remove @vlocator from queue @vlq if its use count is zero (and
+ * return true). */
+static __inline__
+unsigned __vlocator_rem (struct vlocator *v)
+{
+ unsigned result = 0, refcount;
+ refcount = atomic_read (&v->refcount);
+ if (refcount == 0) {
+ list_del (&v->hash_list);
+ result = 1;
+ }
+ return result;
+}
+
+
+/** Append @vlocator to queue @vlq. */
+static __inline__
+void __vlocator_add (struct vlocator_queue *vlq, struct vlocator *v)
+{
+ vlq->additions++;
+ list_add (&v->hash_list, &vlq->queue);
+}
+
+
+/**
+ * Setup all the memory mumbo jumbo we need to access a user space
+ * item: locate the page where the address is and generate a futex_key
+ * for it. Return both referenced [with pin_page() and get_key_refs()]
+ *
+ * Shamelessly copied from Jamie's kernel/futex.c:get_futex_key()...
+ * good invention this GPL thingie :)
+ */
+int vl_setup (struct page **ppage, union futex_key *key,
+ volatile const unsigned __user *_uaddr)
+{
+ int result;
+ struct page *page;
+ unsigned long uaddr = (unsigned long) _uaddr;
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+
+ ftrace ("(%p, %p, %p)\n", ppage, key, _uaddr);
+
+ /* The futex address must be "naturally" aligned. */
+ key->both.offset = uaddr % PAGE_SIZE;
+ if (unlikely ((key->both.offset % sizeof (u32)) != 0))
+ return -EINVAL;
+ uaddr -= key->both.offset;
+
+ /* Generate a key for the vfulock. */
+ down_read (&mm->mmap_sem);
+ /*
+ * The futex is hashed differently depending on whether
+ * it's in a shared or private mapping. So check vma first.
+ */
+ vma = find_extend_vma (mm, uaddr);
+ result = -EFAULT;
+ if (unlikely (!vma))
+ goto error_up;
+ /* Permissions. */
+ result = (vma->vm_flags & VM_IO) ? -EPERM : -EACCES;
+ if (unlikely ((vma->vm_flags & (VM_IO|VM_READ)) != VM_READ))
+ goto error_up;
+ /*
+ * Private mappings are handled in a simple way.
+ *
+ * NOTE: When userspace waits on a MAP_SHARED mapping, even if
+ * it's a read-only handle, it's expected that futexes attach to
+ * the object not the particular process. Therefore we use
+ * VM_MAYSHARE here, not VM_SHARED which is restricted to shared
+ * mappings of _writable_ handles.
+ */
+ result = 0;
+ if (likely (!(vma->vm_flags & VM_MAYSHARE))) {
+ key->private.mm = mm;
+ key->private.uaddr = uaddr;
+ goto out_pin;
+ }
+ /* Linear file mappings are also simple. */
+ key->shared.inode = vma->vm_file->f_dentry->d_inode;
+ key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
+ if (likely (!(vma->vm_flags & VM_NONLINEAR))) {
+ key->shared.pgoff = (((uaddr - vma->vm_start) >> PAGE_SHIFT)
+ + vma->vm_pgoff);
+ goto out_pin;
+ }
+
+ /*
+ * We really need to pin the page in memory, we are about to
+ * modify it.
+ */
+
+ /*
+ * Do a quick atomic lookup first - this is the fastpath.
+ */
+#warning FIXME: OPTIMIZE:
+ /*
+ * We only need to take this path when we don't know the page;
+ * we can hack vl_locate() so that if we know the key without
+ * checking the page and we find the vlocator, we can take the
+ * page from there.
+ */
+out_pin:
+ spin_lock (&mm->page_table_lock);
+ page = follow_page (mm, uaddr, 0);
+ if (likely (page != NULL)) {
+ key->shared.pgoff =
+ page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ if (!PageReserved (page))
+ get_page (page);
+ spin_unlock (&mm->page_table_lock);
+ goto out;
+ }
+ spin_unlock (&mm->page_table_lock);
+
+ /*
+ * Do it the general way.
+ */
+ result = get_user_pages (current, mm, uaddr, 1, 0, 0, &page, NULL);
+ if (result < 0)
+ goto error_up;
+ key->shared.pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+out:
+ get_key_refs (key);
+ *ppage = page;
+error_up:
+ up_read (&current->mm->mmap_sem);
+ return result;
+}
+
+
+/**
+ * Locate a vlocator for a certain user address.
+ *
+ * @ppage: Pointer to where to store the pointer to the referenced
+ * page where @uaddr is. NULL if unneeded.
+ * @pvl: Pointer to where to store the pointer to the located
+ * vlocator based structure.
+ * @ops: Pointer to the function ops for this guy (for checking)
+ * @uaddr: Address in user space.
+ * @returns: < 0 errno code on error (no resources held)
+ * 0 if ok and:
+ *
+ * - Located vlocator in *pvl, referenced
+ * - Page pointer in *ppage (unless NULL), referenced
+ * - The key in (*pvl)->key has been referenced with get_key_refs()
+ */
+int vl_find (struct page **ppage, struct vlocator **pvl,
+ const struct vlocator_ops *ops,
+ volatile const unsigned __user *uaddr)
+{
+ int result;
+ struct vlocator_queue *vlq;
+ struct vlocator *vl;
+ union futex_key key;
+ struct page *page;
+ unsigned need_page = 0;
+
+ ftrace ("(%p, %p, %p, %p)\n", ppage, pvl, ops, uaddr);
+
+ result = vl_setup (&page, &key, uaddr);
+ if (result < 0)
+ goto error;
+ /* Is it in the hash already? [do we know about it?] */
+ vlq = vl_hash (&key);
+ spin_lock (&vlq->lock);
+ vl = __vlocator_find (vlq, &key);
+ if (likely (vl == NULL))
+ goto out_unlock;
+ /* Check the type is correct */
+ result = -EFAULT;
+ if (unlikely (vl->ops != ops))
+ goto out_unlock;
+ result = 0;
+ vl_get (vl);
+ *pvl = vl;
+ if (ppage) {
+ *ppage = page;
+ need_page = 1;
+ }
+out_unlock:
+ spin_unlock (&vlq->lock);
+ drop_key_refs (&key); // Undo from ...
+ if (need_page == 0)
+ put_page (page); // ... ufu_setup() (maybe)
+error:
+ return result;
+}
+
+
+/**
+ * Locate or allocate a vlocator for a certain user address.
+ *
+ * @ppage: Pointer to where to store the pointer to the referenced
+ * page where @uaddr is. NULL if unneeded.
+ * @pvl: Pointer to where to store the pointer to the allocated
+ * vlocator based structure.
+ * @ops: Pointer to the function ops used for allocating/constructing, etc.
+ * @uaddr: Address in user space.
+ * @flags: flags, for initialization, passed to ops->create()
+ * @returns: < 0 errno code on error (no resources held)
+ * 0 if ok and:
+ *
+ * - Located or allocated vlocator in *pvl, referenced
+ * - Page pointer in *ppage (unless NULL), referenced
+ * - The key in (*pvl)->key has been referenced with get_key_refs()
+ *
+ * FIXME: optimize allocation collision detection; add a modification
+ * count to the hash table; whenever anything is added, inc the
+ * count. If we see the count changed after we droped the lock,
+ * allocated and re-locked, then we need to find_r(), if not, we can
+ * assume nothing changed.
+ */
+int vl_locate (struct page **ppage, struct vlocator **pvl,
+ const struct vlocator_ops *ops,
+ volatile const unsigned __user *uaddr, unsigned flags)
+{
+ int result;
+ struct vlocator_queue *vlq;
+ struct vlocator *vl, *vl_alt;
+ union futex_key key;
+ struct page *page;
+ unsigned need_page = 0;
+ unsigned additions0;
+
+ ftrace ("(%p, %p, %p, %p, %x)\n", ppage, pvl, ops, uaddr, flags);
+
+ result = vl_setup (&page, &key, uaddr);
+ if (result < 0)
+ goto error;
+ /* Is it in the hash already? [do we know about it?] */
+ vlq = vl_hash (&key);
+ spin_lock (&vlq->lock);
+ vl = __vlocator_find (vlq, &key);
+ if (likely (vl != NULL))
+ goto out_check;
+ additions0 = vlq->additions;
+ spin_unlock (&vlq->lock);
+
+ /* Naaah, let's alloc it */
+ result = -ENOMEM;
+ vl = ops->alloc();
+ if (unlikely (vl == NULL))
+ goto error_alloc;
+ result = 0;
+
+ /* Ok, allocated - did somebody add it while we were allocating?
+ * [we try to speed up by checking if anybody has added since
+ * we dropped the lock--we live up with the chance of 4G
+ * allocations wrapping up the counter in the middle *grin*]. */
+ spin_lock (&vlq->lock);
+ if (additions0 == vlq->additions
+ || (vl_alt = __vlocator_find_r (vlq, &key)) == NULL) {
+ ops->create (vl, page, &key, flags);
+ vl->ops = ops;
+ __vlocator_add (vlq, vl);
+ goto out_ref;
+ }
+ /* Allocation collision, get the new one, discard ours */
+ ops->free (vl);
+ vl = vl_alt;
+out_check:
+ result = -EFAULT;
+ if (unlikely (vl->ops != ops)) /* Check the type is correct */
+ goto out_unlock;
+ result = 0;
+out_ref:
+ vl_get (vl);
+ *pvl = vl;
+ if (ppage) {
+ *ppage = page;
+ need_page = 1;
+ }
+out_unlock:
+ spin_unlock (&vlq->lock);
+error_alloc:
+ drop_key_refs (&key); // Undo from ...
+ if (need_page == 0)
+ put_page (page); // ... ufu_setup() (maybe)
+error:
+ return result;
+}
+
+
+/**
+ * Dispose a vlocator
+ *
+ * When a vlocator is no longer needed, remove it from the hash table
+ * (add it to a delete list so the garbage collector can properly get
+ * rid of it).
+ */
+int vl_dispose (const struct vlocator_ops *vops,
+ volatile const unsigned __user *uaddr)
+{
+ int result = -ENXIO;
+ struct vlocator_queue *vlq;
+ struct vlocator *vl;
+ struct page *page;
+ union futex_key key;
+
+ result = vl_setup (&page, &key, uaddr);
+ if (result < 0)
+ goto error;
+ vlq = vl_hash (&key);
+ spin_lock (&vlq->lock);
+ vl = __vlocator_find (vlq, &key);
+ if (vl == NULL)
+ goto out_unlock;
+ /* In the hash, move it out */
+ result = 0;
+ list_del (&vl->hash_list);
+ spin_lock (&__vl_disposal_lock);
+ list_add_tail (&vl->hash_list, &__vl_disposal);
+ spin_unlock (&__vl_disposal_lock);
+out_unlock:
+ spin_unlock (&vlq->lock);
+ put_page (page);
+ drop_key_refs (&key);
+error:
+ return result;
+}
+
+
+/** Work structure for the garbage collector */
+static void vl_garbage_collector (void *);
+DECLARE_WORK(vl_workqueue, vl_garbage_collector, NULL);
+
+
+/**
+ * Do garbage collection (called from the work-queue) and re-arm
+ *
+ * The most important action is to cleanup the ownership of [u]fulocks
+ * that were assigned to init because the owner died and robustness
+ * was not enabled. This means that only ufulocks get to optionally
+ * suport robustness; kernel based fulocks have to do robustness
+ * always.
+ */
+
+static inline
+void __vl_garbage_collect (struct list_head *list)
+{
+ struct list_head *itr, *nxt;
+ struct vlocator *vl;
+ list_for_each_safe (itr, nxt, list) {
+ vl = container_of (itr, struct vlocator, hash_list);
+ if (__vlocator_rem (vl)) {
+ vl->ops->release (vl);
+ vl->ops->free (vl);
+ }
+ }
+}
+
+static
+void vl_garbage_collector (void *dummy)
+{
+ unsigned cnt;
+ struct vlocator_queue *vlq;
+
+ /* Cleanup the ufulock hash: anything with a zero refcount is wiped */
+ for (cnt = 0; cnt < VL_HASH_QUEUES; cnt++) {
+ vlq = &__vl_hash[cnt];
+ if (list_empty (&vlq->queue)) /* Some cheating always helps... */
+ continue;
+ spin_lock (&vlq->lock);
+ __vl_garbage_collect (&vlq->queue);
+ spin_unlock (&vlq->lock);
+ }
+ /* Cleanup the list of stuff marked for disposal */
+ spin_lock (&__vl_disposal_lock);
+ __vl_garbage_collect (&__vl_disposal);
+ spin_unlock (&__vl_disposal_lock);
+ /* Re-arm */
+ schedule_delayed_work (&vl_workqueue, VL_GC_PERIOD * HZ);
+}
+
+
+/** Initialize the ufu subsystem. */
+static
+int __init subsys_ufu_init (void)
+{
+ unsigned i;
+ for (i = 0; i < sizeof (__vl_hash) / sizeof (__vl_hash[0]); i++)
+ vlocator_queue_init (&__vl_hash[i]);
+ /* Set up the garbage collector to run every 10 seconds */
+ schedule_delayed_work (&vl_workqueue, VL_GC_PERIOD * HZ);
+ return 0;
+}
+__initcall (subsys_ufu_init);
+
+

2004-01-14 22:54:13

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 7/10: user space fuqueues

ufuqueue.c | 236 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 236 insertions(+)

--- /dev/null Wed Jan 14 14:39:30 2004
+++ linux/kernel/ufuqueue.c Wed Nov 19 17:32:07 2003
@@ -0,0 +1,236 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ */
+
+#include <linux/sched.h>
+#include <linux/init.h>
+#include <linux/highmem.h> /* kmap_atomic() */
+#include <linux/vlocator.h>
+#include <linux/fuqueue.h>
+
+extern signed long ufu_timespec2jiffies (const struct timespec __user *);
+static struct fuqueue_ops ufuqueue_ops;
+
+/** Slab for allocation of 'struct ufuqueue's. */
+static kmem_cache_t *ufuqueue_slab;
+
+/** A ufuqueue, tied to a user-space vm address. */
+struct ufuqueue {
+ struct fuqueue fuqueue;
+ struct vlocator vlocator;
+};
+
+
+/** Initialize an @ufuqueue (for slab object construction). */
+static inline
+void ufuqueue_ctor (void *obj, kmem_cache_t *slab, unsigned long flags)
+{
+ struct ufuqueue *ufuqueue = obj;
+ __fuqueue_init (&ufuqueue->fuqueue, &ufuqueue_ops);
+ vlocator_init (&ufuqueue->vlocator);
+}
+
+
+/** Allocate an ufuqueue maintaining debug-allocation counts. */
+static __inline__
+struct vlocator * ufuqueue_alloc (void)
+{
+ struct ufuqueue *ufuqueue;
+ ufuqueue = kmem_cache_alloc (ufuqueue_slab, GFP_KERNEL);
+ if (DEBUG > 0 && ufuqueue != NULL)
+ ldebug (1, "ufuqueue %p allocated, total: %d\n",
+ ufuqueue, allocated_inc());
+ return &ufuqueue->vlocator;
+}
+
+
+/**
+ * Create a ufuqueue that is going to be inserted to the hash
+ * [called from vlocator.c:vl_locate() throught the vlocator ops]
+ */
+static
+void ufuqueue_create (struct vlocator *vl, struct page *page,
+ const union futex_key *key, unsigned flags)
+{
+ memcpy (&vl->key, key, sizeof (*key));
+ get_key_refs (&vl->key);
+}
+
+
+/** Free an ufuqueue maintaining debug-allocation counts. */
+static __inline__
+void ufuqueue_release (struct vlocator *vl)
+{
+ drop_key_refs (&vl->key);
+}
+
+
+/** Free an ufuqueue maintaining debug-allocation counts. */
+static __inline__
+void ufuqueue_free (struct vlocator *vl)
+{
+ struct ufuqueue *ufuqueue =
+ container_of (vl, struct ufuqueue, vlocator);
+ kmem_cache_free (ufuqueue_slab, ufuqueue);
+ if (DEBUG > 0 && ufuqueue != NULL)
+ ldebug (1, "ufuqueue %p freed, total: %d\n",
+ ufuqueue, allocated_dec());
+}
+
+
+struct vlocator_ops ufuqueue_vops = {
+ .alloc = ufuqueue_alloc,
+ .create = ufuqueue_create,
+ .release = ufuqueue_release,
+ .free = ufuqueue_free
+};
+
+
+/**
+ * Wait on a fuqueue
+ *
+ * @_vfuqueue: address in user space of the condvar
+ * @cond_flags: flags for the conditional variable
+ * @_vfulock: address in user space of the fulock.
+ * @lock_flags: flags for the fulock.
+ * @_timeout: timeout information [see sys_ufulock_lock()]
+ *
+ * This is just a thin shell that locates the kernel descriptors for
+ * the condvar and the lock and then handles the work to
+ * ufuqueue_wait().
+ */
+asmlinkage
+int sys_ufuqueue_wait (volatile unsigned __user *_vfuqueue,
+ unsigned val, struct timespec __user *_timeout)
+{
+ int result = -EINVAL;
+ struct ufuqueue *ufuqueue;
+ struct page *page;
+ char *page_kmap;
+ volatile unsigned *vfuqueue;
+ unsigned new_val;
+ struct vlocator *vl;
+ signed long timeout;
+ struct fuqueue_waiter w;
+
+ ftrace ("(%p, %x, %p)\n", _vfuqueue, val, _timeout);
+ might_sleep();
+
+ /* ufuqueue: pin pages, get keys, look up/allocate, refcount */
+ result = vl_locate (&page, &vl, &ufuqueue_vops, _vfuqueue, 0);
+ if (unlikely (result < 0))
+ goto out;
+ ufuqueue = container_of (vl, struct ufuqueue, vlocator);
+ /* We are going to lock the ufuqueue, so get the timeout first */
+ timeout = ufu_timespec2jiffies (_timeout);
+ result = (int) timeout;
+ if (timeout < 0)
+ goto out_put_vl;
+
+ spin_lock_irq (&ufuqueue->fuqueue.lock);
+ __fuqueue_wait_queue (&ufuqueue->fuqueue, &w);
+ /* Now, are we ok with the value? */
+ page_kmap = kmap_atomic (page, KM_IRQ0);
+ vfuqueue = (volatile unsigned *)page_kmap + (vl->key.both.offset & ~1);
+ new_val = *vfuqueue;
+ kunmap_atomic (page_kmap, KM_IRQ0);
+ result = -EWOULDBLOCK;
+ if (val != new_val)
+ goto out_unqueue;
+ /* ok, go ahead and wait (it will unlock and restore irqs/preempt */
+ return __fuqueue_wait_block (&ufuqueue->fuqueue, &w, timeout);
+
+out_unqueue:
+ __fuqueue_wait_unqueue (&w, 0);
+ spin_unlock_irq (&ufuqueue->fuqueue.lock);
+out_put_vl:
+ vl_put (vl);
+ put_page (page);
+out:
+ return result;
+}
+
+
+/**
+ * Wake up waiters of a fuqueue
+ *
+ * @_vfuqueue: pointer to the fuqueue
+ * @howmany: number of waiters to wake up
+ * @code: code to return to the waiters [default it to zero]
+ * @returns: 0 if ok, < 0 errno code on error.
+ */
+asmlinkage
+int sys_ufuqueue_wake (volatile unsigned __user *_vfuqueue,
+ size_t howmany, int code)
+{
+ int result;
+ struct vlocator *vl;
+ struct ufuqueue *ufuqueue;
+
+ ftrace ("(%p, %u)\n", _vfuqueue, howmany);
+ might_sleep();
+
+ /* ufuqueue: get keys, look up (don't allocate), refcount */
+ result = vl_locate (NULL, &vl, NULL, _vfuqueue, 0);
+ if (result < 0)
+ goto out;
+ ufuqueue = container_of (vl, struct ufuqueue, vlocator);
+ /* Wake'en up */
+ spin_lock_irq (&ufuqueue->fuqueue.lock);
+ __fuqueue_wake (&ufuqueue->fuqueue, howmany, code);
+ spin_unlock_irq (&ufuqueue->fuqueue.lock);
+ vl_put (vl);
+ result = 0;
+out:
+ return result;
+}
+
+
+/** Initialize the ufuqueue subsystem. */
+static
+int __init subsys_ufuqueue_init (void)
+{
+ ufuqueue_slab = kmem_cache_create ("ufuqueue", sizeof (struct ufuqueue),
+ 0, 0, ufuqueue_ctor, NULL);
+ if (ufuqueue_slab == NULL)
+ panic ("ufuqueue_init(): "
+ "Unable to initialize ufuqueue slab allocator.\n");
+ return 0;
+}
+__initcall (subsys_ufuqueue_init);
+
+
+/* Adaptors for fulock operations */
+static
+void ufuqueue_put (struct fuqueue *fuqueue)
+{
+ struct ufuqueue *ufuqueue =
+ container_of (fuqueue, struct ufuqueue, fuqueue);
+ vl_put (&ufuqueue->vlocator);
+}
+
+static
+void ufuqueue_get (struct fuqueue *fuqueue)
+{
+ struct ufuqueue *ufuqueue =
+ container_of (fuqueue, struct ufuqueue, fuqueue);
+ vl_get (&ufuqueue->vlocator);
+}
+
+/** Ufuqueue operations */
+static
+struct fuqueue_ops ufuqueue_ops = {
+ .get = ufuqueue_get,
+ .put = ufuqueue_put,
+ .wait_cancel = __fuqueue_wait_cancel,
+ .chprio = __fuqueue_chprio
+};
+

2004-01-14 22:56:34

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 5/10: kernel fuqueues

include/linux/fuqueue.h | 451 ++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/plist.h | 197 ++++++++++++++++++++
kernel/fuqueue.c | 219 +++++++++++++++++++++++
3 files changed, 867 insertions(+)

--- /dev/null Wed Jan 14 14:39:30 2004
+++ linux/include/linux/fuqueue.h Wed Jan 14 11:13:25 2004
@@ -0,0 +1,451 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ *
+ * Quick usage guide:
+ *
+ * struct fuqueue f = fuqueue_INIT (&f);
+ *
+ * fuqueue_wait (&f, 4343) Wait for max 4343 jiffies
+ * fuqueue_wake (&f, 5, -EPERM) Wake the first five guys waiting on f
+ * with -EPERM.
+ *
+ * These are simple priority-sorted wait queues that can be used from
+ * the kernel. They provide the foundation for more complex items
+ * (fulocks, fuconds, ufulocks, ufuconds) and use from user-space
+ * (ufuqueues, just like futexes).
+ *
+ * The type wlist_t provides the sorting for the wait list. The type
+ * 'struct fuqueue_waiter' represents a waiting task on a fuqueue. The
+ * 'struct fuqueue_ops' is what allows extensibility for other
+ * synchronization primitives.
+ *
+ * I have just realized that this is too similar to the wait_queues...
+ */
+
+#ifndef __linux_fuqueue_h__
+#define __linux_fuqueue_h__
+
+#ifdef __KERNEL__
+
+#include <linux/spinlock.h>
+#include <linux/plist.h>
+#include <linux/sched.h>
+#include <asm/hardirq.h>
+
+ /*
+ * Debug stuff -- move forward for seeing the real meat
+ *
+ * Levels
+ *
+ * 0 Nothing activated, all compiled out
+ * > 0 Activates assertions, memory allocation tracking
+ * > 1 Activates debug messages
+ * > 2 Activates [most] function traces
+ * > 3 Activates random debug stuff
+ */
+
+#undef DEBUG
+#define DEBUG 0
+
+#if DEBUG > 0
+#if 0 || !defined(__i386__)
+#define __debug_printstr(a...) printk(a) /* Dump to normal console */
+#else /* Dump straight to ttyS0 */
+#include <linux/serial_reg.h>
+#include <linux/stringify.h>
+static inline void __debug_outb (unsigned val, int port) {
+ __asm__ __volatile__ ("outb %b0,%w1" : : "a" (val), "Nd" (port));
+}
+static inline unsigned __debug_inb (int port) {
+ unsigned value;
+ __asm__ __volatile__ ("inb %w1,%b0" : "=a" (value) : "Nd" (port));
+ return value;
+}
+static inline
+void __debug_printstr (const char *str) {
+ const int port = 0x03f8;
+ while (*str) {
+ while (!(__debug_inb (port + UART_LSR) & UART_LSR_THRE));
+ __debug_outb (*str++, port+UART_TX);
+ }
+ __debug_outb ('\r', port + UART_TX);
+}
+#endif
+
+extern spinlock_t __debug_lock;
+
+static inline
+u64 __tsc_read (void)
+{
+ u64 tsc;
+#if defined(__i386__)
+ __asm__ __volatile__("rdtsc" : "=A" (tsc));
+#elif defined (__ia64__)
+ __asm__ __volatile__("mov %0=ar.itc" : "=r" (tsc) : : "memory");
+#else
+#warning "Architecture not supported in __tsc_read()!"
+ tsc = 0;
+#endif
+ return tsc;
+}
+
+#define __debug(a...) \
+do { \
+ /* Dirty: Try to avoid >1 CPUs printing ... will suck */ \
+ char __X_buf[256]; \
+ unsigned __X_len; \
+ unsigned long __X_flags; \
+ __X_len = snprintf (__X_buf, 255, "%Lu: %s:%d: %s[%d:%d] ", \
+ __tsc_read(), __FILE__, __LINE__, __FUNCTION__, \
+ current->pid, current->thread_info->cpu); \
+ snprintf (__X_buf + __X_len, 255 - __X_len, a); \
+ spin_lock_irqsave (&__debug_lock, __X_flags); \
+ __debug_printstr (__X_buf); \
+ spin_unlock_irqrestore (&__debug_lock, __X_flags); \
+} while (0)
+#endif /* #if DEBUG > 0 */
+
+/* The real debug statements */
+
+#if DEBUG > 0
+#define ldebug(l,a...) do { if (DEBUG >= l) __debug (a); } while (0)
+#define debug(a...) ldebug(1,a)
+#define fdebug(f, a...) do { if ((DEBUG >= 2) && f) __debug (a); } while (0)
+#define __ftrace(l,a...) do { if ((l)) __debug (a); } while (0)
+#define ftrace(a...) __ftrace((DEBUG >= 2),a)
+#define assert(c, a...) do { if ((DEBUG >= 0) && !(c)) BUG(); } while (0)
+#else
+#define ldebug(l,a...)
+#define debug(a...)
+#define fdebug(f, a...)
+#define __ftrace(l,a...)
+#define ftrace(a...)
+#define assert(c, a...)
+#endif
+
+
+
+
+/**
+ * Wait list type, O(N) addition, O(1) removal
+ *
+ * This is just a redirection of the methods used to manage the list
+ * of waiters that are waiting for a fuqueue. For hardcore-timing
+ * applications, you might want to use a PALIST instead of a PLIST,
+ * but FIXME, it is not done yet :)
+ */
+
+typedef struct plist wlist_t;
+
+#define FULOCK_WLIST_USE_PLIST
+#ifdef FULOCK_WLIST_USE_PLIST
+#define wlist_add plist_add
+#define wlist_rem plist_rem
+#define __wlist_rem __plist_rem
+#define wlist_first plist_first
+#define wlist_empty plist_empty
+#define wlist_INIT plist_INIT
+#define wlist_init plist_init
+#define wlist_last plist_last
+#define wlist_update_prio plist_update_prio
+#define wlist_prio plist_prio
+#define wlist_chprio plist_chprio
+#define __wlist_chprio __plist_chprio
+#if 0
+#define wlist_for_each_safe plist_for_each_safe
+#endif
+#endif
+
+struct task_struct;
+struct fuqueue;
+
+/** Descriptor of a waiting task */
+struct fuqueue_waiter {
+ wlist_t wlist_node; /* node for the wait list */
+ struct task_struct *task; /* task that is waiting */
+ int result; /* what happened */
+};
+
+/**
+ * Operations on a fuqueue.
+ *
+ * FIXME: is it worth to have get/put? maybe they should be enforced
+ * for every fuqueue, this way we don't have to query the ops
+ * structure for the get/put method and if it is there, call
+ * it. We'd have to move the get/put ops over to the vlocator,
+ * but that's not much of a problem.
+ *
+ * The decission factor is that an atomic operation needs to
+ * lock the whole bus and is not as scalable as testing a ptr
+ * for NULLness.
+ *
+ * For simplicity, probably the idea of forcing the refcount in
+ * the fuqueue makes sense.
+ */
+struct fuqueue_ops {
+ void (* get) (struct fuqueue *);
+ void (* put) (struct fuqueue *);
+ unsigned (* wait_cancel) (struct fuqueue *, struct fuqueue_waiter *);
+ struct task_struct * (* chprio) (
+ struct task_struct *, struct fuqueue *,
+ struct fuqueue_waiter *);
+};
+
+/** A fuqueue, a prioritized wait queue usable from kernel space. */
+struct fuqueue {
+ spinlock_t lock;
+ wlist_t wlist;
+ struct fuqueue_ops *ops;
+};
+
+
+/** Initialize a @fuqueue structure with given @ops */
+static inline
+void __fuqueue_init (struct fuqueue *fuqueue, struct fuqueue_ops *ops)
+{
+ spin_lock_init (&fuqueue->lock);
+ wlist_init (&fuqueue->wlist);
+ fuqueue->ops = ops;
+}
+
+/** Statically initialize a @fuqueue with given @ops. */
+#define __fuqueue_INIT(fuqueue, ops) { \
+ .lock = SPIN_LOCK_UNLOCKED, \
+ .wlist = wlist_INIT (&(fuqueue)->wlist), \
+ .ops = (ops) \
+}
+
+
+/** fuqueue operations for in-kernel usage */
+extern struct fuqueue_ops fuqueue_ops;
+
+
+/** Initialize a @fuqueue for usage within the kernel */
+static inline
+void fuqueue_init (struct fuqueue *fuqueue)
+{
+ __fuqueue_init (fuqueue, &fuqueue_ops);
+}
+
+/** Statically initialize a @fuqueue for usage within the kernel */
+#define fuqueue_INIT(fuqueue) __fuqueue_INIT(fuqueue, &fuqueue_ops)
+
+
+/** Wait for a @fuqueue to be woken for as much as @timeout, @returns
+ * wake up code */
+extern int fuqueue_wait (struct fuqueue *fuqueue, signed long timeout);
+
+/* Cancel the wait on a fuqueue */
+extern void fuqueue_wait_cancel (struct task_struct *, int);
+
+/*
+ * The following are functions to be able to access fuqueue
+ * functionality when building on top of it, like the [u]fulocks,
+ * [u]fuconds and ufuqueues.
+ */
+
+#if DEBUG > 0
+/* BUG_ON() firing? Temporary fix, do you have CONFIG_PREEMPT enabled?
+ * either that or disable DEBUG (force #define it to zero). */
+#define CHECK_IRQs() do { BUG_ON (!in_atomic()); } while (0)
+#else
+#define CHECK_IRQs() do {} while (0)
+#endif
+
+
+/**
+ * Setup @current to wait for a fuqueue.
+ *
+ * This only setups, it does not block.
+ *
+ * @fuqueue: fuqueue to wait on.
+ * @w: waiter structure to fill up and queue.
+ * @returns !0 if the wlist prio changed.
+ *
+ * Fills up @current's fuqueue_wait* info and queues up @w after
+ * filling it.
+ *
+ * WARNING: Call with preempt and local IRQs disabled
+ */
+static inline
+unsigned __fuqueue_wait_queue (struct fuqueue *fuqueue,
+ struct fuqueue_waiter *w)
+{
+ ftrace ("(%p, %p)\n", fuqueue, w);
+ CHECK_IRQs();
+
+ _raw_spin_lock (&current->fuqueue_wait_lock);
+ current->fuqueue_wait = fuqueue;
+ current->fuqueue_waiter = w;
+ _raw_spin_unlock (&current->fuqueue_wait_lock);
+ w->task = current;
+ w->result = INT_MAX;
+ __wlist_chprio (&w->wlist_node, current->prio);
+ return wlist_add (&fuqueue->wlist, &w->wlist_node);
+}
+
+
+/**
+ * Wakes up a single waiter and cleans up it's task wait information.
+ *
+ * Needs to be split from __fuqueue_wake_waiter() as
+ * fuqueue_wake_cancel() needs to acquire the locks in a funny way to
+ * prevent issues.
+ *
+ * WARNING: call with preeempt and local IRQs disabled!!
+ */
+static inline
+void __fuqueue_wait_unqueue (struct fuqueue_waiter *w, int result)
+{
+ struct task_struct *task = w->task;
+
+ ftrace ("(%p [%d], %d)\n", w, w->task->pid, result);
+ CHECK_IRQs();
+
+ __wlist_rem (&w->wlist_node);
+ _raw_spin_lock (&task->fuqueue_wait_lock);
+ task->fuqueue_wait = NULL;
+ task->fuqueue_waiter->result = result;
+ task->fuqueue_waiter = NULL;
+ _raw_spin_unlock (&task->fuqueue_wait_lock);
+}
+
+
+/**
+ * Wait for a @fuqueue until woken up, @returns wake up code
+ *
+ * Needs to be called with the fuqueue lock held, local IRQs disabled
+ * and preempt disabled. Will release the lock, enable IRQs and
+ * preemtion.
+ */
+static inline
+int __fuqueue_wait_block (struct fuqueue *fuqueue, struct fuqueue_waiter *w,
+ signed long timeout)
+{
+ ftrace ("(%p, %p, %ld)\n", fuqueue, w, timeout);
+ CHECK_IRQs();
+
+ set_current_state (TASK_INTERRUPTIBLE);
+ _raw_spin_unlock (&fuqueue->lock);
+ local_irq_enable();
+ preempt_enable_no_resched();
+
+ /* Wait until we are woken up */
+ timeout = schedule_timeout (timeout);
+ set_current_state (TASK_RUNNING);
+ /*
+ * Now, whoever woke us up had to call first
+ * fuqueue->ops->wait_cancel() through fuqueue_wait_cancel(),
+ * and thus, unqueued us and set up a return code in
+ * w->result. However, if w->result is still pristine as left
+ * by __fuqueue_wait_queue(), that means our waker either
+ * didn't know we were waiting or we have signal(s) pending,
+ * so we do it ourselves.
+ */
+ if (unlikely (w->result == INT_MAX)) {
+ int result = -EINTR;
+ BUG_ON (wlist_empty (&w->wlist_node));
+ if (timeout == 0)
+ result = -ETIMEDOUT;
+ else
+ WARN_ON (!signal_pending (current));
+ fuqueue_wait_cancel (current, result);
+ }
+ return w->result;
+}
+
+
+
+/** @Returns true if the @fuqueue has no waiters. */
+static inline
+unsigned __fuqueue_empty (const struct fuqueue *fuqueue)
+{
+ return wlist_empty (&fuqueue->wlist);
+}
+
+
+/** Return the first waiter on a @fuqueue */
+static inline
+struct fuqueue_waiter * __fuqueue_first (struct fuqueue *fuqueue)
+{
+ return container_of (wlist_first (&fuqueue->wlist),
+ struct fuqueue_waiter, wlist_node);
+}
+
+/** Wake @howmany @fuqueue waiters with @code. */
+static inline
+void __fuqueue_wake (struct fuqueue *fuqueue, size_t howmany, int code)
+{
+ unsigned to_go = howmany;
+ struct fuqueue_waiter *w;
+
+ ftrace ("(%p, %zu, %d)\n", fuqueue, howmany, code);
+
+ while (to_go-- && !__fuqueue_empty (fuqueue)) {
+ w = __fuqueue_first (fuqueue);
+ __fuqueue_wait_unqueue (w, code);
+ wake_up_process (w->task);
+ }
+ if (likely (to_go != howmany))
+ wlist_update_prio (&fuqueue->wlist);
+}
+
+
+/** Wake @howmany waiters waiting on a @fuqueue, with return code (for
+ * them) @code */
+static inline
+int fuqueue_wake (struct fuqueue *fuqueue, size_t howmany, int code)
+{
+ unsigned long flags;
+
+ ftrace ("(%p, %zu, %d)\n", fuqueue, howmany, code);
+
+ spin_lock_irqsave (&fuqueue->lock, flags);
+ __fuqueue_wake (fuqueue, howmany, code);
+ spin_unlock_irqrestore (&fuqueue->lock, flags);
+ return 0;
+}
+
+
+/* See docs in kernel/fuqueue.c */
+extern unsigned __fuqueue_wait_cancel (struct fuqueue *, struct fuqueue_waiter *);
+
+/** Propagation of priority change */
+extern void fuqueue_chprio (struct task_struct *);
+extern void __set_prio (struct task_struct *p, int prio);
+
+
+/**
+ * Set the priority of a fuqueue waiter, repositioning it in the wait
+ * list.
+ *
+ * This does not set the prio of the process itself!
+ *
+ * @task: task to reprioritize
+ * @fuqueue: fuqueue the task is waiting for [locked]
+ * @w: waiter @task is waiting on in @fuqueue.
+ * @prio: new priority (prio
+ * @returns: NULL (as there is no propagation).
+ *
+ * Assumptions: prio != task->prio
+ * fuqueue->lock held
+ */
+static inline
+struct task_struct * __fuqueue_chprio (struct task_struct *task,
+ struct fuqueue *fuqueue,
+ struct fuqueue_waiter *w)
+{
+ wlist_chprio (&fuqueue->wlist, &w->wlist_node, task->prio);
+ return NULL;
+}
+
+#endif /* #ifdef __KERNEL__ */
+#endif /* #ifndef __linux_fuqueue_h__ */
--- /dev/null Wed Jan 14 14:39:30 2004
+++ linux/include/linux/plist.h Sun Nov 16 07:21:32 2003
@@ -0,0 +1,197 @@
+/*
+ * Descending-priority-sorted double-linked list
+ *
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on simple lists (include/linux/list.h).
+ *
+ * The head will always contain the head and the highest priority of
+ * the nodes in the list. Addition is O(N), removal is O(1), change of
+ * priority is O(N)
+ *
+ * 0 is highest priority, ~0 is lowest priority.
+ *
+ * No locking is done, up to the caller.
+ */
+
+#ifndef _LINUX_PLIST_H_
+#define _LINUX_PLIST_H_
+
+#include <linux/list.h>
+
+/* Priority-sorted list */
+struct plist {
+ unsigned prio;
+ struct list_head node;
+};
+
+#define plist_INIT(p) \
+{ \
+ .node = LIST_HEAD_INIT((p)->node), \
+ .prio = ~0 \
+}
+
+/* Initialize a pl */
+static inline
+void plist_init (struct plist *pl)
+{
+ INIT_LIST_HEAD (&pl->node);
+ pl->prio = ~0;
+}
+
+/* Return the first node (and thus, highest priority)
+ *
+ * Assumes the plist is _not_ empty.
+ */
+static inline
+struct plist * plist_first (struct plist *plist)
+{
+ return list_entry (plist->node.next, struct plist, node);
+}
+
+/* Return if the plist is empty. */
+static inline
+unsigned plist_empty (const struct plist *plist)
+{
+ return list_empty (&plist->node);
+}
+
+/* Update the maximum priority
+ *
+ * Return !0 if the plist's maximum priority changes.
+ *
+ * __plist_update_prio() assumes the plist is not empty.
+ */
+static inline
+unsigned __plist_update_prio (struct plist *plist)
+{
+ unsigned prio = plist_first (plist)->prio;
+ if (plist->prio == prio)
+ return 0;
+ plist->prio = prio;
+ return !0;
+}
+
+static inline
+unsigned plist_update_prio (struct plist *plist)
+{
+ unsigned old_prio = plist->prio;
+ /* plist empty, max prio = ~0 */
+ plist->prio = plist_empty (plist)? ~0 : plist_first (plist)->prio;
+ return old_prio != plist->prio;
+}
+
+/* Add a node to the plist [internal]
+ *
+ * pl->prio == ~0 is an special case, means low priority, get down to
+ * the end of the plist. Note the <; we want to add the new guy of a
+ * set of same priority people to the end of that set (FIFO behaviour
+ * on the same priority).
+ */
+static inline
+void __plist_add_sorted (struct plist *plist, struct plist *pl)
+{
+ struct list_head *itr;
+ struct plist *itr_pl;
+
+ if (pl->prio < ~0)
+ list_for_each (itr, &plist->node) {
+ itr_pl = list_entry (itr, struct plist, node);
+ if (pl->prio < itr_pl->prio)
+ break;
+ }
+ else
+ itr = &plist->node;
+ list_add_tail (&pl->node, itr);
+}
+
+/* Add a node to a plist
+ *
+ * Return !0 if the plist's maximum priority changes.
+ */
+static inline
+unsigned plist_add (struct plist *plist, struct plist *pl)
+{
+ __plist_add_sorted (plist, pl);
+ /* Are we setting a higher priority? */
+ if (pl->prio < plist->prio) {
+ plist->prio = pl->prio;
+ return !0;
+ }
+ return 0;
+}
+
+/* Return the priority a pl node */
+static inline
+unsigned plist_prio (struct plist *pl)
+{
+ return pl->prio;
+}
+
+/* Change the priority of a pl node, without updating plist position */
+static inline
+void __plist_chprio (struct plist *pl, unsigned new_prio)
+{
+ pl->prio = new_prio;
+}
+
+/* Change the priority of a pl node updating the list's max priority.
+ *
+ * Return !0 if the plist's maximum priority changes
+ */
+static inline
+unsigned plist_chprio (struct plist *plist, struct plist *pl,
+ unsigned new_prio)
+{
+ __plist_chprio (pl, new_prio);
+ list_del (&pl->node);
+ __plist_add_sorted (plist, pl);
+ return __plist_update_prio (plist);
+}
+
+/* Remove a pl node from a plist
+ *
+ * Return !0 if the plist's maximum priority changed.
+ */
+static inline
+void __plist_rem (struct plist *pl)
+{
+ list_del_init (&pl->node);
+}
+static inline
+unsigned plist_rem (struct plist *plist, struct plist *pl)
+{
+ __plist_rem (pl);
+ return plist_update_prio (plist);
+}
+
+/* Iterate over a plist - in priority order (from high to low) */
+#define plist_for_each(pos, head) \
+for (pos = container_of ((head)->node.next, struct plist, node), \
+ prefetch (pos->node.next); \
+ pos != (head); \
+ pos = container_of (pos->node.next, struct plist, node), \
+ prefetch (pos->node.next))
+
+#define plist_for_each_safe(pos, n, head) \
+ for (pos = container_of ((head)->node.next, struct plist, node), \
+ n = container_of (pos->node.next, struct plist, node); \
+ pos != (head); \
+ pos = n, \
+ n = container_of (pos->node.next, struct plist, node))
+
+
+/** Return !0 if node @pl is the last one on list @plist. */
+
+static inline
+unsigned plist_last (const struct plist *plist, const struct plist *pl)
+{
+ return pl->node.next == &plist->node;
+}
+
+
+#endif /* #ifndef _LINUX_PLIST_H_ */
+
--- /dev/null Wed Jan 14 14:39:30 2004
+++ linux/kernel/fuqueue.c Tue Dec 9 20:12:24 2003
@@ -0,0 +1,219 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ *
+ * see the doc in linux/fuqueue.h for some info.
+ */
+
+#define DEBUG 8
+
+#include <linux/fuqueue.h>
+#include <linux/sched.h>
+#include <linux/plist.h>
+#include <linux/errno.h>
+
+#if DEBUG > 0
+spinlock_t __debug_lock = SPIN_LOCK_UNLOCKED;
+#endif
+
+
+/**
+ * Wait for a @fuqueue to be woken for as much as @timeout, @returns
+ * wake up code
+ *
+ * WARNING: can only be called from process context
+ */
+int fuqueue_wait (struct fuqueue *fuqueue, signed long timeout)
+{
+ struct fuqueue_waiter w;
+
+ ftrace ("(%p, %ld)\n", fuqueue, timeout);
+
+ spin_lock_irq (&fuqueue->lock);
+ __fuqueue_wait_queue (fuqueue, &w);
+ return __fuqueue_wait_block (fuqueue, &w, timeout); /* unlocks */
+}
+
+
+/**
+ * Change the priority of a waiter.
+ *
+ * @task: task whose priority has to be changed.
+ *
+ * This is the entry point that the scheduler functions call when a
+ * task that is waiting for a fuqueue changes its priority. It will
+ * call the fuqueue-specific chprio function after safely determining
+ * what is the fuqueue it is waiting for and then, if the
+ * specific chprio function determines that the prio change has to be
+ * propagated, it will keep doing it.
+ *
+ * The task that the chprio function returns has to be returned with a
+ * get_task_struct() reference.
+ *
+ * Note the weird locking: we are one of the little places that needs
+ * to take the locks in inverse order (most are fuqueue first,
+ * task_wait later--FT), we need to do TF, so we do T, test F, if it
+ * fails, unlock T, try again.
+ */
+void fuqueue_chprio (struct task_struct *task)
+{
+ struct fuqueue_ops *ops;
+ struct fuqueue *fuqueue;
+ struct fuqueue_waiter *w;
+ int prio = task->prio;
+ unsigned long flags;
+
+ ftrace ("(%p [%d])\n", task, task->pid);
+ CHECK_IRQs();
+
+ local_irq_save (flags);
+ preempt_disable();
+ get_task_struct (task);
+next:
+ /* Who is the task waiting for? safely acquire and lock it */
+ _raw_spin_lock (&task->fuqueue_wait_lock);
+ fuqueue = task->fuqueue_wait;
+ if (fuqueue == NULL) /* Ok, not waiting */
+ goto out_task_unlock;
+ if (!_raw_spin_trylock (&fuqueue->lock)) { /* Spin dance... */
+ _raw_spin_unlock (&task->fuqueue_wait_lock);
+ goto next;
+ }
+ ops = fuqueue->ops;
+ if (ops->get)
+ ops->get (fuqueue);
+
+ w = task->fuqueue_waiter;
+ _raw_spin_unlock (&task->fuqueue_wait_lock);
+ put_task_struct (task);
+
+ /* Ok, we need to propagate the prio change to the fuqueue */
+ ops = fuqueue->ops;
+ task = ops->chprio (task, fuqueue, w);
+ if (task == NULL)
+ goto out_fuqueue_unlock;
+
+ /* Do we need to propagate to the next task */
+ get_task_struct (task);
+ _raw_spin_unlock (&fuqueue->lock);
+ if (ops->put)
+ ops->put (fuqueue);
+ __set_prio (task, prio);
+ goto next;
+
+out_fuqueue_unlock:
+ _raw_spin_unlock (&fuqueue->lock);
+ if (ops->put)
+ ops->put (fuqueue);
+ goto out;
+
+out_task_unlock:
+ _raw_spin_unlock (&task->fuqueue_wait_lock);
+ put_task_struct (task);
+out:
+ local_irq_restore (flags);
+ preempt_enable();
+ return;
+}
+
+
+/** Cancel @task's wait on @fuqueue and update the wait list priority */
+unsigned __fuqueue_wait_cancel (struct fuqueue *fuqueue,
+ struct fuqueue_waiter *w)
+{
+ ftrace ("(%p, %p [%d])\n", fuqueue, w, w->task->pid);
+
+ __wlist_rem (&w->wlist_node);
+ return wlist_update_prio (&fuqueue->wlist);
+}
+
+
+/**
+ * Cancel the wait of a task on a fuqueue and wake it up.
+ *
+ * @task: task whose wait is to be canceled
+ *
+ * Called by:
+ * - signal_wake_up()
+ * - process_timeout()
+ * - __wake_up_common()
+ * - FIXME
+ *
+ * when the task they are about to wake is waiting on a
+ * fuqueue. Safely acquires which fuqueue the task is waiting for,
+ * references it, cleans up the task->fuqueue_wait* information, and
+ * then calls the fuqueue specific wait_cancel() function.
+ *
+ * FIXME: the entry points we get called from don't seem to be all of
+ * them; the perfect thing here would be to hook into
+ * try_to_wake_up()--but is kind of tricky..,
+ *
+ * Note that for this function to actually do anything, the task must
+ * be a task waiting on a fuqueue (so that means TASK_INTERRUPTIBLE
+ * and off the runqueues).
+ */
+void fuqueue_wait_cancel (struct task_struct *task, int result)
+{
+ struct fuqueue_ops *ops;
+ struct fuqueue *fuqueue;
+ struct fuqueue_waiter *w;
+ unsigned long flags;
+
+ ftrace ("(%p [%d], %d)\n", task, task->pid, result);
+
+ local_irq_save (flags);
+ preempt_disable();
+ get_task_struct (task);
+retry:
+ /* Who is the task waiting for? safely acquire and lock it */
+ _raw_spin_lock (&task->fuqueue_wait_lock);
+ fuqueue = task->fuqueue_wait;
+ if (fuqueue == NULL) /* Ok, not waiting */
+ goto out_task_unlock;
+ if (!_raw_spin_trylock (&fuqueue->lock)) { /* Spin dance... */
+ _raw_spin_unlock (&task->fuqueue_wait_lock);
+ goto retry;
+ }
+ ops = fuqueue->ops;
+ if (ops->get)
+ ops->get (fuqueue);
+
+ w = task->fuqueue_waiter;
+ w->result = result;
+
+ /* Do the specific cancel op */
+ ops->wait_cancel (fuqueue, w);
+ task->fuqueue_wait = NULL;
+ task->fuqueue_waiter = NULL;
+ _raw_spin_unlock (&task->fuqueue_wait_lock);
+ put_task_struct (task);
+ _raw_spin_unlock (&fuqueue->lock);
+ local_irq_restore (flags);
+ preempt_enable();
+ if (ops->put)
+ ops->put (fuqueue);
+ return;
+
+out_task_unlock:
+ _raw_spin_unlock (&task->fuqueue_wait_lock);
+ put_task_struct (task);
+ local_irq_restore (flags);
+ preempt_enable();
+ return;
+}
+
+
+/** Fuqueue operations for usage within the kernel */
+struct fuqueue_ops fuqueue_ops = {
+ .get = NULL,
+ .put = NULL,
+ .wait_cancel = __fuqueue_wait_cancel,
+ .chprio = __fuqueue_chprio
+};

2004-01-14 23:04:24

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: [RFC/PATCH] FUSYN 10/10: user space fulocks

ufulock.c | 1010 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 1010 insertions(+)

--- /dev/null Wed Jan 14 14:39:30 2004
+++ linux/kernel/ufulock.c Wed Jan 14 13:38:30 2004
@@ -0,0 +1,1010 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <[email protected]>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ *
+ *
+ * The ufulocks are fulocks with a vlocator to help pinpoint them from
+ * a user space address.
+ *
+ * The vfulock is the value in an address in user space that is
+ * associated to the ufulock. The trick here is to keep them in sync
+ * [__vfulock_sync()]. I think it is doing the right thing when we
+ * know about a ufulock in the kernel and something wipes it to some
+ * crappy value in user space (wild pointer)--however, it cannot be
+ * perfect.
+ *
+ * Mapping the address in user space to the ufulock is done through a
+ * hash table kept by the vlocator.h code - ufulock_locate() does the
+ * full work; give it an adress, you get a ufulock. No need to worry
+ * about anything else.
+ *
+ * QUICK ROADMAP:
+ *
+ * sys_ufulock_lock(): [try]lock a fulock from user space
+ * sys_ufulock_unlock(): unlock a fulock from user space
+ * sys_ufulock_consistency(): get/set the consistency of a ufulock
+ * __ufulock_exit(): called from fulock.c:exit_fulocks() when a
+ * process exits and still holds fulocks.
+ * ufulock_gc(): called from ufu-common.c:ufu_gc() to do garbage
+ * collection.
+ *
+ * TODOs:
+ *
+ * - All the #warning FIXME's need to be sorted out
+ *
+ * - There is no need to change the flags of a mutex once it has been
+ * initialized [POSIX doesn't provide tools for it anyway], except
+ * for the prioceiling. Still need to provide a way for user space
+ * to tell kernel space that we are relinquishing a fulock forever
+ * (for pthread_mutex_destroy()) so we can clean it up and reinit with
+ * different flags; as well, call in from pthread_mutex_init() so we
+ * cache the fulock?).
+ */
+
+#warning FIXME: page unpinning still not that done
+/*
+ * The problem is we need to access the page to modify the value on
+ * when a ufulock dies. We could do it on the exit path, but we don't
+ * have a way to locate the page when it is shared on
+ * __ufulock_exit(). We could have the sys_ufulock_lock() return path
+ * do it, but then we don't update the vfulock to VFULOCK_DEAD when
+ * there are no waiters.
+ *
+ * So, until we find a solution for that, pages are still pinned. It
+ * also means we have to pass the page all the way down to
+ * fulock->ops->owner_set(). A solution would get rid of that
+ * too. It's plain ugly and messy.
+ */
+
+#include <linux/sched.h>
+#include <linux/highmem.h>
+#include <linux/plist.h>
+#include <asm/uaccess.h> /* copy_from_user() */
+#include <linux/init.h>
+
+#include <linux/futex.h> /* futex keys */
+#include <linux/vlocator.h>
+#include <linux/workqueue.h>
+#include <linux/fulock.h>
+
+#warning DEBUG: remove me
+//#include <asm/kgdb.h>
+
+#if DEBUG > 5
+#define __DEBUG_get_value(result, _vfulock) \
+do { \
+ unsigned val = *_vfulock; \
+ debug ("result %d * _vfulock %u/%x\n", result, val, val); \
+} while (0)
+#define DEBUG_get_value(result, _vfulock) \
+do { \
+ unsigned val; \
+ get_user (val, _vfulock); \
+ debug ("result %d * _vfulock %u/%x\n", result, val, val); \
+} while (0)
+#else
+#define __DEBUG_get_value(a,b) do {} while (0)
+#define DEBUG_get_value(a,b) do {} while (0)
+#endif
+
+extern signed long ufu_timespec2jiffies (const struct timespec __user *);
+
+/** Slab for allocation of 'struct ufulock's. */
+static kmem_cache_t *ufulock_slab;
+
+
+struct fulock_ops ufulock_ops;
+
+
+/** Set an ufulock ownership and increment its use count. */
+static __inline__
+void __ufulock_owner_set (struct fulock *fulock, struct task_struct *task)
+{
+ struct ufulock *ufulock;
+
+ ftrace ("(%p, %p [%d])\n", fulock, task, task->pid);
+ ufulock = container_of (fulock, struct ufulock, fulock);
+ __fulock_owner_set (fulock, task);
+ vl_get (&ufulock->vlocator);
+}
+
+
+/** Reset an ufulock ownership and drop its use count. */
+static __inline__
+void __ufulock_owner_reset (struct fulock *fulock)
+{
+ struct ufulock *ufulock;
+
+ ufulock = container_of (fulock, struct ufulock, fulock);
+ ftrace ("(%p [ufulock %p])\n", fulock, ufulock);
+ __fulock_owner_reset (fulock);
+ vl_put (&ufulock->vlocator);
+}
+
+
+/** Initialize an @ufulock (for slab object construction). */
+static
+void ufulock_ctor (void *obj, kmem_cache_t *slab, unsigned long flags)
+{
+ struct ufulock *ufulock = obj;
+ __fulock_init (&ufulock->fulock, &ufulock_ops);
+ vlocator_init (&ufulock->vlocator);
+}
+
+
+/** Allocate an ufulock maintaining debug-allocation counts. */
+static __inline__
+struct vlocator * ufulock_alloc (void)
+{
+ struct ufulock *ufulock;
+ ufulock = kmem_cache_alloc (ufulock_slab, GFP_KERNEL);
+ if (DEBUG > 0 && ufulock != NULL)
+ ldebug (2, "ufulock %p allocated, total: %d\n",
+ ufulock, allocated_inc());
+ return &ufulock->vlocator;
+}
+
+
+/**
+ * Create a ufulock that is going to be inserted to the hash
+ * [called from ufu-common.c:ufu_locate()]
+ */
+static
+void ufulock_create (struct vlocator *vl, struct page *page,
+ const union futex_key *key, unsigned flags)
+{
+ struct ufulock *ufulock = container_of (vl, struct ufulock, vlocator);
+ ufulock->fulock.flags = (flags & FULOCK_FL_USER_MK) | FULOCK_FL_NEW;
+ memcpy (&vl->key, key, sizeof (*key));
+ get_key_refs (&vl->key);
+ get_page (page);
+ ufulock->page = page;
+}
+
+
+/** Release @ufulock's resources. */
+static __inline__
+void ufulock_release (struct vlocator *vl)
+{
+ struct ufulock *ufulock = container_of (vl, struct ufulock, vlocator);
+ BUG_ON (ufulock->page == NULL);
+ put_page (ufulock->page);
+ drop_key_refs (&vl->key);
+}
+
+
+/** Free an ufulock maintaining debug-allocation counts. */
+static __inline__
+void ufulock_free (struct vlocator *vl)
+{
+ struct ufulock *ufulock = container_of (vl, struct ufulock, vlocator);
+ kmem_cache_free (ufulock_slab, ufulock);
+ if (DEBUG > 0 && ufulock != NULL)
+ ldebug (2, "ufulock %p freed, total: %d\n",
+ ufulock, allocated_dec());
+}
+
+
+struct vlocator_ops ufulock_vops = {
+ .alloc = ufulock_alloc,
+ .create = ufulock_create,
+ .release = ufulock_release,
+ .free = ufulock_free
+};
+
+
+
+/**
+ * Set @vfulock with, maintaining consistency with @fulock.
+ *
+ * We don't care about the previous value because we only call this
+ * function from points where the kernel has control over the ufulock
+ * and the associated vfulock (KCO: Kernel Controlled Ownership).
+ */
+static __inline__
+void __vfulock_update (volatile unsigned *vfulock,
+ const struct fulock *fulock, unsigned value)
+{
+ ftrace ("(%p, %p, %u/%x)\n", vfulock, fulock, value, value);
+
+ if (fulock->flags & FULOCK_FL_DEAD)
+ value = VFULOCK_DEAD;
+ else if (fulock->flags & FULOCK_FL_NR)
+ value = VFULOCK_NR;
+ vfulock_set (vfulock, value);
+ return;
+}
+
+
+/** @returns expected value of the vfulock given the state of the
+ * @fulock. */
+static inline
+unsigned __vfulock_expected_nonkco (struct fulock *fulock)
+{
+ if (fulock->flags & FULOCK_FL_DEAD)
+ return VFULOCK_DEAD;
+ else if (fulock->flags & FULOCK_FL_NR)
+ return VFULOCK_NR;
+ else if (fulock->owner == NULL)
+ return VFULOCK_UNLOCKED;
+ else if (__fulock_empty (fulock) && fulock->owner == NULL)
+ return fulock->owner->pid;
+ else
+ return VFULOCK_KCO;
+}
+
+
+/**
+ * Perform owner identification chores for __vfulock_sync()
+ *
+ * Probably one of the dirtiest functions I've ever written. The thing
+ * that makes it complex is that after we assign ownership, somebody
+ * the target task might start to exit, and we don't know if it has
+ * gone through exit_fulocks() after we added the fulock to its
+ * ownership list, so we have to be paranoid.
+ *
+ * FIXME: false positives if the PID is reused are the rule. Remedies:
+ *
+ * - The chance of that happening will depend on what the reuse rate is,
+ * how frequently are we assigning new PIDs and the probability of
+ * the sucker dying in the middle of it (so a well-coded program
+ * should be fine, except for the OOM killer, or a crazy process
+ * killing -9 randomly...).
+ *
+ * Let's say the probability is P (one in X)
+ *
+ * - Limit PIDs to 64Ki - 3 (to account for NR, DEAD and KCO), create a
+ * 16 bit hash for each task with immutable elements (eg: pointer,
+ * start time...); compose the hash and the 16 bit pid into a 32 bit
+ * thing that the user space uses for fast-locking. When id'ing,
+ * locate the task by PID, hash it out and compare it; if it fails,
+ * the PID was reused, if not, keep on:
+ *
+ * (there are still chances of collision, but I think we have cut
+ * P down by 64Ki)
+ *
+ * - Check the mm mappings. If the task doesn't have the mapping where
+ * the fulock is, well, that was a collision (and that was chance,
+ * probably like shoting up to the sky in the middle of the dessert,
+ * the bullet falling back to the ground, hitting this man [who
+ * happened to be wandering around all lost] in the forehead,
+ * killing him, and it happens he is some rotten-rich guy who had
+ * randomly chosen somebody on the phone book to pass on all his
+ * wealth and that one happens to be you).
+ *
+ * - Now, if the task has the same mapping, we assume it is good.
+ *
+ * I think I cannot be more of a pain in the ass, and probably the
+ * mapping thing is overkill; the hash thing looks like more than
+ * enough, and it is pretty cheap (wasn't computing an exact science?
+ * my flaming bullas).
+ *
+ * If still not good enough to you, use KCO mode, sacrificing speed.
+ */
+static
+int __fulock_id_owner (struct fulock *fulock, volatile unsigned *vfulock,
+ unsigned owner_pid, int do_lock)
+{
+ int result = EBUSY;
+ struct task_struct *task;
+
+ ftrace ("(%p, %p, %u, %d)\n", fulock, vfulock, owner_pid, do_lock);
+ CHECK_IRQs();
+
+ BUG_ON (fulock->owner != NULL);
+ _raw_read_lock (&tasklist_lock);
+ task = find_task_by_pid (owner_pid);
+ if (task == NULL || task->flags & PF_EXITING)
+ goto dead_unlock;
+ get_task_struct (task);
+ _raw_read_unlock (&tasklist_lock);
+ __ufulock_owner_set (fulock, task);
+ if (unlikely (task->flags & PF_EXITING)) {
+ __ufulock_owner_reset (fulock);
+ put_task_struct (task);
+ goto dead;
+ }
+ put_task_struct (task);
+ return result;
+
+dead_unlock:
+ _raw_read_unlock (&tasklist_lock);
+dead:
+ result = -EOWNERDEAD;
+ vfulock_set (vfulock, VFULOCK_DEAD);
+ fulock->flags |= FULOCK_FL_DEAD;
+ if (do_lock)
+ __ufulock_owner_set (fulock, current);
+ return result;
+}
+
+
+/**
+ * Sync up a ufulock that supports fast-userspace locking and its associated
+ * vfulock and maybe fast-lock.
+ *
+ * @uf: The ufulock to sync up.
+ * @vfulock: pointer to a kmapped associated value.
+ * @returns: < 0 errno code on error [we did not acquire it]
+ * -EOWNERDEAD [previous owner died, we got it]
+ * 0 if we locked it to a PID (and thus 'current' is owner)
+ * > 0 if we didn't lock it, need to proceed to fulock
+ *
+ * Call with uf->fulock.fuqueue.lock held!
+ *
+ * This syncs up the vfulock and the ufulock (if the ufulock is new,
+ * sync from vfulock->ufulock, or the opposite). It tries to leave the
+ * ufulock as it if where a kernel fulock, for the fulock layer to
+ * operate on it.
+ *
+ * When we see VFULOCK_UNLOCKED and move to PID, we don't set the
+ * owner, because there will be nobody coming down to the kernel to
+ * reset it [at any effect, it is like if we had done a user space
+ * fast-lock operation].
+ *
+ * The vfulock can only be modified in the following ways:
+ * - In user space:
+ * - when fast-locking: VFULOCK_UNLOCKED to PID
+ * - when fast-unlocking: PID to VFULOCK_UNLOCKED
+ * - when reinitializing from not-recoverable: VFULOCK_NR to VFULOCK_UNLOCKED
+ * - Illegally: any other transition [mistake, error, wild pointer]
+ * - In kernel space: When we have the lock over the ufulock
+ * associated to that vfulock.
+ *
+ * That means that as we are called with the lock held, we can modify
+ * the vfulock at will knowing that nobody is going to touch it as
+ * long as it is not VFULOCK_UNLOCKED (fast-lock), a PID
+ * (fast-unlock)--the not-recoverable case is not of concern because
+ * as soon as anyone sees it, they bail out, so we never get to modify
+ * it.
+ *
+ * Now, for those legal-from-user-space modifications, we operate
+ * atomic; the while(1) loop and vfulock_acas() protect against this.
+ */
+#warning FIXME: rename uf -> ufulock when stabilized
+static
+int __vfulock_sync_nonkco (struct ufulock *uf, volatile unsigned *vfulock,
+ unsigned do_lock)
+{
+ int result = 0;
+ unsigned val, expected = 0, sync_k2u = 0;
+
+ ftrace ("(%p, %p, %u)\n", uf, vfulock, do_lock);
+
+ if ((uf->fulock.flags & FULOCK_FL_NEW) == 0) {
+ sync_k2u = 1; /* synch kernel-to-user */
+ expected = __vfulock_expected_nonkco (&uf->fulock);
+ }
+ while (1) {
+ val = *vfulock;
+ // kgdb_ts ((unsigned long) uf, *vfulock);
+ if (sync_k2u && val != expected) { /* Unexpected changes? */
+ printk (KERN_WARNING "pid %d (%s) (ufulock %p): "
+ "out of sync, val %u/0x%x, expected %u/0x%x. "
+ "Fixing.\n", current->pid, current->comm, uf,
+ val, val, expected, expected);
+ if (expected == VFULOCK_UNLOCKED && val < VFULOCK_KCO)
+ goto make_kco; /* Legal: locked? */
+ if (expected < VFULOCK_KCO && val == VFULOCK_UNLOCKED)
+ goto fast_lock; /* Legal: unlocked? */
+ if (!vfulock_acas (vfulock, val, expected))
+ continue; /* Illegal: reset it */
+ val = expected;
+ }
+ switch (val) {
+ case VFULOCK_UNLOCKED: /* unlocked, fast-lock it */
+fast_lock:
+ // kgdb_ts ((unsigned long) uf, *vfulock);
+ if (do_lock == 0)
+ goto out;
+ // kgdb_ts ((unsigned long) uf, *vfulock);
+ if (vfulock_acas (vfulock, val, current->pid))
+ goto out;
+ break;
+ case VFULOCK_DEAD: /* idem, but dead owner */
+ uf->fulock.flags |= FULOCK_FL_DEAD;
+ case VFULOCK_KCO: /* kernel controlled */
+ result = EBUSY;
+ goto out_synced;
+ case VFULOCK_NR: /* gee...gone completely */
+ uf->fulock.flags |= FULOCK_FL_NR;
+ result = -ENOTRECOVERABLE;
+ goto out_synced;
+ default: /* This is a PID(locked) no waiters */
+make_kco:
+ if (vfulock_acas (vfulock, val, VFULOCK_KCO))
+ goto id_owner;
+ }
+ }
+id_owner: /* Id owner, mark as dead if it died */
+ result = __fulock_id_owner (&uf->fulock, vfulock, val, do_lock);
+out_synced:
+ uf->fulock.flags &= ~FULOCK_FL_NEW;
+out: /* We are the sole owners of the lock */
+ __DEBUG_get_value (result, vfulock);
+#warning DEBUG: remove me
+ // if (result == 0)
+ // kgdb_ts ((unsigned long) uf, *vfulock);
+ return result;
+}
+
+
+/** @returns expected value of the vfulock given the state of the
+ * @fulock for KCO mode fulocks. */
+static inline
+unsigned __vfulock_expected_kco (struct fulock *fulock)
+{
+ if (fulock->flags & FULOCK_FL_DEAD)
+ return VFULOCK_DEAD;
+ else if (fulock->flags & FULOCK_FL_NR)
+ return VFULOCK_NR;
+ return VFULOCK_HEALTHY;
+}
+
+
+/**
+ * Synchronize a KCO ufulock with the KCO vfulock in user space.
+ *
+ * KCO ufulocks exist only in kernel space when someone owns them (and
+ * when none does, for a while in the cache). When it dissapears, we
+ * need to keep somewhere the state of the fulock (if it was healthy,
+ * dead or not-recoverable). We use the vfulock, and VFULOCK_HEALTHY
+ * (for convenience, the same as VFULOCK_UNLOCKED), VFULOCK_DEAD and
+ * VFULOCK_NR.
+ *
+ * The checks are just to be a PITA and catch errors; they will also
+ * fix up stuff (being the information in the kernel the authoritative
+ * reference).
+ */
+static
+int __vfulock_sync_kco (struct ufulock *ufulock, volatile unsigned *vfulock)
+{
+ struct fulock *fulock;
+ int result = EBUSY;
+ unsigned val, new_flags, expected;
+
+ ftrace ("(%p, %p)\n", ufulock, vfulock);
+
+ fulock = &ufulock->fulock;
+ val = *vfulock;
+ new_flags = fulock->flags & ~FULOCK_FL_NEW;
+ if ((fulock->flags & FULOCK_FL_NEW) == 0) {
+ expected = __vfulock_expected_kco (fulock);
+ if (val != expected) {
+ printk (KERN_WARNING "pid %d (%s) (KCO ufulock %p): "
+ "out of sync, val %u/0x%x, expected %u/0x%x. "
+ "Fixing.\n",
+ current->pid, current->comm, ufulock,
+ val, val, expected, expected);
+ __vfulock_update (vfulock, fulock, expected);
+ }
+ }
+ switch (val) {
+ case VFULOCK_HEALTHY:
+ break;
+ case VFULOCK_DEAD:
+ new_flags |= FULOCK_FL_DEAD;
+ break;
+ case VFULOCK_NR:
+ result = -ENOTRECOVERABLE;
+ new_flags |= FULOCK_FL_NR;
+ break;
+ /* Why default to not-recoverable? well somebody
+ * screwed up the value of this lock, so we can't be
+ * certain of what was its previous state, so the
+ * safest thing to do is to kill it left and right.
+ */
+ default:
+ printk (KERN_WARNING "pid %d (%s) (KCO ufulock %p): "
+ "invalid value 0x%x. Defaulting to not-recoverable.\n",
+ current->pid, current->comm, ufulock, val);
+ __vfulock_update (vfulock, fulock, VFULOCK_NR);
+ new_flags |= FULOCK_FL_NR;
+ __fulock_unlock (fulock, (size_t)~0, -ENOTRECOVERABLE);
+ result = -ENOTRECOVERABLE;
+ }
+ fulock->flags = new_flags;
+ return result;
+}
+
+
+/**
+ * Sync up a ufulock with the user space vfulock.
+ *
+ * Depending on the type of locking, we go one way or another.
+ */
+static inline
+int __vfulock_sync (struct ufulock *ufulock, volatile unsigned *vfulock,
+ unsigned do_lock)
+{
+ ftrace ("(%p, %p, %u)\n", ufulock, vfulock, do_lock);
+
+ return ufulock->fulock.flags & FULOCK_FL_KCO?
+ __vfulock_sync_kco (ufulock, vfulock) :
+ __vfulock_sync_nonkco (ufulock, vfulock, do_lock);
+}
+
+
+/** Helper for kmapping the user's vfulock to the kernel. */
+static inline
+volatile unsigned * vfulock_kmap (struct ufulock *ufulock)
+{
+ ftrace ("(%p)\n", ufulock);
+ BUG_ON (ufulock->page == NULL);
+ return (volatile unsigned *)
+ (kmap_atomic (ufulock->page, KM_IRQ0)
+ + (ufulock->vlocator.key.both.offset & ~1));
+}
+
+/** Unmap the user's vfulock from the kernel. */
+static inline
+volatile unsigned * vfulock_kunmap (volatile unsigned *vfulock)
+{
+ ftrace ("(%p)\n", vfulock);
+ kunmap_atomic ((void *) (PAGE_MASK & (unsigned long)vfulock), KM_IRQ0);
+}
+
+
+/**
+ * Lock a ufulock, maybe wait for it to be available.
+ *
+ * @returns: 0 if acquired, < 0 errno code on error.
+ * -EAGAIN Not acquired, try again [change this, conflicst
+ * with POSIX]
+ * -EBUSY Not acquired, is locked, didn't block
+ * -EBADR Not acquired, fulock is not recoverable
+ * -ESRCH Acquired, previous owner died, data might be
+ * inconsistent
+ * * Not acquired, error.
+ *
+ * USER CONTEXT ONLY
+ *
+ * What we do is the minimal operations to obtain a fulock that cannot
+ * be distinguished from a kernel-only fulock and then call the fulock
+ * layer for doing the locking.
+ *
+ * For obtaining that fulock, we use the address [ufulock_locate()],
+ * lock it and sync up the vfulock and ufulock [__vfulock_sync()],
+ * then ask the fulock layer to lock for us [__fulock_lock()], if we
+ * have to].
+ */
+int ufulock_lock (struct ufulock *ufulock, unsigned flags,
+ signed long timeout)
+{
+ volatile unsigned *vfulock;
+ int result;
+
+ ftrace ("(%p, %x, %ld)\n", ufulock, flags, timeout);
+ might_sleep();
+
+ /* Check flags */
+ spin_lock_irq (&ufulock->fulock.fuqueue.lock);
+ result = -EINVAL;
+ if (flags != (ufulock->fulock.flags & FULOCK_FL_USER_MK))
+ goto out_unlock;
+ /* sync up kernel and user space, maybe fast-lock */
+try_again:
+ vfulock = vfulock_kmap (ufulock);
+ result = __vfulock_sync (ufulock, vfulock, 1);
+ // kgdb_ts ((unsigned long) ufulock, result);
+ vfulock_kunmap (vfulock);
+ if (result <= 0) /* Error or fast-locked */
+ goto out_unlock;
+ result = __fulock_lock (&ufulock->fulock, timeout);
+ if (result == -EAGAIN) {
+ spin_lock_irq (&ufulock->fulock.fuqueue.lock);
+ goto try_again; /* Non-serialized wake up */
+ }
+ // kgdb_ts ((unsigned long) ufulock, result);
+ return result;
+
+out_unlock:
+ spin_unlock_irq (&ufulock->fulock.fuqueue.lock);
+ // kgdb_ts ((unsigned long) ufulock, result);
+ return result;
+}
+
+
+/**
+ * Lock a ufulock, maybe wait for it to be available.
+ *
+ * @returns: 0 if acquired, < 0 errno code on error.
+ * -EAGAIN Not acquired, try again [change this, conflicst
+ * with POSIX]
+ * -EBUSY Not acquired, is locked, didn't block
+ * -EBADR Not acquired, fulock is not recoverable
+ * -ESRCH Acquired, previous owner died, data might be
+ * inconsistent
+ * * Not acquired, error.
+ *
+ * USER CONTEXT ONLY
+ *
+ * Massage everything, locate an ufulock and lock it with uflock_lock().
+ */
+asmlinkage
+int sys_ufulock_lock (volatile unsigned __user *_vfulock, unsigned flags,
+ struct timespec __user *_timeout)
+{
+ struct ufulock *ufulock;
+ struct vlocator *vl;
+ signed long timeout;
+ int result = -EINVAL;
+
+ ftrace ("(%p, 0x%x, %p)\n", _vfulock, flags, _timeout);
+
+ /* FIXME: temporary, until PI supports SCHED_NORMAL */
+ if (unlikely (current->policy == SCHED_NORMAL && flags & FULOCK_FL_PI))
+ goto out;
+ /* Pin page, get key, look up/allocate the ufulock, refcount it */
+ result = vl_locate (NULL, &vl, &ufulock_vops, _vfulock, flags);
+ if (unlikely (result < 0))
+ goto out;
+ ufulock = container_of (vl, struct ufulock, vlocator);
+ timeout = ufu_timespec2jiffies (_timeout);
+ result = (int) timeout;
+ if (unlikely (timeout < 0))
+ goto out_put;
+ // kgdb_ts ((unsigned long) _vfulock, (unsigned long) ufulock);
+ result = ufulock_lock (ufulock, flags, timeout);
+out_put:
+ vl_put (vl);
+out:
+ DEBUG_get_value (result, _vfulock);
+ return result;
+}
+
+
+/**
+ * Sync up an unfulock, then unlock it.
+ *
+ * @ufulock: ufulock structure.
+ * @flags: flags from users pace for the ufulock
+ * @howmany: Wake up this many waiters; if 0, wake up only one,
+ * forcing a serialization in the acquisition of the
+ * lock, so that no other task (in user or kernel space)
+ * can acquire it in the meantime.
+ * @returns: 0 if ok, < 0 errno code on error.
+ */
+int __ufulock_unlock (struct ufulock *ufulock, unsigned flags,
+ size_t howmany)
+{
+ volatile unsigned *vfulock;
+ int result = -EINVAL;
+ struct fulock *fulock = &ufulock->fulock;
+
+ ftrace ("(%p, %x, %zu)\n", ufulock, flags, howmany);
+ CHECK_IRQs();
+
+ if (flags != (fulock->flags & FULOCK_FL_USER_MK))
+ goto out;
+ vfulock = vfulock_kmap (ufulock);
+ result = __vfulock_sync (ufulock, vfulock, 0);
+ if (result < 0)
+ goto out_kunmap;
+ /* Now do a proper unlock and update the vfulock */
+ if (howmany > 0)
+ __vfulock_update (vfulock, fulock, VFULOCK_UNLOCKED);
+ result = __fulock_unlock (fulock, howmany, 0);
+ if (result < 0)
+ goto out_kunmap;
+ else if (howmany == 0) {
+ unsigned new_val;
+ if (fulock->flags & FULOCK_FL_KCO)
+ new_val = VFULOCK_HEALTHY;
+ else if (result == 0)
+ new_val = VFULOCK_UNLOCKED;
+ else if (result == 1) { /* enable fast-unlock */
+ new_val = fulock->owner->pid;
+ fulock->flags |= FULOCK_FL_NEW;
+ __ufulock_owner_reset (fulock);
+ }
+ else
+ new_val = VFULOCK_KCO;
+ __vfulock_update (vfulock, &ufulock->fulock, new_val);
+ }
+ else if (__fulock_empty (fulock))
+ fulock->flags |= FULOCK_FL_NEW;
+ result = 0;
+out_kunmap:
+ vfulock_kunmap (vfulock);
+out:
+ return result;
+}
+
+
+/**
+ * Unlock an ufulock, waking up waiter(s)
+ *
+ * @_vfulock: Address for the fulock (user space).
+ * @howmany: Number of waiters to wake up [see ufulock_unlock()]
+ * @returns: 0 if ok, < 0 errno code on error.
+ *
+ * USER CONTEXT ONLY
+ *
+ * There is a gloomy thing here. We should be just looking for a
+ * ufulock associated to _vfulock and if not found, assume there are
+ * no waiters and just set *_vfulock to unlocked.
+ *
+ * But by doing that, we can provoke a race condition; if another
+ * thread B just came in, saw it locked, went into the kernel, saw it
+ * unlocked and acquired it, then it would set the vfulock to KCO and
+ * the original thread A in this moment sets the vfulock to unlocked
+ * -- havoc happens.
+ *
+ * So we force to do exactly the same thing as lock() do, we locate();
+ * this way, we also force the caching of the fulock to be
+ * refreshed. Then, the setting of the *_vfulock [from inside the
+ * kernel] is protected by the spinlock.
+ */
+asmlinkage
+int sys_ufulock_unlock (volatile unsigned __user *_vfulock, unsigned flags,
+ size_t howmany)
+{
+ struct vlocator *vl;
+ struct ufulock *ufulock;
+ int result = -ENOMEM;
+ unsigned long cpu_flags;
+
+ ftrace ("(%p, 0x%x, %zu)\n", _vfulock, flags, howmany);
+
+ /* Pin page, get key, look up/allocate the ufulock, refcount it */
+ result = vl_locate (NULL, &vl, &ufulock_vops, _vfulock, flags);
+ if (unlikely (result < 0))
+ goto out;
+ ufulock = container_of (vl, struct ufulock, vlocator);
+ spin_lock_irqsave (&ufulock->fulock.fuqueue.lock, cpu_flags);
+ result = __ufulock_unlock (ufulock, flags, howmany);
+ spin_unlock_irqrestore (&ufulock->fulock.fuqueue.lock, cpu_flags);
+ vl_put (vl);
+out:
+ DEBUG_get_value (result, _vfulock);
+ return result;
+}
+
+
+/**
+ * Change the consistency state of a fulock -- just call the fulock
+ * layer to do it and update the vfulock in user space.
+ *
+ * @_vfulock: Location of the fulock in user space
+ * @flags: flags for the fulock
+ * @consistency: consistency to set to (fulock_con_*). Note it is not
+ * possible to set from some states to others.
+ * @returns: >= 0 previous consistency before the change
+ * < 0 errno code on error.
+ *
+ * USER CONTEXT ONLY
+ */
+#warning FIXME: rename to _ctl()
+asmlinkage
+int sys_ufulock_consistency (unsigned __user *_vfulock,
+ unsigned flags, unsigned consistency)
+{
+ struct vlocator *vl;
+ struct fulock *fulock;
+ struct ufulock *ufulock;
+ int result;
+ unsigned new_value;
+ volatile unsigned *vfulock;
+
+ ftrace ("(%p, 0x%x, %u)\n", _vfulock, flags, consistency);
+ might_sleep();
+
+ /* Ugly: special case: destruction; can't really destroy it
+ * (somebody might be using it still), but we can disconnect
+ * it from the hash until the gc destroys it. */
+ if (consistency == fulock_con_destroy) {
+ vl_dispose (&ufulock_vops, _vfulock);
+ return 0;
+ }
+ /* Pin page, get key, look up/allocate the ufulock, refcount it */
+ result = vl_locate (NULL, &vl, &ufulock_vops, _vfulock, flags);
+ if (unlikely (result < 0))
+ goto out;
+ ufulock = container_of (vl, struct ufulock, vlocator);
+ fulock = &ufulock->fulock;
+ spin_lock_irq (&fulock->fuqueue.lock);
+ /* make sure flags are consistent */
+ if (consistency != fulock_con_init) {
+ result = -EINVAL;
+ if (flags != (fulock->flags & FULOCK_FL_USER_MK))
+ goto out_unlock;
+ }
+ /* Ugly: special case: do we have waiters? */
+ if (consistency == fulock_con_waiters) {
+ result = __fulock_empty (fulock)? 0 : 1;
+ goto out_unlock;
+ }
+ /* Ugly: special case: is it locked? */
+ if (consistency == fulock_con_locked) {
+ result = fulock->owner == NULL? 0 : 1;
+ goto out_unlock;
+ }
+ /* Map the user space area */
+ vfulock = vfulock_kmap (ufulock);
+ result = __vfulock_sync (ufulock, vfulock, 0);
+ if (result < 0)
+ goto out_kunmap;
+ /* Set the consistency */
+ result = __fulock_consistency (fulock, consistency);
+ if (result < 0)
+ goto out_kunmap;
+ /* Ok, update the vfulock */
+ switch (result) {
+ case fulock_st_healthy:
+ if (unlikely (fulock->flags & FULOCK_FL_KCO))
+ new_value = VFULOCK_HEALTHY;
+ else if (__fulock_empty (fulock)) {
+ new_value = current->pid;
+ fulock->flags |= FULOCK_FL_NEW;
+ __ufulock_owner_reset (fulock);
+ }
+ else
+ new_value = VFULOCK_KCO;
+ break;
+ case fulock_st_init:
+ new_value = VFULOCK_UNLOCKED;
+ fulock->flags = (flags & FULOCK_FL_USER_MK) | FULOCK_FL_NEW;
+ break;
+ case fulock_st_dead:
+ new_value = VFULOCK_DEAD;
+ break;
+ case fulock_st_nr:
+ new_value = VFULOCK_NR;
+ break;
+ default:
+ WARN_ON(1);
+ new_value = 0; /* shut gcc up */
+ }
+ vfulock_set (vfulock, new_value);
+ __DEBUG_get_value (result, vfulock);
+ result = 0;
+out_kunmap:
+ vfulock_kunmap (vfulock);
+out_unlock:
+ spin_unlock_irq (&fulock->fuqueue.lock);
+ vl_put (vl);
+out:
+ return result;
+}
+
+
+/** Release as dead @ufulock because the owner is exiting. */
+void __ufulock_exit (struct fulock *fulock)
+{
+ struct ufulock *ufulock;
+ volatile unsigned *vfulock;
+
+ ftrace ("(%p)\n", fulock);
+
+ ufulock = container_of (fulock, struct ufulock, fulock);
+ vfulock = vfulock_kmap (ufulock);
+ /* Update the vfulock to VFULOCK_DEAD */
+ __vfulock_update (vfulock, fulock, VFULOCK_DEAD);
+ vfulock_kunmap (vfulock);
+ __fulock_exit (fulock);
+}
+
+
+/** Cancel @task's wait on the ufulock */
+unsigned __ufulock_wait_cancel (struct fuqueue *fuqueue,
+ struct fuqueue_waiter *w)
+{
+ unsigned prio_changed;
+ struct fulock *fulock =
+ container_of (fuqueue, struct fulock, fuqueue);
+ struct ufulock *ufulock =
+ container_of (fulock, struct ufulock, fulock);
+
+ ftrace ("(%p, %p [%d], %p)\n", fuqueue, w, w->task->pid, w);
+
+ prio_changed = __fulock_wait_cancel (fuqueue, w);
+ if ((fulock->flags & FULOCK_FL_KCO) == 0
+ && __fulock_empty (fulock)) {
+ /* Re-enable fast unlock */
+ volatile unsigned *vfulock;
+ unsigned value = VFULOCK_UNLOCKED;
+ if (fulock->owner) {
+ value = fulock->owner->pid;
+ ufulock->fulock.flags |= FULOCK_FL_NEW;
+ __ufulock_owner_reset (fulock);
+ }
+ vfulock = vfulock_kmap (ufulock);
+ __vfulock_update (vfulock, fulock, value);
+ vfulock_kunmap (vfulock);
+ }
+ return prio_changed;
+}
+
+
+/* Adaptors for fulock operations */
+static
+void ufulock_put (struct fuqueue *fuqueue)
+{
+ struct fulock *fulock =
+ container_of (fuqueue, struct fulock, fuqueue);
+ struct ufulock *ufulock =
+ container_of (fulock, struct ufulock, fulock);
+ vl_put (&ufulock->vlocator);
+}
+
+static
+void ufulock_get (struct fuqueue *fuqueue)
+{
+ struct fulock *fulock =
+ container_of (fuqueue, struct fulock, fuqueue);
+ struct ufulock *ufulock =
+ container_of (fulock, struct ufulock, fulock);
+ vl_get (&ufulock->vlocator);
+}
+
+/** ufulock fulock operations */
+struct fulock_ops ufulock_ops = {
+ .fuqueue = {
+ .get = ufulock_get,
+ .put = ufulock_put,
+ .wait_cancel = __ufulock_wait_cancel,
+ .chprio = __fulock_chprio
+ },
+ .owner_set = __ufulock_owner_set,
+ .owner_reset = __ufulock_owner_reset,
+ .exit = __ufulock_exit,
+};
+
+
+/**
+ * Initialize the ufulock subsystem.
+ */
+static
+int __init subsys_ufulock_init (void)
+{
+ ufulock_slab = kmem_cache_create ("ufulock", sizeof (struct ufulock),
+ 0, 0, ufulock_ctor, NULL);
+ if (ufulock_slab == NULL)
+ panic ("ufulock_init(): "
+ "Unable to initialize ufulock slab allocator.\n");
+ return 0;
+}
+__initcall (subsys_ufulock_init);
+
+
+/**
+ * Convert a user 'struct timespec *' to jiffies with a twist
+ *
+ * @_timespec: pointer to user space 'struct timespec'.
+ *
+ * @returns: MAX_SCHEDULE_TIMEOUT
+ * Wait for ever, if @_timespec is (void *) -1.
+ * 0 Don't wait at all (if @_timespec is NULL).
+ * Other: Number of jiffies to wait as specified in
+ * @_timespec.
+ *
+ * I kind of think this is still a little bit twisted -- raise your
+ * hand if you can think of a better arrangement.
+ *
+ * WARNING: might sleep!
+ */
+signed long ufu_timespec2jiffies (const struct timespec __user *_timespec)
+{
+ struct timespec timespec;
+ signed long j;
+
+ might_sleep();
+ if (_timespec == NULL)
+ j = 0;
+ else if (_timespec == (struct timespec *) ~0)
+ j = MAX_SCHEDULE_TIMEOUT;
+ else {
+ if (copy_from_user (&timespec, _timespec, sizeof (timespec)))
+ return -EFAULT;
+ j = timespec_to_jiffies (&timespec);
+ }
+ return j;
+}

2004-01-15 01:14:00

by Bernhard Kuhn

[permalink] [raw]
Subject: Re: [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2.1

[email protected] wrote:

> This code proposes an implementation of kernel based mutexes,

Pretty interessting stuff! I will inspect if i could combine
it with the "real-time interrupts" i recently described
(http://www.linuxdevices.com/articles/AT6105045931.html).

Currently i'm protecting critical areas with "prioritized
spinlocks" that don't provide a priority inversion aviodance
scheme. Having "real" mutexes with priority inheritence
should be pretty helpfull to make the kernel hard real time
aware.

best regards

Bernhard

2004-01-15 04:58:24

by Perez-Gonzalez, Inaky

[permalink] [raw]
Subject: RE: [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2.1

> From: Bernhard Kuhn [mailto:[email protected]]

> > This code proposes an implementation of kernel based mutexes,
>
> Pretty interessting stuff! I will inspect if i could combine
> it with the "real-time interrupts" i recently described
> (http://www.linuxdevices.com/articles/AT6105045931.html).

I saw it, nice trick!

> Currently i'm protecting critical areas with "prioritized
> spinlocks" that don't provide a priority inversion aviodance
> scheme. Having "real" mutexes with priority inheritence
> should be pretty helpfull to make the kernel hard real time
> aware.

Scratch, scratch. Well, the mutexes (fulocks) are implemented
using spinlocks; I mean, the spinlock protects the fulock
structure that describes the mutex (wait list, owner, etc),
so I am afraid we'd be chasing our tail.

What's special about the prioritized spinlocks? I don't remember
having read about that in the article.

I?aky P?rez-Gonz?lez -- Not speaking for Intel -- all opinions are my own (and my fault)